ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font color=red>Working with an ADSDataset Object</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

## Overview of this Notebook
One of the most important elements of any data science project is the data itself. This notebook demonstrates how to work with the `ADSDataset` class.

---

## Objectives:
To demonstrate the following methods features of the `ADSDataset` class:
 - <a href='#columnbasedops'>Column Based Operations</a>
    - <a href='#rename_columns'>Renaming Columns</a>
    - <a href='#feature_types'>Feature Type</a>
    - <a href='#assign_column'>Assigning Values to a Column</a>
    - <a href='#astype'>Altering a Column's Data Type</a>
    - <a href='#merge'>Merging ADSDataset Objects</a>
 - <a href='#target'>Target</a>
 - <a href='#correlation'>Correlation</a>
    - <a href='#corr'>Correlation Calculation</a>
    - <a href='#showcorr'>Correlation Visualization</a>
 - <a href='#reference'>References</a>
 ***

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party Content and are not considered Materials 
under your agreement with Oracle applicable to the Services.  
    
The `wine` dataset license is available [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING)    
</font>

***

In [None]:
import ads
import logging
import pandas as pd
import numpy as np
import warnings
from ads.dataset.dataset_browser import DatasetBrowser
from ads.dataset.factory import DatasetFactory
from os import path

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)


<a id='columnbasedops'></a>
# Column Based Operations

The sections below demonstrate the common operations that a data scientist will perform. For the purposes of the example, a small `ADSDataset` will be created with some basic student information.

In [None]:
data = [["ID", "Name", "GPA"], 
        [1, "Bob", 3.7], 
        [2, "Sam", 4.3], 
        [3, "Erin", 2.6]]
df = pd.DataFrame(data[1:], columns=data[0])
ds = DatasetFactory.open(df)
ds.head()

<a id='rename_columns'></a>
## Renaming Columns

The `ADSDataset` has the method `rename_columns` which is used to rename the columns. It takes a JSON object where the key is the old column name and the value is the new column name. If the column name does not exist, the key is ignored. Note, the column names are processed in the order listed in the JSON. Thus, it is possible to swap column names if the order of updating does not conflict with any column name.

In this example, the column `ID` will be changed to `StudentID` and `Name` will be changed to `GivenName`.

In [None]:
rename = ds.rename_columns({'ID': "StudentID", 'Name': 'GivenName'})
rename.columns

<a id='feature_types'></a>
## Feature Type

Each column in an `ADSDataset` object has a feature type. This determines how the information will be used in modeling. The `feature_types` attribute provides information on them. Common feature types are ordinal, categorical and continuous.

In [None]:
for key in ds.feature_types:
    print('{} is {}'.format(key, ds.feature_types[key]['type']))

Calling `feature_types` returns a dictionary with information about the feature. This would include its type, percentage of missing values, the low-level data type and summary statistics. The summary statistics vary for each feature type. For example, continuous features report measures of centrality (mode, median, mean), kurtosis, skewness, variance, standard deviation, count, the percentage of outliers, observed range (minimum and maximum) and quantiles (25, 50 and 75th). 

In [None]:
ds.feature_types['GPA']

<a id='assign_column'></a>
## Assigning Values to a Column

The `assign_column` will add a new column to an `ADSDataset` object or update the values of an existing column. The method accepts the column name and then a list of values.

In [None]:
surname = ds.assign_column("Surname", ['Smith', 'Jones', 'Allan'])
surname

It is possible to update the values in a column using a lamba function. In this example, the student IDs will be increased by 1000.

In [None]:
new_id = ds.assign_column("ID", lambda x:  int(x+1000))
new_id

When values are updated, the `feature_types` information is automatically updated.

In [None]:
new_id.feature_types['ID']

<a id='astype'></a>
## Altering a Column's Data Type

The student ID column is an ordinal data type. The `astype` method allows this to be changed. It takes a JSON object where the key is the column name and the value is the new column type. If the column name does not exist, an error is created.

In this example, the column `ID` will be changed to `categorical`

In [None]:
new_type = ds.astype({'ID': 'categorical'})
new_type.feature_types['ID']['type']

<a id='merge'></a>

## Merging ADSDataset Objects

Two `ADSDataset`s can be merged with the `merge` method. In this example an `ADSDataset` will be created with two columns that represent the student's surname and if the are on the Honors list or not.

In [None]:
data = [["Surname", "Honors"], 
        ['Smith', True], 
        ['Jones', True], 
        ['Allan', False]]
df = pd.DataFrame(data[1:], columns=data[0])
honors = DatasetFactory.open(df)
honors.head()

The two datasets can be merged with:

In [None]:
honors_list = ds.merge(honors, left_index=True, right_index=True)
honors_list

<a id='target'></a>

<a id='target'></a>

# Target

The target in a dataset is the value that is to be predicted. The target can be set with the `target` method and it accepts the name of a column. By setting the target, the class of the object will be changed to an object that is customized for the properties of the target. In the example below the target will be set to the `Honors` column. Since `Honors` is binary-valued the `ADSDataset` will be converted to a `BinaryClassificationDataset` object.

ADS will attempt to determine which class is the positive class for the `BinaryClassificationDataset` object. However, this can be sent manually with the `set_positive_class()` method. The parameter is the value of the positive class.

In [None]:
honors_target = honors_list.set_target("Honors")
type(honors_target)

The `target.show_in_notebook()` will generate a plot of the target column. The type of plot that is generated is specific to the data type. Generally, this is a bar plot for categorical and ordinal data. A histogram is used for continuous data.

In [None]:
honors_target.target.show_in_notebook()

The `target.show_in_notebook` method also takes a list of columns in the `feature_names` parameter. It will generate one plot for each listed feature against the target column.

In [None]:
honors_target.target.show_in_notebook(feature_names=["GPA", "Name"])

The `Honors` target resulted in a `BinaryClassificationDataset` because it had true and false values and any prediction that would be done on that type of data would be a binary classification problem. However, changing the target to the `GPA` column will result in a `RegressionDataset` because `GPA` is a continuous variable.

In [None]:
type(honors_list.set_target("GPA"))

Changing the target to `Name` will result in a `MultiClassClassificationDataset` object because the data is categorical, like the `Honors` column, but there are more than two categories. Therefore, the prediction problem would be a multiclass classification problem.

In [None]:
type(honors_list.set_target("Name"))

<a id='correlation'></a>
# Correlation
The `corr()` methods uses the following techniques to calculate the correlation based on the data types:
- Continuous-continuous: `Pearson` method [(link)](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). The correlations range from -1 to 1.
- Categorical-categorical: `Cramer's V`[(link)](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V). The correlations range from 0 to 1.
- Continuous-categorical: `Correlation Ratio`[(link)](https://en.wikipedia.org/wiki/Correlation_ratio). The correlations range from 0 to 1.

**Note:** `Continuous` features consist of `continuous` and `ordinal` type.
          `Categorical` features consist of `categorical` and `zipcode` type.

In [None]:
ds = DatasetFactory.open(
    path.join("/", "opt", "notebooks", "ads-examples", "oracle_data", "orcl_attrition.csv"), 
    target="Attrition").set_positive_class('Yes')

To check the `Datatype`, we can use `summary` function.

In [None]:
ds.summary()

<a id='corr'></a>
## Correlation Calculation

- `correlation_methods`: Methods to calculate the correlation and it defaults to 'pearson'.
- `frac`: The portion of the data used to calculate the correlation. It must be greater than 0 and less than or equal to 1. Defaults to 1.
- `nan_threshold`: Drop a column from the analysis if the proportion of values that are NaN exceed this threshold. Valid values are 0 to 1. Default is 0.8.
- `force_recompute`: Force a recompute of the cached correlation matrices. Default is False.

**Note**: For Cramer's V, a bias correction is applied, but not the Yate's correction for continuity([link](https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity)).

In [None]:
cts_vs_cts = ds.corr()

Can select single `correlation_methods` from 'pearson', 'cramers v', and 'correlation ratio', parse in as a string

In [None]:
cat_vs_cat = ds.corr(correlation_methods='cramers v')

Or choose multiple ones among 'pearson', 'cramers v', and 'correlation ratio', parse in as a list

In [None]:
cat_vs_cat, cat_vs_cts = ds.corr(correlation_methods=['cramers v', 'correlation ratio'])

Use 'all' in `correlation_methods` equivalent to ['pearson', 'cramers v', 'correlation ratio']

In [None]:
cts_vs_cts, cat_vs_cat, cat_vs_cts = ds.corr(correlation_methods='all')

Use Pearson's Correlation between continuous features

In [None]:
cts_vs_cts

Use Cramer's V correlations between categorical features

In [None]:
cat_vs_cat

Use Correlation Ratio Correlation between categorical and continuous features

In [None]:
cat_vs_cts

<a id='showcorr'></a>
## Correlation Visualization

The `show_corr()` method creates a heatmap or barplot for the pairwise correlations. The parameter `correlation_methods` assigns the methods the correlation methods to use. By default, this is the 'pearson' correlation. You can select one or more from 'pearson', 'cramers v', and 'correlation ratio'. Or set it to 'all' to show all correlation charts. The usage is the same as in the `corr()` method.

The `correlation_threshold` method applies a filter to the correlation matrices and only exhibits the pairs whose correlation values are greater than or equal to the `correlation_threshold`.

The correlation results are cached as they are expensive to compute. If the parameters`nan_threshold`, `frac` or `correlation_target` are changed, the parameter `force_recompute` need to be set to be `True`. Otherwise, it will output the old result.

When the `correlation_target` parameter is not `None`, the `plot_type` can be `'bar'` or `'heatmap'`. However, when `plot_type` is set to be `'bar'`, `correlation_target` also has to be set.

The following cell will show all the correlation charts.

In [None]:
ds.show_corr(plot_type='heatmap', correlation_methods='all')

<a id='reference'></a>
# 4. References
 - <a href="https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html">Oracle ADS</a>