ADS Sample Notebook.

Copyright (c) 2019, 2021 Oracle, Inc. All rights reserved. Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.

***
# <font> Introduction to Dataset Factory Transformations </font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

## Overview of this Notebook
The most important element in any data science project is the data itself. It is extremely important that this data is as clear as possible, so that we do not misinterpret any structure inherent to the problem. This notebook will demonstrate the core functionality of the Dataset Factory class.

In this notebook, you will learn some of the many ways to clean and transform data in an `ADSDatasetFactory` Object.

---

## Objectives:
By the end of this tutorial, we will know how to:

0. <a href='#setup'>Setup</a>
1. <a href='#load'>Load in datasets</a> from a multitude of sources and formats using Oracle's ADS Dataset Factory class.
2. <a href='#clean'>Auto Clean</a> the entire dataset with built-in recommendation engines.
3. <a herf='#rowops'>Examples of Row Operations</a> valid for any `ADSDataset`. 
    - 3.1. <a herf='#delrow'>Delete a row</a>
    - 3.2. <a herf='#addrow'>Add a row</a>
    - 3.3. <a herf='#filterrow'>Filter by row</a>
    - 3.4. <a herf='#deldup'>Delete duplicate rows</a>
4. <a herf='#colops'>Examples of Column Operations</a> valid for any `ADSDataset`.
     - 4.1. <a herf='#delcol'>Delete a column</a>
     - 4.2. <a herf='#addcol'>Add a column</a>
     - 4.3. <a herf='#filtercol'>Filter by column</a>
     - 4.4. <a herf='#rename'>Rename a column</a>
     - 4.5. <a herf='#convert'>Convert the data type of a column</a>
     - 4.6. <a herf='#norm'>Normalize a column</a>
     - 4.7. <a herf='#strops'>Operations on a column of strings</a>
5. <a herf='#dsops'>Examples of General Dataset Operations</a> valid for any `ADSDataset`.
     - 5.1. <a herf='#catenc'>Categorical encoding</a>
     - 5.2. <a herf='#onehotenc'>One-hot encoding</a>
     - 5.3. <a herf='#null'>Getting all null values from a dataset</a>
     - 5.4. <a herf='#impute'>Imputation</a>
     - 5.5. <a herf='#merge'>Merging two datasets</a>
6. <a href='#ref'> References</a>
 ***

<a id='setup'></a>
## Setup

In [None]:
import warnings
import logging
warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from ads.dataset.factory import DatasetFactory
from ads.dataset.dataset_browser import DatasetBrowser


<a id='load'></a>
## Load The Dataset

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party
Content and are not considered Materials under Your agreement with Oracle
applicable to the Services.  You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING). 
Dataset `iris` is distributed under a BSD license. 
</font>

In [None]:
sklearn = DatasetBrowser.sklearn()
iris_ds = sklearn.open('iris').set_target("target")

Let us see what this dataset looks like

In [None]:
iris_ds.show_in_notebook()

And its shape

In [None]:
iris_ds.shape

<a id='clean'></a>
## Auto Clean the Data 

We can auto transform the data using either of ADS's auto transform tool. First, let's look at `suggest_recommendations`, which is a built-in function for any `ADSDataset`. It shows the user every detected issue with the dataset, and recommends the change to apply. Changing is as easy as clicking a button in the drop down menu. Once we have applied all of the changes we like, we can retrieve the transformed dataset by calling `get_transformed_dataset`.

In [None]:
iris_ds.suggest_recommendations()

In this case, we have no recommendations to make, so we can continue forward.

<a id='rowops'></a>
## Row Operations

Next, let's go through several examples of row operations. Any operation you can apply to a `Pandas DataFrame`, you can also apply to an `ADSDataset`. We will walk through a few in this section.

<a id='delrow'></a>
### Delete a Row

Deleting a row, similar to filtering a row, can be done using the `.loc[]` function. We will demonstrate how to delete the first two rows over the next couple cells. First we check what the first 5 rows look like, using `.head()`, then we select all but the first two. Because `ADSDataset` objects are immutable, we never really add or delete rows, we only create new datasets with different rows.

In [None]:
iris_ds.head()

In [None]:
iris_minus_2 = iris_ds.df.loc[2:150]
iris_minus_2.head()

#### Reset the Index
We can reset the index to start from 0, after our various row operations, with a simple call to .reset_index()

In [None]:
iris_minus_2 = iris_minus_2.reset_index()
iris_minus_2.head()

<a id='filterrow'></a>
### Filter By Row

Filtering by row is easy, and there are several ways to do it. In the following cell, we will filter the rows such that we only get rows of type 'setosa'. Next we will calculate the mean and variance in a DataFrame for readability.

In [None]:
iris_setosa = iris_ds[iris_ds['target']=='setosa']
pd.DataFrame([iris_setosa.mean(), iris_setosa.var()], index=['mean', 'var'])

<a id='deldup'></a>
### Remove Duplicate Rows

Having duplicate rows is a pain, as it will slow down your model training, without any actual gain. And therefore, we need to remove duplicates. We can call the `drop_duplicates` function to return a dataset with all of the duplicates removed.

In [None]:
iris_ds.shape

In [None]:
iris_without_dup = iris_ds.drop_duplicates()
iris_without_dup.shape

In the previous cell, we saw the shape change by 1 row, can we check if this is correct? Of course, we can use the `duplicated` function to see how many duplicate rows exist in our dataset. We demonstrate this in the following cell to confirm there was only 1 duplicate row.

In [None]:
sum(iris_ds.to_pandas_dataframe().duplicated())

<a id='colops'></a>
## Column Operations

Next, let's go through several examples of column operations. Any operation you can apply to a `Pandas DataFrame`, you can also apply to an `ADSDataset`. We will walk through a few in this section.

<a id='delcol'></a>
### Delete a Column

Deleting a column can be done using the `drop_columns` method. We can pass in a list of all of the columns we want to delete, and the method will return a dataset without those columns. We demonstrate this in the following cell, where we drop all petal data from our dataset.

In [None]:
iris_sepal = iris_ds.drop_columns(['petal_width_(cm)', 'petal_length_(cm)'])
iris_sepal.head()

<a id='addcol'></a>
### Adding a Column
We can add columns using the `assign_column` method. Assign_column will take in anything array-like (Pandas, numpy, dask, etc), a dictionary, or a function to build the new column. In the following cell, we will create a column of differences between petal and sepal lengths. We will assign this column into the dataset with the name 'petal_minus_sepal', and display the last 5 rows of our new dataset.

In [None]:
petal_minus_sepal_col = (iris_ds['petal_length_(cm)'] - iris_ds['sepal_length_(cm)'])
iris_petal_minus_sepal = iris_ds.assign_column('petal_minus_sepal', petal_minus_sepal_col)
iris_petal_minus_sepal.head()

<a id='filtercol'></a>
### Filter by Column
We can also filter by column in all of the same ways we can in Pandas. Rather than deleting columns, we can create a new dataset with only the columns we require. In the following cell, we will build a dataset with only 2 of the original columns: petal and sepal lengths. Then we will print out the first 5 rows to double-check it worked.

In [None]:
iris_filtered = iris_ds[['petal_length_(cm)', 'sepal_length_(cm)']]
iris_filtered.head()

We can also filter based off of value ranges we'd like to see. In the following cell we will only take rows from our filtered dataset when the petal length is greater than 6, and the sepal length is less than 7.5. The printouts will tell us how many datapoints were found at each step. Finally we will print out the remaining rows. 

In [None]:
iris_filtered = iris_filtered[iris_ds['petal_length_(cm)'] > 6][iris_ds['sepal_length_(cm)'] < 7.5]
iris_filtered.head()

<a id='rename'></a>
### Rename Columns
Often times we should like to rename columns. Maybe this is because we have duplicate named columns, or maybe (as in this case) the column names are just bulky. We would like to drop the "\_(cm)" from our column names. We can do this using `rename_columns`, and passing in a dictionary mapping the old name to the new one. We demonstrate this in the following cells.

In [None]:
iris_ds.columns

In [None]:
iris_renamed = iris_ds.rename_columns({'sepal_length_(cm)': 'sepal_length', 
                                    'sepal_width_(cm)': 'sepal_width', 
                                    'petal_length_(cm)': 'petal_length', 
                                    'petal_width_(cm)': 'petal_width'})

In [None]:
iris_renamed.columns

<a id='convert'></a>
### Convert a column to a different data type
For various reasons, we might like to recast our data as a different type. Over the following cells, we will walk through casting columns using `astype`, and the various requirements for each.

Starting off, we can see that sepal_length is a continuous typed variable. This is because it is a floating point number, and therefore is continuous. However, if we would like to make it categorical, we can do that easily using astype.

In [None]:
iris_ds.feature_types['sepal_length_(cm)']

In [None]:
iris_string_features = iris_ds.astype({'sepal_length_(cm)': 'categorical'})
iris_string_features.feature_types['sepal_length_(cm)']

Maybe the data isn't truly categorical, but its actually ordinal, meaning that it is positive integers. If we first convert this column to positive integers (by multiplying by 10), we can assign the column to be ordinal.

In [None]:
iris_ordinal = iris_ds.assign_column('sepal_length_(cm)', lambda x: x*10)
iris_ordinal = iris_ordinal.astype({'sepal_length_(cm)':'ordinal'})
iris_ordinal.feature_types['sepal_length_(cm)']

Finally, maybe ordinal isn't exactly right, because our column is actually timeseries data. We can type cast that as well, as demonstrated in the following cell:

In [None]:
iris_datetime = iris_ds.astype({'sepal_length_(cm)': 'datetime'})
iris_datetime.feature_types['sepal_length_(cm)']['type']

In [None]:
iris_datetime.head()

<a id='norm'></a>
### Normalize a Column
To demonstrate applying our own functions to columns in a dataset, we will show a potential route to making a min-max normalization column for sepal_length. First we gather our min and max values, then we create our normalized column, and finally we can assign this column to our new dataset.

In [None]:
sepal_length_max = iris_ds['sepal_length_(cm)'].max()
sepal_length_min = iris_ds['sepal_length_(cm)'].min()
sepal_length_range = sepal_length_max - sepal_length_min
sepal_length_norm = (iris_ds['sepal_length_(cm)'] - sepal_length_min) / (sepal_length_max - sepal_length_min)
iris_norm = iris_ds.assign_column('sepal_length_norm', sepal_length_norm)

<a id='strops'></a>
### Operations on String Columns
Strings often require specific attention, and therefore the purpose of this section is to explore some of the ways to manipulate strings. As always, all Pandas functions are valid on any `ADSDataset` object. 

We can use `value_counts` to get the class label and frequency for a specific column in our dataset.

In [None]:
iris_ds['target'].value_counts()

We might want to reduce the length of the label in our dataset, we can use the apply function with a lambda to achieve this:

In [None]:
iris_ds['target'].apply(lambda x: str(x)[:-5]).value_counts()

Or, if we have more specific names in mind, we can use a dictionary:

In [None]:
iris_target_label_map = {'virginica': 'vi', 'versicolor': 've', 'setosa': 's'}
iris_ds['target'].apply(lambda x: iris_target_label_map[x]).value_counts()

Finally, we can apply built-in string methods like converting all strings to upper or lower case

In [None]:
iris_ds['target'].str.upper().value_counts()

<a id='dsops'></a>
## General Dataset Manipulation Functions

Our last main section will cover functions that we can apply to entire datasets. 

<a id='catenc'></a>
### Categorical Encoding
`ADSDataset` has a built-in categorical encoder. We can access it directly using the import in the following cell. Simply pass in our dataset object, and it will be automatically encoded. We demonstrate this in the following cell and use the `value_counts` function as verification.

In [None]:
from ads.dataset.label_encoder import DataFrameLabelEncoder
iris_encoded = DataFrameLabelEncoder().fit_transform(iris_ds.to_pandas_dataframe())
iris_encoded['target'].value_counts()

In [None]:
iris_encoded.sample(frac=.034)


<a id='onehotenc'></a>
### One-hot Encoding¶
One-hot encoding transforms one categorical column with "n" categories into "n" or "n-1"columns with indicator variables. Let's prepare one of the columns to be categorical with categories low, medium and high.

In [None]:
def convert_to_level(value):
    if value < 5:
        return 'short'
    elif value > 6:
        return 'long'
    else:
        return 'medium'

iris_copy = iris_ds
iris_copy = iris_copy.assign_column('sepal_length_(cm)', convert_to_level)
iris_copy

You can use the Pandas method `get_dummies()` to perform one-hot encoding on a column. Use the `prefix` parameter to assign a prefix to the new columns that contain the indicator variables. Here is an example on how to create "n" columns with one-hot encoding:

In [None]:
data = iris_copy.to_pandas_dataframe()['sepal_length_(cm)'] # data of which to get dummy indicators
onehot = pd.get_dummies(data, prefix='sepal')
onehot

To create "n-1" columns, use `drop_first=True` when converting the categorical column.

In [None]:
data = iris_copy.to_pandas_dataframe()['sepal_length_(cm)'] # data of which to get dummy indicators
onehot = pd.get_dummies(data, prefix='sepal', drop_first=True)
onehot

Add a one-hot column to the initial dataset with the ``merge()`` method and drop the initial categorical column that you transformed into one-hot:

In [None]:
iris_onehot = iris_copy.merge(onehot, left_index=True, right_index=True).drop_columns('sepal_length_(cm)')
iris_onehot['sepal_short'].value_counts()

<a id='null'></a>
### Get Null Values from Dataset
We might wish to detect all Nulls in our dataset. We can simply use the `isnull` function to return a boolean dataset matching the dimension of our input. In the following cell we will demonstrate this, and then check for any nulls in each row.

In [None]:
iris_null = iris_ds.isnull()
np.any(iris_null)

<a id='impute'></a>
### Imputation
If we find data that is null, imputation is easy. We know how to add and remove rows from prior cells in this notebook, so in this section, we will focus on the `fillna` function. In the following cells we will demonstrate first filling a dataset with nulls, then using fillna to, well, fill the NA's. Lastly we will print out the first 5 rows again to verify our imputation was successful.


In [None]:
iris_with_null = iris_ds.assign_column("sepal_length_(cm)", lambda x: None if x < 5 else x)
iris_null1 = iris_with_null.isnull()
np.any(iris_null1)

In [None]:
iris_with_null.head()

In [None]:
iris_impute = iris_with_null.fillna(method='pad')
iris_impute.head()

In [None]:
iris_null2 = iris_impute.isnull()
np.any(iris_null2)

<a id='merge'></a>
### Merging Datasets
We may wish to merge datasets for a variety of reasons, so in the following cell we will merge two datasets. One containing all of the 'setosa' data, and the other containing all of the 'virginica' data. 

In [None]:
iris_merged1 = iris_ds[iris_ds['target'] == 'setosa'].merge(iris_ds[iris_ds['target'] == 'virginica'], how='outer')
iris_merged1['target'].value_counts()

## Moving Forward

Dataset Factory is a powerful tool, which can be used for all data science problems. It only takes a few lines of code to perform some very sophisticated analysis. 

<a id='ref'></a>
## References
 - <a href="https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html">Oracle ADS</a>