<font color=gray>ADS Sample Notebook.

Copyright (c) 2019, 2021 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font>Data Visualization with ADS SDK </font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Service Team </font></p>

***

## Overview of this Notebook

Data visualization is an important component of data exploration and data analysis in modern data science practices. An efficient and flexible data visualization tool can provide more insight about the data for data scientists.

This notebook provides an overview of the data visualizations that you can perform with ADS. It will focus on smart data visualization technology that uses the columns types and other settings to atomically great an intuitive plot for your data.

---

## Objectives:
By the end of this tutorial, you will know how to:
 - <a href='#setup'>0. Setup</a> the required packages.
 - <a href='#data'>1. Source the Dataset</a> from a host of filesystems and formats.
     - <a href='#sinb'>1.1 Visualize the Dataset Overall</a>: auto-generate the most popular plots for your data types
 - <a href='#explore'>2. Dataset Exploration using Visualization</a> to interpret and internalize the data.
 - <a href='#custom'>3. Custom Plotting Examples </a> Using `ADSDataset`'s built in methods
     - <a href='#lambda'>3.1 Using Lambdas to Plot </a>
     - <a href='#3d'>3.2 Rendering a 3D Plot </a> 
     - <a href='#pair'>3.3 Using Seaborn's PairPlot Function </a>
     - <a href='#mat'>3.4 Using Matplotlib Functions </a>
     - <a href='#pie'>3.5 Pie Chart </a>
     - <a href='#gis'>3.6 GIS Plot </a> 
 - <a href='#ref'>4. References </a> 
 ***

 <a id='setup'></a>
## 0. Setup
Import the necessary packages:

In [None]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
from numpy.random import randn

from sklearn.utils import Bunch
from sklearn.datasets import load_iris

from ads.dataset.factory import DatasetFactory


%matplotlib inline


<a id='data'></a>
## 1. Source the Dataset

<font color=gray> (You can load from: a local or network file system, Hadoop Distributed File System, Amazon S3, Google Cloud Service, Pandas, Dask, or H2O. And in any of the following formats: CSV, TSV, Parquet, libsvm, json, Excel, HDF5, SQL, xml, apache server log files (clf, log) and arff.)</font>

We're working with the Oracle Classification Dataset, this has a set of features and a `1`/`0` target (`class`)

The oracle_classification_dataset1_150K.csv file is stored here on Oracle ArtiFactory, but the source could be any number of locations, Oracle Storage, HDFS, Git etc. The format and additional options are inferred, however there are many options to control how `open` works. It can also convert any local Pandas DataFrame to a Dataset.

The data is sampled down to 1500 rows and 21 columns. The columns describe different attributes of each row.

<font color=gray> If you don't yet know the target in your project, you can explore the data first and set the target later.</font>

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party
Content and are not considered Materials under Your agreement with Oracle
applicable to the Services.  You can access the `oracle_classification_dataset1` dataset license [here](oracle_data/UPL.txt). 
Dataset `oracle_classification_dataset1` is distributed under UPL license. 
</font>

In [None]:
ds_preview = DatasetFactory.open("/opt/notebooks/ads-examples/oracle_data/oracle_classification_dataset1_150K.csv", target="class")
ds_preview = ds_preview[['class', 'col01', 'col02', 'col03', 'col04', 'col05', 'col06', 'col07',
       'col08', 'col09', 'col010', 'col011', 'col012', 'col013', 'col014',
       'col015', 'col016', 'col017', 'col018', 'col019', 'col020']].sample(frac=0.01)

 <a id='explore'></a>
## 2. Dataset Exploration using Visualization

### Plot target distribution

Let's take a look at the distribution of the target column. 

In [None]:
ds_preview.target.show_in_notebook()

In the above cell, the target column `count` is a categorical value, therefore the smart data visualization tool selected a `count plot`. The above plot demonstrates that the count of class 1 is more than the count of class 0.

### Plot a distribution for a set of features vs target variable

Next, we are going to plot a set of features against the target feature by specifying a list of feature names in the parameter `feature_names` in the method of `show_in_notebook`

In [None]:
ds_preview.target.show_in_notebook(
    feature_names=["col01", "col02", "col03", "col09"])

The above cell demonstrates that given different types of features, the ADS SDK selected different plotting methods. When plotting `col01` (a continuous variable) against `class` (a categorical variable) a family of PDF curves was the most appropriate plot. Meanwhile, when plotting `col02` against `class`, in which both are categorical variables, a count plot was created.

### Automatic plotting between features using ADS SDK

The `plot()` method is an automatic plotting method. Users can pass in a variable for the x axis and an optionally a variable for y, and then call `show_in_notebook()` method to plot. Here are some examples using oracle classification synthetic dataset:

In [None]:
ds_preview.plot("col02").show_in_notebook(figsize=(4,4))

In the above cell, since we only pass the x variable `col02`, which is a categorical variable, our automatic plotting used countplot, a simple and straightforward visualization.

In [None]:
ds_preview.plot("col02", y="col01").show_in_notebook(figsize=(4,4))

In this above example, we are plotting `col02` against `col01`, with one being a categorical typed feature and one being a continuous typed feature, the best plotting method is violin plot.

In [None]:
ds_preview.plot("col01").show_in_notebook(figsize=(4,4))

The automatic plotting routine used a histogram to plot `col01` as it was a continuous variable.

In [None]:
ds_preview.plot("col01", y="col03").show_in_notebook()

When plotting `col01` against `col03`, which are both continuous typed feature, the ADS SDK used a Gaussian heatmap to visualize the data. It generates a scatter plot and assigns a color to each data point based on the local density (Gaussian kernel) of points.
ADS SDK analyzes the data and selects an appropriate plot type. Here are some showcase examples using `oracle_traffic_timeseries` datasets.

As you can probably tell at this point, our ADS SDK can pick the best plotting type based on different data types. Here are some showcase examples using the Oracle traffic timeseries dataset.

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party
Content and are not considered Materials under Your agreement with Oracle
applicable to the Services.  You can access the `oracle_traffic_timeseries_dataset1` dataset license [here](oracle_data/UPL.txt). 
Dataset `oracle_traffic_timeseries_dataset1` is distributed under UPL license. 
</font>

In [None]:
oracle_traffic_timeseries = DatasetFactory.open("/opt/notebooks/ads-examples/oracle_data/oracle_traffic_timeseries_dataset1.csv")

In [None]:
oracle_traffic_timeseries.head()

The above cell visualizes the relationship between `date` and `cloud_coverage` using a scatter plot. It shows how the value of the ordinal variable `cloud_coverage` changes across different years.

In [None]:
oracle_traffic_timeseries.plot("weather", y="cloud_coverage").show_in_notebook(figsize=(4,4))

By plotting `weather` against `cloud_coverage`, we can visualize the count of different kinds of weather occurred in different cloud coverages.

 <a id='custom'></a>
## 3. Custom Plotting
The call() method allows users to have a more flexible way to plotting using their preferred plotting libraries/packages.

<a id='lambda'></a>
### Using Lambdas to Plot

Here is an example of a simple matplotlib scatter plot with the custom function being a lambda

In [None]:
oracle_traffic_timeseries.call(lambda df, x,y: plt.scatter(df[x], df[y]), x='cloud_coverage', y='sensor4')

<a id='3d'></a>
### Rendring a 3D Plot


Here we showcase 3D plotting wit the `iris` dataset

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party
Content and are not considered Materials under Your agreement with Oracle
applicable to the Services. You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).  
</font>

In [None]:
%matplotlib inline

#load iris dataset
data = load_iris()
iris_df = pd.DataFrame(data.data, columns=data.feature_names)

In [None]:
def my_3d_plot(df, figsize=None):
    plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=1, hspace=1)
    plt.style.use('seaborn-white')

    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111, projection='3d')

    ax.scatter(df['sepal_length_(cm)'], df['sepal_width_(cm)'], df['petal_length_(cm)'])

    ax.set_xlabel('sepal length')
    ax.set_ylabel('sepal width')
    ax.set_zlabel('petal length')

In [None]:
ds = DatasetFactory.from_dataframe(iris_df)
ds.call(my_3d_plot, figsize=(10,10));

<a id='pair'></a>
### Using Seaborn's `pairplot` function 

In this cell we show how the dataframe is passed directly to the `Seaborn` pair plot function which plots a pairwise relationships in for the dataset. This function will create a grid of Axes such that each variable in data will by shared in the `y-axis` across a single row and in the `x-axis` across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.


In [None]:
sns.set(style="ticks", color_codes=True)
DatasetFactory.from_dataframe(iris_df).call(lambda df: sns.pairplot(df.dropna()))

<a id='mat'></a>
### Using any Matplotlib Function

In [None]:
df = pd.DataFrame(randn(1000, 4), columns=list('ABCD'))

def ts_plot(df, figsize):
    ts = pd.Series(randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    df.set_index(ts)
    df = df.cumsum()
    plt.figure()
    df.plot(figsize=figsize)
    plt.legend(loc='best')
    
ds = DatasetFactory.from_dataframe(df, target='A')
ds.call(ts_plot, figsize=(7,7))

<a id='pie'></a>
### Pie Chart Example

In this example, we make a customized pie-chart and show `DatasetFactory` loading it 

In [None]:
data = {'data': [1109, 696, 353, 192, 168, 86, 74, 65, 53]}
df = pd.DataFrame(data, index = ['20-50 km', '50-75 km', '10-20 km', '75-100 km', '3-5 km', '7-10 km', '5-7 km', '>100 km', '2-3 km'])


explode = (0, 0, 0, 0.1, 0.1, 0.2, 0.3, 0.4, 0.6)
colors = ['#191970', '#001CF0', '#0038E2', '#0055D4', '#0071C6', '#008DB8', '#00AAAA',
          '#00C69C', '#00E28E', '#00FF80', ]

def bar_plot(df, figsize):
    df["data"].plot(kind='pie', fontsize=17, colors=colors, explode=explode)
    plt.axis('equal')
    plt.ylabel('')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()

    
ds = DatasetFactory.from_dataframe(df)
ds.call(bar_plot, figsize=(7,7))

<a id='gis'></a>
### GIS Plot Example

Here are some examples where you can visualize geographical data using ADS SDK visualizations.

For this example the dataset used is the California earthquake data retrieved from the USGS earthquake catalog. 

<font color=gray>Datasets are provided as a convenience.  Datasets are considered Third Party
Content and are not considered Materials under Your agreement with Oracle
applicable to the Services. The `earthquake` dataset is public domain coming from the United States Geological Survery (USGS) Earthquake Hazards program. Public Domain License [here](https://creativecommons.org/publicdomain/zero/1.0/).
</font>

In [None]:
earthquake = DatasetFactory.open("/opt/notebooks/ads-examples/3P_data/earthquake_01.csv", target="depth")

A brief overview to visual major places where earthquakes happened. 

In [None]:
earthquake.plot_gis_scatter(lon="longitude", lat="latitude")

In [None]:
earthquake.head()

Here we do minor transformations using Pandas so that our earthquake dataset contains the column `location` which has the format of `"(latitude, logitude)"`.

In [None]:
df=earthquake.compute()
earthquake_df=df.assign(location=[*zip(df.latitude, df.longitude)]).astype(str)

Now we can plot the column `location` using `.plot`, which outputs an interactive map that gives you the flexibility to zoom in/out, identify outliers/inliers etc. 

In [None]:
earthquake_02 = DatasetFactory.open(earthquake_df)
earthquake_02.plot("location").show_in_notebook()