In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2019, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---
# <font color="red">Visualizing Data</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal>Oracle Cloud Infrastructure Data Science Service Team</font></p>

---
    
# Overview:

Data visualization is an important component of data exploration and data analysis in modern data science practices. An efficient and flexible data visualization tool can provide more insight into the data for data scientists.

This notebook provides an overview of the data visualizations that you can perform with ADS. It will focus on smart data visualization technology that uses the columns types and other settings to atomically great an intuitive plot for your data.

Compatible with: [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for CPU on Python 3.8 (version 1.0)

---
    
## Contents:

- <a href='#data'>Dataset</a>
- <a href='#eda'>Exploratory Data Analysis</a>
    - <a href='#eda_target'>Plot Target Distribution</a>
    - <a href='#eda_feature'>Plot Feature Distributions</a>
    - <a href='#eda_target'>Automatic Feature Plotting</a>
- <a href='#custom'>Custom Plotting</a>
    - <a href='#custom_lambda'>Ploting with Lambdas</a>
    - <a href='#custom_3d'>3D Ploting</a> 
    - <a href='#custom_pairplot'>Seaborn's `pairplot` Method</a>
    - <a href='#custom_matplotlib'>Matplotlib</a>
    - <a href='#custom_pie'>Pie Chart</a>
    - <a href='#custom_gis'>GIS Plot</a> 
- <a href='#ref'>References </a> 
 
---
 
Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.
    
You can access the `earthquake` dataset license [here](https://creativecommons.org/publicdomain/zero/1.0/).    

You can access the `iris` dataset license [here](https://github.com/scikit-learn/scikit-learn/blob/master/COPYING).  
    
You can access the `oracle_classification_dataset1` dataset license [here](https://oss.oracle.com/licenses/upl). 
    
You can access the `oracle_traffic_timeseries_dataset1.csv` dataset license [here](https://oss.oracle.com/licenses/upl). 

---


In [None]:
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

from ads.dataset.factory import DatasetFactory
from mpl_toolkits.mplot3d import Axes3D
from numpy.random import randn
from os.path import join
from sklearn.utils import Bunch
from sklearn.datasets import load_iris

warnings.filterwarnings("ignore")
logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.ERROR)

<a id='data'></a>
# Dataset

You are working with the Oracle Classification Dataset. This has a set of features and a binary (`1`/`0`) target called `class`.

The `oracle_classification_dataset1_150K.csv` file is stored here on Oracle ArtiFactory, but the source could be any number of locations, Oracle Storage, HDFS, Git etc. The format and additional options are inferred, however, there are many options to control how the `.open()` method works. It can also convert any local Pandas DataFrame to a Dataset.

The data is downsampled down to 1,500 rows and 21 columns. The columns describe the different attributes of each row.

If you don't yet know the target in your project, you can explore the data first and set the target later.

In this notebook, you will be working with a CSV file that is stored on the network. The `DatasetFactory` class allows you to load from both the local and network file system. You can read many different file formats such as CSV, TSV, Parquet, libsvm, JSON, Excel, HDF5, SQL, XML, apache server log files (clf, log), and ARFF.

In [None]:
data_path = join(
    "/",
    "opt",
    "notebooks",
    "ads-examples",
    "oracle_data",
    "oracle_classification_dataset1_150K.csv",
)
ds_preview = DatasetFactory.from_dataframe(data_path, target="class")
ds_preview = ds_preview[
    [
        "class",
        "col01",
        "col02",
        "col03",
        "col04",
        "col05",
        "col06",
        "col07",
        "col08",
        "col09",
        "col010",
        "col011",
        "col012",
        "col013",
        "col014",
        "col015",
        "col016",
        "col017",
        "col018",
        "col019",
        "col020",
    ]
].sample(frac=0.01)

<a id='eda'></a>
# Exploratory Data Analysis

<a id='eda_target'></a>
## Plot Target Distribution

Let's take a look at the distribution of the target column. 

In [None]:
ds_preview.target.show_in_notebook()

In the above cell, the target column `count` is a categorical value, therefore the smart data visualization tool selected a `count plot`. The above plot demonstrates that the count of class 1 is more than the count of class 0.

<a id='eda_feature'></a>
## Plot Feature Distributions

The next cell will plot a set of features against the target feature by specifying a list of feature names in the parameter `feature_names` in the method of `show_in_notebook`.

In [None]:
ds_preview.target.show_in_notebook(feature_names=["col01", "col02", "col03", "col09"])

The above cell demonstrates that given different types of features, the ADS SDK selected different plotting methods. When plotting `col01` (a continuous variable) against `class` (a categorical variable) a family of PDF curves was the most appropriate plot. Meanwhile, when plotting `col02` against `class`, in which both are categorical variables, a count plot was created.

<a id='eda_target'></a>
## Automatic Feature Plotting

The `.plot()` method is an automatic plotting method. Users can pass in a variable for the x-axis and optionally a variable for y., and then call `show_in_notebook()` method to plot. Here are some examples using the Oracle Classification Synthetic dataset.

In [None]:
ds_preview.plot("col02").show_in_notebook(figsize=(4, 4))

In the above cell, since you only passed the x variable `col02`, which is a categorical variable. ADS automatic plotting used `countplot`, which is a simple and straightforward visualization.

In [None]:
ds_preview.plot("col02", y="col01").show_in_notebook(figsize=(4, 4))

In this above example, you are plotting `col02` against `col01`, with one being a categorical typed feature and one being a continuous typed feature, the best plotting method is violin plot.

In [None]:
ds_preview.plot("col01").show_in_notebook(figsize=(4, 4))

The automatic plotting routine used a histogram to plot `col01` as it was a continuous variable.

In [None]:
ds_preview.plot("col01", y="col03").show_in_notebook()

When plotting `col01` against `col03`, which are both continuous typed features, the ADS SDK used a Gaussian heatmap to visualize the data. It generates a scatter plot and assigns a color to each data point based on the local density (Gaussian kernel) of points.
ADS SDK analyzes the data and selects an appropriate plot type. Here are some showcase examples using `oracle_traffic_timeseries` datasets.

As you can probably tell at this point, our ADS SDK can pick the best plotting type based on different data types. Here are some showcase examples using the Oracle traffic time series dataset.

In [None]:
data_path = join(
    "/",
    "opt",
    "notebooks",
    "ads-examples",
    "oracle_data",
    "oracle_traffic_timeseries_dataset1.csv",
)
oracle_traffic_timeseries = DatasetFactory.from_dataframe(data_path)
oracle_traffic_timeseries.head()

The above cell visualizes the relationship between `date` and `cloud_coverage` using a scatter plot. It shows how the value of the ordinal variable `cloud_coverage` changes across different years.

In [None]:
oracle_traffic_timeseries.plot("weather", y="cloud_coverage").show_in_notebook(
    figsize=(4, 4)
)

By plotting `weather` against `cloud_coverage`, you can visualize the count of different kinds of weather that occurred in different cloud coverages.

<a id='custom'></a>
# Custom Plotting

The `.call()` method allows users to have a more flexible way to plotting using their preferred plotting libraries/packages.

<a id='custom_lambda'></a>
## Ploting with Lambdas

Here is an example of a matplotlib scatter plot with the custom lambda function.

In [None]:
oracle_traffic_timeseries.call(
    lambda df, x, y: plt.scatter(df[x], df[y]), x="cloud_coverage", y="sensor4"
)

<a id='custom_3d'></a>
## 3D Plots

This section showcases 3D plotting using the `iris` dataset.

In [None]:
data = load_iris()
iris_df = pd.DataFrame(data.data, columns=data.feature_names)

In [None]:
def my_3d_plot(df, figsize=None):
    plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=1, hspace=1)
    plt.style.use("seaborn-white")

    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111, projection="3d")

    ax.scatter(df["sepal_length_(cm)"], df["sepal_width_(cm)"], df["petal_length_(cm)"])

    ax.set_xlabel("Sepal Length (cm)")
    ax.set_ylabel("Sepal Width (cm)")
    ax.set_zlabel("Petal Length")

In [None]:
ds = DatasetFactory.from_dataframe(iris_df)
ds.call(my_3d_plot, figsize=(10, 10));

<a id='custom_pairplot'></a>
## Seaborn's `pairplot` Method

The next cell demonstrates how the dataframe is passed directly to the Seaborn `.pairplot()` method. It plots a pairwise relationship for the dataset. This function will create a grid of Axes such that each variable in data will be shared in the `y-axis` across a single row and in the `x-axis` across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

In [None]:
sns.set(style="ticks", color_codes=True)
DatasetFactory.from_dataframe(iris_df).call(lambda df: sns.pairplot(df.dropna()))

<a id='custom_matplotlib'></a>
## Matplotlib

In [None]:
df = pd.DataFrame(randn(1000, 4), columns=list("ABCD"))


def ts_plot(df, figsize):
    ts = pd.Series(randn(1000), index=pd.date_range("1/1/2000", periods=1000))
    df.set_index(ts)
    df = df.cumsum()
    plt.figure()
    df.plot(figsize=figsize)
    plt.legend(loc="best")


ds = DatasetFactory.from_dataframe(df, target="A")
ds.call(ts_plot, figsize=(7, 7))

<a id='custom_pie'></a>
## Pie Chart

In this example, you make a customized pie-chart and show how to load data using the `DatasetFactory` class.

In [None]:
data = {"data": [1109, 696, 353, 192, 168, 86, 74, 65, 53]}
df = pd.DataFrame(
    data,
    index=[
        "20-50 km",
        "50-75 km",
        "10-20 km",
        "75-100 km",
        "3-5 km",
        "7-10 km",
        "5-7 km",
        ">100 km",
        "2-3 km",
    ],
)


explode = (0, 0, 0, 0.1, 0.1, 0.2, 0.3, 0.4, 0.6)
colors = [
    "#191970",
    "#001CF0",
    "#0038E2",
    "#0055D4",
    "#0071C6",
    "#008DB8",
    "#00AAAA",
    "#00C69C",
    "#00E28E",
    "#00FF80",
]


def bar_plot(df, figsize):
    df["data"].plot(kind="pie", fontsize=17, colors=colors, explode=explode)
    plt.axis("equal")
    plt.ylabel("")
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
    plt.show()


ds = DatasetFactory.from_dataframe(df)
ds.call(bar_plot, figsize=(7, 7))

<a id='custom_gis'></a>
## GIS Plot

Here are some examples where you can visualize geographical data using ADS SDK visualizations.

For this example, the dataset used is the California earthquake data retrieved from the USGS earthquake catalog.

In [None]:
data_path = join(
    "/", "opt", "notebooks", "ads-examples", "3P_data", "earthquake_01.csv"
)
earthquake = DatasetFactory.from_dataframe(data_path, target="depth")

A brief overview to visual major places where earthquakes happened. 

In [None]:
earthquake.plot_gis_scatter(lon="longitude", lat="latitude")

In [None]:
earthquake.head()

In the next cell, you will do minor transformations using Pandas so that our earthquake dataset contains the column `location` which has the format of `"(latitude, longitude)"`.

In [None]:
df = earthquake.compute()
earthquake_df = df.assign(location=[*zip(df.latitude, df.longitude)]).astype(str)

In the next cell, you plot the column `location` using `.plot`, which outputs an interactive map that gives you the flexibility to zoom in/out, identify outliers/inliers etc. 

In [None]:
earthquake_02 = DatasetFactory.from_dataframe(earthquake_df)
earthquake_02.plot("location").show_in_notebook()

# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)