In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Exploratory Data Analysis
A key to any data science project is to fully understand your data and the relationships between potential predictors and the predictand. The insights gained should help identify likely paths toward machine learning (if that is the goal). As you perform exploratory data analysis (EDA) you'll want to **keep asking questions and be inquisitive about the data**. This will be a very iterative process and likely cause you to break down data in numerous ways.

https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15

## Read Data
We have been using Pandas module for reading in data this semester and the reason we have done so, is that the DataFrame object and methods associated with it will make the work of completing EDA an easier, more enjoyable task. The DataFrame data object will also be useful to feed into the Seaborn module for visual EDA and is built on top of both Pandas and Matplotlib.

For this example of EDA, we'll use the Iris dataset, which contains 150 rows of data about three different species of Iris flowers. The data can be found at https://datahub.io/machine-learning/iris/r/iris.csv and doesn't require any special keyword usage to read it in using the `read_csv` function from Pandas

In [None]:
df = pd.read_csv('https://datahub.io/machine-learning/iris/r/iris.csv')

Let's first take a look at the data by representing the DataFrame as a table. Note to due this we don't want to use the print function, but rather just type of the variable name into a cell and run that cell.

In [None]:
df

The default view is that we get a table view of the first and last give rows of the DataFrame. Since it is a small enough dataset we see all of the columns, which include the species class, the sepal length, sepal width, petal length, an petal width.

*What part of the flower is the sepal?*

It is the part of a flower that covers a bud.

![image](https://leafyplace.com/wp-content/uploads/2019/05/flower-diagram.jpg)

## Data Insights

Let's start off with some insights from the size and shape of the dataset and describing some of the numeric characteristics of the dataset.
* shape of the dataset
* common statitics (five numbers)
* correlation between numeric fields

In [None]:
df.shape

## Five Number Summary
With our DataFrame we can use the `describe()` method to get a quick look at the observation count, mean, standard deviation, minimum, 25%, median, 75%, and max values.

You can also alter the percentile values shown with the `percentiles` keyword argument and supplying the integer values in a list that you would like to be shown (e.g., `[.05, .10, .25, .50, .75, .90, .95]`).

**What notable things stand out from your five number summary?**

In [None]:
df.describe()

## Correlations
With a DataFrame it is also very easy to get a table of correlation values among all of the numeric columns within your dataset. In addition you have a choice of correlation calculation type including `pearson` (which is the default and most common correlation calculations), `kendall`, and `spearman`. To change the correlation calculation use the `method` keyword.

Kendall and Spearman Rank Correlation Information: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/kendalls-tau-and-spearmans-rank-correlation-coefficient/

**What jumps out about the correlation between all of the variables?**

In [None]:
df.corr()

## Visual Exploratory Data Analysis
While we have previously used Matplotlib to make some basic graphics, for exploratory data analysis we are going to use a module that has been built on top of it, which will make the creation of graphics easier, while still allowing all of the hooks to refine the figures using the details of Matplotlib.

Seaborn (https://seaborn.pydata.org/index.html) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Most of the Matplotlib functionality is built-in to keyword arguments for each of the plotting functions, others are obtainable through giving the plot a namespace (assigning it to a variable object name) and accessing the axis (ax) and setting parameters like we did using native Matplotlib axis calls.

Seaborn API: https://seaborn.pydata.org/api.html

### Histogram
In Seaborn, `displot` is a figure plotting function that can plot a couple of different types of plots for univariate and bivariate data. The default is `histplot`. You'll need to feed it your dataset, prefereably as a Pandas DataFrame, and specify the column of data you wish to plot (e.g., sepallength). The lower-level plot for a histogram is `sns.histplot`.

Displot Documentation: https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot

Histplot Documentation: https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot

**What does the distribution of sepal length among the iris types look like?**

In [None]:
sns.displot(df, x='sepallength')

What makes this plotting function (and all of Seaborn) really powerful is that it is easy to subset and plot data from different categoies. For example, with the Iris dataset, we have three different types of the Iris flower, and by adding one keyword argument, `hue`, we can get a visual representation on the similarities and differences of a particular charactersitc across the different categories.

**Are the distributions different for the different iris classes?**

In [None]:
plot = sns.displot(df, x='sepallength', hue='class')

### Kernel Density Estimate
Another option from `displot` is to plot the kernel density estimate of the histrogram distribution. To plot the KDE, set the `kind` keyword to `'kde'`.

kdeplot Documentation: https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot

In [None]:
sns.displot(df, x='sepallength', kind='kde', hue='class')

### Dot Plots
In Seaborn there is a simple method for plotting dot plots to visualize the distribution of charactersitics within a single category. In this case the `stripplot` plotting function can be used. This is not a figure-level plotting function and this particular function requires specifying the DataFrame with a keyword argument `data` instead of just listing it as the first value in the fuction arguments. Here let's plot the `sepalwidth` by they `'class'`.

**What kind of difference in distribution of sepal widths are there between the different iris classes?**

In [None]:
sns.stripplot(data=df, x='sepalwidth', y='class')

### Box and Whisker Plots
A box plot is a very common method to visualize the five number numeric statistics. This is a part of the Seaborn categorical plotting, which includes dot plots, box plots, swarmplots, etc. Here we use the individual plotting function and not the figure-level plotting function (`catplot`). 

boxplot Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot

For more information on the figure-level categorical plotting see the `catplot` documentation at https://seaborn.pydata.org/generated/seaborn.catplot.html#seaborn.catplot

**What kind of range is there for petal length? Are there any outliers?**

In [None]:
fig = plt.figure(figsize=(10, 12))
sns.boxplot(data=df, y='petallength')
plt.title('Box and Whisker Plot\nPetal Length');

Switch up which way the box and whiskers are oriented by swapping the x and y values.

In [None]:
sns.boxplot(data=df, x='sepalwidth', y='class')

### Scatter Plots

There are many times where you might be trying to determine the relationship between predictor variables or between the predictor variables and the predictand. In this way simple scatter plots allow a visual comparison to could lead to the identification of clustering or some sort of linear or other relationship between two variables. This is part of the relational plotting in Seaborn.

scatterplot Documentation: https://seaborn.pydata.org/generated/seaborn.scatterplot.html#seaborn.scatterplot

**What is the relationship between sepal width and petal length?**

In [None]:
sns.scatterplot(data=df, x='petallength', y='sepallength')

Combine a KDE plot with a scatterplot of the points going into the KDE.

In [None]:
sns.displot(df, x='petallength', y='sepallength', kind='kde', hue='class')
sns.scatterplot(data=df, x='petallength', y='sepallength', hue='class', alpha=0.5)
plt.title('This is a title')

### Pair Plots
A quick way to visualize all of the pairwise relationships within your dataset is to use the `pariplot` function. This will create a matrix of plots that investigates the histogram of each variable and produces a scatterplot of each variable agaist the other (in a similar fashion to a correlation matrix).

pairplot Documentation: https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot

**What are the distribution and relationships between all of the numeric variables in my dataset?**

In [None]:
sns.pairplot(df)

There are many more ways to customize the pairplot to visualize the data in a meaningful way. If you need more control over the pairplot, you can use the PairGrid class (https://seaborn.pydata.org/generated/seaborn.PairGrid.html#seaborn.PairGrid), but even without going to that level, there is great ability to modify the pairplot with some mapping.

In [None]:
g = sns.pairplot(df, hue='class', diag_kind="kde", height=2.5)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, fill=False)
g.map_diag(sns.histplot, multiple='stack', stat='count', kde=True)

### Heat Map

Heat maps are a matrix plot and are great way to add some visualization to otherwise text-based analysis. Specifically, you can plot a correlation matrix and fill it in with colors representing the value of the correlation of each pair of observation vectors. By using the annot keyword you can even add the numeric value (correlation in the example below) to the plot.

heatmap Documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap

**What variables have a high correlation to one another?**

In [None]:
fig = plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap='vlag_r')

### Additional Resources
There are many other types of plots and some additional packages that might help make the visualization that you need for a particular part of  your work. Here are a couple of links to other tutorials/help pages that might be helpful.

https://www.python-graph-gallery.com

https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed

https://nbviewer.jupyter.org/github/PBPatil/Exploratory_Data_Analysis-Wine_Quality_Dataset/blob/master/winequality_white.ipynb

https://realpython.com/pandas-plot-python/

https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html