

# Data Exploration and Visualizations

*Exploratory Data Analysis (EDA)* is the process where a Data Scientist gathers information 
from a dataset. This inclues knowing the source of the data, an understading 
of what the data itself represents, the features that describe the data, the type 
of data in each feature, and what stories comes out from the data. 

Data visualisation is the most agile method to acquire information from a dataset.





In [1]:
# Our data management libraries
import pandas as pd
import numpy as np

# A basic visualization library
import matplotlib.pyplot as plt

# A great visualization library
import seaborn as sns
# This command tells python to use seaborn for its styling.
sns.set()

# Very important, this will make your charts appear in your notebook instead of in a new window.
%matplotlib inline

# Provides z-score helper function
import scipy.stats as stats


# Ignore this, this is just for displaying images.
from IPython.display import Image

ModuleNotFoundError: No module named 'matplotlib'

# First things first... load the data

In [None]:
df = pd.read_csv('data/iris.csv')
df.head()

In [None]:
df['class'].value_counts()

<img src='https://ib.bioninja.com.au/_Media/flower-labelled_med.jpeg'>

<img src='https://i.imgur.com/RcxYYBA.png'>

### Lets see how 'big' our data is by printing its shape.

In [None]:
print(df.shape)

# Lets look at the description and information about our dataset.
* Why, it gives us a high level summary view of our data.

In [None]:
df.describe()

In [None]:
df.info()

# Lets check if we have any null values in our data.

In [None]:
df.isnull().sum()

### Remove, aka Drop, our null values

In [None]:
df = df.dropna()

# Sanity Check
print(df.shape)

# pring how many null values there are
df.isnull().sum()

# Check for duplicate rows.

In [None]:
print(df.duplicated().sum())
df.duplicated()

# Drop said duplicates...
Because `df.duplicated()` returns a selection boolean mask, where it is `True` when the row is duplicated, how would we use that to filter to remove the duplicated rows from our data set..?

In [None]:
# Ask students to solve this one...


# Another way to do the same thing
# df = df.drop_duplicates()
print(df.shape)

# Plotting
`df.plot(kind='scatter', x='COLUMN_NAME', y='COLUMN_NAME')`

The kinds of plot we can produce:
        - 'line' : line plot (default)
        - 'bar' : vertical bar plot
        - 'barh' : horizontal bar plot
        - 'hist' : histogram
        - 'box' : boxplot
        - 'kde' : Kernel Density Estimation plot
        - 'density' : same as 'kde'
        - 'area' : area plot
        - 'pie' : pie plot
        - 'scatter' : scatter plot
        - 'hexbin' : hexbin plot.`

In [None]:
df.plot(kind='scatter', x='sepal_width', y='sepal_length', figsize=(13,8), alpha=0.2);

# How do we remove this outlier...?

In [None]:
# Ask students to see if they can answer this. 



# More plotting.

In [None]:
# How to remove the outlier...
df = df[df.sepal_width < 20].copy()

df.plot(kind='scatter', x='petal_width', y='petal_length', figsize=(13,8));

# Bar Charts

In [None]:
# Find the mean sepal_length for each of the classes 
gb = df.groupby('class')['sepal_length'].mean()

# Make a bar plot of said means
gb.plot(kind='bar');

# Histograms are great for finding what type of distribution the data is.

In [None]:
df.plot(kind='hist', bins=10, figsize=(8,5));

In [None]:
df.plot(bins=20, kind='hist', subplots=True, figsize=(5,8));

# Enter [Seaborn](https://seaborn.pydata.org/)
Seaborn is a visualization library that makes pretty plotting easy and fun. 
* Gallery of Examples:  https://seaborn.pydata.org/examples/index.html


### Scatter plots with Seaborn
* [Scatterplot Reference](https://seaborn.pydata.org/generated/seaborn.scatterplot.html#seaborn.scatterplot)

In [None]:
import seaborn as sns
# Dont forget to call sns.set()! 
sns.set()
# This tells the program to use the seaborn styles 
# Which make our graphs awesome looking


# Call using sns, and pass in the data frame.
ax = sns.scatterplot(data=df, x='petal_width', y='petal_length');

### Seaborn plotting functions return an `axis` object aka `ax`, which you can then set chart options on. 
* All the options you can set using the axis object.
    * https://matplotlib.org/3.3.1/api/axes_api.html

In [None]:
ax = sns.scatterplot(data=df, x='petal_width', y='petal_length');
ax.set_title("Relationship of Iris flowers petal length to width");

In [None]:
# Say you wanted to change the x-axis

ax = sns.scatterplot(data=df, x='petal_width', y='petal_length');
ax.set_title("Relationship of Iris flowers petal length to width");
ax.set_xlim(left=-5, right=5);

In [None]:
ax = sns.scatterplot(data=df, x='petal_width', y='petal_length');
ax.set_title("Relationship of Iris flowers petal length to width");
ax.set_xticks(ticks=[0,1,2,3]);
ax.set_yticks(ticks=[4]);

In [None]:
# Changing the size of the dots based on a column of our data.
sns.scatterplot(data=df, 
                x='petal_width', 
                y='petal_length',
                size='sepal_length');

# What if we wanted to change the color of the markers based on the type of flower it is.
* Have students read documentation it to see if they can figure it out.
    * https://seaborn.pydata.org/generated/seaborn.scatterplot.html#seaborn.scatterplot


In [None]:
sns.scatterplot(data=df, 
                x='petal_width', 
                y='petal_length', 
                ......................??? );

In [None]:
sns.scatterplot(data=df, 
                x='petal_width', 
                y='petal_length', 
                hue='class', 
                palette='Dark2');

# The super scatter plot

In [None]:
sns.jointplot(data=df, x='petal_width', y='petal_length', hue='class');

## Plotting the line of best fit
* Using `sns.regplot` and `sns.lmplot` you can easily plot regression analyses.

In [None]:
sns.regplot(data=df, x='petal_width', y='petal_length');

In [None]:
sns.lmplot(data=df, x='petal_width', y='petal_length', hue='class');

# Box Plots Are great for catching outliers
It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

* Minimum = Q1 -1.5 * IQR 
* Q1 = median of lower half of data
* Q2 = median of data
* Q3 = median of upper half of data
* Maximum = Q3 + 1.5 * IQR
* IQR = Q3 - Q1


<img src='https://miro.medium.com/max/1400/1*2c21SkzJMf3frPXPAR_gZA.png' width=500>

In [None]:
plt.figure(figsize=(8,8))
ax = sns.boxplot(data=df, y='class', x='petal_width');

# The 'catch all' plotting function.

In [None]:
sns.pairplot(df);

# Finding Correlations in your data.
In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. [Learn more here](https://en.wikipedia.org/wiki/Correlation_and_dependence)

<img src='https://www.onlinemathlearning.com/image-files/correlation-coefficient.png' width=500>

In [None]:
df.corr()

# Easily visualize your correlations with a heatmap.

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.heatmap(df.corr(), annot=True, cmap='Spectral')

# Advanced, filtering outliers using z-scores.
A z-score (or standard score) represents the number of standard deviations a given value x falls from the mean.

In [None]:
# Lets reload our data to have those outliers in it
df = pd.read_csv('data/iris.csv')
df = df.dropna()

f, axes = plt.subplots(1, 4, figsize=(13,5))

ax = sns.boxplot(data=df, x='petal_length', ax=axes[0]);
ax = sns.boxplot(data=df, x='petal_width', ax=axes[1]);
ax = sns.boxplot(data=df, x='sepal_width', ax=axes[2]);
ax = sns.boxplot(data=df, x='sepal_length', ax=axes[3]);
# f.set_figsize=(13,8)

In [None]:
df = df.dropna()

# Create a list of just our numerical columns
numerical_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

# Create an empty list that we will store our new z-score column names in
z_score_cols = []

# Loop through the numerical columns
for col in numerical_cols:
    
    # Create a new column name that is the old column name + 'z_score'
    new_col_name = col + '_zscore'
    
    # Call the zscore function on the numerical column in our dataframe
    # And set it equal to our new column name.
    df[new_col_name] = stats.stats.zscore(df[col])
    
    # Convert all values into absolute values. 
    df[new_col_name] = abs(df[new_col_name])
    
    # Append the new column name our our z_score_cols list for easier access for later.
    z_score_cols.append(new_col_name)
    

df.head()

In [None]:
new_df = df.copy()

c1 = new_df['sepal_length_zscore'] < 3
c2 = new_df['sepal_width_zscore'] < 3
c3 = new_df['petal_length_zscore'] < 3
c4 = new_df['petal_width_zscore'] < 3

new_df = new_df[c1]
new_df = new_df[c2]
new_df = new_df[c3]
new_df = new_df[c4]


In [None]:
# Make our selection mask, so anwhere the zscore is greater than 3
condition = df[z_score_cols] < 3

# Say if any of the rows is true, set to true, else set to false
condition = condition.any(axis=1)

# Apply our condition mask
newdf = df[condition]

# Print our results
newdf.head()

In [None]:
f, axes = plt.subplots(1, 4, figsize=(13,5))

ax = sns.boxplot(data=newdf, x='petal_length', ax=axes[0]);
ax = sns.boxplot(data=newdf, x='petal_width', ax=axes[1]);
ax = sns.boxplot(data=newdf, x='sepal_width', ax=axes[2]);
ax = sns.boxplot(data=newdf, x='sepal_length', ax=axes[3]);


# Now it's time for you to practice some EDA. Open up the Exercise notebook and begin coding!