<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/notebooks/data-explore-iris-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data exploration - Iris Data Set

This example explores the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) [[csv](https://github.com/venky14/Machine-Learning-with-Iris-Dataset/blob/master/Iris.csv)] from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) usings [Pandas Dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).  This kind of [EDA (Exploratory Data Analysis)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) is one of the key data science components of machine learning applications.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/iris.jpg?raw=1" /></center>
</figure>

Some of the example code here is pulled from the book [Neural Network Projects with Python](https://www.oreilly.com/library/view/neural-network-projects/9781789138900/) by James Loy. The majority is from the Github project [Machine Learning with Iris Dataset](https://github.com/venky14/Machine-Learning-with-Iris-Dataset) by [Veky Rathod](https://github.com/venky14)

## Data Set Information:

This is perhaps the best known dataset to be found in the pattern recognition literature. Fisher's 1936 paper is a classic in the field and is referenced frequently to this day. (See [Duda & Hart](https://drive.google.com/open?id=1fq7usrZu7nmi0co5urq60h9P6w0WkDbY), for example.) 

- The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. 
- One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 
- Predicted attribute: class of iris plant. 
- This is an exceedingly simple domain. 
- [Here's Jesin Fahad's Analysis of the dataset](https://www.kaggle.com/jesyfax/iris-analysis) on [Kaggle](https://www.kaggle.com/).
- [Fisher's 1936 paper](https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf)

## Attribute Information:

1. [sepal](https://en.wikipedia.org/wiki/Sepal) length in cm 
2. [sepal](https://en.wikipedia.org/wiki/Sepal) width in cm 
3. [petal](https://en.wikipedia.org/wiki/Petal) length in cm 
4. [petal](https://en.wikipedia.org/wiki/Petal) width in cm 
5. class: 
  - Iris Setosa 
  - Iris Versicolour 
  - Iris Virginica
  
 <figure>
  <img src="../images/iris-petal-sepal.jpg?raw=1" align=”right”/>
</figure>


# Python code

**Usage NOTE!** Use `Shift+Enter` to step through this notebook, executing the code as you go.

In [None]:
#@title Welcome
import datetime
print(f"Welcome to exploring this notebook at {datetime.datetime.now()}! ")

In [None]:
class Context:
  DATA = 'https://raw.githubusercontent.com/jfogarty/machine-learning-intro-workshop/master/data/'

In [None]:
import numpy as np #linear algebra
import pandas as pd # a data processing and CSV I/O library

import warnings # current version of seaborn generates a bunch of warnings that will be ignore
warnings.filterwarnings('ignore')

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style='white', color_codes=True)

import matplotlib.cm as cm

colors = cm.rainbow(np.linspace(0, 1, 4))

In [None]:
URL= 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'


# Visualization Tools

## <center>[matplotlib - Python 2D plotting library](https://matplotlib.org/) </center>

## <center>[seaborn - Visualizing the distribution of a dataset](https://seaborn.pydata.org/tutorial/distributions.html) </center>

## <center>[pandas - DataFrame visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) </center>


### Read the Iris Data set from the UC Irvine ML databases

In [None]:
colnames = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(URL, names=colnames)

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.plot(kind='scatter',x='sepal_length', y='sepal_width', color=[colors[0]]) # use this to make a scatterplot of the Iris features.

In [None]:
colors

## Read a CSV file with column headings from a local directory


In [None]:
# load Iris Flower dataset
URL = Context.DATA + 'Iris.csv'
iris = pd.read_csv(URL)

Display the [pandas DataFrame head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) of the first few elements.

In [None]:
iris.head(10)

In [None]:
iris['Species'].value_counts()

In [None]:
 blue = [0, 0, 1]
iris.plot(kind='scatter',x='SepalLengthCm', y='SepalWidthCm', color=[blue]) # use this to make a scatterplot of the Iris features.

### seaborn jointplot

A [seaborn jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html) shows bivariate scatterplots and univariate histograms in the same figure.

In [None]:

sns.jointplot(x='SepalLengthCm',y='SepalWidthCm', data=iris, size=5)

In [None]:
iris.shape


In [None]:
iris.info()

### seaborn FacetGrid

A [seaborn FacetGrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) to color the scatterplot by species.

In [None]:
# use seaborn's FacetGrid to color the scatterplot by species
sns.FacetGrid(iris, hue = 'Species', size=5) \
    .map(plt.scatter, 'SepalLengthCm','SepalWidthCm') \
    .add_legend()

### seaborn boxplot

A [seaborn boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) to look at an individual feature in Seaborn through a boxplot.

The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
# We can look at an individual feature in Seaborn through a boxplot
sns.boxplot(x='Species', y='PetalLengthCm', data=iris)

### seaborn stripplot

One way we can extend this plot is adding a layer of individual points on top of it through Seaborn's [stripplot](https://seaborn.pydata.org/generated/seaborn.stripplot.html).

- use `jitter=True` so that all the points don't fall in single vertical lines above the species

- Saving the resulting axes as `ax` each time causes the resulting plot to be shown on top of the previous axes

In [None]:
ax = sns.boxplot(data=iris, x = 'Species', y = 'PetalLengthCm')
ax = sns.stripplot(data=iris, x='Species', y='PetalLengthCm', jitter=True, edgecolor='green')

### seaborn violinplot

 A violin plot combines the benefits of the previous two plots and simplifies them.

Denser regions of the data are **fatter**, and sparser **thinner** in a [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html).

In [None]:
sns.violinplot(x='Species',y='PetalLengthCm', data=iris, size=6)   

### seaborn kdeplot

A useful seaborn plot for looking at univariate relations is the [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html),
which creates and visualizes a kernel density estimate of the underlying feature


In [None]:
sns.FacetGrid(iris, hue="Species", size=6) \
   .map(sns.kdeplot, "PetalLengthCm") \
   .add_legend()

In [None]:
iris.head()

### seaborn pairplot

A useful seaborn plot for looking at univariate relations is the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html),
which creates and visualizes a kernel density estimate of the underlying feature.

The pairplot shows the bivariate relation between each pair of features.

From the pairplot, we'll see that the Iris-setosa species is separated from the other two across all feature combinations.

In [None]:
sns.pairplot(iris.drop('Id', axis=1), hue='Species', size=3, diag_kind='hist')

#### KDE - Kernel density estimation

We can update these elements to show other things, such as a [kde](https://en.wikipedia.org/wiki/Kernel_density_estimation) [(kernel density estimation)](https://mglerner.github.io/posts/histograms-and-kernel-density-estimation-kde-2.html)

- The diagonal elements can be shown in a pairplot are often shown as histograms.


In [None]:
sns.pairplot(iris.drop('Id', axis=1), hue='Species', size=3, diag_kind='kde')

### boxplot on each feature split out by species

Make a boxplot with Pandas on each feature split out by species.

In [None]:
iris.drop('Id', axis=1).boxplot(by='Species', figsize=(12,6))

### Pandas andrews_curves

One cool more sophisticated technique pandas has available is called [Andrews Curves](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.andrews_curves.html).

- Andrews Curves involve using attributes of samples as coefficients for Fourier series and then plotting these.


In [None]:
from pandas.plotting import andrews_curves
andrews_curves(iris.drop("Id", axis=1), "Species")

### Pandas parallel_coordinates

Another multivariate visualization technique pandas has is [parallel_coordinates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.parallel_coordinates.html).

- Parallel coordinates plots each feature on a separate column & then draws lines connecting the features for each data sample.

In [None]:
from pandas.plotting import parallel_coordinates
parallel_coordinates(iris.drop("Id", axis=1), "Species")

### Pandas radviz

A final multivariate visualization technique pandas has is [radviz](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.radviz.html).

- Radviz puts each feature as a point on a 2D plane, and then simulates
having each sample attached to those points through a spring weighted
by the relative value for that feature

In [None]:
from pandas.plotting import radviz
radviz(iris.drop("Id", axis=1), "Species")

### seaborn factorplot

Draw a categorical plot onto a FacetGrid. [factorplot](https://kite.com/python/docs/seaborn.factorplot).

In [None]:
sns.factorplot('SepalLengthCm', data=iris, hue='Species', kind='count', aspect=2.5 )

### End of notebook.