# Pandas Profiling - Iris Dataset

*October 2019 | Hilary Goh, Perth - Australia*

---

Blog https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625

EDA: Exploratory Data Analysis

"Instead of just giving you a single output, pandas-profiling enables its user to quickly generate a very broadly structured HTML file containing most of what you might need to know before diving into a more specific and individual data exploration." - Lukas Frie

The Iris dataset is a mulitvariate dataset of the Iris flower. It contains 50 samples from each of the three species; Iris setosa, Iris virginica, Iris versicolour. Using the Pandas Profiler below can you determine how many classes there are and what they are?

https://en.wikipedia.org/wiki/Iris_flower_data_set

Examine data for;

- Number of variables/features/classes
- Duplicate columns/rows
- NaNs/Missing values
- Variable type

---
### Pandas Profiling Package, V2.3.0

https://anaconda.org/conda-forge/pandas-profiling

License: MIT

Home: http://github.com/pandas-profiling/pandas-profiling

Development: http://github.com/pandas-profiling/pandas-profiling

Documentation: https://pandas-profiling.github.io/pandas-profiling/docs/

---
To install this package with conda run one of the following:

* conda install -c conda-forge pandas-profiling 
* conda install -c conda-forge/label/cf201901 pandas-profiling 
---

### Install pandas-profiling package

In [None]:
!pip install pandas-profiling

In [1]:
!pip show pandas-profiling #check which version

Name: pandas-profiling
Version: 2.3.0
Summary: Generate profile report for pandas DataFrame
Home-page: https://github.com/pandas-profiling/pandas-profiling
Author: Jos Polfliet, Simon Brugman
Author-email: pandasprofiling@gmail.com
License: MIT
Location: c:\users\hilary goh\anaconda3\lib\site-packages
Requires: astropy, confuse, jinja2, phik, htmlmin, pandas, missingno, matplotlib
Required-by: 


In [None]:
!pip install pandas-profiling --upgrade #go back to previous cell to see what version was installed

### Load libraries

In [2]:
%matplotlib inline

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [4]:
import pandas_profiling

### Load Iris dataset

In [5]:
from sklearn.datasets import load_iris

In [6]:
iris = load_iris()

In [7]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

### Examine data

In [8]:
df.describe() #this is the typical pandas way

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


#### Use Pandas Profiling instead

In [9]:
pandas_profiling.ProfileReport(df)
# will also provide warnings and labels



In [10]:
df[:5] #compare to Pandas

Unnamed: 0,sepal_length_(cm),sepal_width_(cm),petal_length_(cm),petal_width_(cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [11]:
x = iris['data']
y = iris['target']

In [12]:
y
# class 0 = setosa
# class 1 = versicolor
# class 2 = virginica

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [13]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [14]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [15]:
df['target'] = [iris.target_names[t] for t in iris.target]
df

Unnamed: 0,sepal_length_(cm),sepal_width_(cm),petal_length_(cm),petal_width_(cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


## Export the report

In [18]:
profile = df.profile_report(title='Iris Pandas Profiling Report')
profile.to_file(output_file="Iris Pandas Profiling Report.html")

FIN