# Pandas Profiling - Iris Dataset

*October 2019 | Hilary Goh, Perth - Australia*

---

Blog https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625

EDA: Exploratory Data Analysis

"Instead of just giving you a single output, pandas-profiling enables its user to quickly generate a very broadly structured HTML file containing most of what you might need to know before diving into a more specific and individual data exploration." - Lukas Frie

The Iris dataset is a mulitvariate dataset of the Iris flower. It contains 50 samples from each of the three species; Iris setosa, Iris virginica, Iris versicolour. Using the Pandas Profiler below can you determine how many classes there are and what they are?

https://en.wikipedia.org/wiki/Iris_flower_data_set

Examine data for;

- Number of variables/features/classes
- Duplicate columns/rows
- NaNs/Missing values
- Variable type

---
### Pandas Profiling Package, V2.3.0

https://anaconda.org/conda-forge/pandas-profiling

License: MIT

Home: http://github.com/pandas-profiling/pandas-profiling

Development: http://github.com/pandas-profiling/pandas-profiling

Documentation: https://pandas-profiling.github.io/pandas-profiling/docs/

---
To install this package with conda run one of the following:

* conda install -c conda-forge pandas-profiling 
* conda install -c conda-forge/label/cf201901 pandas-profiling 
---

### Install pandas-profiling package

In [None]:
!pip install pandas-profiling

In [None]:
!pip show pandas-profiling #check which version

In [None]:
!pip install pandas-profiling --upgrade #go back to previous cell to see what version was installed

### Load libraries

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
import pandas_profiling

### Load Iris dataset

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()

In [None]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

### Examine data

In [None]:
df.describe() #this is the typical pandas way

#### Use Pandas Profiling instead

In [None]:
pandas_profiling.ProfileReport(df)
# will also provide warnings and labels

In [None]:
df[:5] #compare to Pandas

In [None]:
x = iris['data']
y = iris['target']

In [None]:
y
# class 0 = setosa
# class 1 = versicolor
# class 2 = virginica

In [None]:
iris.target_names

In [None]:
iris.feature_names

In [None]:
df['target'] = [iris.target_names[t] for t in iris.target]
df

## Export the report

In [None]:
profile = df.profile_report(title='Iris Pandas Profiling Report')
profile.to_file(output_file="Iris Pandas Profiling Report.html")

FIN