# Fast Pandas Profiling
 _like...Panda Express fast_ 

## Python Package Demo: pandas_profiling

### The Demonstration
The heart of the presentation is six cells of code showing a use case for the `pandas_profiling` package. These six cells demonstrate:
* creating a `ProfileReport` object
* presenting the object as a Jupyter Notebook widget and an HTML embedded iframe
* changing the report configuration
* exporting the report to a standalone HTML file

### The Package
> Generates profile reports from a pandas `DataFrame`. The pandas `df.describe()` function is great but a little basic for serious exploratory data analysis. `pandas_profiling` extends the pandas DataFrame with `df.profile_report()` for quick data analysis.

You can read about the `pandas_profiling` package at this [PyPi link](https://pypi.org/project/pandas-profiling/).

### The Data 
Data purposely unfamiliar to participants is used in the demo today to profile a novel data set. The data come from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Horse+Colic) and is stored as space-delimited files. The metadata notes:

> **Abstract**
>- Well documented attributes
>- 368 instances with 28 attributes* (continuous, discrete, and nominal)
>- 30% missing values

*Note: For Attribute 28 `cp_data`, the data dictionary notes*
>this variable is of no significance since pathology data is not included or collected for these cases

It is ultimately dropped from the data in pre-processing

In [None]:
# packages for data ETL
import urllib
import my_simplifying_module as msm  

# packages to create required DataFrame object
import numpy as np
import pandas as pd

# package to profile the DataFrame
from pandas_profiling import ProfileReport

pd.options.display.max_columns=30
%load_ext watermark

In [None]:
%watermark -v -iv -n -p pandas_profiling

### Data Pre-processing

The data on the external data repository `horse-colic.data` is from 1989. The data is stored as a space-delimited file without a header of column names. The data dictionary (at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Horse+Colic)) contains the 28 attributes within text coded as bytes.

The cell below handles all the pre-processing of the column names and data. It is provided for full reproducability, but is not the focus of this demo. The pre-processing steps are packaged in the module `my_simplifying_module`:

**Column Names**
1. create a regular expression pattern to isolate and extract the variable names
2. open the file `horse-colic.names` and step through each line
  - decode bytes to utf-8
  - for lines containing a variable (matches the regular expression)
    - extract the required regular expression group
    - tranform variable names `tolower()` and snake case
3. add in two variable names not captured in loop

**Data**
1. open the file `horse-colic.data` using `pd.read_csv()`
  - pass parameters to handle space-deliminated and `?` used as `NaN`s
  - pass a dictionary of columns to coerce to string type
2. drop the variable `cp_data`, as noted above  

In [None]:
names_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.names'
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data'

names_file = urllib.request.urlopen(names_url)
col_names = msm.column_name_processing(names_file)

data = msm.data_file_processing(data_url, col_names)

In [None]:
data.sample(5)

## A standard pandas approach to data profiling

The `<dataframe>.describe()` function - along with its parameters like `include='object'` - does a good job of summarizing numerical and categorical data. Those are shown below. 

In [None]:
data.describe()

In [None]:
data.describe(include='object')

## Generating Profiling Reports for EDA

Next steps in Exploratory Data Analysis might include:
* finding missing values
* checking if data types need adjusting
* finding correlations
* plotting histograms
* and more

The `pandas_profiling` package is an early generation of these type of off-the-shelf tools. Let's look those six cells of code that will

- create a ProfileReport object
- present the object as a Jupyter Notebook widget and an HTML embedded iframe
- change the report configuration to customize the output, to some degree
- export the report to a standalone HTML file

In [None]:
# Create a ProfileReport object
# in this case used minimal=True to supress correlations

profile = ProfileReport(data, minimal=True,
                        title='Profile of Horse Colic Data v1.0'
                       )

In [None]:
# View in the notebook as a widget
profile.to_widgets()

In [None]:
profile_frame = ProfileReport(data, minimal=True,
                              title='Profile of Horse Colic Data v1.0'
                            )

# View in the notebook as an HTML iframe
profile_frame.to_notebook_iframe()

## Configuring the report

It is well worth exploring the [Advanced Usage](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html) for configuration output options.

In our demostration, the HTML output parameters are changed to use brand colors and logo. Those are found at the end of the [YAML configuration file](./custom_config.yml).

In [None]:
# change the report configuration to customize the output
full_profile = ProfileReport(data,
                             title='Profile of Horse Colic Data',
                             config_file='custom_config.yml'
                            )

In [None]:
full_profile.to_notebook_iframe()

In [None]:
# export the report to a standalone HTML file
full_profile.to_file("./horse_colic_report.html")