# Advanced pandas - Going Beyond the Basics

## ydata-profiling
- Formerly known as `pandas-profiling`
___

### Table of Contents
1. [Import dependencies](#section1)
2. [Import dataset](#section2)
3. [General quickstart](#section3)
4. [Further features](#section4)
5. [Export reports](#section5)

___
<a id='section1'></a>
# (1) Import dependencies

In [1]:
# Install dependencies (if not already done so)
# !pip install pandas==2.0.3
# !pip install ydata-profiling==4.4.0

In [1]:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

___
<a id='section2'></a>
# (2) Import dataset
- Data Source: https://www.kaggle.com/datasets/datascientistanna/customers-dataset (Database Contents License (DbCL) v1.0)

In [2]:
# Import and read CSV file
df = pd.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/Customers_Mini.csv')

# Set CustomerID as index
df = df.set_index('CustomerID')

# View entire DataFrame
df

Unnamed: 0_level_0,Gender,Age,AnnualIncome,SpendingScore,Profession,WorkExperience,FamilySize
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Male,19,15000,39,Scientist,1,4
2,Male,21,35000,81,Engineer,3,3
3,Female,20,86000,6,Engineer,1,1
4,Female,23,59000,77,Lawyer,0,2
5,Female,31,38000,40,Artist,2,6
6,Female,22,58000,76,Engineer,0,2
7,Female,35,31000,6,Scientist,1,3


___
<a id='section3'></a>
# (3) General quickstart
`ydata-profiling` can be used outside of Jupyter Notebooks (e.g., from command line), but we use the Jupyter Notebook as an easy way to showcase its capabilities. 

The key features of `ydata-profiling` include the following:

- **Type inference**: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)
- **Warnings**: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
- **Univariate analysis**: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
- **Multivariate analysis**: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
- **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata

From these features, we can clearly see that `ydata-profiling` is capable of handling a wide range of data types, such as boolean, numerical, categorical, time series, URL, paths, and images.

To get started right away, we can run the following code to generate a standard profiling report on our DataFrame:

In [3]:
# Generate ProfileReport instance
profile = ProfileReport(df, title='Profiling Report')

In [4]:
# View profiling report
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



A comprehensive report with rich insights is generated quickly, and the best part is that it is interactive and user-friendly. It means that we can adjust the settings and views via the tabs, buttons, and drop-down menus available. 

The report also includes an "Alerts" tab (in the Overview section) that shows a comprehensive and automated list of potential data quality issues. However, the decision on whether an alert is in fact a data quality issue always requires domain validation by the user.

___
<a id='section4'></a>
# (4) Further features
While the general quickstart is already helpful in providing us with a detailed EDA of our data, there are further features in `ydata-profiling` that are highly useful as well. Let us take a look at some of them.

## Comparing two datasets
`ydata-profiling` provides a quick way to generate a report that compares two different datasets. Suppose we perform a series of data transformations on our original dataset to generate a modified DataFrame, as shown below:

In [5]:
# Generate copy of DataFrame
df_transformed = df.copy()

In [6]:
# Filter to engineers
df_transformed = df_transformed[df_transformed['Profession'] == 'Engineer']

# Divide annual income by 1000
df_transformed['AnnualIncome'] = df_transformed['AnnualIncome'] / 1000  

# Standardize spending score values
mean = df_transformed['SpendingScore'].mean()
std = df_transformed['SpendingScore'].std()
df_transformed['SpendingScore'] = (df_transformed['SpendingScore'] - mean) / std

# Introduce random NaN values into DataFrame
n = 3 
row_indices = np.random.randint(low=0, high=df_transformed.shape[0], size=n)
col_indices = np.random.randint(low=0, high=df_transformed.shape[1], size=n)
for i in range(n):
    df_transformed.iat[row_indices[i], col_indices[i]] = np.nan

In [7]:
# View modified DataFrame
df_transformed

Unnamed: 0_level_0,Gender,Age,AnnualIncome,SpendingScore,Profession,WorkExperience,FamilySize
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,Male,21,35.0,0.635943,Engineer,3,3
3,Female,20,,-1.152647,Engineer,1,1
6,Female,22,58.0,,,0,2


With our transformed DataFrame ready, we can now generate a profiling report for it before comparing it with the original profiling report, as shown below:

In [8]:
transformed_report = ProfileReport(df_transformed, title='Transformed Data')
comparison_report = profile.compare(transformed_report)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

In [9]:
# View comparison report
comparison_report

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



The output above shows how we can now easily compare the details and characteristics of both DataFrames in a single report.

## Time series data
`ydata-profiling` also works well for time series data, where it includes different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.

As an example, let us explore the eBay stock price data from 2013 to 2018.

In [58]:
# Read all stock prices dataset into pandas
df_ts = pd.read_csv('https://raw.githubusercontent.com/kennethleungty/Educative-Advanced-Pandas/main/data/csv/all_stocks_5yr.csv')

# Filter to eBay stocks
df_ts = df_ts[df_ts['Name'] == 'EBAY']

# View head
df_ts.head(10)

Unnamed: 0,date,open,high,low,close,volume,Name
186847,2013-02-08,56.46,57.08,56.39,56.62,8066626,EBAY
186848,2013-02-11,56.52,56.58,55.75,56.41,5150867,EBAY
186849,2013-02-12,56.4,57.18,56.11,56.78,10023081,EBAY
186850,2013-02-13,56.86,57.26,56.41,57.05,9095970,EBAY
186851,2013-02-14,56.79,57.12,56.63,56.83,7054543,EBAY
186852,2013-02-15,56.81,57.15,56.41,56.7,9130168,EBAY
186853,2013-02-19,56.86,56.98,56.36,56.68,5701679,EBAY
186854,2013-02-20,56.9,57.1,55.48,55.53,7395537,EBAY
186855,2013-02-21,55.34,55.58,53.9,54.62,10735036,EBAY
186856,2013-02-22,54.96,55.13,54.57,55.02,5087109,EBAY


If we already know the data types of the DataFrame columns, we can specify them in the `type_schema` parameter so that non-time-series data are not processed unnecessarily.

In [53]:
# Setting what variables are time series
type_schema = {
            "open": "timeseries",
            "high": "timeseries",
            "low": "timeseries",
            "close": "timeseries",
            "volume": "timeseries",
            "Name": "categorical",
        }

We can then generate the profiling report specific to time series data by setting the `tsmode` parameter as `True`:

In [54]:
ts_profile = ProfileReport(df_ts,
                           tsmode=True,
                           type_schema=type_schema,
                           sortby="date",
                           title="eBay Time Series Profiling")

In [55]:
# View time series profiling report
ts_profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



From the output, we can see that the report includes checks specific to time series data, such as seasonality and stationarity.

## Sensitive data
In certain data-sensitive contexts (for instance, private health records), sharing a report that includes actual data samples would violate privacy constraints. The following configuration (using the `sensitive` parameter) groups various options together so that only aggregate information is provided in the report and no individual records are shown:

In [56]:
report = df.profile_report(sensitive=True)

In [57]:
# View report
report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



We can see that the output above now does not include sample rows of the actual dataset.

## Profiling large datasets

In [None]:
https://ydata-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html

## Customize appearance
`ydata-profiling` offers two major customization dimensions: 
- Styling of the HTML report 
- Styling of the visualizations and plots contained within

Given that the report is HTML-based, the following table shows the various parameters we can utilize to make changes to the report's appearance:

| Parameters   | Type       | Default | Description          |
|--------------|------------|---------|----------------------|
| `html.minify_html`   | boolean      | `True` | If `True`, the output HTML is minified with `htmlmin` package. Minifying refers to removing unnecessary characters from code (e.g., whitespaces, comments), without affecting code functionality. This can reduce HTML file size, resulting in faster loading.    |
| `html.use_local_assets`   | boolean      | `True` | If `True`, all assets (stylesheets, scripts, images) are stored locally. If `False`, a CDN is used for some stylesheets and scripts.        |
| `html.inline`   | boolean      | `True` | If `True`, all assets are contained in the report. If `False`, then a web export is created, where all assets are stored in the `[REPORT_NAME]_assets/'` directory.         |
| `html.navbar_show`   | boolean      | `True` | If `True`, a navigation bar is included in the report.         |
| `html.style.theme`   | string      | `None` | Defines the bootswatch theme. Available options: `'flatly'` (dark) and `'united'` (orange).         |
| `html.style.logo`   | string      | `None` | Defines a base64 encoded logo, to display in the navigation bar.         |
| `html.style.primary_color`   | string      | #337ab7 | Specifies primary color to use in the report.         |
| `html.style.full_width`   | boolean      | `False` |  If `True`, the full width of the screen is used. If `False`, the width of the report is fixed.  |

For example, the following code demonstrates how to modify the primary color of the report and include a logo within the navigation bar:

___
<a id='section5'></a>
# (5) Export reports
Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.