<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@acca-logo.jpg" alt="ACCA logo" style="width: 400px;"/>

# Python for data analysis
## Part 6 - One more thing

* **Course:** __Machine learning with Python for finance professionals__ by ACCA
* **Instructor:** [Coefficient](https://coefficient.ai) / [@CoefficientData](https://twitter.com/CoefficientData)

---

During this module, you will have seen how powerful pandas is for manipulating data efficiently, and how combined with Seaborn or matplotlib we can produce insightful visualisations for analysis.

There are many Python modules that have been developed to extend the functionality of others, and in this notebook we will demonstrate one which is invaluable for quickly producing detailed reporting and automated analysis for any pandas DataFrame.

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
pandas-profiling
</h2><br>
</div>

<a href="https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/"><img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@pandas-profiling.png" alt="pandas-profiling" style="width: 800px;"/></a>

**[pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/)** is a Python module that allows you to perform powerful automated exploratory data analysis with a tiny amount of code. The tool generates a report from a pandas DataFrame showing key insights and visualisations. These can then be sent on as interactive reports for stakeholders to explore the data without requiring coding skills.

For data scientists, `pandas-profiling` is an invaluable tool for gaining quick insights into unknown datasets. This is a critical part of any machine learning project, highlighting potential issues (e.g. missing values, outliers, co-linearity) or opportunities (e.g. correlations, distribution fitting) and guiding the approach that we may want to take (e.g. understanding non-linearities and the relationships between features helps to identify which models may be appropriate for a given problem).

---

In [10]:
# Import modules as required
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
# from ydata_profiling import ProfileReport

ModuleNotFoundError: No module named 'pandas_profiling'

In [11]:
# Let's read in the Dream Destination hotel data.
orders = pd.read_excel(
    "Source\\Hotel Industry - Orders Database - 2019.xlsx", sheet_name="Order Database"
)

orders.head()

Unnamed: 0,Booking ID,Date of Booking,Year,Time,Customer ID,Gender,Age,Origin Country,State,Location,Destination Country,Destination City,No. Of People,Check-in date,No. Of Days,Check-Out Date,Rooms,Hotel Name,Hotel Rating
0,DDID57035,2019-01-01,2019.0,13:23:47,ID10297,Female,51,Indonesia,Tambora,Jakarta,Ireland,Tallaght,2,2019-03-24,1,2019-03-25,1,Blooming Bed And Breakfast,4.2
1,DDSG57036,2019-01-01,2019.0,16:14:22,SG10307,Male,46,Singapore,Central,Novena,Maldives,Viligili,4,2019-01-15,2,2019-01-17,2,Four Points,4.3
2,DDMY57037,2019-01-01,2019.0,09:49:48,MY10283,Female,25,Malaysia,Johor,Johor Bahru,Canada,North York,5,2019-01-16,9,2019-01-25,3,Hotel Joy Stick,3.8
3,DDSG57038,2019-01-01,2019.0,11:46:28,SG10308,Male,22,Singapore,North-East,Hougang,Maldives,Fuvahmulah,5,2019-01-18,1,2019-01-19,3,Classio Hotel,3.7
4,DDID57039,2019-01-01,2019.0,13:57:50,ID10298,Male,45,Indonesia,Bekasi,West Java,France,Nice,7,2019-01-02,1,2019-01-03,4,Adam Lake B&B,4.5


In [12]:
# Let's now generate a report with pandas-profiling
profile = ProfileReport(orders, title="Hotel Industry - Orders Report - 2019")

NameError: name 'ProfileReport' is not defined

In [None]:
profile.to_notebook_iframe()

In [None]:
!pip uninstall -y pandas-profiling

In [None]:
!pip install pandas-profiling[notebook]

In [None]:
!pip show pandas-profiling

---

<div class="alert alert-block alert-warning">
<b><i class="fa fa-check-square" aria-hidden="true"></i>&nbsp; Check</b><br>
pandas-profiling can take a long time to process large datasets. However, there are methods to reduce the amount of information analysed in order to profile large datasets more efficiently.
</div>

---

The following examples showcase two methods to reduce processing when working with large datasets.

In [8]:
# By calling the sample function on the order DataFrame, we can select the amount of rows to report on
small_profile = ProfileReport(orders.sample(n=100), title="Hotel Industry - Orders Report - 2019")
small_profile

NameError: name 'ProfileReport' is not defined

In [3]:
# Alternatively, pandas_profiling has a minimal parameter, which produces a simplified report when flagged 
minimal_profile = ProfileReport(orders, title="Hotel Industry - Orders Report - 2019", minimal=True)
minimal_profile

NameError: name 'ProfileReport' is not defined

---

Running the module within the Jupyter notebook is great for seeing the results, but in order to distribute the reports we must use the following to save the report to an interactive HTML document.

In [4]:
# Save the report to a file
profile.to_file('Hotel Industry - Orders Report - 2019.html')

NameError: name 'profile' is not defined

This can then be sent to stakeholders and viewed in any web browser (right-click the file and open in your browser of choice).

---
<div class="alert alert-block alert-success">
<b>🎉 Congratulations</b><br>
You have reached the end of this module.
</div>