# Pandas Profiling - Titanic Dataset

*October 2019 | Hilary Goh, Perth - Australia*

---

Blog https://towardsdatascience.com/speed-up-your-exploratory-data-analysis-with-pandas-profiling-88b33dc53625

EDA: Exploratory Data Analysis

"Instead of just giving you a single output, pandas-profiling enables its user to quickly generate a very broadly structured HTML file containing most of what you might need to know before diving into a more specific and individual data exploration." - Lukas Frie

The Titanic dataset is the passenger manifest of the ship with an included variable of 'Survived'. It contains roughly 2200 passenger details. 
Use the Pandas Profiler to examine the dataset. 

https://www.kaggle.com/c/titanic

Examine data for;

- Number of variables/features/classes
- Duplicate columns/rows
- NaNs/Missing values
- Variable type

---
### Pandas Profiling Package, V2.3.0

https://anaconda.org/conda-forge/pandas-profiling

License: MIT

Home: http://github.com/pandas-profiling/pandas-profiling

Development: http://github.com/pandas-profiling/pandas-profiling

Documentation: https://pandas-profiling.github.io/pandas-profiling/docs/

---
To install this package with conda run one of the following:

* conda install -c conda-forge pandas-profiling 
* conda install -c conda-forge/label/cf201901 pandas-profiling 
---

### Install pandas-profiling package

In [None]:
!pip install pandas-profiling

In [1]:
!pip show pandas-profiling #check which version

Name: pandas-profiling
Version: 2.3.0
Summary: Generate profile report for pandas DataFrame
Home-page: https://github.com/pandas-profiling/pandas-profiling
Author: Jos Polfliet, Simon Brugman
Author-email: pandasprofiling@gmail.com
License: MIT
Location: c:\users\hilary goh\anaconda3\lib\site-packages
Requires: pandas, htmlmin, matplotlib, missingno, astropy, phik, confuse, jinja2
Required-by: 


In [None]:
!pip install pandas-profiling --upgrade

### Load libraries

In [2]:
%matplotlib inline 

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd   

In [4]:
import pandas_profiling

### Load Titanic dataset

In [5]:
data = pd.read_csv('3_Day_MLDL/data/titanic.csv')
#make sure it is saved in the same directory as this notebook

### Examine data

In [6]:
data.describe() #this is the typical pandas way

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


#### Use Pandas Profiling instead

In [7]:
pandas_profiling.ProfileReport(data)
# will also provide warnings and labels



In [9]:
data[:8] #compare to Pandas

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare
0,0,3,Mr. Lionel Leonard,male,36.0,0,0,0.0
1,0,1,Mr. William Harrison,male,40.0,0,0,0.0
2,1,3,Mr. William Henry Tornquist,male,25.0,0,0,0.0
3,0,2,Mr. Francis Parkes,male,21.0,0,0,0.0
4,0,3,Mr. William Cahoone Jr Johnson,male,19.0,0,0,0.0
5,0,2,Mr. Alfred Fleming Cunningham,male,22.0,0,0,0.0
6,0,2,Mr. William Campbell,male,21.0,0,0,0.0
7,0,2,Mr. Anthony Wood Frost,male,37.0,0,0,0.0


## Export the report

In [14]:
profile = data.profile_report(title='Titanic Pandas Profiling Report')
profile.to_file(output_file="Titanic Pandas Profiling Report.html")