# Exploratory Data Analysis
This jupyter notebook focuses on various tools used can be used for EDA <br>

As an example, <b>Titanic</b> data taken from Kaggle <br>

Following things are covered:
<ol>
<li>Getting Data</li>
<li>Data Cleaning</li>
<li>Exploratory Data Analysis (EDA)</li>
<ul>
<li>Traditional</li>
<li>Tensorflow validator</li>
<li>Lux</li>
</ul>
<li>Takeaways</li>
</ol>

In [1]:
# import library
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow_data_validation as tfdv

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

import warnings
warnings.filterwarnings('ignore')

# 1. Getting Data
## Titanic dataset column information

Column description:
<ol>
<li><b>survival</b> - Survival (0 = No; 1 = Yes)</li>
<li><b>Pclass</b>   - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)</li>
<li><b>name</b>     - Name</li>
<li><b>sex</b>      - Sex</li>
<li><b>age</b>      - Age</li>
<li><b>sibsp</b>    - Number of Siblings/Spouses Aboard</li>
<li><b>parch</b>    - Number of Parents/Children Aboard</li>
<li><b>ticket</b>   - Ticket Number</li>
<li><b>fare</b>     - Passenger Fare</li>
<li><b>cabin</b>    - Cabin</li>
<li><b>embarked</b> - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)​</li>
</ol>


In [2]:
# import data into pandas
df = pd.read_csv("./Dataset/train.csv")

# check first 5 records
df.head(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# 2. Data Cleaning

## Handling missing data


In [3]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### 1. Handling <b>Age</b> column null value
<ol>
<li>We could replace the null value of AGE by mean.</li>
<li>However, here I would take Sex and Pclass mean age to replace null values.</li>
</ol>

In [4]:
# Age 
# Missing age data in different passenger class
df[df.Age.isnull()].Pclass.groupby(df.Pclass).count()

Pclass
1     30
2     11
3    136
Name: Pclass, dtype: int64

In [5]:
#mean calculation
mean_1_male = df.loc[(df.Age.notnull()) & (df.Pclass == 1) & (df.Sex == 'male')].Age.mean()
mean_2_male = df.loc[(df.Age.notnull()) & (df.Pclass == 2) & (df.Sex == 'male')].Age.mean()
mean_3_male = df.loc[(df.Age.notnull()) & (df.Pclass == 3) & (df.Sex == 'male')].Age.mean()

mean_1_female = df.loc[(df.Age.notnull()) & (df.Pclass == 1) & (df.Sex == 'female')].Age.mean()
mean_2_female = df.loc[(df.Age.notnull()) & (df.Pclass == 2) & (df.Sex == 'female')].Age.mean()
mean_3_female = df.loc[(df.Age.notnull()) & (df.Pclass == 3) & (df.Sex == 'female')].Age.mean()

#function for age data cleaning
def replaceAge(dataframe):
    #fill null value with mean based on sub-group
    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==1) & (dataframe['Sex']=='male')), mean_1_male, dataframe['Age'])
    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==2) & (dataframe['Sex']=='male')), mean_2_male, dataframe['Age'])
    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==3) & (dataframe['Sex']=='male')), mean_3_male, dataframe['Age'])

    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==1) & (dataframe['Sex']=='female')), mean_1_female, dataframe['Age'])
    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==2) & (dataframe['Sex']=='female')), mean_2_female, dataframe['Age'])
    dataframe['Age'] = np.where(((dataframe.Age.isnull()) & (dataframe['Pclass']==3) & (dataframe['Sex']=='female')), mean_3_female, dataframe['Age'])

    return dataframe

#cleaned data
df_Cleaned = replaceAge(df)

#now all age data is not null
df_Cleaned.isnull().sum() 


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### 2. Handling <b>Cabin</b> column null value
<ol>
<li>Here fill null with X</li>
</ol>

In [6]:
#cabin 687 missing value
df_Cleaned[df_Cleaned['Cabin'].isnull()].Pclass.groupby(df_Cleaned['Pclass']).count()

Pclass
1     40
2    168
3    479
Name: Pclass, dtype: int64

In [7]:
df_Cleaned['Cabin'].fillna('X', inplace=True)

#now all cabin data is not null
df_Cleaned.isnull().sum() # All cabin null value replaced by X

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       2
dtype: int64

### 3. Handling <b>Embarked</b> column null value <br>
Two possibilities:
<ol>
<li>Drop null</li>
<li>Since missing value from 1st class passenger. Replace unknown with place where max people who boarded in first class. (Most people boarded from S.)</li>
</ol>

In [8]:
df_Cleaned['Embarked'].fillna(df_Cleaned.loc[(df_Cleaned.Embarked.notnull()) & (df_Cleaned.Pclass == 1)].Embarked.value_counts().index[0], inplace=True)


df_Cleaned.isnull().sum() # No null value

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64