# Process

Here we will process the titanic data set.  We will look for 
* Missing Data
* Data outliers

### Part 1: Missing Data
First let's load the libraries and data

In [None]:
# Import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Workshop Functions
import sys
sys.path.append('..')
from Wksp722_functions import * 

In [None]:
df = pd.read_csv("titanic_train.csv")
df.head(2)

As we did in a previous lecture, let's reset the index to ***PassengerId***

In [None]:
df.set_index('PassengerId', inplace=True)

Now let's start by looking for missing data

In [None]:
df.isnull().sum()

With only 2 passengers in "Embarked" status showing missing, we will drop them from the dataframe

In [None]:
print(df.shape) # size of df before 
df = df.loc[df.loc[:,'Embarked'].notnull()]
print(df.shape) # size of df after

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.isna())
plt.ylabel('Passenger ID')
plt.show()

"Cabin" stands for the cabin number or the room number on the ship.  

We can hypothesize that location on the ship is not a strong correlation to whether they survived or not.  

Other factors such as the passenger's wealth, which may be related to their Cabin type, can be deduced from other variables such as 'Pclass', and 'Fare'.  

So we will remove this column from the dataframe

In [None]:
df.drop(['Cabin'], axis=1, inplace=True)
df.head()

In [None]:
df.describe()

## If we wanted to replace all NAN's with the median value, use this code:
#temp = df.copy()
#temp['Age'] = temp['Age'].fillna(temp['Age'].median())
#temp.isnull().sum()

Now let's theorize how to replace the missing ages.  A glance at the names shows that each name has a salutation such as "Mr.", "Mrs.", "Miss", "Master", etc.  It is a reasonable assumption that some salutations are correlated to age.  For instance, 
* "Miss" would generally be younger than a "Mrs.".  
* "Master" (salutation for a small child) would be younger than a "Mr."

So let's find the median age for each salutation and then replace missing age values with that for the passenger's respective category

Split the strings in the Name column into seperate names

Using the n=3 option restricts the splitting to 4 sub-strings.  Some people have very long names with double first names, and we are only interested in the 2nd column

Also using the "expand=True" option, we get the resulting strings back as separate columns, which makes accessing the data easier.  

In [None]:
split_name = df.loc[:,'Name'].str.split(n=3, expand=True)

In [None]:
# let's see the list of salutations
df.loc[:,'Salutation']=split_name[1]

In [None]:
#  Count the number of passengers for each salutation
df.groupby('Salutation').count().loc[:,'Name']

In [None]:
# Now let's calculate the median age for each salutation
# But first let's filter out the entries that have null values in the "Age" column
df_clean = df.loc[pd.notna(df.loc[:,'Age']),:]
df_clean.head()

### Uncomment the code below to verify all null records the "Age" column were removed
# df_clean['Age'].isnull().sum()

In [None]:
# Now let's calculate the median_age per salutation
median_age = df_clean.groupby('Salutation').median(numeric_only=True).loc[:,'Age']
print((median_age))

Now go back to the original dataframe and replace any missing values in the "Age" column with their respective median values

In [None]:
df.loc[:,'Age'].isnull().sum()

In [None]:
for ind in df.index:
    if np.isnan(df.loc[ind,'Age']): 
        df.loc[ind,'Age'] = median_age[df.loc[ind,'Salutation']]

***Curiosity Points (10 points)***
Use Method Chaining to replace the for loop above.  Verify that the resulting dataframe is the same as that with the for loop.
(Hint) use the ***index*** and ***map*** dataframe functions

If you're stuck and want to see one possible solution, run the code below.  Remove the '#' and then run

In [None]:
# M3L2_1()

### Part 2: Outliers

In [None]:
df.head()

Reviewing the current dataframe, we see that we do not need to consider the Survived, Pclass, Name, Sex, Ticket, and Embarked columns for outlier analysis.  This is because their the columns have unique values for each passenger (e.g. Name, Ticket) or there are too few categories to warrant an analysis (e.g. Survived, Pclass, Sex, Embarked).  We will examine these later.   

We also can not use the boxplot function for Salutation, as it requires a numeric number and it is not possible to assign a nominal order to the values in this column.  However, earlier in this exercise we saw the distribution of each.  

In [None]:
# Review the Age column
sns.boxplot(y=df.loc[:,'Age'])
plt.show()

# We see several outliers over 55 at the top of the boxplot.  These are outliers but they are likely correctly recorded.  
# We will keep these records in the datafram and we can see who they are:
print('Number of Age outliers are: ', df.loc[df.loc[:,'Age']>55 , 'Name'].count())

In [None]:
#SibSp is the number of siblings and spouse that the passenger had on the titanic.  
sns.boxplot(y=df['SibSp'])
plt.show()

SibSpSurvivors = df.loc[df.loc[:,'SibSp']>2 , ['SibSp','Survived']]
Sorted_SibSpSurvivors = SibSpSurvivors.sort_values('SibSp')
print(Sorted_SibSpSurvivors)

We see that most passengers had either 1 or 0 siblings or spouses.  One big outlier here is a family of 8 none of whom survived.  

Next we can look at the Parch column that states the number of Parents or Children on board.  This will be very similar to the SibSp column

In [None]:
#SibSp is the number of siblings and spouse that the passenger had on the titanic.  
sns.boxplot(y=df.loc[:,'Parch'])
plt.show()

Here the boxplot outlines that most passengers has Parch values of 0.  

In [None]:
sns.boxplot(y=df.loc[:,'Fare'])
plt.show()

This boxplot is interesting.  It characterizes outlier fare as higher than $62.  However, many passengers paid much more, probably for higher class lodging.  Let's see the range of values grouped by class

The boxplots below show that most of the largest outliers in fare came from First Class, which is expected.  

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=3)
axes[0].boxplot(df.loc[df.loc[:,'Pclass']==1 , 'Fare'])
axes[1].boxplot(df.loc[df.loc[:,'Pclass']==2 , 'Fare'])
axes[2].boxplot(df.loc[df.loc[:,'Pclass']==3 , 'Fare'])
axes[0].set_title('First Class')
axes[1].set_title('Second Class')
axes[2].set_title('Third Class')
plt.show()

### Part 3: Exploratory Data Analysis
Let's dive a little deeper before heading to model building to see if certain factors were better for survival

We learned from the Titanic movie (though fiction) that women and children were allowed on life boats first.  Does the data confirm this?

In [None]:
totalPassengers = df.shape[0]

WomenSurvived = df.loc[(df.loc[:,'Survived']==1) & (df.loc[:,'Sex']=='female') , 'Name'].count()
MenSurvived = df.loc[(df.loc[:,'Survived']==1) & (df.loc[:,'Sex']=='male') , 'Name'].count() 

PcntWomenSurvived = 100 * WomenSurvived / totalPassengers
PcntMenSurvived = 100 * MenSurvived / totalPassengers

print(PcntWomenSurvived, PcntMenSurvived)

***Curiosity Points (5 Points)***
Using the ***groupby*** function, recreate the analysis above on percent of passengers survived based on their sex.  

If you're stuck and want to see one possible solution, run the code below.  Remove the '#' and then run

In [None]:
# M3L2_2()

***Curiosity Points (5 Points)***
Using the ***groupby*** function, what percentage of each class survived and was that a significant distinguisher for survival?

If you're stuck and want to see one possible solution, run the code below.  Remove the '#' and then run

In [None]:
# M3L2_3()

Let's look at histograms of age of those that survived to see if younger passengers were more likely to survive.  

In [None]:
AgeSurvived = df.loc[df.loc[:,'Survived']==1 , 'Age']
AgeNotSurvived = df.loc[df.loc[:,'Survived']==0 , 'Age']

fig, axes = plt.subplots(figsize=(15,4))
sns.histplot(x=df.loc[:,'Age'],hue=df.loc[:,'Survived'], kde=True, bins=20)

plt.show()

In [None]:

survivedCounts, binEdges = np.histogram(AgeSurvived,bins=16)
notSurvivedCounts, binEdges2 = np.histogram(AgeNotSurvived,bins=16)
print(binEdges)

pcntCounts = survivedCounts/(survivedCounts+notSurvivedCounts)*100
print(pcntCounts)

In [None]:
fig, ax = plt.subplots()
ax.stem(binEdges[1:], pcntCounts)
ax.set_title('Survival Rate Per Age Group')
ax.set_xlabel('Age')
ax.set_ylabel('Survival Rate')
plt.show()

We see from this chart that a higher percentage of younger passengers survived the titanic.  

***Curiosity Points (5 Points)*** Check out different notebooks on Kaggle.com to see how other data scientists have explored this data.  

Full list is here: https://www.kaggle.com/competitions/titanic/code?competitionId=3136&sortBy=voteCount

One good example is here: https://www.kaggle.com/code/ash316/eda-to-prediction-dietanic

### Save cleaned dataset

In [None]:
df.to_csv('titanic_train_cleaned.csv')
median_age.to_csv('median_age.csv')

### Repeat for titanic_test.csv
Go back to the top of the dataset