### CCBM 2023 Summer Programming Workshop
# Day 8 - Segment 2

### Machine Learning Mini-Challenge
Use the data in the file "titanic_dataset.csv" (adapted from Kaggle) to predict survival on the Titanic using logistic regression and the data preparation methods described above. Evaluate the model by analyzing its accuracy.

The variables in the dataset consist of:
- PassengerID - Identification number
- Survived - Survival (1 = Yes, 0 = No)
- PClass - Passenger Class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- Name - Name
- Sex - Gender
- Age - Age
- SiblingSpouse - Number of siblings and spouses on board with the passenger
- ParentChild - Number of parents and children on board with the passenger
- Ticket - Ticket number
- Fare - Fare
- Cabin - Cabin
- Embarked - Geographic location from which passenger depart

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing

In [None]:
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

# Import libraries to score our predictive models
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score

# Import libraries to On Hot Encode our categorical variables
from sklearn.preprocessing import OneHotEncoder

In [None]:
%matplotlib inline
rcParams['figure.figsize'] = 5, 4
sns.set_style('whitegrid')

In [None]:
titanic = pd.read_csv("data/titanic_dataset.csv")
titanic

In [None]:
# Assess the dataframe for missing values and data types
titanic.info()

In [None]:
# Confirm the target variable is binary
titanic["Survived"].unique()

# Is survivability a balanced class?
titanic["Survived"].value_counts()

# Use seaborn's countplot to observe this
sns.countplot(x="Survived", data=titanic)

In [None]:
# Drop irrelevant columns. 
# Assume the PassengerID, Name and Ticket are not relevant to survival
df = titanic.drop(columns=['PassengerId', 'Name', 'Ticket'])

In [None]:
# Assess the dataset and check for missing values
df.info()
df.isnull().sum()

In [None]:
# The "Cabin" column contains too many null values. Drop it.
df.drop(columns="Cabin", inplace=True)
df.info()

In [None]:
# Impute missing values in the "Age" variable
df.hist(column="Age");

In [None]:
# Use ParentChild variable to guide imputing Age
df.groupby(by=df.ParentChild).mean()

In [None]:
# Should we impute according to the ParentChild AND Sex variables?
display(df.groupby(by=[df.ParentChild, df.Sex]).mean())

df[["ParentChild", "Sex"]].value_counts()

The number of passengers traveling with three or more parents and children make up less than 2% of the total number of passengers in this dataset. Let's keep it simple and impute according the mean Age of the passenger and their parent-child companions and ignore gender.

In [None]:
# Write a function to impute values for Age according to presence of null and 
# the ParentChild value

def approx_age(lst):
    Age = lst[0]
    ParentChild = lst[1]
    
    if pd.isnull(Age):
        if ParentChild == 0:
            return 32
        elif ParentChild == 1:
            return 24
        elif ParentChild == 2:
            return 17
        elif ParentChild == 3:
            return 33
        elif ParentChild == 4:
            return 45
        elif ParentChild == 5:
            return 39
        else:
            return 43  
    else:
        return Age

In [None]:
# Use .apply() to apply the function, approx_age to the Age column
df['Age'] = df[['Age', 'ParentChild']].apply(approx_age, axis=1)

# Check for the existence of null values in the dataframe
df.isnull().sum()

In [None]:
# Drop the remaining two null values in the Embarked column
df.dropna(inplace=True)

display(df.head())
df.info()

In [None]:
df.index

In [None]:
# Reset the index to more accurately reflect the cleaned dataset
df.reset_index(inplace=True, drop=True)

print(df.index)
list(df.index)

Now that the null values have been removed from the dataset, we need to encode the remaining categorical variables to dummy variables (1,0). The only two remaining categorical variables are:
- Sex
- Embarked

We can use pandas .replace() method to encode "Sex" with bummy variables (1,0).
We'll need to one hot encode the "Embarked" variable.

In [None]:
# Confirm that the "Sex" and "Embarked" variables are categorical and assess the number of
# categories for each
df.Sex.unique(), df.Embarked.unique()

In [None]:
# Convert male/female to 1/0 in the "Sex" feature
df["Sex"].replace(to_replace=["male", "female"], value=[1,0], inplace=True)
df.head()

In [None]:
# Don't forget to rename the "Sex" column to "male" since "1" means the passenger is male and
# "0" means the passenger is female
df.rename(columns = {"Sex": "male"}, inplace=True)
df.head()

In [None]:
# With the "Embarked" variable, we can encode the categories, in alphabetical order, as 
# "Cherbourg" = 0, "Queenstown" = 1, and "Southampton" = 2 and then we'll one hot encode this
# dataframe since it is a multi-nomial variable
embarked = df.Embarked
embarked.replace({"Cherbourg": 0, "Queenstown": 1, "Southampton":2}, inplace=True)

# One Hot Encode the embarked dataframe
embarked_encoder = OneHotEncoder(categories="auto")
embarked_ohe = embarked_encoder.fit_transform(embarked.to_numpy().reshape(-1,1))
embarked_ohe

embarked_mx = embarked_ohe.toarray()
embarked_mx[:10]

# Convert the array into a pandas dataframe
embarked_df = pd.DataFrame(embarked_mx, columns = ["Cherbourg", "Queenstown", "Southampton"])
embarked_df

In [None]:
# Drop the "Embarked" column in the original titanic dataframe, df
# and concatenate embarked_df into the original

df.drop(columns="Embarked", inplace=True)
df

In [None]:
titanic_df = pd.concat([df, embarked_df], axis=1)
titanic_df

In [None]:
# Check for independence between the explanatory variables
titanic_df.iloc[1:, 1:].corr()

In [None]:
# Check for independence between the explanatory variables
sns.heatmap((titanic_df.iloc[1:, 1:].corr()))

In [None]:
# Drop the Cherbourg feature since there is collinearity between it and Southampton
titanic_df.drop(columns="Cherbourg", inplace=True)
titanic_df

In [None]:
# Deploy and evaluate the model. First, split the data into train and test sets
# Use default test siize, 75%:25%
X_train, X_test, y_train, y_test = train_test_split(
    titanic_df.iloc[:, 1:],
    titanic_df.iloc[:, 0])

In [None]:
X_train.shape, y_train.shape

In [None]:
X_train

In [None]:
# Instantiate the Logistic Regression model
lr = LogisticRegression(solver="liblinear")

# Fit the model to our data
lr.fit(X_train, y_train)

In [None]:
# Use the model to make a prediction on the X_test data
y_pred = lr.predict(X_test)

# and evaluate model performance
print(classification_report(y_test, y_pred))