# Class Competition

# Who survived the sinking of the Titanic?

The goal of this competition is to predict who survived the Titanic sinking in 1912.

## Data set description

<ul>
<li><b>Survived</b>: binary attribute that indicates whether the passenger survived. This is the dependent variable that we will attempt to explain
<li><b>Pclass</b>: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
<li><b>Age</b>: Passenger age
<li><b>SibSp</b>: The amout of the passenger's siblings/spouses aboard the Titanic
<li><b>Parch</b>: The amout of the passenger's parents/children aboard the Titanic
<li><b>Fare</b>: The ticket fare
<li><b>Male</b>: binary attibute that indicates the gender (1=Male, 0=Female)
<li><b>Embarked_C</b>: binary attibute that indicates whether the passenger embarked in Cherbourg
<li><b>Embarked_Q</b>: binary attibute that indicates whether the passenger embarked in Queenstown
<li><b>Embarked_S</b>: binary attibute that indicates whether the passenger embarked in Southampton
</ul>

## Instruction

Cleaning the data set if necessary. 

Use everything you know to find a machine learning model to achieve the highest possible AUC score. Two testing sets have been reserved: TestA.csv and TestB.csv. Your model will be evaluated using these two sets. 70% of the grade will be based on the AUC score on TestA.csv. 30% of the grade will be based on the ranking of the AUC score on TestB.csv among the groups. To be specific, your grade on TestA.csv will be equal to the final AUC score multiplied by 70, and your grade on TestB.csv will be equal to 30 * (number of groups - your ranking)/(number of groups - 1). You must submit the same model for both sets with clear explanation of your codes. You must include the codes to evaluate your model on TestA.csv and TestB.csv. Failure to do so will result in 20% loss of grades (10% for each test). 

TestB.csv is private, which means you will never see it. The ranking will be revealed only after the deadline. TestA.csv is semi-private. This means that you have at most one chance everyday for me to check your model performance on TestA.csv using your code, and I will let you know the AUC score and post your score on the discussion board. I will save your notebook file in the same folder with the data files. If your code does not work on my computer, you lose the opportunity on the same day. 

In [1]:
# Common imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from math import sqrt
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.tree import plot_tree
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
import catboost as cb

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'


In [2]:
#reading the data
df = pd.read_csv("Titanic_0.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            141
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          549
Embarked         1
dtype: int64

In [6]:
df.Embarked.value_counts()

S    517
C    130
Q     65
Name: Embarked, dtype: int64

In [7]:
# Replace missing values with the median
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# Replace missing values with the mode
df.Embarked.fillna("S", inplace=True)
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          549
Embarked         0
dtype: int64

In [8]:
# Extract titles from 'Name' column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Map titles to common categories
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Dr': 'Dr',
    'Rev': 'Rev',
    'Col': 'Other',
    'Major': 'Other',
    'Mlle': 'Miss',
    'Countess': 'Other',
    'Ms': 'Miss',
    'Lady': 'Other',
    'Jonkheer': 'Other',
    'Don': 'Other',
    'Mme': 'Mrs',
    'Capt': 'Other',
    'Sir': 'Other'
}

df['Title'] = df['Title'].map(title_mapping)
df.Title.value_counts()

Mr        418
Miss      146
Mrs       101
Master     31
Dr          7
Other       6
Rev         4
Name: Title, dtype: int64

In [9]:
df.Age.max()

80.0

In [10]:
df.Age.min()

0.42

In [11]:
# Define the age bins
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]

# Define custom names for the age buckets
age_labels = ['0-10', '11-20', '21-30', '31-40','41-50','51-60','61-70','71-80']

# Create a new column 'Age_Bucket' with the age bins
df['Age_Bucket'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

In [12]:
# droping columns not included in model
df= df.drop(['PassengerId','Name','Ticket','Cabin', "Age"], axis=1)

In [13]:
df

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked,Title,Age_Bucket
0,0,3,male,1,0,7.2500,S,Mr,21-30
1,1,1,female,1,0,71.2833,C,Mrs,31-40
2,1,3,female,0,0,7.9250,S,Miss,21-30
3,1,1,female,1,0,53.1000,S,Mrs,31-40
4,0,3,male,0,0,8.4583,Q,Mr,21-30
...,...,...,...,...,...,...,...,...,...
708,0,3,female,0,5,29.1250,Q,Mrs,31-40
709,0,2,male,0,0,13.0000,S,Rev,21-30
710,1,1,female,0,0,30.0000,S,Miss,11-20
711,0,3,female,1,2,23.4500,S,Miss,21-30


In [15]:
df.Age_Bucket.value_counts()

21-30    323
31-40    133
11-20     79
41-50     72
0-10      49
51-60     35
61-70     15
71-80      6
Name: Age_Bucket, dtype: int64

In [16]:
# Define features (X) and target variable (y)
X_train = df.drop('Survived', axis=1)
y_train = df['Survived']

In [17]:
X_train.columns

Index(['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Title',
       'Age_Bucket'],
      dtype='object')

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import xgboost as xgb

# Your column categories and numeric features
col_cat = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Title', 'Age_Bucket']
col_num = ['Fare']

# Separate pipelines for categorical and numeric features
pipe_cat = make_pipeline(SimpleImputer(strategy='constant'), OneHotEncoder(handle_unknown='ignore'))
pipe_num = make_pipeline(StandardScaler(), SimpleImputer())

# Create the column transformer
preprocessor = make_column_transformer(
    (pipe_cat, col_cat),
    (pipe_num, col_num)
)

# Create the complete pipeline
pipe_xboost = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', xgb.XGBClassifier(eval_metric='auc', use_label_encoder=False, random_state=0))
])

# Define the parameter grid for XGBoost
param_grid_xgb = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 7],
    'classifier__min_child_weight': [1, 3, 5]
}

# Perform 5-fold GridSearchCV for XGBoost
grid_search_xgb = GridSearchCV(pipe_xboost, param_grid_xgb, cv=5, scoring='accuracy')
grid_search_xgb.fit(X_train, y_train)

# Get the best XGBoost model
best_model_xgb = grid_search_xgb.best_estimator_


In [19]:
grid_search_xgb.best_params_

{'classifier__max_depth': 7,
 'classifier__min_child_weight': 1,
 'classifier__n_estimators': 50}

In [20]:
# Report the test set accuracy of the best XGBoost model
train_accuracy_xgb = best_model_xgb.score(X_train, y_train)
print(f'Train set accuracy of the best XGBoost model: {train_accuracy_xgb:.4f}')

Train set accuracy of the best XGBoost model: 0.9439


In [21]:
from sklearn.metrics import roc_auc_score
auc_score = roc_auc_score(y_train, best_model_xgb.predict_proba(X_train)[:,1])
print("AUC Score: ", round(auc_score,4))

AUC Score:  0.984


## Testing on TestA data

In [None]:
#read the data
df_A = pd.read_csv("TestA.csv")

In [None]:
#data preprocessing
# Replace missing values 
df_A['Age'].fillna(median_age, inplace=True)
df_A.Embarked.fillna("S", inplace=True)

# Extract titles from 'Name' column
df_A['Title'] = df_A['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Map some titles to common categories
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Dr': 'Dr',
    'Rev': 'Rev',
    'Col': 'Other',
    'Major': 'Other',
    'Mlle': 'Miss',
    'Countess': 'Other',
    'Ms': 'Miss',
    'Lady': 'Other',
    'Jonkheer': 'Other',
    'Don': 'Other',
    'Mme': 'Mrs',
    'Capt': 'Other',
    'Sir': 'Other'
}

df_A['Title'] = df_A['Title'].map(title_mapping)



# Define the age bins
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
# Define custom names for the age buckets
age_labels = ['0-10', '11-20', '21-30', '31-40','41-50','51-60','61-70','71-80']

# Create a new column 'Age_Bucket' with the age bins
df_A['Age_Bucket'] = pd.cut(df_A['Age'], bins=age_bins, labels=age_labels, right=False)

# droping columns not included in model
df_A= df_A.drop(['PassengerId','Name','Ticket','Cabin', "Age"], axis=1)



X = df_A.drop('Survived', axis=1)
y = df_A['Survived']

In [None]:
from sklearn.metrics import roc_auc_score

#get the result
#here model represent the model you choose for test
y_pred = best_model_xgb.predict(X)

auc_score = roc_auc_score(y, best_model_xgb.predict_proba(X)[:,1])
print("AUC Score:", round(auc_score,4))

## Testing on TestB data

In [None]:
#read the data
df_B = pd.read_csv("TestB.csv")

In [None]:
#data preprocessing
# Replace missing values 
df_B['Age'].fillna(median_age, inplace=True)
df_B.Embarked.fillna("S", inplace=True)

# Extract titles from 'Name' column
df_B['Title'] = df_B['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Map some titles to common categories
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Dr': 'Dr',
    'Rev': 'Rev',
    'Col': 'Other',
    'Major': 'Other',
    'Mlle': 'Miss',
    'Countess': 'Other',
    'Ms': 'Miss',
    'Lady': 'Other',
    'Jonkheer': 'Other',
    'Don': 'Other',
    'Mme': 'Mrs',
    'Capt': 'Other',
    'Sir': 'Other'
}

df_B['Title'] = df_B['Title'].map(title_mapping)



# Define the age bins
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
# Define custom names for the age buckets
age_labels = ['0-10', '11-20', '21-30', '31-40','41-50','51-60','61-70','71-80']

# Create a new column 'Age_Bucket' with the age bins
df_B['Age_Bucket'] = pd.cut(df_B['Age'], bins=age_bins, labels=age_labels, right=False)

# droping columns not included in model
df_B= df_B.drop(['PassengerId','Name','Ticket','Cabin', "Age"], axis=1)



X = df_B.drop('Survived', axis=1)
y = df_B['Survived']

In [None]:
from sklearn.metrics import roc_auc_score

#get the result
#here model represent the model you choose for test
y_pred = best_model_xgb.predict(X)

auc_score = roc_auc_score(y, best_model_xgb.predict_proba(X)[:,1])
print("AUC Score:", round(auc_score,4))