## Machine Learning Exercise: Titanic

1. Load the Dataset
2. Explore the Data:
   - Perform EDA (Explorative Data Analysis) to understand feature distributions and missing data.
   - Visualize survival rates across different features.
3. Preprocess the Data:
   - Handle missing values (e.g., age).
   - Encode categorical variables using techniques like One-Hot Encoding.
   - Feature scaling where appropriate.
4. Feature Engineering:
   - Create new features (e.g., family size, title extraction from names).
5. Split the Data:
   - Divide into training and testing sets.
6. Train Models:
   - Apply classification algorithms such as:
     - Logistic Regression
     - Decision Trees
     - Random Forest
     - SVM
7. Evaluate Models:
   - Use accuracy, precision, recall, and ROC-AUC scores.
   - Perform cross-validation.
8. Model Improvement:
   - Hyperparameter tuning.
   - Ensemble methods.


In [79]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import cross_val_score
import numpy as np

In [80]:
# Load dataset
titanic = pd.read_csv('titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [81]:
# Turning Cabin into Deck
titanic['Deck'] = None
titanic.loc[titanic['Cabin'].notna(), 'Deck'] = [room[0] for room in titanic['Cabin'] if not type(room)==float]

In [82]:
# Preprocessing
fill_dict = {'Age' : titanic['Age'].median(), 
             'Embarked' : titanic['Embarked'].mode()[0],
             'Deck': titanic['Deck'].mode()[0]}
titanic.fillna(fill_dict, inplace=True)
titanic = titanic.drop(['PassengerId', 'Cabin', 'Ticket', 'Name'], axis=1)

In [83]:
# Encoding categorical variables
categorical_features = ['Sex', 'Embarked', 'Deck']
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_cols = pd.DataFrame(encoder.fit_transform(titanic[categorical_features]), columns=encoder.get_feature_names_out())
titanic = pd.concat([titanic.drop(categorical_features, axis=1), encoded_cols], axis=1)

In [102]:
titanic.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,0,3,22.0,1,0,7.25,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1,1,38.0,1,0,71.2833,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1,3,26.0,0,0,7.925,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1,1,35.0,1,0,53.1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0,3,35.0,0,0,8.05,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [93]:
# Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(titanic.drop('Survived', axis=1))
X = pd.DataFrame(scaled_features, columns=titanic.columns.drop('Survived'))
y = titanic['Survived']

In [94]:
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,0.827377,-0.565736,0.432793,-0.473674,-0.502445,0.737695,-0.307562,0.615838,-0.235981,0.440874,-0.196116,-0.193009,-0.121681,-0.067153,-0.03352
1,-1.566107,0.663861,0.432793,-0.473674,0.786845,-1.355574,-0.307562,-1.623803,-0.235981,0.440874,-0.196116,-0.193009,-0.121681,-0.067153,-0.03352
2,0.827377,-0.258337,-0.474545,-0.473674,-0.488854,-1.355574,-0.307562,0.615838,-0.235981,0.440874,-0.196116,-0.193009,-0.121681,-0.067153,-0.03352
3,-1.566107,0.433312,0.432793,-0.473674,0.42073,-1.355574,-0.307562,0.615838,-0.235981,0.440874,-0.196116,-0.193009,-0.121681,-0.067153,-0.03352
4,0.827377,0.433312,-0.474545,-0.473674,-0.486337,0.737695,-0.307562,0.615838,-0.235981,0.440874,-0.196116,-0.193009,-0.121681,-0.067153,-0.03352


In [103]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [104]:
# Training
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [105]:
# Prediction
y_pred = logreg.predict(X_test)

In [106]:
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1]))

Accuracy: 0.832089552238806
ROC AUC Score: 0.8782459425717852


### With cross validation

In [107]:
model = LogisticRegression()
print("Accuracy under cross validation:", cross_val_score(model, X, y, cv=10, scoring='accuracy').mean())
print("ROC AUC Score under cross validation:", cross_val_score(model, X, y, cv=10, scoring='roc_auc').mean())

Accuracy under cross validation: 0.7946317103620475
ROC AUC Score under cross validation: 0.85354576578106


### With XGBoost

In [108]:
import xgboost as xgb
clf = xgb.XGBClassifier(tree_method="hist", early_stopping_rounds=2)
clf.fit(X_train, y_train, eval_set=[(X_test, y_test)])
y_pred = clf.predict(X_test)

[0]	validation_0-logloss:0.53307
[1]	validation_0-logloss:0.46404
[2]	validation_0-logloss:0.42841
[3]	validation_0-logloss:0.41112
[4]	validation_0-logloss:0.39441
[5]	validation_0-logloss:0.38305
[6]	validation_0-logloss:0.37767
[7]	validation_0-logloss:0.37156
[8]	validation_0-logloss:0.36878
[9]	validation_0-logloss:0.36816
[10]	validation_0-logloss:0.37182
[11]	validation_0-logloss:0.36622
[12]	validation_0-logloss:0.36557
[13]	validation_0-logloss:0.36602


In [109]:
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1]))

Accuracy: 0.8582089552238806
ROC AUC Score: 0.8782459425717852
