## Titanic - Machine Learning from Disaster

In [185]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

In [121]:
# Set local path
local_path = '/Users/katedao/Desktop/'

In [123]:
# Select features and target
features = ['Age', 'Fare', 'Sex', 'Pclass', 'Parch']
target = 'Survived'

X_train = train_data[features]
y_train = train_data[target]
X_test = test_data[features]
passenger_ids = test_data['PassengerId']

In [21]:
# Data Inspection
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [27]:
# Let's set PassengerId as the index column:
print(len(set(train_data["PassengerId"])) == len(train_data))

True


In [25]:
print(len(set(train_data["PassengerId"])) == len(train_data))

True


In [29]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [35]:
# Numerical attributes:
train_data.describe().round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33


In [43]:
# Categorical attributes/outcomes
train_data['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [45]:
train_data["Pclass"].value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [47]:
train_data["Sex"].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [49]:
train_data["Embarked"].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

It seems fairly obvious to include Sex as an attribute in the model. This attribute has two possible values: male and female. We can encode one of these as 1 and the other as 0 for simplicity.

In [52]:
train_data['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

One possible approach is to transform the attribute into a boolean value. When fed into ML models, boolean values are automatically converted into 0 and 1.

In [55]:
(train_data['Sex'] == 'male').head()

0     True
1    False
2    False
3    False
4     True
Name: Sex, dtype: bool

The LabelEncoder is part of the sklearn.preprocessing module, which provides tools to preprocess data before feeding it into a machine learning model. LabelEncoder is specifically used to encode categorical variables as numeric labels. This is often required since machine learning algorithms work with numbers, not strings or categories.

In [60]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

We create an instance of the LabelEncoder class. This instance will be used to transform the values in the categorical column ('Sex' in this case).

In [63]:
print(encoder.fit_transform(train_data['Sex'])[:5])
print(train_data['Sex'].iloc[:5])

[1 0 0 0 1]
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object


Missing value:

The Age attribute contains approximately 19% missing values, which necessitates deciding how to handle them. It's evident that a passenger's Age significantly influences their likelihood of survival. While we can train the model using observations with available Age data, its performance could be severely impacted when predicting new data, particularly if those data points also lack Age information.

A more effective approach might involve imputing the missing values (e.g., using the median age) or leveraging a model that can handle missing data directly, rather than simply dropping the rows with missing values.

Data preparation pipeline:

After analyzing the dataset, I selected the following five attributes: Sex, Pclass, Parch, Age, and Fare.

The data will be processed as follows:

1. Impute missing values in the Age attribute using the median.
2. Convert the Sex attribute into a binary value (e.g., male as 1 and female as 0).
3. Transform the Pclass attribute into one-hot encoded vectors to represent each class as a separate binary column.
4. Encode the Parch attribute so that any value greater than 0 is set to 1, and 0 remains as 0.
5. Normalize the Age and Fare attributes by scaling them to ensure consistent ranges.

This process is often referred to as constructing a pipeline, as the output of each step serves as the input for the next.

### 1. Using Linear Regression Model

In [79]:
# Define columns for numerical and categorical transformations
numerical_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Pclass', 'Parch']

# Numerical transformations: Impute missing values and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing values with median
    ('scaler', StandardScaler())  # Normalize features
])

# Categorical transformations: Encode features as required
categorical_transformer = ColumnTransformer(transformers=[
    ('sex_binary', FunctionTransformer(lambda x: (x == 'male').astype(int)), ['Sex']),  # Encode 'Sex' as binary
    ('pclass_onehot', OneHotEncoder(), ['Pclass']),  # One-hot encode 'Pclass'
    ('parch_binary', FunctionTransformer(lambda x: (x > 0).astype(int)), ['Parch'])  # Encode 'Parch' as binary
])

# Combine numerical and categorical transformers into a single preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),  # Apply numerical transformations
    ('cat', categorical_transformer, categorical_features)  # Apply categorical transformations
])

# Define the complete pipeline
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('regressor', LinearRegression())  # Linear regression model
])

In [105]:
# Fit pipeline
full_pipeline.fit(X_train, y_train)

In [107]:
# Predict
predictions = full_pipeline.predict(X_test)
predicted_labels = (predictions >= 0.5).astype(int)

In [127]:
# Save submission
submission = pd.DataFrame({
    'PassengerId': passenger_ids,
    'Survived': predicted_labels
})

submission.to_csv(local_path + "submission_linear.csv", index=False)

## 2. Using KNN algorithm

In [197]:
# Encode categorical variables
label_encoders = {}
for col in ["Sex", "Embarked"]:
  print(col)
  le = LabelEncoder()
  train_data[col] = le.fit_transform(train_data[col])
  test_data[col] = le.transform(test_data[col])

Sex
Embarked


In [199]:
# Features order
select_features = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

train_data = train_data[select_features + ["Survived"]]
test_data = test_data[select_features]

# Split features and target variable
X = train_data.drop(columns=["Survived"])
y = train_data["Survived"]

In [201]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(test_data)

#  After evaluating all features, I decided to scale the 'Age' feature to enhance its impact
scale = 1.5
X_scaled[:,2] = X_scaled[:,2] * scale
X_test_scaled[:,2] = X_test_scaled[:,2] * scale

Why do we need to scale? Because:

1. KNN is distance-based — it calculates the distance between data points (typically Euclidean distance).

2. Features like Fare and Age can be on very different scales (e.g., Fare in the hundreds, Pclass just 1–3).

3. Without scaling, features with larger values dominate the distance metric, making the model biased toward them.

3. So, standardizing (mean = 0, std = 1) makes all features contribute fairly to the distance calculation.

In [205]:
# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [209]:
# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=40)
knn.fit(X_train, y_train)

In [225]:
# Predict
knn_predictions = knn.predict(X_test_scaled)
knn_predicted_labels = knn_predictions.astype(int)

# Create and save submission file
submission_knn = pd.DataFrame({
    'PassengerId': passenger_ids,
    'Survived': knn_predicted_labels
})

submission_knn.to_csv(local_path + "submission_knn.csv", index=False)

'/Users/katedao'