# 3 Feature Selection
In this problem, you will work with a dataset describing the surviving pas- sengers of the Titanic. The objective is to build a predictive model that answers the question: “what sorts of people were more likely to survive?” You can use characteristics of the passengers provided to you in the dataset. Download the train and test dataset for this problem from here.

(a) Impute your dataset’s missing values.

(b) Which features from the dataset do you think will be useful in your prediction? Do you think adding ticket number as a feature is a good idea? Explain your reasoning.

(c) Using forward feature selection, find the best subset of features for a logistic regression model with L-2 regularization.

(d) Report the AUC in the test set for your selected model.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import metrics

np.random.seed(0)

In [6]:
test_data = pd.read_csv("./test.csv")
train_data = pd.read_csv("./train.csv")

In [7]:
test_data.head(5)
print(len(test_data))
print(len(train_data))

418
891


In [8]:
train_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## PART A

In [9]:
# Mean Imputation: Used for numerical variable to replace missing values wit
# Mode Imputation: Used for categorical variables to replace missing values

# Mean imputation for numerical columns in train data
num_columns_train = train_data.select_dtypes(include='number').columns 
for col in num_columns_train:
    train_data[col] = train_data[col].fillna(train_data[col].mean())

# Mean imputation for numerical columns in test data
num_columns_test = test_data.select_dtypes(include='number').columns 
for col in num_columns_test:
    test_data[col] = test_data[col].fillna(test_data[col].mean())

# Mode imputation for categorical columns in train data
cat_columns_train = train_data.select_dtypes(include='object').columns 
for col in cat_columns_train:
    train_data[col] = train_data[col].fillna(train_data[col].mode()[0])

# Mode imputation for categorical columns in test data
cat_columns_test = test_data.select_dtypes(include='object').columns 
for col in cat_columns_test:
    test_data[col] = test_data[col].fillna(test_data[col].mode()[0])

In [10]:
# Test data post imputing missing values
test_data.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,B57 B59 B63 B66,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,B57 B59 B63 B66,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,B57 B59 B63 B66,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,B57 B59 B63 B66,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,B57 B59 B63 B66,S


In [11]:
# Train data post imputing missing values
train_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S


## PART B

Important features to consider in our prediction would be Pclass, Sex, Age, SibSp, Parch, Fare, and Embarked.

Ticket number would not be a good feature to use since each passenger will have a unique ticket number, so it won't help us in finding patterns in our data to help make predictions.  





In [12]:
# Filter test and train data for variables specified above
features_train = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
features_test = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

filtered_train_data = train_data[features_train] 
filtered_test_data = test_data[features_test]

filtered_train_data.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [13]:
# Convert categorical features to 0-1 indicator variables
filtered_train_data = pd.get_dummies(
    filtered_train_data, 
    columns = ['Sex', 'Embarked'], 
    dtype = int,
    drop_first = True
)

filtered_test_data = pd.get_dummies(
    filtered_test_data, 
    columns = ['Sex', 'Embarked'],
    dtype = int, 
    drop_first = True
)

X = filtered_train_data.drop('Survived', axis = 1)
y = filtered_train_data.Survived

print("Shape of X is: ", X.shape)
print("Shape of y is: ", y.shape)

filtered_train_data.head(5)

Shape of X is:  (891, 8)
Shape of y is:  (891,)


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


In [15]:
filtered_train_data.dtypes

Survived        int64
Pclass          int64
Age           float64
SibSp           int64
Parch           int64
Fare          float64
Sex_male        int64
Embarked_Q      int64
Embarked_S      int64
dtype: object

## PART C

In [93]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Cross validation - split data into training and validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize the l2 logistic regression model
model = LogisticRegression(penalty='l2', solver='liblinear')

features = filtered_test_data.columns.tolist()

best_features = []
best_auc = 0

while len(features) > 0:
    best_auc_feature = None
    best_auc_value = 0  # Initialize the best AUC value
    
    for feature in features:
        # Try adding the feature to the selected features
        current_features = best_features + [feature]
        
        # Train the logistic regression model with the current features
        model.fit(X_train[current_features], y_train)
        
        # Predict probabilities on the validation set
        y_pred_proba = model.predict_proba(X_val[current_features])[:, 1]
        
        # Calculate AUC
        auc = roc_auc_score(y_val, y_pred_proba)

        # Print AUC progress
        print(f"AUC with {current_features}: {auc}")
        
        # Update the best feature if needed
        if auc > best_auc_value:
            best_auc_value = auc
            best_auc_feature = feature
    
    if best_auc_feature is None:
        break
    
    # Update the best feature and AUC value if the new AUC is higher
    if best_auc_value > best_auc:
        best_auc = best_auc_value
        best_features.append(best_auc_feature)
        features.remove(best_auc_feature)
    else:
        break

# Print the best feature set and its corresponding AUC value
print("Best features:", best_features)
print("Best AUC:", best_auc)

AUC with ['Pclass']: 0.7333333333333334
AUC with ['Age']: 0.4627799736495388
AUC with ['SibSp']: 0.4293148880105402
AUC with ['Parch']: 0.5693017127799737
AUC with ['Fare']: 0.776679841897233
AUC with ['Sex_male']: 0.7732542819499342
AUC with ['Embarked_Q']: 0.4762845849802371
AUC with ['Embarked_S']: 0.5938076416337286
AUC with ['Fare', 'Pclass']: 0.7803030303030303
AUC with ['Fare', 'Age']: 0.7377470355731226
AUC with ['Fare', 'SibSp']: 0.7363636363636364
AUC with ['Fare', 'Parch']: 0.7754281949934124
AUC with ['Fare', 'Sex_male']: 0.8712779973649538
AUC with ['Fare', 'Embarked_Q']: 0.7399868247694334
AUC with ['Fare', 'Embarked_S']: 0.7593544137022399
AUC with ['Fare', 'Sex_male', 'Pclass']: 0.8763504611330697
AUC with ['Fare', 'Sex_male', 'Age']: 0.8646903820816865
AUC with ['Fare', 'Sex_male', 'SibSp']: 0.8549407114624507
AUC with ['Fare', 'Sex_male', 'Parch']: 0.8607378129117259
AUC with ['Fare', 'Sex_male', 'Embarked_Q']: 0.8703557312252964
AUC with ['Fare', 'Sex_male', 'Embarke

In [126]:
# Cross validation - split data into training and validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize the l2 logistic regression model
model = LogisticRegression(penalty='l2', solver='liblinear')

# Train the logistic regression model on full train set
model.fit(X_train[best_features], y_train)
        
# Predict probabilities on the test set
y_test_pred_proba = model.predict_proba(filtered_test_data[best_features])[:, 1]
        
# Calculate AUC
auc = roc_auc_score(y[:418], y_test_pred_proba)
print("AUC on test set with best features:", auc)

AUC on test set with best features: 0.4658246120534103
