## Assignment: Cross Validation of Logistic Regression Model

Following assignment-1, we have a dataset for titanic passengers. Here is the first 5 rows of the dataset:

As you can see, in the dataset we have 6 input features ["Age","SibSp","Parch","male","Q","S"] and 1 binary output "Survived". In assignment-1, we have generated a logistic regression model with all those 6 features. In this assignment, please iterate over all the possible feature subsets and do cross validation to find the best feature subset.

Here are some guidelines:
1. Generate all the possible feature subsets;
2. Split data into train, valid and test sets. The proportion is $80\%\times80\%:80\%\times20\%:20\%$;
3. Fit a logistic regression model for each feature subset and find the accuracy on the valid set;
4. Find the feature subset with maximal accuracy;
5. Fit a logistic regression model with the optimal feature subset and calculate the accuracy on test set.

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')
from itertools import combinations

from sklearn.model_selection import train_test_split
from sklearn import linear_model

In [2]:
data = pd.read_csv('titanic_cross_validation.csv')

In [3]:
train_valid, test = train_test_split(data, test_size = 0.2, random_state = 0)
train, valid = train_test_split(train_valid, test_size = 0.2, random_state = 0)

train = train.reset_index(drop = True)
valid = valid.reset_index(drop = True)
test = test.reset_index(drop = True)
features = data.drop(['Survived'], axis = 1).columns
target = ['Survived']

In [4]:
features

Index(['Age', 'SibSp', 'Parch', 'male', 'Q', 'S'], dtype='object')

In [5]:
def all_subsets(my_list):
    subs = []
    for i in range(1, len(my_list) + 1):
        subs += combinations(my_list, i)
    subset_List = []
    for i in subs:
        subset_List += [list(i)]
    return subset_List

In [6]:
features_subs = all_subsets(features)
accu_cv = np.array([])

In [7]:
def CrossValidation(train_cross, valid_cross, target, proporiton):  
    
    # extract X and Y to be fit in a model
    X_train = train_cross.drop(target, axis = 1)
    Y_train = train_cross[target]
    X_valid = valid_cross.drop(target, axis = 1)
    Y_valid = valid_cross[target]
    
    logisticReg = linear_model.LogisticRegression()
    
    # fit model using training data
    logisticReg.fit(X_train, Y_train)
        
    return  logisticReg.score(X_valid, Y_valid)

In [8]:
for sub in features_subs:
    # create a sub dataframe
    train_sub = train[sub + target]
    valid_sub = valid[sub + target]
    sub_accu_cv = CrossValidation(train_sub,valid_sub, 'Survived', 0.2)
    accu_cv = np.append(accu_cv, sub_accu_cv)

In [13]:
features_selected = features_subs[(accu_cv.argmax())]
titanic_data_selected = train[features_selected + target]
features_selected

['SibSp', 'male']

In [14]:
# we build linear regression model
model_selected = linear_model.LogisticRegression()

# features traget split
X = train[features_selected]
Y = train[target]

# fit model
model_selected.fit(X, Y)

# Use model
X_test_selected = test[features_selected]
Y_test_selected = test[target]

In [15]:
from sklearn.metrics import confusion_matrix

confusion_matrix(Y_test_selected, model_selected.predict(X_test_selected))

array([[85, 20],
       [30, 43]], dtype=int64)

In [16]:
from sklearn.metrics import classification_report

print("The accuracy on this set is: ", model_selected.score(X_test_selected, Y_test_selected))
print(classification_report(Y_test_selected, model_selected.predict(X_test_selected)))

The accuracy on this set is:  0.7191011235955056
              precision    recall  f1-score   support

           0       0.74      0.81      0.77       105
           1       0.68      0.59      0.63        73

    accuracy                           0.72       178
   macro avg       0.71      0.70      0.70       178
weighted avg       0.72      0.72      0.72       178

