## Titanic Project - Survival Prediction

CSI-22 (Object-Oriented Programming)

Group members: André Diogo, Antônio Gustavo and Lucca Haddad

### Importing Libraries

In [537]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


### Defining Parameters

Here, we define different parameters for both logistic regression and random forest models. Thus, we can analyse their dynamics with distinct hyperparameters.

In [538]:
log_reg_params = [{"C":0.01}, {"C":0.1}, {"C":1}, {"C":10}]
rand_for_params = [{"criterion": "gini"}, {"criterion": "entropy"}]
naive_bayes_params = [{}]

### List of Models

In [539]:
modelclasses = [
    ["log regression", LogisticRegression, log_reg_params],
    ["random forest", RandomForestClassifier, rand_for_params],
    ["naive bayes", GaussianNB, naive_bayes_params],
]

### Importing Problem's Data

In [540]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

### Variable Replacement, Invalid Variable Removal

In this section, we replace both 'Sex' and 'Embarked' strings/chars features with numbers, remove the NAN from the 'Embarked' column and drop the columns we don't see as coherent for analysis (Name, Cabin and Ticket).

In [541]:
train = train.replace('male', 0)\
    .replace('female', 1)\
    .replace('S', 0)\
    .replace('C', 1)\
    .replace('Q', 2)\
    .dropna(subset=['Embarked'])\
    .drop(['Name','Cabin','Ticket'], axis=1)

train['Age'].fillna(train['Age'].mean(),inplace=True)

test = test.replace('male', 0)\
    .replace('female', 1)\
    .replace('S', 0)\
    .replace('C', 1)\
    .replace('Q', 2)\
    .drop(['Name','Cabin','Ticket'], axis=1)

test['Age'].fillna(test['Age'].mean(),inplace=True)
test['Fare'].fillna(test['Fare'].mean(),inplace=True)

In the test dataframe, we need as specific number of rows after prediction, so we didn't remove the rows with NAN; instead, the missing values we're replaced with the average value of the respective column.

### Min-Max Normalization (Age and Fare)

To reduce the range between minimum and maximum values of both Fare and Age features, we replaced its values with a min-max normalization.

In [542]:
train.Fare = (train.Fare - train.Fare.min())/(train.Fare.max() - train.Fare.min())
train.Age = (train.Age - train.Age.min())/(train.Age.max() - train.Age.min())

test.Fare = (test.Fare - test.Fare.min())/(test.Fare.max() - test.Fare.min())
test.Age = (test.Age - test.Age.min())/(test.Age.max() - test.Age.min())

### Feature Selection

Perform a random split for train and test and select the feature and the target columns.

In [543]:
feature_cols = ['PassengerId','Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
x = train[feature_cols]
y = train.Survived

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=123)

Iterate over the different models and respective hyperparameters selected.

In [544]:
insights = []
for modelname, Model, params_list in modelclasses:
    for params in params_list:
        model = Model(**params)
        model.fit(x_train, y_train)
        score = model.score(x_test, y_test)
        insights.append((modelname, model, params, score))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Print the performance of the various models after prediction.

In [545]:
insights.sort(key=lambda x:x[-1], reverse=True)
for modelname, model, params, score in insights:
    print(modelname, params, score)

random forest {'criterion': 'gini'} 0.8258426966292135
random forest {'criterion': 'entropy'} 0.8202247191011236
log regression {'C': 0.1} 0.7752808988764045
naive bayes {} 0.7752808988764045
log regression {'C': 10} 0.7696629213483146
log regression {'C': 1} 0.7640449438202247
log regression {'C': 0.01} 0.702247191011236


Here, we remake the prediction with the model that had the best performance, now using the desired database.

In [546]:
final_model = RandomForestClassifier(criterion='gini')
final_model.fit(x_train, y_train)
test['Survived'] = final_model.predict(test)

Generates a CSV of the final database after prediction.

In [547]:
final = test[['PassengerId','Survived']]
final.to_csv('final.csv',index=False)