# DEVELOP OF ML MODEL

According with the analysis we develop on Tableau to get the insights of the dataset, the most relevant features are:

- AGE (CHILDREN (<12 YEARS), ADULTS 12+)
- SEX
- CLASS
- #SIB/SPOUSES (2 GROUPS, 0 OR MORE)
- #PARENTS (2 GROUPS, 0 OR MORE)
- PLACE WHERE WAS THE EMBARKED

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('Titanic_preprocessed.csv')

In [2]:
data.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived
0,3,male,22.0,1,0,7.25,S,0
1,1,female,38.0,1,0,71.2833,C,1
2,3,female,26.0,0,0,7.925,S,1
3,1,female,35.0,1,0,53.1,S,1
4,3,male,35.0,0,0,8.05,S,0


In [3]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

### Handling Data Variables

According with our analysis, we would treat each variable with the following criteria:

- P_class: 3 categories(1,2,3)
- Sex: 2 categories (Male, Female)
- Age: 2 categories (Adults >12 years, Children <= 12 years)
- SibSp: 2 categories (None = 0, 1 or more)
- Parch: 2 categories (None, 1 or more)
- Fare: Treat it as a continues variable but standarized
- Embarked: 3 categories

### Create the categories

In [4]:
X['Age'] = pd.cut(data['Age'], bins=[0., 12, np.inf], labels=['Children', 'Adult'], right=True)

In [5]:
X['SibSp'] =  data['SibSp'].clip(0, 1)
X['Parch'] = data['Parch'].clip(0,1)

In [6]:
X

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,Adult,1,0,7.2500,S
1,1,female,Adult,1,0,71.2833,C
2,3,female,Adult,0,0,7.9250,S
3,1,female,Adult,1,0,53.1000,S
4,3,male,Adult,0,0,8.0500,S
...,...,...,...,...,...,...,...
886,2,male,Adult,0,0,13.0000,S
887,1,female,Adult,0,0,30.0000,S
888,3,female,Adult,1,1,23.4500,S
889,1,male,Adult,0,0,30.0000,C


### Encoding Categorical Variables

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

column_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 1, 2, 6])],
                                       remainder='passthrough')

X = column_transformer.fit_transform(X)

## TRAIN AND TEST SPLIT

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
X_train.shape

(712, 13)

### FEATURE SCALING

In this case we only need to scale he Fare column

In [10]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# TRAIN MODEL

Based on previous experience and taking into account that the data available is small, we will try the following list of models and their best parameters found with GridSearch to see which one has the best performance. As for recommendation we will use only one metric to evaluate performance, and it will be F1 SCORE

### LIST OF MODELS TO TRY OUT

- Logistic Regression
- Support Vector Machine
- Decision Tree
- Random Forest
- KNN
- Naive-Bayes

### LOGISTIC REGRESSION

For Logistic Regression we would try with the following hyperparemeters:

- Penalty (L1, L2, NONE)
- C (Inverse of regularization strength, where small values increase regularization)

In [11]:
X_train_log = X_train_scaled
y_train_log = y_train

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'penalty': ['l1'], 'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'solver': ['liblinear']},
    {'penalty': ['l2'], 'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]},
    {'penalty': ['none']},
]

log_reg = LogisticRegression()
grid_search = GridSearchCV(log_reg, param_grid, cv=5,
                           scoring='f1',
                           return_train_score=True)

grid_search.fit(X_train_log, y_train_log)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                          'penalty': ['l1'], 'solver': ['liblinear']},
                         {'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                          'penalty': ['l2']},
                         {'penalty': ['none']}],
             pre_dispatch='2*n

In [13]:
print(f'Parmeters: {grid_search.best_params_}')
print(f'F1 score:{grid_search.best_score_}')

Parmeters: {'C': 0.3, 'penalty': 'l2'}
F1 score:0.7255836189953838


##### BEST SCORE FOR LOGISTIC REGRESSION WAS 0.7256

### SUPPORT VECTOR MACHINE SVC

For Support Vector Machine we will use the following hyperparemeters:

- C (Inverse Regularization Strength)
- Kernel (‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’) 

In [14]:
X_train_svc = X_train_scaled
y_train_svc = y_train

In [15]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
]

svc= SVC()
grid_search = GridSearchCV(svc, param_grid, cv=5,
                           scoring='f1',
                           return_train_score=True)

grid_search.fit(X_train_svc, y_train_svc)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                          'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='f1', verbose=0)

In [16]:
print(f'Parmeters: {grid_search.best_params_}')
print(f'F1 score:{grid_search.best_score_}')

Parmeters: {'C': 0.1, 'kernel': 'poly'}
F1 score:0.7382819607599487


##### BEST SCORE FOR SVC WAS 0.7383

### DECISION TREE

For Decision Tree classifier we will use the following hyperparameters:

- Min Samples Leaf

In [17]:
X_train_tree = X_train_scaled
y_train_tree = y_train

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'min_samples_leaf': [1, 10, 20, 30]},
]

tree= DecisionTreeClassifier()
grid_search = GridSearchCV(tree, param_grid, cv=5,
                           scoring='f1',
                           return_train_score=True)

grid_search.fit(X_train_tree, y_train_tree)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'min_samples_leaf': [1, 10, 20, 30]}],
             

In [19]:
print(f'Parmeters: {grid_search.best_params_}')
print(f'F1 score:{grid_search.best_score_}')

Parmeters: {'min_samples_leaf': 1}
F1 score:0.7344584562813863


##### BEST SCORE FOR DECISION TREE WAS 0.7297

### RANDOM FOREST

For Random Forest classifier we will use the following hyperparameters:

- N_estimators (Number of trees)
- Min_samples_leaf
- Bootstrap
- Max Features

In [20]:
X_train_forest = X_train_scaled
y_train_forest = y_train

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 20, 30, 40], 'min_samples_leaf': [1, 10, 20, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False],'n_estimators': [3, 10, 20, 30, 40], 'min_samples_leaf': [1, 10, 20, 30], 'max_features': [2, 4, 6, 8]},
]

forest = RandomForestClassifier()
grid_search = GridSearchCV(forest, param_grid, cv=5,
                           scoring='f1',
                           return_train_score=True)

grid_search.fit(X_train_forest, y_train_forest)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,...
                                              random_state=None, verbose=0,
                                   

In [22]:
print(f'Parmeters: {grid_search.best_params_}')
print(f'F1 score:{grid_search.best_score_}')

Parmeters: {'max_features': 6, 'min_samples_leaf': 1, 'n_estimators': 30}
F1 score:0.7408939703972093


##### BEST SCORE FOR RANDOM FOREST WAS 0.7364

### KNN

For KNN we will use the following hyperparameters:

- N_NEIGHBORS

In [23]:
X_train_knn = X_train_scaled
y_train_knn = y_train

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_neighbors': [1, 5, 10, 15, 20]},
]

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5,
                           scoring='f1',
                           return_train_score=True)

grid_search.fit(X_train_knn, y_train_knn)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'n_neighbors': [1, 5, 10, 15, 20]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='f1', verbose=0)

In [25]:
print(f'Parmeters: {grid_search.best_params_}')
print(f'F1 score:{grid_search.best_score_}')

Parmeters: {'n_neighbors': 5}
F1 score:0.7211610845295057


##### BEST SCORE FOR KNN WAS 0.7212

### Naive Bayes

For Naive Bayes we will try different classes:

- Gaussian
- Categorical
- Bernoulli

In [26]:
X_train_nb = X_train_scaled
y_train_nb = y_train

In [27]:
from sklearn.naive_bayes import GaussianNB, CategoricalNB, BernoulliNB

gnb = GaussianNB()
cnb = CategoricalNB()
bnb = BernoulliNB()

gnb.fit(X_train_nb, y_train_nb)
cnb.fit(X_train_nb, y_train_nb)
bnb.fit(X_train_nb, y_train_nb)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [28]:
from sklearn.metrics import f1_score

print(f1_score(y_train_nb, gnb.predict(X_train_nb)))
print(f1_score(y_train_nb, cnb.predict(X_train_nb)))
print(f1_score(y_train_nb, bnb.predict(X_train_nb)))

0.7291666666666666
0.7376146788990826
0.7352941176470589


##### BEST SCORE FOR NAIVE BAYES WAS 0.7376 AND WAS USING CATEGORICAL NB

# EVALUATE WITH TEST SET

According with what we found on the previous section, the best algorithms were:

- SVC with Parmeters: {'C': 0.1, 'kernel': 'poly'}
- Random Forest with Parmeters: {'max_features': 4, 'min_samples_leaf': 1, 'n_estimators': 10}
- Categorical Naive Bayes

So we will evaluate the test set with the 3 of them and we will choose the one with the best score

In [29]:
X_test_prepared = scaler.transform(X_test)

In [30]:
from sklearn.metrics import f1_score

svc = SVC(C=0.1, kernel='poly')
forest = RandomForestClassifier(max_features=4, n_estimators=10)
cnb = CategoricalNB()

models = [svc, forest, cnb]
y_predict = []
score = []

for i, model in enumerate(models):
    model.fit(X_train_scaled, y_train)
    y_predict.append(model.predict(X_test_prepared))
    score.append(f1_score(y_test, y_predict[i]))

In [31]:
score

[0.746031746031746, 0.7703703703703704, 0.7464788732394365]

### GIVEN THE RESULTS ABOVE, THE BEST ALGORITHM WAS SVC WITH PARAMETERS C=0.1, KERNEL='POLY

### Save the model for future predictions

In [32]:
import joblib

joblib.dump(svc, 'Titanic_model.pkl')

['Titanic_model.pkl']