## Model

In this final step, using our notebooks, we will apply various machine learning models to determine which one is the most effective in predicting survival on the Titanic.

Importing classification libraries from Scikit-Learn:

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

Loading our pre-processed data:

In [2]:
df_train = pd.read_csv('../data/interim/train.csv')
df_test = pd.read_csv('../data/interim/test.csv')

In [3]:
df_train.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,2,1,22.0,1,0,7.25,2
1,1,1,3,0,38.0,1,0,71.2833,0
2,1,3,1,0,26.0,0,0,7.925,2
3,1,1,3,0,35.0,1,0,53.1,2
4,0,3,2,1,35.0,0,0,8.05,2


In [4]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,2,1,34.5,0,0,7.8292,1
1,893,3,3,0,47.0,1,0,7.0,2
2,894,2,2,1,62.0,0,0,9.6875,1
3,895,3,2,1,27.0,0,0,8.6625,2
4,896,3,3,0,22.0,1,1,12.2875,2


Here, we fine-tune our training set, appropriately separating it:

In [5]:
y_train = df_train['Survived']
x_train = df_train.drop(['Survived'], axis=1)
del(df_train)

In [6]:
passengerId	= df_test['PassengerId']
x_test	= df_test.drop(['PassengerId'], axis=1)
del(df_test)

Finally, we create a function for testing various models from the library:

In [7]:
def test_models(x, y):
    models = { 'Linear Regression': LinearRegression(),
            'Ridge': Ridge(),
            'Lasso': Lasso(),
            'Decision Tree': DecisionTreeRegressor(),
            'Logistic Regression': LogisticRegression(max_iter=1000),
            'SVM': SVR() }
    results = {}

    for name, model in models.items():
        model.fit(x, y)
        ac = round(model.score(x,y) * 100, 2)
        results[name] = ac
        
    return results

In [8]:
test_models(x_train, y_train)

{'Linear Regression': 40.14,
 'Ridge': 40.14,
 'Lasso': 6.35,
 'Decision Tree': 95.73,
 'Logistic Regression': 80.99,
 'SVM': 11.93}

Therefore, we find that our best model is the Decision Tree, and we proceed to train and apply it to our test set to generate our submission file:

In [9]:
model = DecisionTreeRegressor()
model.fit(x_train, y_train)

In [10]:
y_test = model.predict(x_test)

In [11]:
submission = pd.DataFrame({"PassengerId": passengerId,
                           "Survived": y_test.astype(int)})

submission.to_csv('../data/processed/result-test.csv', index=False)

We will use the Pickle library to store the parameters of our model so that we can use it at another time to build an application capable of making survival predictions:

In [12]:
import pickle

In [13]:
pickle.dump(model, open('../data/external/param_model.sav', 'wb'))