# Training the model

Now that we have preproccessed our data we can finally train the ML model using logistic regression.



We import our libraries as before.

In [None]:
import sklearn
import pandas as pd

We import the preprocessed data that we created in the previous notebook.

In [None]:
titanic_df = pd.read_csv('datasets/titanic_processed.csv')

titanic_df.head()

In [None]:
titanic_df.shape

When training an ML model we split our source data into two parts, one for the actual training of model and the second part is used to evaluate our trained model on. We usually use a split of 80:20 for this.

The features that we will use to train our model are all the features (columns) except the 'Suvived' feature.

We drop the 'Survived' column from the dataframe and save the rest of the features in the 'X' variable.

The target variable or label, in our case 'Survived', will be saved to the 'Y' variable.






In [None]:
X = titanic_df.drop('Survived', axis=1)
Y = titanic_df['Survived']


Scikit-learn provides us a handy method that will spit our data out into training and testing subsets.

When called, the `.train_test_split()` method automatically shuffles the data, then splits it up based on a parameter that we provide. As we want an 80:20 split we give it the `test_size=0.2` parameter.

The output of calling this method will be saved into four variables:
- x_train
- x_test
- y_train
- y_test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

We can now take a look at the shape each of the variables, let's look at the training features and labels first

In [None]:
x_train.shape, y_train.shape

We'll use 569 rows of our dataset to train our model.

Now let's look at the test features and labels

In [None]:
x_test.shape, y_test.shape

The remaining 143 rows will be used to test the accuracy of our model.

A logistic regression model can be easily built and trained in scikit-learn using the logistic regression estimator.

An estimator in scikit-learn is a high level object that makes it easy to build and train models for prediction.

We create an instance of the the estimator we pass in a few parameters to 'design' our model in a certain way.

* `penalty='l2'` is a default value used by the estimator, it implies that we are regualrizing our model by applying a penalty on models that are overly complex.
    * The 'l2' penalty uses the l2 norm of the coefficients of the model. (l2 norm is the sum of the squares of the coeffcieints)
* `C=1.0` specifies the strength of the regularization, 'C' stand for inverse of regularization strength.
* `solver='liblinear'` specifies the optimisation algorithm that the estimator will use, 'liblinear' is good for small datasets

We then the `.fit()` function on our estimator to perform the training.

More info on the LogisticRegression estimator can be found at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear').fit(x_train, y_train)

Now that we have trained our model it's time to see how the predictions turn out.


In [None]:
y_pred = logistic_model.predict(x_test)

We'll set up the actual 'Survived' values from our data and the predicted values from our models in a dataframe so that we can see them side by side.

In [None]:
pred_results = pd.DataFrame({'y_test': y_test,
                             'y_pred': y_pred})

Let's take a sample of actual vs predicted results.

In [None]:
pred_results.head()

We must set up an objective way to measure the performance of our model.

We can create a confusion matrix to view the results.

In [None]:
titanic_crosstab = pd.crosstab(pred_results.y_pred, pred_results.y_test)

titanic_crosstab

We can see from this that most of the results are in the true-positive and true-negative cells.

There are however some false-positive and false-negative values.

Scikit-learn prvides us with the tools to properly measure the accuracy of our model.

* Accuracy score indicates how many of the predicted values did the model get right. Higher than 50% is better than random guessing for a binary classfier.
* Precision score indicates how many passngers the model thought survived actually did survice. 80% shows us there were few false positives.
* Recall score indicates how many of the actual survivors did the model correctly predict. This lower score tells us there were many false negatives.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy Score: ", acc)
print("Pricision Score: ", prec)
print("Recall Score: ", recall)

We can calculate these scores manually by breaking down the confusion matrix into true positive, true negative, false positive and false negative then performing some basic mathematical functions.

In [None]:
TP = titanic_crosstab[1][1]
TN = titanic_crosstab[0][0]
FP = titanic_crosstab[0][1]
FN = titanic_crosstab[1][0]

In [None]:
accuracy_score_verified = (TP + TN) / (TP + FP + TN + FN)
accuracy_score_verified

In [None]:
precision_score_verified = TP / (TP + FP)
precision_score_verified

In [None]:
recall_score_verified = TP / (TP + FN)
recall_score_verified

As we can see the scores we calculated manually are exactly the same as the scores given to us by the scikit-learn function, never hurts to double check to be sure.