## Training a logistic regression classifier


First import the dataset we prepared earlier.

In [None]:
import pandas as pd

In [None]:
titanic_df = pd.read_csv('datasets/titanic_processed.csv')

titanic_df.head()

In [None]:
titanic_df.shape

When training an ML model we split our source data into two parts, one for the actual training of model and the other part is used to evaluate our trained model on. We usually use a split of 80:20 for this.

Scikit-learn provides us a handy fucntion to do just this.

In [None]:
from sklearn.model_selection import train_test_split

X = titainc_df.drop('Survived', axis=1)
Y = titainc_df['Survived']

x_trian, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
x_train.shape, y_train.shape

In [None]:
x_test.shape, y_test.shape

We'll use 569 records from our dataset to train the model and the remaining 143 records to test the model.

A logistic regression model can be easily built and trained in scikit-learn using the logistic regression estimator.

An estimator in scikit-learn is a high level object that makes it easy to build and train models for prediction.

When we create an instance of the the estimator we pass in a few parameters to 'design' our model in a certain way.

* `penalty='l2'` is a default value used by the estimator, it implies that we are regualrizing our model by applying a penalty on models that are overly complex.
    * The 'l2' penalty uses the l2 norm of the coefficients of the model. (l2 norm is the sum of the squares of the coeffcieints)
* `C=1.0` specifies the strength of the regularization, 'C' stand for inverse of regularization strength.
* `solver='liblinear'` specifies the optimisation algorithm that the estimator will use, 'liblinear' is good for small datasets

The `fit()` function is what starts the training when invoked.

More info on the LogisticRegression estimator can be found at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear').fit(x_train, y_train)

In [None]:
y_pred = logistic_model.predict(x_test)

In [None]:
pred_results = pd.DataFrame({'y_test': y_test,
                             'y_pred': y_pred})

In [None]:
pred_results.head()

We must set up an objective way to measure the performance of our model.

We can create a confusion matrix to view the results.

In [None]:
titanic_crosstab = pd.crosstab(pred_results.y_pred, pred_results.y_test)

titanic_crosstab

Scikit-learn prvides us with the tools to properly measure the accuracy of our model.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [None]:
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy Score: ", acc)
print("Pricision Score: ", prec)
print("Recall Score: ", recall)

* Accuracy score indicates how many of the predicted values did the model get right. Higher than 50% is better than random guessing for a binary classfier.
* Precision score indicates how many passngers the model thought survived actually did survice. 80% shows us there were few false positives.
* Recall score indicates how many of the actual survivors did the model correctly predict. This lower score tells us there were many false negatives.

We can calculate these scores manually by breaking down the confusion matrix into trtue positive, true negative, false positive and false negative then performing some basic mathematical functions.

In [None]:
TP = titanic_crosstab[1][1]
TN = titanic_crosstab[0][0]
FP = titanic_crosstab[0][1]
FN = titanic_crosstab[1][0]

In [None]:
accuracy_score_verified = (TP + TN) / (TP + FP + TN + FN)
accuracy_score_verified

In [None]:
precision_score_verified = TP / (TP + FP)
precision_score_verified

In [None]:
recall_score_verified = TP / (TP + FN)
recall_score_verified