## Introduction to classification with logistic regression

After we have seen examples of performing regression modelling with different libraries, now we will look at the second type of predictive analysis problem, classification with python. From the general introduction to the topic, you already know the basics of logistic regression, working with training and test set, and performance evaluation, we will look at how to do this with the tools in the sklearn library.

In [None]:
# Importing libraries
import pandas as pd
import numpy as np

In [None]:
# As an example, we will use data about Titanic, originally from Kaggle
# https://www.kaggle.com/c/titanic

titanic_df = pd.read_csv('train.csv')

In [None]:
# Looking at the data, we see the familiar variables
# In this example, we only care about missing values as preparation
# In particular, we have to fill in missing values for the Age variable
titanic_df.info()

In [None]:
# We can use the median as the replacement value

age_value = titanic_df['Age'].median()
titanic_df['Age'] = titanic_df['Age'].fillna(age_value)

In [None]:
# In order to make use of the Sex column, we turn into numeric 
# when 1 will correspond to females

titanic_df['IsFemale'] = (titanic_df['Sex'] == 'female').astype(int)

In [None]:
# In the prediction model, we will make use of three predictor variables, Pclass, Age and IsFemale
# and we try to predict Survived, so we create a separate dataframe for this purpose

prediction_data = titanic_df[['Pclass', 'IsFemale', 'Age', 'Survived']]

In [None]:
# In order to appropriately perform the model building, we need to create training and test set
# For this purpose, we have a useful function available in sklearn, in model selection

from sklearn.model_selection import train_test_split

In [None]:
# When we crete training and test sets, we simply specify the dataframe, the outcome column
# and what proportion of the data we want to be in the test set (in this example 20%)
# This will create four things: 
# X_train and X_test: predictors for training and test set
# y_train and y_test: outcome for training and test set

X_train, X_test, y_train, y_test = train_test_split(prediction_data[['Pclass', 'IsFemale', 'Age']], prediction_data.Survived, test_size=0.2)

In [None]:
# Now we can import the function that will build the model for us
# and create the model object

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver = 'lbfgs')

In [None]:
# Next we can train the model using fit, with the predictor varibles and the outcome

model.fit(X_train, y_train)

In [None]:
# As we did with linear regression, we can alternatively use the statsmodels package
# As we can observe from the coefficient, being  female has positive effect on survival
# having higher class (in this case worse, as class 1 is the best) is negative,
# and age has a small negative effect

import statsmodels.api as sm
logit_model = sm.Logit(y_train, X_train)
result = logit_model.fit()
print(result.summary())

In [None]:
# We can create a prediction for our test data based on the created model

y_predict = model.predict(X_test)
y_predict[:10]

In [None]:
# In order to assess the quality of the model, we can look at the confusion matrix

from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_predict)
print(matrix)

In [None]:
(93 + 47) /(93 + 15 + 24 + 47)

In [None]:
# Based on this array, we could also manually calculate different measures
# but there is is also a useful function for this

from sklearn.metrics import classification_report

report = classification_report(y_test, y_predict)

print(report)

In [None]:
# Finally an example of performing crossvalidation
# When we look at the accuracy values, we can see quite big differences, 
# indicating that it relly impacts the results how the (single) test set is selected

from sklearn.model_selection import cross_val_score
model = LogisticRegression(solver = 'lbfgs')
scores = cross_val_score(model, X_train, y_train, cv=10, scoring = 'accuracy')
scores