# Supervised learning - Classification

In [1]:
%pylab
%matplotlib inline

%config InlineBackend.figure_format = 'retina'

import numpy as np

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


### Logistical regression 

We are going to make a logistical regression with scikit-learn. This library have the `make_classification()` method to create a random dataset, and we are going to use it:

In [4]:
from sklearn.datasets              import make_classification
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics               import confusion_matrix
from sklearn.model_selection       import train_test_split

In [7]:
# Create a random dataset for trainning
x, y = make_classification(n_samples    = 2500, # size of dataset
                           n_features   = 3,    # number of independients variables
                           n_redundant  = 0,    # no redundant
                           random_state = 1)    # between executions, dataset don't change

# Split dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 1)

# Create and fit the model
classifier = LogisticRegression().fit(x_train, y_train)

# Predict the values
y_train_pred = classifier.predict(x_train)
y_test_pred = classifier.predict(x_test)

# Get the confusion matrices
cm_train = confusion_matrix(y_train, y_train_pred)
cm_test = confusion_matrix(y_test, y_test_pred)

print('Trainning confusion matrix:')
print(cm_train)

print('\nTesting confusion matrix:')
print(cm_test)

Trainning confusion matrix:
[[862  90]
 [ 97 826]]

Testing confusion matrix:
[[268  35]
 [ 28 294]]


These confusion matrix show that the prediction is good, because the number of FN and FP is small. We can comparate the results and check if there is overfitting or not normalizing the confusion matrixs:

In [14]:
print('Normalized trainning confusion matrix:')
print(cm_train / sum(cm_train))
print('\nTesting confusion matrix:')
print(cm_test / sum(cm_test))

Normalized trainning confusion matrix:
[[ 0.89885297  0.09825328]
 [ 0.10114703  0.90174672]]

Testing confusion matrix:
[[ 0.90540541  0.10638298]
 [ 0.09459459  0.89361702]]


Both matrix have the similar results, so we can say that there is not overfitting.