# Logistic Regression binary classification exercise

In this exercise you will be working with the [affairs dataset](http://statsmodels.sourceforge.net/stable/datasets/generated/fair.html) based on a survei of women on 1974 where they asked them whether they had extramarital affairs.

To Correctly work with the database we will be splitting the data into training (X_train and y_train matrices) and test data (X_test and y_test).

We ask you to:

1) Build a binary classifier trained on the training data, and compute its classification accuracy

2) Test the classification accuracy on the test data given the model you just trained

3) create a new sample modeling a virtual surveyed woman (you can randomly set parameters for it) and see whether your new sample would cheat or not on her husband.

Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.


In [33]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
from sklearn.model_selection import train_test_split #creation of train.test sets

#loading and splitting the data into train/test sets
data = pd.read_csv('data/affairs_dataset/fair.csv', sep=',')
y = (data.affairs > 0).astype(int)
X = data.drop('affairs', axis=1)

#split the data into train and test sets, with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.describe()

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb
count,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0
mean,4.116248,29.09807,9.017504,1.3989,2.414946,14.2307,3.419659,3.865575
std,0.959071,6.788485,7.226163,1.427461,0.877166,2.19748,0.935788,1.344939
min,1.0,17.5,0.5,0.0,1.0,9.0,1.0,1.0
25%,4.0,22.0,2.5,0.0,2.0,12.0,3.0,3.0
50%,4.0,27.0,6.0,1.0,2.0,14.0,3.0,4.0
75%,5.0,32.0,16.5,2.0,3.0,16.0,4.0,5.0
max,5.0,42.0,23.0,5.5,4.0,20.0,6.0,6.0


Your code starts here...

In [34]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

## Without normalization

In [35]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
predictedAffair = lr.predict(X_test)
comparison = np.logical_xor(y_test, predictedAffair)
(y_test.shape[0] - np.sum(comparison))/y_test.shape[0]

0.73036649214659688

In [37]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictedAffair)

0.73036649214659688

In [38]:
predictedAffair

array([1, 0, 0, ..., 0, 0, 0])

## With normalization

In [22]:
X_train = (X_train - X_train.mean())/ X_train.std()
X_test = (X_test - X_test.mean())/ X_test.std()

In [23]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [24]:
predictedAffair = lr.predict(X_test)
probs = lr.predict_proba(X_test)
comparison = np.logical_xor(y_test, predictedAffair)
(y_test.shape[0] - np.sum(comparison))/y_test.shape[0]

0.73141361256544501

In [25]:
lr.score(X_test, y_test)

0.73141361256544501

In [26]:
y_test.mean()

0.31780104712041884

In [27]:
# generate evaluation metrics
from sklearn import metrics

print("Accuracy: %f", metrics.accuracy_score(y_test, predictedAffair))
print("AUC: %f", metrics.roc_auc_score(y_test, probs[:, 1]))
print("Classification confusion matrix:")
print(metrics.confusion_matrix(y_test, predictedAffair))
print("Classification report:")
print(metrics.classification_report(y_test, predictedAffair))

Accuracy: %f 0.731413612565
AUC: %f 0.745079470642
Classification confusion matrix:
[[1174  129]
 [ 384  223]]
Classification report:
             precision    recall  f1-score   support

          0       0.75      0.90      0.82      1303
          1       0.63      0.37      0.47       607

avg / total       0.72      0.73      0.71      1910



In [28]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LogisticRegression(), X_test, y_test, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())

[ 0.72916667  0.72395833  0.70833333  0.7486911   0.70680628  0.7382199
  0.7486911   0.70526316  0.68947368  0.74736842]
0.724597197345


In [29]:
predictedAffair


array([1, 0, 0, ..., 0, 0, 0])