# Chapter 7: Linear and Logistic Regression
Training a Classifier on the *Salammbô* Dataset with scikit-learn

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## Modules

We first need to import a few modules

In [1]:
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score, LeaveOneOut

## Dataset
We can directly create numpy arrays 

In [2]:
X = np.array(
    [[35680, 2217], [42514, 2761], [15162, 990], [35298, 2274],
     [29800, 1865], [40255, 2606], [74532, 4805], [37464, 2396],
     [31030, 1993], [24843, 1627], [36172, 2375], [39552, 2560],
     [72545, 4597], [75352, 4871], [18031, 1119], [36961, 2503],
     [43621, 2992], [15694, 1042], [36231, 2487], [29945, 2014],
     [40588, 2805], [75255, 5062], [37709, 2643], [30899, 2126],
     [25486, 1784], [37497, 2641], [40398, 2766], [74105, 5047],
     [76725, 5312], [18317, 1215]
     ])

y = np.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

## Loading the Dataset
### Pandas

In [3]:
import pandas as pd

You may have to adjust the path

In [4]:
PATH = '../datasets/salammbo_char_a/'

In [5]:
dataset_pd = pd.read_csv(PATH + 'salammbo_a_binary.tsv',
                         sep='\t',
                         names=['cnt_chars', 'cnt_a', 'class'])

In [6]:
dataset_pd

Unnamed: 0,cnt_chars,cnt_a,class
0,35680,2217,0
1,42514,2761,0
2,15162,990,0
3,35298,2274,0
4,29800,1865,0
5,40255,2606,0
6,74532,4805,0
7,37464,2396,0
8,31030,1993,0
9,24843,1627,0


In [7]:
X = dataset_pd.to_numpy()[:, :2]
X

array([[35680,  2217],
       [42514,  2761],
       [15162,   990],
       [35298,  2274],
       [29800,  1865],
       [40255,  2606],
       [74532,  4805],
       [37464,  2396],
       [31030,  1993],
       [24843,  1627],
       [36172,  2375],
       [39552,  2560],
       [72545,  4597],
       [75352,  4871],
       [18031,  1119],
       [36961,  2503],
       [43621,  2992],
       [15694,  1042],
       [36231,  2487],
       [29945,  2014],
       [40588,  2805],
       [75255,  5062],
       [37709,  2643],
       [30899,  2126],
       [25486,  1784],
       [37497,  2641],
       [40398,  2766],
       [74105,  5047],
       [76725,  5312],
       [18317,  1215]])

In [8]:
y = dataset_pd.to_numpy()[:, 2]
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

### svmlight
We can read the data from a file with the svmlight format. We convert ${X}$ to a dense array so that we can easily read it.

In [9]:
X, y = load_svmlight_file(PATH + 'salammbo_a_binary.libsvm')

In [10]:
X = np.array(X.todense())
print(type(X))
X[:4]

<class 'numpy.ndarray'>


array([[35680.,  2217.],
       [42514.,  2761.],
       [15162.,   990.],
       [35298.,  2274.]])

In [11]:
print(type(y))
y

<class 'numpy.ndarray'>


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

## Fitting the Data
We create a classifier and train a model

In [12]:
classifier = LogisticRegression()
classifier

In [13]:
m = classifier.fit(X, y)

## Predicting Classes
We now apply the model to the training set

We predict the classes for the whole dataset

In [14]:
y_hat = classifier.predict(X)
y_hat

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

We predict two observations

In [15]:
classifier.predict([X[-1]])

array([1.])

In [16]:
classifier.predict(np.array([[35680, 2217]]))

array([0.])

We predict the training set with probabilities

In [17]:
y_predicted = classifier.predict_proba(X)
y_predicted[:4]

array([[1.00000000e+00, 4.70168695e-30],
       [1.00000000e+00, 8.59479008e-11],
       [9.76197588e-01, 2.38024122e-02],
       [1.00000000e+00, 1.00087464e-12]])

This is a perfect prediction, but not a good evaluation practice because we did it on the training set. 

In [18]:
classifier.predict_proba([X[2]])

array([[0.97619759, 0.02380241]])

In [19]:
classifier.predict_proba([X[-1]])

array([[0.0165252, 0.9834748]])

## The Model
We print the model weights

In [20]:
'Model weights: {}, {}'.format(classifier.intercept_, classifier.coef_)

'Model weights: [3.05895793], [[-0.03210407  0.48483739]]'

Using this model, we predict the classes with the logistic function

The weight vector

In [21]:
w = np.append(classifier.intercept_, classifier.coef_)
w

array([ 3.05895793, -0.03210407,  0.48483739])

The feature vector of one observation

In [22]:
x = np.append([1.0], X[-1])
x

array([1.0000e+00, 1.8317e+04, 1.2150e+03])

The prediction

In [23]:
1/(1 + np.exp(-w @ x))

0.9834748033489703

## The Loss

In [24]:
classifier.predict_proba(X)[:, 1]

array([4.70168695e-30, 8.59479008e-11, 2.38024122e-02, 1.00087464e-12,
       3.44315694e-22, 6.21540855e-12, 8.10231988e-27, 3.08500676e-17,
       2.18523269e-12, 3.42943036e-03, 1.20445330e-03, 8.11464193e-12,
       6.54115891e-43, 2.35921417e-24, 3.53373575e-15, 1.00000000e+00,
       1.00000000e+00, 9.88088912e-01, 1.00000000e+00, 9.99999987e-01,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 9.83474803e-01])

In [25]:
-np.mean(np.log([y_hat if y_obs else 1 - y_hat
                 for y_obs, y_hat in
                 zip(y, classifier.predict_proba(X)[:, 1])]))

0.0019125545724898248

In [26]:
metrics.log_loss(y, classifier.predict_proba(X))

0.0019125545724899507

## Multinomial Logistic Regression

In [27]:
X_de = np.array(
    [[37599, 1771], [44565, 2116], [16156, 715], [37697, 1804],
     [29800, 1865], [42606, 2146], [78242, 3813], [40341, 1955],
     [31030, 1993], [26676, 1346], [39250, 1902], [41780, 2106],
     [72545, 4597], [79195, 3988], [19020, 928]
     ])

In [28]:
X = np.vstack((X, X_de))
X

array([[35680.,  2217.],
       [42514.,  2761.],
       [15162.,   990.],
       [35298.,  2274.],
       [29800.,  1865.],
       [40255.,  2606.],
       [74532.,  4805.],
       [37464.,  2396.],
       [31030.,  1993.],
       [24843.,  1627.],
       [36172.,  2375.],
       [39552.,  2560.],
       [72545.,  4597.],
       [75352.,  4871.],
       [18031.,  1119.],
       [36961.,  2503.],
       [43621.,  2992.],
       [15694.,  1042.],
       [36231.,  2487.],
       [29945.,  2014.],
       [40588.,  2805.],
       [75255.,  5062.],
       [37709.,  2643.],
       [30899.,  2126.],
       [25486.,  1784.],
       [37497.,  2641.],
       [40398.,  2766.],
       [74105.,  5047.],
       [76725.,  5312.],
       [18317.,  1215.],
       [37599.,  1771.],
       [44565.,  2116.],
       [16156.,   715.],
       [37697.,  1804.],
       [29800.,  1865.],
       [42606.,  2146.],
       [78242.,  3813.],
       [40341.,  1955.],
       [31030.,  1993.],
       [26676.,  1346.],


In [29]:
X.shape

(45, 2)

In [30]:
y = np.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [31]:
cls_de = LogisticRegression(max_iter=200)

In [32]:
cls_de.fit(X, y)

In [33]:
y_hat = cls_de.predict(X)
y_hat

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2,
       2])

## Evaluation

On the training set

In [34]:
print(metrics.classification_report(y, y_hat))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        15
           1       1.00      1.00      1.00        15
           2       1.00      0.80      0.89        15

    accuracy                           0.93        45
   macro avg       0.94      0.93      0.93        45
weighted avg       0.94      0.93      0.93        45



We use cross validation instead

In [35]:
scores = cross_val_score(cls_de, X, y, cv=5, scoring='accuracy')
scores

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([0.77777778, 0.88888889, 0.88888889, 1.        , 0.88888889])

In [36]:
scores.mean()

0.888888888888889

The stratification strategy

In [37]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train, test in skf.split(X, y):
    print(train, test)

[ 3  4  5  6  7  8  9 10 11 12 13 14 18 19 20 21 22 23 24 25 26 27 28 29
 33 34 35 36 37 38 39 40 41 42 43 44] [ 0  1  2 15 16 17 30 31 32]
[ 0  1  2  6  7  8  9 10 11 12 13 14 15 16 17 21 22 23 24 25 26 27 28 29
 30 31 32 36 37 38 39 40 41 42 43 44] [ 3  4  5 18 19 20 33 34 35]
[ 0  1  2  3  4  5  9 10 11 12 13 14 15 16 17 18 19 20 24 25 26 27 28 29
 30 31 32 33 34 35 39 40 41 42 43 44] [ 6  7  8 21 22 23 36 37 38]
[ 0  1  2  3  4  5  6  7  8 12 13 14 15 16 17 18 19 20 21 22 23 27 28 29
 30 31 32 33 34 35 36 37 38 42 43 44] [ 9 10 11 24 25 26 39 40 41]
[ 0  1  2  3  4  5  6  7  8  9 10 11 15 16 17 18 19 20 21 22 23 24 25 26
 30 31 32 33 34 35 36 37 38 39 40 41] [12 13 14 27 28 29 42 43 44]


### Leave one out

We train on all the observations, except one that serves as test set. We repeat this evaluation with a different observation as many times as there are observations.

In [38]:
loo = LeaveOneOut()
predictions = 0
correct_predictions = 0
for train_index, test_index in loo.split(X):
    predictions += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    cls_de.fit(X_train, y_train)
    if cls_de.predict(X_test)[0] == y_test:
        correct_predictions += 1
'Leave-one-out crossvalidation accuracy: {}'.format(
    correct_predictions / predictions)

'Leave-one-out crossvalidation accuracy: 0.8666666666666667'