# Training a Classifier on the *Salammbô* Dataset with scikit learn
Author: Pierre Nugues

We first need to import a few modules

In [None]:
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score, LeaveOneOut

### Reading the dataset
We can read the data from a file with the svmlight format. We convert $\mathbf{X}$ to a dense array so that we can easily read it.

In [None]:
X, y = load_svmlight_file('../salammbo/salammbo_a_binary.libsvm')
X = np.array(X.todense())
print(type(X))
print(X)
print(type(y))
y

We can also directly create numpy arrays 

In [None]:
y = np.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

X = np.array(
    [[35680, 2217], [42514, 2761], [15162, 990], [35298, 2274],
     [29800, 1865], [40255, 2606], [74532, 4805], [37464, 2396],
     [31030, 1993], [24843, 1627], [36172, 2375], [39552, 2560],
     [72545, 4597], [75352, 4871], [18031, 1119], [36961, 2503],
     [43621, 2992], [15694, 1042], [36231, 2487], [29945, 2014],
     [40588, 2805], [75255, 5062], [37709, 2643], [30899, 2126],
     [25486, 1784], [37497, 2641], [40398, 2766], [74105, 5047],
     [76725, 5312], [18317, 1215]
     ])

## Fitting the Data
We create a classifier and learn a model

In [None]:
classifier = LogisticRegression()
model = classifier.fit(X, y)
model

## Predicting Classes
We now apply the model to the training set

We predict the classes for the whole dataset

In [None]:
y_hat = classifier.predict(X)
y_hat

We predict two observations

In [None]:
print(classifier.predict([X[-1]]))
print(classifier.predict(np.array([[35680, 2217]])))

We predict the training set with probabilities

In [None]:
y_predicted = classifier.predict_proba(X)
y_predicted

This is a perfect prediction, but not a good evaluation practice because we did it on the training set. 

## The Model

In [None]:
print('Model weights:', classifier.intercept_, classifier.coef_, )

Using this model, we predict the classes with the logistic function

The weight vector

In [None]:
w = np.append(classifier.intercept_, classifier.coef_)
w

The feature vector

In [None]:
x = np.append([1.0], X[-1])
x

The prediction

In [None]:
1/(1 + np.exp(-(np.dot(w, x))))

## Evaluation

On the training set

In [None]:
print("Classification report for classifier on the training set:\n",
      metrics.classification_report(y, y_hat))

We use cross validation instead

In [None]:
scores = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
scores

In [None]:
scores.mean()

### Leave one out

We train on all the examples, except one that serves as test set

In [None]:
loo = LeaveOneOut()
predictions = 0
correct_predictions = 0
for train_index, test_index in loo.split(X):
    predictions += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    classifier.fit(X_train, y_train)
    if classifier.predict(X_test)[0] == y_test:
        correct_predictions += 1
'Leave-one-out crossvalidation accuracy: {}'.format(correct_predictions / predictions)