# An Introduction to sklearn with Quinlan's dataset
Author: Pierre Nugues

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn import linear_model
from sklearn import svm
import csv

### The dataset in a matrix format

$$
\mathbf{X} =
\begin{bmatrix}
Sunny& Hot& High& False\\
%\hline
Sunny& Hot& High& True\\
%\hline
Overcast& Hot& High& False\\
%\hline
Rain& Mild& High& False\\
%\hline
Rain& Cool& Normal& False\\
Rain& Cool& Normal& True\\
%\hline
Overcast& Cool& Normal& True \\
%\hline
Sunny& Mild& High& False\\
%\hline
Sunny& Cool& Normal& False\\
%\hline
Rain& Mild& Normal& False \\
%\hline
Sunny& Mild& Normal& True\\
%\hline
Overcast& Mild& High& True \\
%\hline
Overcast& Hot& Normal& False \\
%\hline
 Rain& Mild& High& True 
\end{bmatrix}
; \mathbf{y} =
\begin{bmatrix}
    N\\
    N  \\
    P \\
    P \\
    P\\
    N\\
    P\\
 N   \\
 P   \\
P   \\
 P   \\
 P   \\
 P  \\
N   
\end{bmatrix}
$$

### Reading the dataset
We use the csv module and `DictReader()`

In [None]:
column_names = ['outlook', 'temperature', 'humidity', 'windy', 'play']
dataset = list(csv.DictReader(open('weather-nominal.csv'), 
                              fieldnames=column_names))
dataset

### Extracting the features 

We extract the features and the classes and we store them in `X_dict` and `y`

The class name

In [None]:
target_class = 'play'

We extract the features

In [None]:
X_dict = [dict((k, v) for (k, v) 
               in obs.items() if k != target_class) 
          for obs in dataset]
X_dict

We extract the class

In [None]:
y = [obs[target_class] for obs in dataset]
y

### Vectorizing the Dataset

#### The Features

We vectorize the feature matrix and carry out a one-hot encoding

In [None]:
vec = DictVectorizer(sparse=False) # Should be true
#vec = DictVectorizer(sparse=True) # Should be true
vec.fit(X_dict)

In [None]:
vec.get_feature_names()

In [None]:
vec.vocabulary_

In [None]:
X = vec.transform(X_dict)
X

#### The class vector

Scikit learn can handle strings as output

In [None]:
y

### The Classifier

#### Building the model

With a numerical dataset, we can use a linear classifier and fit a model

In [None]:
classifier = linear_model.LogisticRegression()
# classifier = svm.SVC()
model = classifier.fit(X, y)
model

#### Predicting the classes

We have trained a classifier and we predict the classes

In [None]:
y_hat = classifier.predict(X)
y_hat

### Using a test set

Should we carry out predictions on a dataset, we must use `transform()` only, and not `fit_transform()` to vectorize this set.

Although this is not a good practice, here we apply the prediction on the training set:

In [None]:
X_test_dict = X_dict
X_test = vec.transform(X_test_dict)
X_test

In [None]:
y_hat = classifier.predict(X_test)
y_hat

Note that sklearn outputs strings

### One more word on vectorizing

sklearn transformers must be fitted only once. Here are a few cells that show why.

A test set with new values

In [None]:
new_elt = {'outlook': 'rainy',
  'temperature': 'mild',
  'humidity': 'normal',
  'windy': 'RATHER'}

In [None]:
X_test_dict.append(new_elt)
X_test_dict

In [None]:
X_test_correct = vec.transform(X_test_dict)
X_test_correct

In [None]:
X_test_correct.shape

In [None]:
X_test_wrong = vec.fit_transform(X_test_dict)
X_test_wrong

Note that we have three columns to vectorize the values of _windy_

In [None]:
X_test_wrong.shape