# Classification task - Room occupancy 2

Some machine learning methods, namely Support Vector Machines, require standardized features. This means that every numerical feature should have zero mean and unit standard deviation. Implement normalization using standard Scikit-learn tools.

In [0]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/mlcollege/introduction-to-ml/master/data/occupancy.csv', sep=',')
data.head()

## Standardized classifier
Implement a classifier with standardized features.

### Data preparation

In [0]:
from sklearn.model_selection import train_test_split

X_all = data[['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio']]
y_all = data['Occupancy']

X_train, X_test, y_train, y_test = train_test_split(
    X_all, 
    y_all,
    random_state=1,
    test_size=0.1)

print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

### Training a standardized classifier

Implement a [Support Vector Machines](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) classifier with feature standardization. You can use the standard Scikit-learn tools:
* [sklearn.preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)
* [sklearn.pipeline.Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

Remember that standardization has to be estimated on the train set only and applied to both the train and test sets.


In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
print(scaler.transform(X_train))

In [0]:
print(scaler.transform(X_train)[:,0].mean())
print(scaler.transform(X_train)[:,0].std())

In [0]:
from sklearn import svm
from sklearn.pipeline import Pipeline
clf_pipeline = Pipeline([('std', StandardScaler()),
                         ('svm', svm.SVC(kernel='linear'))])

clf_pipeline.fit(X_train, y_train)

### Evaluate the models

Implement all evaluation methods you have learned in the Scikit-learn tutorial. Decide which model performs best on the given problem.

In [0]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

y_pred = clf_pipeline.predict(X_test)

print ("Test accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))
print ()
print(metrics.classification_report(y_test, y_pred))

In [0]:
y_pred = clf_pipeline.predict(X_train)

print ("Train accuracy: {:.2f}".format(accuracy_score(y_train, y_pred)))
print ()
print(metrics.classification_report(y_train, y_pred))