# Introduction to scikit-learn
* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license
* Documentation: [http://scikit-learn.org/stable/documentation.html](http://scikit-learn.org/stable/documentation.html)

## installation
`$ pip install scikit-learn`

## Loading an example dataset

Scikit-learn comes with a few standard datasets, for instance the [iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset for classification.

In [0]:
from sklearn import datasets
iris = datasets.load_iris()
print(iris.data[:5])

In [2]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


![Sepal vs. Petal](https://github.com/mlcollege/introduction-to-ml/blob/master/src/images/sepal-petal.jpg?raw=1)

In [0]:
print(iris.target)

In [0]:
print(iris.target_names)

## Data preparation
In order to be able to measure the performance of an estimator, we need to split the data into train and test data sets. Shuffling is not necessary.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.1)
print('Train size: {}'.format(len(X_train)))
print('Test size: {}'.format(len(X_test)))

## Learning and predicting
In the case of the Iris dataset, the task is to predict, given a feature vector, which species the flower belong to. We are given samples of each of the 3 possible classes on which we fit an estimator to be able to predict the species to which unseen samples belong.

An example of an estimator is the class [sklearn.naive_bayes.GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) that implements Gaussian Naive Bayes classification.

In [0]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print (y_pred)
print(iris.target_names[y_pred])

In [0]:
print(clf.predict_proba(X_test))

## Model persistence
It is possible to save a model in Scikit-learn by using Python’s built-in persistence model, namely [pickle](https://docs.python.org/2/library/pickle.html):

In [0]:
import pickle

with open('/tmp/model.pkl', 'wb') as f:
    pickle.dump(clf, f)
    
with open('/tmp/model.pkl', 'rb') as f:
    clf2 = pickle.load(f)
    print(iris.target_names[clf2.predict(X_test)])

## Model evaluation
Scikit-learn provides implementation of all methods you need.

In [0]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)

print ("Test accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))
print ()
print(metrics.classification_report(y_test, y_pred, target_names=iris.target_names))

In [0]:
y_pred = clf.predict(X_train)

print ("Train accuracy: {:.2f}".format(accuracy_score(y_train, y_pred)))
print ()
print(metrics.classification_report(y_train, y_pred, target_names=iris.target_names))

In [0]:
from sklearn.metrics import confusion_matrix

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))

In [0]:
from sklearn.model_selection import cross_val_score

folds = 10
acccuracies = cross_val_score(clf, iris.data, iris.target, cv=folds, scoring='accuracy')
print('Cross-validated accuracy: {:.2f} with standard deviation {:.2f}'.format(acccuracies.mean(), acccuracies.std()))

In [0]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, ShuffleSplit

cv = ShuffleSplit(n_splits=20, test_size=0.1, random_state=0)

train_sizes, train_scores, test_scores = learning_curve(
    clf,
    iris.data, iris.target,
    train_sizes=range(5,99,5),
    n_jobs=-1,
    cv=cv,
)

plt.figure(figsize=(15,6))

#train_scores_mean = np.mean(train_scores, axis=1)
#test_scores_mean = np.mean(test_scores, axis=1)

#train_scores_std = np.std(train_scores, axis=1)
#test_scores_std = np.std(test_scores, axis=1)

#plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
#                     train_scores_mean + train_scores_std, alpha=0.1,
#                     color="r")
#plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
#                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Train accuracy")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Test accuracy")

plt.xlabel("Training examples")

plt.ylabel("Accuracy")

plt.grid()

plt.legend(loc="lower right")

plt.show()