In [1]:
%matplotlib inline

In [2]:
from sklearn import datasets

In [3]:
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### The scikit-learn estimator API
<img src="figures/supervised_workflow.svg" width="100%">

Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a logistic regression is:

In [5]:
from sklearn.linear_model import LogisticRegression

All models in scikit-learn have a very consistent interface.
First, we instantiate the estimator object.

In [6]:
classifier = LogisticRegression()

To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):

In [7]:
classifier.fit(X_train, y_train)

LogisticRegression()

We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:

In [8]:
prediction = classifier.predict(X_test)

We can compare these against the true labels:

We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:

In [9]:
print(prediction)
print(y_test)

[1 1 0 2 0 2 0 2 2 1 1 2 1 2 1 0 1 1 0 0 1 1 0 0 2 0 0 2 1 0 2 1 0 2 2 1 0
 1]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]


There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:
    

In [10]:
classifier.score(X_test, y_test)

0.7894736842105263

It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:

In [11]:
classifier.score(X_train, y_train)

0.8392857142857143

**Estimated parameters**: All the estimated parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:

In [12]:
print(classifier.coef_)
print(classifier.intercept_)

[[-2.5238339   2.04057617]
 [ 0.48047069 -1.37876396]
 [ 2.04336321 -0.66181221]]
[ 7.85542031  1.90440376 -9.75982407]


In [13]:
print(classifier.classes_)

[0 1 2]
