In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [2]:
from sklearn import datasets
iris= datasets.load_iris()
digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target member. More details on the different datasets can be found in the dedicated section.

For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify the digits samples:

In [3]:
print(digits.data)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]


and the digits.target gives the ground truth for the digits dataset, that is the number corresponding to each digit image that we are trying to learn:


In [4]:
digits.target 


array([0, 1, 2, ..., 8, 9, 8])

##### Shape of the data arrays

The data is always a 2D array, shape (n_samples, n_features), although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape (8, 8) and can be accessed using:

In [5]:
digits.images[0]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

#### Learning and Predicting


In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

For now, we will consider the estimator as a black box:

In [6]:
from sklearn import svm
clf = svm.SVC(gamma=0.001,  C=100.)

The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset, except for the last image, which we’ll reserve for our predicting. We select the training set with the [:-1] Python syntax, which produces a new array that contains all but the last item from digits.data:

In [7]:
clf.fit(digits.data[:-1], digits.target[:-1])  #makes sense 

SVC(C=100.0, gamma=0.001)

Now we can predict new values. In this case we can predict using the last image from the digits.data. By predicting, we'll determine the image from the training set that best matches the last igame 

In [8]:
clf.predict(digits.data[-1:])

array([8])

### Conventions 

###### Type Casting 
unless otherwise specifies, input will be cast to float64:

In [13]:
from sklearn import random_projection

rng= np.random.RandomState(0)
X = rng.rand(10, 2000)
X = np.array(X, dtype='float32')
X.dtype

dtype('float32')

In [15]:
transformer = random_projection.GaussianRandomProjection()
X_new= transformer.fit_transform(X)
X_new.dtype

dtype('float64')

In this example X is float32, which we cast to float64 by fit_transform

Regression targets are cast to float64 and classification  targets are maintained:

In [18]:
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
clf=SVC()
clf.fit(iris.data, iris.target)




SVC()

In [19]:
list(clf.predict(iris.data[:3]))

[0, 0, 0]

In [21]:
clf.fit(iris.data, iris.target_names[iris.target])

SVC()

In [24]:
list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit. The second predict() returns a string array, since iris.target_names was for fitting.

#### Refitting and updating parameters

Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling fit() more than once will overwrite what was learned by any previous fit():

In [28]:
import numpy as np 
from sklearn.datasets import load_iris
from sklearn.svm import SVC     
X,y = load_iris(return_X_y= True)
clf=  SVC()
clf.set_params(kernel='linear').fit(X,y)



SVC(kernel='linear')

In [29]:
clf.predict(X[:5])

array([0, 0, 0, 0, 0])

Here, the default kernel rbf is first changed to linear via SVC.set_params() after the estimator has been constructed, and changed back to rbf to refit the estimator and to make a second prediction.

### Multiclass vs. multilabel fitting

When using multiclass classifiers, the learning and prediction task that is performed is dependent on the format of the target data fit upon:

In [33]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]

classif = OneVsRestClassifier(estimator=SVC(random_state=0))
classif.fit(X,y).predict(X)

array([0, 0, 1, 1, 2])

In the above case, the classifier is fit on a 1d array of multiclass labels and predict() method therfore provides correspinding multiclass predictions. It also possible to fit upon 2D array binary label indicators:

In [35]:
y= LabelBinarizer().fit_transform(y)
classif.fit(X,y).predict(X)

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 0]])

Here, the classifier is fit() on the 2d binary label representation of y, using the labelBinarizer. In this case predict() returns a 2d array representing the corresponding multilabel predictions 

Note that the fourtg and fifth instances returned zeros, indicztingg that they matched none of the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels

In [38]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X,y).predict(X)

array([[1, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0]])

In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is used to binarize the 2d array of the multilabels to fit upon. As a resultst, predict() returns a 2d array with multiple predicted labels for each intsances 

array([0, 1, 2])