#Supervised learning: predicting an output variable from high-dimensional observations

Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

###Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target variable, it is said to be a regression task.

##1.Nearest neighbor and the curse of dimensionality

In [1]:
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target

###k-Nearest neighbors classifier

In [2]:
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test  = iris_X[indices[-10:]]
iris_y_test  = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train) 

knn.predict(iris_X_test)
iris_y_test

array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

##2.Linear model: from regression to sparsity

In [4]:
#Diabetes dataset
'''
The diabetes dataset consists of 10 physiological variables
(age, sex, weight, blood pressure) measure on 442 patients, 
and an indication of disease progression after one year:
'''
diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test  = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test  = diabetes.target[-20:]

###Linear regression
LinearRegression, in it’s simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.

In [6]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
print(regr.coef_)

# The mean square error
print np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)

# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and Y.
print regr.score(diabetes_X_test, diabetes_y_test) 

[  3.03499549e-01  -2.37639315e+02   5.10530605e+02   3.27736980e+02
  -8.14131709e+02   4.92814588e+02   1.02848452e+02   1.84606489e+02
   7.43519617e+02   7.60951722e+01]
2004.56760269
0.585075302269


###Shrinkage
If there are few data points per dimension, noise in the observations induces high variance

A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression。

！Shrinkage：对于高维数据，由于每一维度上的数据非常的小，很小的噪声都会造成很大的变化（为了拟合每一个噪声点，产生了过拟合），使用shrinkage策略可以将这一维上的系数变成0（让每个训练数据的影响尽可能小，不会造成过拟合，不会对噪声太敏感，实际就是限制训练数据的权重），岭回归可以解决这个问题！

This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance.

###Sparsity
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that it would be a fairly empty space.

高维中，空间几乎是空的

有时，某些特征对分类不会提供信息，而我们只需要提供信息的特征。所以，会把不提供分类信息的特征设置成0.岭回归可以减小这些特征稀疏，而Lasso可以将这些特征的稀疏变成0
Ridge regression will decrease their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and selection operator), can set some coefficients to zero.