# 线性回归 LinearRegression

**数据说明：**

美国波士顿地区房价数据描述

In [1]:
# 导入数据
from sklearn.datasets import load_boston

data = load_boston()

print data.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [2]:
# 分割数据集为 测试 和 训练
from sklearn.model_selection import train_test_split
import numpy as np

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

In [3]:
# 分析回归值
print "The max target value:", np.max(y)
print "The min target value:", np.min(y)
print "The average target value:", np.mean(y)

The max target value: 50.0
The min target value: 5.0
The average target value: 22.5328063241


In [4]:
# 对数值进行标准化
from sklearn.preprocessing import StandardScaler

ss_X = StandardScaler()
ss_y = StandardScaler()

X_train = ss_X.fit_transform(X_train)
X_test  = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train)
y_test  = ss_y.transform(y_test)



In [5]:
# 使用线性回归进行预测
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_predict = lr.predict(X_test)

# 使用SGDRegressor进行预测
from sklearn.linear_model import SGDRegressor

sgdr = SGDRegressor()
sgdr.fit(X_train, y_train)
sgdr_y_predict = sgdr.predict(X_test)

In [7]:
# 评价
print "The value of default measurement of LinearRegression:", lr.score(X_test, y_test)

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print "The value of R-squared of LinearRegression", r2_score(y_test, lr_y_predict)
print "The mean squared error of LinearRegression", mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict))
print "The mean absolute error of LinearRegression", mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict))

The value of default measurement of LinearRegression: 0.6763403831
The value of R-squared of LinearRegression 0.6763403831
The mean squared error of LinearRegression 25.0969856921
The mean absolute error of LinearRegression 3.5261239964


In [8]:
print "The value of default measurement of SGDRegressor:", sgdr.score(X_test, y_test)
print "The value of R-squared of SGDRegressor", r2_score(y_test, sgdr_y_predict)
print "The mean squared error of SGDRegressor", mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict))
print "The mean absolute error of SGDRegressor", mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict))

The value of default measurement of SGDRegressor: 0.658208724197
The value of R-squared of SGDRegressor 0.658208724197
The mean squared error of SGDRegressor 26.5029379959
The mean absolute error of SGDRegressor 3.5141659101


$R^2 = 1 - \frac{SS_res}{SS_tot}$

$SS_res = \sum_{i=1}^{m}(y_i - f(x^i)^2$

$SS_tot$ 指的是测试数据真实值的方差，$SS_res$ 指的是回归值与真实值之间的平方差异