**模型优化基础**
![jupyter](images/task3-优化模型基础.png) 

**加载数据**

In [1]:
from sklearn import datasets
import pandas as pd

In [2]:
boston = datasets.load_boston()
X = boston.data
y = boston.target
features = boston.feature_names
boston_data = pd.DataFrame(X, columns=features)
boston_data['Price'] = y
boston_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


**特征提取实例：向前逐步回归**

In [3]:
def forward_select(data, target):
    variate = set(data.columns)
    variate.remove(target)
    
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    
    while variate:
        aic_with_variate = []
        for candidate in variate:
            formula="{}~{}".format(target, "+".join(selected + [candidate]))  #将自变量名连接起来
            aic=ols(formula=formula,data=data).fit().aic  #利用ols训练模型得出aic值
            aic_with_variate.append((aic,candidate))
        aic_with_variate.sort(reverse=True)  #降序排序aic值
        best_new_score,best_candidate=aic_with_variate.pop()
        if current_score > best_new_score:  #如果目前的aic值大于最好的aic值
            variate.remove(best_candidate)  #移除加进来的变量名，即第二次循环时，不考虑此自变量了
            selected.append(best_candidate)  #将此自变量作为加进模型中的自变量
            current_score=best_new_score  #最新的分数等于最好的分数
            print("aic is {},continuing!".format(current_score))  #输出最小的aic值
        else:
            print("for selection over!")
            break
    formula="{}~{}".format(target,"+".join(selected))  #最终的模型式子
    print("final formula is {}".format(formula))
    model=ols(formula=formula,data=data).fit()
    return(model)

In [5]:
import statsmodels.api as sm  #最小二乘法
from statsmodels.formula.api import ols  # 加载ols模型

forward_select(data=boston_data, target='Price')

aic is 3286.974956900157,continuing!
aic is 3171.5423142992013,continuing!
aic is 3114.0972674193326,continuing!
aic is 3097.359044862759,continuing!
aic is 3069.438633167217,continuing!
aic is 3057.9390497191152,continuing!
aic is 3048.438382711162,continuing!
aic is 3042.274993098419,continuing!
aic is 3040.1545621751425,continuing!
aic is 3032.0687017003256,continuing!
aic is 3021.726387825062,continuing!
for selection over!
final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAX


<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f7cab3f0340>

In [6]:
lm = ols("Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAX",data=boston_data).fit()
lm.summary()

0,1,2,3
Dep. Variable:,Price,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.735
Method:,Least Squares,F-statistic:,128.2
Date:,"Mon, 22 Mar 2021",Prob (F-statistic):,5.54e-137
Time:,23:26:41,Log-Likelihood:,-1498.9
No. Observations:,506,AIC:,3022.0
Df Residuals:,494,BIC:,3072.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,36.3411,5.067,7.171,0.000,26.385,46.298
LSTAT,-0.5226,0.047,-11.019,0.000,-0.616,-0.429
RM,3.8016,0.406,9.356,0.000,3.003,4.600
PTRATIO,-0.9465,0.129,-7.334,0.000,-1.200,-0.693
DIS,-1.4927,0.186,-8.037,0.000,-1.858,-1.128
NOX,-17.3760,3.535,-4.915,0.000,-24.322,-10.430
CHAS,2.7187,0.854,3.183,0.002,1.040,4.397
B,0.0093,0.003,3.475,0.001,0.004,0.015
ZN,0.0458,0.014,3.390,0.001,0.019,0.072

0,1,2,3
Omnibus:,178.43,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,787.785
Skew:,1.523,Prob(JB):,8.6e-172
Kurtosis:,8.3,Cond. No.,14700.0


**正则化**

![jupyter](images/正则化.png) 

**岭回归**

In [7]:
from sklearn.linear_model import Ridge

reg_ridge = Ridge(alpha=0.5)
reg_ridge.fit(X, y)
reg_ridge.score(X, y)

0.739957023371629

**Lasso回归**

In [8]:
from sklearn.linear_model import Lasso

reg_lasso = Lasso(alpha=0.5)
reg_lasso.fit(X, y)
reg_lasso.score(X, y)

0.7140164719858566