### Boston Housing Data

为了更好地理解在回归中使用的各种指标，我们将使用波士顿住房数据集。  

首先在下面的单元格中导入数据集，划分训练数据和测试数据来后面做准备。

In [1]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
import tests2 as t

boston = load_boston()
y = boston.target
X = boston.data

X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)

> **步骤 1：**在开始前，让我们先快速检查一下哪些模型可以用于回归问题。请将下面的模型字典中的各项和相应的字母（问题类型标识）进行配对。

In [2]:
# When can you use the model - use each option as many times as necessary
a = 'regression'
b = 'classification'
c = 'both regression and classification'

models = {
    'decision trees':c, # Letter here,
    'random forest': c,# Letter here,
    'adaptive boosting':c, # Letter here,
    'logistic regression': b,# Letter here,
    'linear regression': a# Letter here
}

#checks your answer, no need to change this code
t.q1_check(models)

That's right!  All but logistic regression can be used for predicting numeric values.  And linear regression is the only one of these that you should not use for predicting categories.  Technically sklearn won't stop you from doing most of anything you want, but you probably want to treat cases in the way you found by answering this question!


> **步骤 2：**现在，从sklearn库中导入这些在前面找到的可用于回归的模型。

In [3]:
# Import models from sklearn - notice you will want to use 
# the regressor version (not classifier) - googling to find 
# each of these is what we all do!
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

> **步骤 3：**现在，你已经导入了4个可用于回归问题的模型，接下来实例化这些模型。

In [4]:
# Instantiate each of the models you imported
# For now use the defaults for all the hyperparameters
tree_mod = DecisionTreeRegressor()
rf_mod = RandomForestRegressor()
ada_mod = AdaBoostRegressor()
reg_mod = LinearRegression()

> **步骤 4：**在训练集数据上拟合你的模型。

In [5]:
# Fit each of your models using the training data
tree_mod.fit(X_train, y_train)
rf_mod.fit(X_train, y_train)
ada_mod.fit(X_train, y_train)
reg_mod.fit(X_train, y_train)



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

> **步骤 5：**使用训练好的模型在测试集数据上做预测。

In [6]:
# Predict on the test values for each model
preds_tree = tree_mod.predict(X_test) 
preds_rf = rf_mod.predict(X_test)
preds_ada = ada_mod.predict(X_test)
preds_reg = reg_mod.predict(X_test)

> **步骤 6：**现在来了解与本课相关的信息。请将下面的指标字典中的各项和相应的字母（问题类型标识）进行配对。

In [7]:
# potential model options
a = 'regression'
b = 'classification'
c = 'both regression and classification'

#
metrics = {
    'precision': b,# Letter here,
    'recall': b,# Letter here,
    'accuracy':b, # Letter here,
    'r2_score':a, # Letter here,
    'mean_squared_error':a, # Letter here,
    'area_under_curve':b, # Letter here, 
    'mean_absolute_area':a # Letter here 
}

#checks your answer, no need to change this code
t.q6_check(metrics)

That's right! Looks like you know your metrics!


> **步骤 6：**现在，你已经找好了可用于回归问题的指标，从sklearn库中导入它们。

In [8]:
# Import the metrics from sklearn
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

> **步骤 7：**与分类模型练习类似，让我们确保你知道这些指标的计算方法。然后我们可以将该值与sklearn库函数的计算结果做匹配。

In [9]:
def r2(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the r-squared score as a float
    '''
    sse = np.sum((actual-preds)**2)
    sst = np.sum((actual-np.mean(actual))**2)
    return 1 - sse/sst

# Check solution matches sklearn
print(r2(y_test, preds_tree))
print(r2_score(y_test, preds_tree))
print("Since the above match, we can see that we have correctly calculated the r2 value.")

0.7524360632365259
0.7524360632365259
Since the above match, we can see that we have correctly calculated the r2 value.


> **步骤 8：**请填写下面的函数，看看你的结果是否与内置的mean_squared_error函数的结果相同。 

In [10]:
def mse(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean squared error as a float
    '''
    
    return np.sum((actual-preds)**2)/len(actual) # calculate mse here


# Check your solution matches sklearn
print(mse(y_test, preds_tree))
print(mean_squared_error(y_test, preds_tree))
print("If the above match, you are all set!")

18.735269461077845
18.735269461077845
If the above match, you are all set!


> **步骤 9：**现在最后一次 - 完成这个平均绝对误差相关的函数。然后比较你的函数的计算结果与sklearn库的指标函数的结果，确保它们匹配。 

In [11]:
def mae(actual, preds):
    '''
    INPUT:
    actual - numpy array or pd series of actual y values
    preds - numpy array or pd series of predicted y values
    OUTPUT:
    returns the mean absolute error as a float
    '''
    
    return np.sum(np.abs(actual-preds))/len(actual) # calculate the mae here

# Check your solution matches sklearn
print(mae(y_test, preds_tree))
print(mean_absolute_error(y_test, preds_tree))
print("If the above match, you are all set!")

2.9802395209580834
2.9802395209580834
If the above match, you are all set!


> **步骤 10：**哪个模型在所有指标方面表现的最好？请注意，R2和MSE将始终匹配，但MAE可能会给出不同的最佳模型。请将下面指标字典中的各项和相应的字母（模型标识）进行配对。

In [12]:
#match each metric to the model that performed best on it
a = 'decision tree'
b = 'random forest'
c = 'adaptive boosting'
d = 'linear regression'


best_fit = {
    'mse':b, # letter here,
    'r2':b, # letter here,
    'mae':b # letter here
}

#Tests your answer - don't change this code
t.check_ten(best_fit)

That's right!  The random forest was best in terms of all the metrics this time!


In [13]:
# cells for work

In [14]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true - the y values that are actually true in the dataset (numpy array or pandas series)
    preds - the predictions for those values from some model (numpy array or pandas series)
    model_name - (str - optional) a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    None - prints the mse, mae, r2
    '''
    if model_name == None:
        print('Mean Squared Error: ', format(mean_squared_error(y_true, preds)))
        print('Mean Absolute Error: ', format(mean_absolute_error(y_true, preds)))
        print('R2 Score: ', format(r2_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Mean Squared Error ' + model_name + ' :' , format(mean_squared_error(y_true, preds)))
        print('Mean Absolute Error ' + model_name + ' :', format(mean_absolute_error(y_true, preds)))
        print('R2 Score ' + model_name + ' :', format(r2_score(y_true, preds)))
        print('\n\n')

In [15]:
# Print Decision Tree scores
print_metrics(y_test, preds_tree, 'tree')

# Print Random Forest scores
print_metrics(y_test, preds_rf, 'random forest')

# Print AdaBoost scores
print_metrics(y_test, preds_ada, 'adaboost')

# Linear Regression scores
print_metrics(y_test, preds_reg, 'linear reg')

Mean Squared Error tree : 18.735269461077845
Mean Absolute Error tree : 2.9802395209580834
R2 Score tree : 0.7524360632365259



Mean Squared Error random forest : 11.220932934131737
Mean Absolute Error random forest : 2.271497005988024
R2 Score random forest : 0.8517289363196189



Mean Squared Error adaboost : 16.817016037852756
Mean Absolute Error adaboost : 2.987831262717019
R2 Score adaboost : 0.7777834632379077



Mean Squared Error linear reg : 20.72402343733974
Mean Absolute Error linear reg : 3.1482557548168217
R2 Score linear reg : 0.7261570836552478



