# 多元线性回归/对数线性回归（二选一）

## 一、多元线性回归
这部分的内容是要求大家完成多元线性回归，我们会先带着大家使用sklearn做一元线性回归的十折交叉验证，多元线性回归大家可以仿照着完成

### 1. 读取数据

In [1]:
import numpy as np

In [2]:
import pandas as pd

# 读取数据
data = pd.read_csv('data/kaggle_house_price_prediction/kaggle_hourse_price_train.csv')

# 丢弃有缺失值的特征（列）
data.dropna(axis = 1, inplace = True)

# 只保留整数的特征
data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   LotArea        1460 non-null   int64
 3   OverallQual    1460 non-null   int64
 4   OverallCond    1460 non-null   int64
 5   YearBuilt      1460 non-null   int64
 6   YearRemodAdd   1460 non-null   int64
 7   BsmtFinSF1     1460 non-null   int64
 8   BsmtFinSF2     1460 non-null   int64
 9   BsmtUnfSF      1460 non-null   int64
 10  TotalBsmtSF    1460 non-null   int64
 11  1stFlrSF       1460 non-null   int64
 12  2ndFlrSF       1460 non-null   int64
 13  LowQualFinSF   1460 non-null   int64
 14  GrLivArea      1460 non-null   int64
 15  BsmtFullBath   1460 non-null   int64
 16  BsmtHalfBath   1460 non-null   int64
 17  FullBath       1460 non-null   int64
 18  HalfBath       1460 non-null   int64
 19  Bedroo

### 2. 引入模型

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict

### 3. 使用sklearn完成一元线性回归的十折交叉验证验证

#### 创建模型

In [6]:
model = LinearRegression()

#### 选取数据

In [20]:
features = ['LotArea']
x = data[features]
y = data['SalePrice']

(1460, 1)

#### 做十折交叉验证的预测

In [17]:
prediction = cross_val_predict(model, x, y, cv = 10)

这十折交叉验证是按顺序做的，会先将前10%的数据作为测试集，然后会往后顺延到10%到20%，最后将这十份的预测值按顺序拼接后返回

In [22]:
prediction.shape

(1460,)

### 4. 计算评价指标

#### MAE

In [23]:
mean_absolute_error(prediction, data['SalePrice'])

55394.44195244894

#### RMSE

In [24]:
mean_squared_error(prediction, data['SalePrice']) ** 0.5

77868.51337752414

### 5. 请你选择多种特征进行组合，完成多元线性回归，并对比不同的特征组合，它们训练出的模型在十折交叉验证上MAE与RMSE的差别，至少完成3组

###### 扩展：多项式回归（一元线性回归的扩展），尝试对部分特征进行变换，如将其二次幂，三次幂作为特征输入模型，观察模型在预测能力上的变化
###### 提示：多元线性回归，只要在上方的features这个list中，加入其他特征的名字就可以

In [54]:
features1 = ['MSSubClass', 'LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd']
x = data[features1]
y = data['SalePrice']

prediction1 = cross_val_predict(model, x, y, cv = 10)

mae1 = mean_absolute_error(prediction1, data['SalePrice'])
rmse1 = mean_squared_error(prediction1, data['SalePrice']) ** 0.5

print('MAE1:',mae1)
print('RMSE1:',rmse1)

MAE1: 30901.33384336828
RMSE1: 45629.63763073459


In [55]:
features2 = ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF']
x = data[features2]
y = data['SalePrice']
x.shape

prediction2 = cross_val_predict(model, x, y, cv = 10)

mae2 = mean_absolute_error(prediction2, data['SalePrice'])
rmse2 = mean_squared_error(prediction2, data['SalePrice']) ** 0.5

print('MAE2:',mae2)
print('RMSE2:',rmse2)

MAE2: 31040.859234506654
RMSE2: 49952.87170209668


In [56]:
# YOUR CODE HERE
features3 = ['LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']
x = data[features3]
y = data['SalePrice']
x.shape

prediction3 = cross_val_predict(model, x, y, cv = 10)

mae3 = mean_absolute_error(prediction3, data['SalePrice'])
rmse3 = mean_squared_error(prediction3, data['SalePrice']) ** 0.5

print('MAE3:',mae3)
print('RMSE3:',rmse3)

MAE3: 34266.78516313289
RMSE3: 51790.84507999503


###### 双击此处填写
1. 模型1使用的特征：'MSSubClass', 'LotArea','OverallQual','OverallCond','YearBuilt','YearRemodAdd'
2. 模型2使用的特征：'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF'
3. 模型3使用的特征:'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath'

模型|MAE|RMSE
-|-|-
模型1 | 30901.33384336828 | 45629.63763073459
模型2 | 31040.859234506654 | 49952.87170209668
模型3 | 34266.78516313289 | 51790.84507999503