### 二手车交易价格预测--T1:赛题理解
对二手车交易数据进行训练，预测其交易价格

#### 数据概况
数据中包含的属性有：
- SaleID - 销售样本ID
- name - 汽车编码
- regDate - 汽车注册时间
- model - 车型编码
- brand - 品牌
- bodyType - 车身类型
- fuelType - 燃油类型
- gearbox - 变速箱
- power - 汽车功率
- kilometer - 汽车行驶公里
- notRepairedDamage - 汽车有尚未修复的损坏
- regionCode - 看车地区编码
- seller - 销售方
- offerType - 报价类型
- creatDate - 广告发布时间
- price - 汽车价格
- v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' 【匿名特征，包含v0-14在内15个匿名特征】

#### 预测指标
本赛题的评价标准为**MAE(Mean Absolute Error)**:

$ MAE=\frac{\sum_{i=1}^{n}\ \ \left|\ y_{i}-\hat{y}_{i}\ \right|}{n} $

其中$ y_{i}$代表第i个样本的真实值，其中$\hat{y}_{i}$代表第i个样本的预测值

**一般问题评价指标:**
评估指标即是我们对于一个模型效果的数值型量化。（有点类似与对于一个商品评价打分，而这是针对于模型效果和理想效果之间的一个打分）


一般来说分类和回归问题的评价指标有如下一些形式：
- 分类算法常见的评估指标如下：
  1. 对于二类分类器/分类算法，评价指标主要有accuracy， [Precision，Recall，F-score，Pr曲线]，ROC-AUC曲线。
  2. 对于多类分类器/分类算法，评价指标主要有accuracy， [宏平均和微平均，F-score]。
- 对于回归预测类常见的评估指标如下:
  - 平均绝对误差（Mean Absolute Error，MAE）
  - 均方误差（Mean Squared Error，MSE）
  - 平均绝对百分误差（Mean Absolute Percentage Error，MAPE）
  - 均方根误差（Root Mean Squared Error）
  - R2（R-Square）


**平均绝对误差**（Mean Absolute Error，MAE）:平均绝对误差，其能更好地反映预测值与真实值误差的实际情况，其计算公式如下：

$MAE=\frac{1}{N} \sum_{i=1}^{N}\left|y_{i}-\hat{y}_{i}\right|$

**均方误差**（Mean Squared Error，MSE）,均方误差,其计算公式为：

$MSE=\frac{1}{N} \sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}$

**平均绝对百分比误差**(Mean Absolute Percent Error, MAPE)：相当于把每个点的误差进行归一化，降低了个别离群点带来的绝对误差的影响。
$MAPE = \sum_{i=1}^{N} \vert{\frac{(y_i - \hat{y}_{i})}{y_i}} \vert \ \times {\frac{100}{N}}$

**R2**（R-Square）:
- 残差平方和，其计算的公式为： 
$SS_{res}=\sum\left(y_{i}-\hat{y}_{i}\right)^{2}$
- 总平均值的计算公式为：
$SS_{tot}=\sum\left(y_{i}-\overline{y}_{i}\right)^{2}$

其中$\overline{y}_{i}$表示$y_i$的平均值，得到$R^2$表达式为：

$R^{2}=1-\frac{SS_{res}}{SS_{tot}}=1-\frac{\sum\left(y_{i}-\hat{y}_{i}\right)^{2}}{\sum\left(y_{i}-\overline{y}\right)^{2}}$

$R^2$用于度量因变量的变异中可由自变量解释部分所占的比例，取值范围是 0~1， 
$R^2$越接近1,表明回归平方和占总平方和的比例越大,回归线与各观测点越接近，用x的变化来解释y值变化的部分就越多,回归的拟合程度就越好。所以 
$R^2$也称为拟合优度（Goodness of Fit）的统计量

In [6]:
# 读取数据
import pandas as pd 
import numpy as np

# 读取训练集与测试集
path = '../../dataset/user_car_data/'
train_data = pd.read_csv(path + 'used_car_train.csv', sep=' ')
test_data = pd.read_csv(path + 'used_car_testA.csv', sep=' ')

print('训练数据集的大小', train_data.shape)
print('测试数据集的大小', test_data.shape)

train_data.head()

训练数据集的大小 (150000, 31)
测试数据集的大小 (50000, 30)


Unnamed: 0,SaleID,name,regDate,model,brand,bodyType,fuelType,gearbox,power,kilometer,...,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0,736,20040402,30.0,6,1.0,0.0,0.0,60,12.5,...,0.235676,0.101988,0.129549,0.022816,0.097462,-2.881803,2.804097,-2.420821,0.795292,0.914762
1,1,2262,20030301,40.0,1,2.0,0.0,0.0,0,15.0,...,0.264777,0.121004,0.135731,0.026597,0.020582,-4.900482,2.096338,-1.030483,-1.722674,0.245522
2,2,14874,20040403,115.0,15,1.0,0.0,0.0,163,12.5,...,0.25141,0.114912,0.165147,0.062173,0.027075,-4.846749,1.803559,1.56533,-0.832687,-0.229963
3,3,71865,19960908,109.0,10,0.0,0.0,1.0,193,15.0,...,0.274293,0.1103,0.121964,0.033395,0.0,-4.509599,1.28594,-0.501868,-2.438353,-0.478699
4,4,111080,20120103,110.0,5,1.0,0.0,0.0,68,5.0,...,0.228036,0.073205,0.09188,0.078819,0.121534,-1.89624,0.910783,0.93111,2.834518,1.923482


In [11]:
# 分类指标评价计算
# accuracy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
y1_true = np.array([0, 0, 1, 1])
y1_pred = np.array([0.1, 0.4, 0.35, 0.8])
# accuracy
print('accuracy: ', accuracy_score(y_true, y_pred))
# precision
print('precision: ', precision_score(y_true, y_pred))
# recall
print('recall: ', recall_score(y_true, y_pred))
# f1-score
print('f1 score: ', f1_score(y_true, y_pred))
# AUC
print('AUC score: ', roc_auc_score(y1_true, y1_pred))

accuracy:  0.75
precision:  1.0
recall:  0.6666666666666666
f1 score:  0.8
AUC score:  0.75


In [14]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# 回归指标评价计算
# MAPE
def mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true))

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE
print('MSE: ', mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE: ', np.sqrt(mean_squared_error(y_true, y_pred)))
# MAE
print('MAE: ', mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE: ', mape(y_true, y_pred))

# R2-Score
print('R2-Score: ', r2_score(y_true, y_pred))

MSE:  0.2871428571428571
RMSE:  0.5358571238146014
MAE:  0.4142857142857143
MAPE:  0.1461904761904762
R2-Score:  0.957874251497006


#### 参考索引
> [零基础入门数据挖掘 - 二手车交易价格预测](https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.15.1cd8593aw4bbL5&postId=95456)