# Step1：赛题理解

# 1.1 需要做什么
   好比中医的望闻问切，拿到一个陌生的赛题，需要先摸一摸，看一看，往往数据很大，几十万上百万条数据很常见，直接用excel打开，往往费时，
还不能很好得到我们想要的信息，这时候使用python的pandas函数读取数据，使用几个很简单的函数，就可以快速得到数据的属性，数据类型，数量等信息。

赛题地址：
https://tianchi.aliyun.com/competition/entrance/231784/information

# 1.2 赛题理解

主要是三个理解

-理解题目讲的是什么，可以查阅资料，了解赛题的背景

-理解数据都哪些属性，多少行多少列，分别代表什么

-理解评价指标，采用了什么评价方式，这也是我们下一步设计算法训练模型的努力方向

## 1.2.1 理解赛题

官方解释：赛题以预测二手车的交易价格为任务，数据集报名后可见并可下载，该数据来自某交易平台的二手车交易记录，总数据量超过40w，包含31列变量信息，其中15列为匿名变量。为了保证比赛的公平性，将会从中抽取15万条作为训练集，5万条作为测试集A，5万条作为测试集B，同时会对name、model、brand和regionCode等信息进行脱敏。

个人理解：就是一个回归问题，训练数据就是二手车的相关参数，标签就是对应车的价格；测试数据就是需要我们预测价格。所以我们就需要利用给定的数据，使用合适的算法，训练出一个回归模型对测试数据对应的价格进行预测。

## 1.2.2 理解数据

每条数据的属性字段表及相应描述如下：

SaleID - 销售样本ID

name - 汽车编码

regDate - 汽车注册时间

model - 车型编码

brand - 品牌

bodyType - 车身类型

fuelType - 燃油类型

gearbox - 变速箱

power - 汽车功率

kilometer - 汽车行驶公里

notRepairedDamage - 汽车有尚未修复的损坏

regionCode - 看车地区编码

seller - 销售方

offerType - 报价类型

creatDate - 广告发布时间

price - 汽车价格

v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' 【匿名特征，包含v0-14在内15个匿名特征】 　

数字全都脱敏处理，都为label encoding形式，即数字形式

train数据有31个属性，相比于test数据多的一个就是price属性也就是二手车的价格

## 1.2.3 理解评价指标

# 1.3 通过代码了解数据组成

## 1.3.1 使用pandas读取数据
此处要注意路径名的匹配，输出为数据的行数表示训练和测试集分别有多少条数据，train的列数为31条,test的列数为30条，通过分析和下边的数据分析可以看到，test中少的就是price属性也就是我们所说的数据标签。

In [4]:
import pandas as pd
import numpy as np

## 1) 载入训练集和测试集；
path = './'
Train_data = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')

print('Train data shape:',Train_data.shape)
print('TestA data shape:',Test_data.shape)

Train data shape: (150000, 31)
TestA data shape: (50000, 30)


## head方法输出前n行（条）数据简单查看，info方法查看数据的基本信息，describe方法查看数值特征列的统计信息
数据基本信息包括：
属性名，数据量，有无null值，数据类型

In [9]:
Train_data.head(10)
Train_data.info()
Train_data.describe()

Unnamed: 0,SaleID,name,regDate,model,brand,bodyType,fuelType,gearbox,power,kilometer,...,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0,736,20040402,30.0,6,1.0,0.0,0.0,60,12.5,...,0.235676,0.101988,0.129549,0.022816,0.097462,-2.881803,2.804097,-2.420821,0.795292,0.914762
1,1,2262,20030301,40.0,1,2.0,0.0,0.0,0,15.0,...,0.264777,0.121004,0.135731,0.026597,0.020582,-4.900482,2.096338,-1.030483,-1.722674,0.245522
2,2,14874,20040403,115.0,15,1.0,0.0,0.0,163,12.5,...,0.25141,0.114912,0.165147,0.062173,0.027075,-4.846749,1.803559,1.56533,-0.832687,-0.229963
3,3,71865,19960908,109.0,10,0.0,0.0,1.0,193,15.0,...,0.274293,0.1103,0.121964,0.033395,0.0,-4.509599,1.28594,-0.501868,-2.438353,-0.478699
4,4,111080,20120103,110.0,5,1.0,0.0,0.0,68,5.0,...,0.228036,0.073205,0.09188,0.078819,0.121534,-1.89624,0.910783,0.93111,2.834518,1.923482
5,5,137642,20090602,24.0,10,0.0,1.0,0.0,109,10.0,...,0.260246,0.000518,0.119838,0.090922,0.048769,1.885526,-2.721943,2.45766,-0.286973,0.206573
6,6,2402,19990411,13.0,4,0.0,0.0,1.0,150,15.0,...,0.267998,0.117675,0.142334,0.025446,0.028174,-4.9022,1.610616,-0.834605,-1.996117,-0.10318
7,7,165346,19990706,26.0,14,1.0,0.0,0.0,101,15.0,...,0.239506,0.0,0.122943,0.039839,0.082413,3.693829,-0.245014,-2.19281,0.236728,0.195567
8,8,2974,20030205,19.0,1,2.0,1.0,1.0,179,15.0,...,0.263833,0.116583,0.144255,0.039851,0.024388,-4.925234,1.587796,0.075348,-1.551098,0.069433
9,9,82021,19980101,7.0,7,5.0,0.0,0.0,88,15.0,...,0.262473,0.068267,0.012176,0.010291,0.098727,-1.089584,0.600683,-4.18621,0.198273,-1.025822


In [8]:
Test_data.head()
Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non

## 1.3.2 评价指标计算
包括分类指标和回归指标

In [11]:
# 分类
## accuracy
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print('ACC:',accuracy_score(y_true, y_pred))

ACC: 0.75


### 该比赛中用到的就是MSE评价标准

In [12]:
# 回归
# coding=utf-8
import numpy as np
from sklearn import metrics

# MAPE需要自己实现
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true))

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))

MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.1461904761904762


### r2_score指标也是用来评价回归模型
表示一元多项式回归方程拟合度的高低，或者说表示一元多项式回归方程估测的可靠程度的高低
r2越大，模型拟合效果越好

参考资料：https://blog.csdn.net/Dear_D/article/details/86144696 

In [14]:
## R2-score
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))

R2-score: 0.9486081370449679


# 参考资料
https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12281978.0.0.6802593aCd2UYt&postId=95456