# 蔬菜价格预测模型

2018年4月25日

在这个项目中，我们将使用成都市蔬菜价格数据建立一个有关最高价和最低价的预测模型。探索多个常用的监督学习算法并找出其中最优的方案。

## 1. 数据准备

### 1.1 导入数据

首先从CSV文件中导入数据，计算每一条数据前3天和前9天价格平均值，然后将数据拆分为特征和目标两个部分。

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [29]:
data = pd.read_csv('./vegetable_prices.csv')

# calculating average prices 3 or 9 days before
data['h_3'] = data['h_price'].shift(1).rolling(window=3).mean()
data['h_9'] = data['h_price'].shift(1).rolling(window=9).mean()
data['l_3'] = data['l_price'].shift(1).rolling(window=3).mean()
data['l_9'] = data['l_price'].shift(1).rolling(window=9).mean()
data = data.dropna()

prices = data[['h_price', 'l_price']]
features = data.drop(['l_price', 'h_price', 'v_price'], axis=1)

print('chengdu vegetable dataset has {} data points with {} variables each'.format(*data.shape))

chengdu vegetable dataset has 7371 data points with 22 variables each


### 1.2 分析数据

数据包含如下22个变量：

In [30]:
data.head()

Unnamed: 0,v_name,v_price,h_price,l_price,v_market,area,source,updateTime,insertTime,yWendu,...,fengli,fengxiang,aqi,aqiLevel,aqiInfo,cpi,h_3,h_9,l_3,l_9
9,生姜,7.2,8.0,6.6,四川成都龙泉聚和(国际)果蔬菜交易中心,cd,vegnet.com.cn,2016-03-01,2018-04-19,6,...,微风,南风,137,3,轻度污染,102.3,3.666667,3.377778,2.9,2.666667
10,大蒜,8.4,8.8,7.8,四川成都龙泉聚和(国际)果蔬菜交易中心,cd,vegnet.com.cn,2016-03-01,2018-04-19,6,...,微风,南风,137,3,轻度污染,102.3,5.6,4.066667,4.533333,3.266667
11,芹菜,4.8,5.2,4.5,四川成都龙泉聚和(国际)果蔬菜交易中心,cd,vegnet.com.cn,2016-03-01,2018-04-19,6,...,微风,南风,137,3,轻度污染,102.3,7.6,4.855556,6.466667,4.0
12,莴笋,2.4,3.0,2.1,四川成都龙泉聚和(国际)果蔬菜交易中心,cd,vegnet.com.cn,2016-03-01,2018-04-19,6,...,微风,南风,137,3,轻度污染,102.3,7.333333,4.833333,6.3,4.033333
13,蒜薹,9.5,10.0,8.0,四川成都龙泉聚和(国际)果蔬菜交易中心,cd,vegnet.com.cn,2016-03-01,2018-04-19,6,...,微风,南风,137,3,轻度污染,102.3,5.666667,4.555556,4.8,3.733333


In [31]:
data.describe()

Unnamed: 0,v_price,h_price,l_price,yWendu,bWendu,aqi,aqiLevel,cpi,h_3,h_9,l_3,l_9
count,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0,7371.0
mean,4.278974,4.882214,3.843664,12.125085,19.888211,90.249084,2.277574,101.695387,4.881142,4.880273,3.843135,3.842783
std,2.945466,3.31187,2.756567,7.613284,7.914107,47.855732,0.961405,0.563581,1.998666,1.037054,1.626428,0.746328
min,0.8,0.9,0.7,-4.0,4.0,25.0,1.0,100.8,1.166667,2.1,0.9,1.666667
25%,2.0,2.3,1.8,6.0,13.0,54.0,2.0,101.3,3.333333,4.277778,2.533333,3.422222
50%,3.6,4.0,3.2,12.0,19.0,79.0,2.0,101.4,4.5,4.8,3.533333,3.888889
75%,5.0,6.0,4.5,19.0,27.0,117.0,3.0,102.3,6.133333,5.511111,4.7,4.3
max,16.5,20.0,16.0,26.0,36.0,318.0,6.0,102.6,12.333333,9.7,8.366667,6.811111


以上是对原始数据集中的几个数值型变量进行的基本统计分析，可以看到均值，标准差，中位数和最值等基本描述统计量。

### 1.3 数据分割与重排
接下来，先将数据集中的蔬菜名称转换为独热编码（One Hot Encoding），然后分成训练和测试两个子集并打乱数据顺序，消除数据集中由于顺序产生的偏差，分割比例为80%数据用于训练，20%用于测试。

In [32]:
from sklearn.model_selection import train_test_split
from pandas import get_dummies

features_encoded = get_dummies(features, columns=['v_name'])
X_train, X_test, y_train, y_test = train_test_split(features_encoded, prices, test_size=0.2, random_state=42)

print('dataset train: {} {}, test: {} {}'.format(
    X_train.shape, y_train.shape, X_test.shape, y_test.shape))

dataset train: (5896, 41) (5896, 2), test: (1475, 41) (1475, 2)


独热编码后，变量增加到了41个，其中蔬菜名称变成了一系列数值型数据。

## 2. 训练模型

### 2.1 线性回归模型

根据前面对数据进行的探索性分析，我们得到如下结论：

1. 蔬菜最高价与“最低价”、“白天温度”、“夜晚温度”、“前3天最高价均值”、“前9天最高价均值”、“前3天最低价均值”、“前9天最低价均值”具有相关关系；
2. 蔬菜最低价与“最高价”、“前3天最高价均值”、“前9天最高价均值”、“前3天最低价均值”、“前9天最低价均值”具有相关关系；

因此首先想到的是通过多元线性回归建立预测模型。首先需要在分割好的数据集中删掉不相关的变量，计算需要的变量。

In [34]:
X_train_lm = X_train.drop([
    'v_market', 
    'area',
    'source', 
    'updateTime', 
    'insertTime', 
    'tianqi', 
    'fengli', 
    'fengxiang', 
    'aqi', 
    'aqiInfo', 
    'aqiLevel', 
    'cpi'
], axis=1)

X_test_lm = X_test.drop([
    'v_market', 
    'area',
    'source', 
    'updateTime', 
    'insertTime', 
    'tianqi', 
    'fengli', 
    'fengxiang', 
    'aqi', 
    'aqiInfo', 
    'aqiLevel', 
    'cpi',
], axis=1)

X_train_lm.head()

Unnamed: 0,yWendu,bWendu,h_3,h_9,l_3,l_9,v_name_冬瓜,v_name_南瓜,v_name_土豆,v_name_大白菜,...,v_name_茄子,v_name_莴笋,v_name_葱头,v_name_蒜薹,v_name_西红柿,v_name_金针菇,v_name_青椒,v_name_韭菜,v_name_香菇,v_name_黄瓜
6241,0,12,3.666667,4.611111,2.166667,2.6,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2532,14,23,2.333333,2.866667,1.933333,2.4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5247,14,20,4.766667,5.988889,3.6,4.288889,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5727,8,13,4.166667,5.355556,3.2,4.177778,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3299,8,13,2.1,3.8,1.7,3.088889,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


下面将使用网格搜索和K折交叉验证的方法，基于线性回归模型建立对价格的预测。

In [35]:
from sklearn.metrics import r2_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

def r2_score_metric(y_true, y_pred):
    score = r2_score(y_true, y_pred)
    return score

def fit_model(X, y, regressor, params):
    cross_validator = KFold(n_splits=5, shuffle=True, random_state=42)
    scoring_func = make_scorer(r2_score_metric)
    grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_func, cv=cross_validator)
    grid = grid.fit(X, y)
    return grid.best_estimator_

linear_regressor = fit_model(X_train_lm, y_train, LinearRegression(), {
        'fit_intercept': [True, False],
        'normalize': [True, False],
    })

print('paramter for the optimal model {}'.format(linear_regressor.get_params()))

paramter for the optimal model {'copy_X': True, 'normalize': False, 'n_jobs': 1, 'fit_intercept': True}


In [36]:
y_pred = linear_regressor.predict(X_test_lm)
r2 = r2_score_metric(y_test, y_pred)

print('optimal model has R^2 score {:,.2f} on test data'.format(r2))

optimal model has R^2 score 0.88 on test data


### 2.2 决策树模型

In [37]:
features_encoded = get_dummies(features, columns=['v_name', 'tianqi'])
X_train, X_test, y_train, y_test = train_test_split(features_encoded, prices, test_size=0.2, random_state=42)

X_train = X_train.drop([
    'v_market', 
    'area',
    'source', 
    'updateTime', 
    'insertTime', 
    'fengli', 
    'fengxiang', 
    'aqiLevel', 
    'aqiInfo', 
    'cpi',
], axis=1)

X_test = X_test.drop([
    'v_market', 
    'area',
    'source', 
    'updateTime', 
    'insertTime', 
    'fengli', 
    'fengxiang', 
    'aqiLevel', 
    'aqiInfo', 
    'cpi',
], axis=1)

print('dataset train: {} {}, test: {} {}'.format(
    X_train.shape, y_train.shape, X_test.shape, y_test.shape))

dataset train: (5896, 68) (5896, 2), test: (1475, 68) (1475, 2)


In [38]:
from sklearn.tree import DecisionTreeRegressor

dt_regressor = fit_model(X_train, y_train, DecisionTreeRegressor(), {
        'max_depth': range(1, 11),
    })

print('paramter for the optimal model {}'.format(dt_regressor.get_params()))

paramter for the optimal model {'presort': False, 'splitter': 'best', 'min_impurity_decrease': 0.0, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'criterion': 'mse', 'random_state': None, 'min_impurity_split': None, 'max_features': None, 'max_depth': 10}


In [39]:
y_pred = dt_regressor.predict(X_test)
r2 = r2_score_metric(y_test, y_pred)

print('optimal model has R^2 score {:,.2f} on test data'.format(r2))

optimal model has R^2 score 0.90 on test data


## 3. 结论