<font size=5>1. 读取数据</font>

In [2]:
import pandas as pd
import numpy as np

In [3]:
stationID = "1316A"
datadate = "201901"

In [4]:
filename = "data/"+stationID+"/"+datadate+".csv"
filename

'data/1316A/201901.csv'

In [5]:
data = pd.read_csv(filename, header=0, encoding="utf-8")

<font size=5>2. 查看数据格式及是否有缺失值<font>

In [6]:
#查看前五行数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4


In [7]:
#查看数据格式
data.dtypes

date           int64
hour           int64
stationID     object
AQI          float64
PM2.5        float64
PM2.5_24h    float64
PM10         float64
PM10_24h     float64
SO2          float64
SO2_24h      float64
NO2          float64
NO2_24h      float64
O3           float64
O3_24h       float64
O3_8h        float64
O3_8h_24h    float64
CO           float64
CO_24h       float64
dtype: object

In [8]:
#查看数据形状
data.shape

(744, 18)

In [9]:
#查看是否有缺失值
data.count()

date         744
hour         744
stationID    744
AQI          709
PM2.5        706
PM2.5_24h    740
PM10         702
PM10_24h     740
SO2          738
SO2_24h      743
NO2          733
NO2_24h      743
O3           735
O3_24h       743
O3_8h        743
O3_8h_24h    743
CO           737
CO_24h       743
dtype: int64

这里可以看到还是有缺失值存在的
**<font color=red>在将数据应用于系统之前，一定要对缺省值进行处理</font>**

In [10]:
#找出含有缺失值的列
hasNAN = data.isnull().any()
hasNAN[hasNAN==True]

AQI          True
PM2.5        True
PM2.5_24h    True
PM10         True
PM10_24h     True
SO2          True
SO2_24h      True
NO2          True
NO2_24h      True
O3           True
O3_24h       True
O3_8h        True
O3_8h_24h    True
CO           True
CO_24h       True
dtype: bool

<font size=5>3. 处理缺失值</font>

我的想法是：先找出含NAN值的列，然后分别计算这些列的均值，然后进行填充

In [11]:
#nancolslist存放含NAN值的列名信息
nancolslist = list(hasNAN[hasNAN==True].index)
len(nancolslist)

15

In [115]:
#计算每一项指标的均值
for indicator in nancolslist:
    print(data[indicator].mean())

163.04090267983074
121.79603399433428
123.16216216216216
180.5242165242165
181.13783783783785
17.90379403794038
18.04306864064603
66.65484311050477
67.21668909825034
22.64625850340136
53.00134589502019
22.08613728129206
25.970390309555853
1.4303934871099038
1.4347240915208608


In [116]:
#利用均值填充缺失值
for indicator in nancolslist:
    colsmean = data[indicator].mean()
    data[indicator] = data[indicator].fillna(colsmean)

In [117]:
#查看填充后的数据
data[50:100]

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
50,20190103,2,1316A,159.0,121.0,140.0,182.0,205.0,22.0,20.0,86.0,77.0,13.0,94.0,16.0,20.0,1.5,1.6
51,20190103,3,1316A,159.0,121.0,138.0,174.0,203.0,19.0,20.0,84.0,76.0,14.0,94.0,16.0,20.0,1.4,1.6
52,20190103,4,1316A,165.0,125.0,136.0,176.0,201.0,18.0,20.0,89.0,76.0,8.0,94.0,14.0,20.0,1.5,1.6
53,20190103,5,1316A,166.0,126.0,133.0,189.0,199.0,17.0,20.0,87.0,76.0,7.0,94.0,12.0,20.0,1.5,1.6
54,20190103,6,1316A,163.0,124.0,131.0,178.0,196.0,16.0,20.0,85.0,75.0,6.0,94.0,11.0,20.0,1.5,1.6
55,20190103,7,1316A,168.0,127.0,129.0,181.0,193.0,17.0,20.0,82.0,75.0,6.0,94.0,11.0,20.0,1.6,1.6
56,20190103,8,1316A,176.0,133.0,127.0,188.0,190.0,19.0,20.0,84.0,74.0,6.0,94.0,10.0,20.0,1.8,1.5
57,20190103,9,1316A,188.0,141.0,125.0,215.0,188.0,24.0,20.0,82.0,74.0,9.0,94.0,9.0,20.0,1.8,1.5
58,20190103,10,1316A,198.0,148.0,124.0,228.0,186.0,21.0,20.0,80.0,74.0,10.0,94.0,8.0,20.0,1.9,1.5
59,20190103,11,1316A,203.0,153.0,124.0,224.0,186.0,20.0,20.0,83.0,74.0,13.0,94.0,8.0,20.0,2.0,1.5


这样就填充好缺失值了，可以进行下一步操作。

<font size=5>4. 日期处理</font>

毕竟是时间序列，肯定要对时间进行处理。

In [118]:
#查看datetime类型
type(data.date)

pandas.core.series.Series

In [119]:
#将数据转化为datetime类型
data["date"]= pd.to_datetime(data["date"], format='%Y%m%d')

In [120]:
data.shape

(744, 18)

In [121]:
#查看更改为datetime类型的date数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
0,2019-01-01,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3
1,2019-01-01,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3
2,2019-01-01,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
3,2019-01-01,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
4,2019-01-01,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4


In [122]:
#查看下类型
data.dtypes

date         datetime64[ns]
hour                  int64
stationID            object
AQI                 float64
PM2.5               float64
PM2.5_24h           float64
PM10                float64
PM10_24h            float64
SO2                 float64
SO2_24h             float64
NO2                 float64
NO2_24h             float64
O3                  float64
O3_24h              float64
O3_8h               float64
O3_8h_24h           float64
CO                  float64
CO_24h              float64
dtype: object

在这里发现date真的转换位datetime类型

目前的datetime中年月日信息是连在一起的，便于后续操作，我们把信息独立分成三列

In [123]:
data["year"] = pd.DatetimeIndex(data.date).year
data["month"] = pd.DatetimeIndex(data.date).month
data["day"] = pd.DatetimeIndex(data.date).day

考虑到周末和工作日人们的生活方式可能与平常不大相同，单独再增加一列dayOfWeek，表示当前是一周的第几天

In [124]:
data["dayOfWeek"] = pd.DatetimeIndex(data.date).dayofweek

**<font color=red size=3>注意：dayOfWeek开始值是0，而不是我们想象中的周一，结束值是6，代表周日</font>**

In [125]:
#再看下当前数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,...,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,year,month,day,dayOfWeek
0,2019-01-01,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,...,6.0,48.0,13.0,35.0,1.6,1.3,2019,1,1,1
1,2019-01-01,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,...,6.0,48.0,6.0,6.0,1.6,1.3,2019,1,1,1
2,2019-01-01,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,...,6.0,48.0,6.0,6.0,1.7,1.3,2019,1,1,1
3,2019-01-01,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,...,6.0,48.0,6.0,6.0,1.7,1.3,2019,1,1,1
4,2019-01-01,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,...,6.0,48.0,6.0,6.0,1.8,1.4,2019,1,1,1


In [126]:
#查看更新后的data的shape：
data.shape

(744, 22)

这里可以看到，由之前的18列扩展为22列。因为我们新增了四列：year,month,day,dayOfWeek

其实，此时的date、stationID数据就不是很有用了，我们可以先将这列去掉，<font color=red>不过保险起见，我们先保存一下吧</font>

In [127]:
initialData = data

In [128]:
#扔掉date、stationID字段
data = data.drop(["date", "stationID"], axis=1)

In [129]:
#查看data的shape:
data.shape

(744, 20)

可以发现由于我们删掉了两列，此时的数据从22列变为20列

In [130]:
#重新查看数据
data.head()

Unnamed: 0,hour,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,year,month,day,dayOfWeek
0,0,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,2019,1,1,1
1,1,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,2019,1,1,1
2,2,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019,1,1,1
3,3,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019,1,1,1
4,4,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,2019,1,1,1


这样数据好很多了，就是位置需要再调整一下

In [132]:
orderlist = ["year", "month", "day", "hour", "AQI", "PM2.5", "PM2.5_24h", "PM10", "PM10_24h","SO2","SO2_24h","NO2","NO2_24h","O3", "O3_24h", "O3_8h", "O3_8h_24h", "CO", "CO_24h", "dayOfWeek"]
data = data[orderlist]
data.head()

Unnamed: 0,year,month,day,hour,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,dayOfWeek
0,2019,1,1,0,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,1
1,2019,1,1,1,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,1
2,2019,1,1,2,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
3,2019,1,1,3,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
4,2019,1,1,4,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,1


可以看到，数据整洁了很多

In [133]:
#再查看下数据类型
data.dtypes

year           int64
month          int64
day            int64
hour           int64
AQI          float64
PM2.5        float64
PM2.5_24h    float64
PM10         float64
PM10_24h     float64
SO2          float64
SO2_24h      float64
NO2          float64
NO2_24h      float64
O3           float64
O3_24h       float64
O3_8h        float64
O3_8h_24h    float64
CO           float64
CO_24h       float64
dayOfWeek      int64
dtype: object

<font size=5>5. 将DataFrame格式的数据转化为ndarray类型</font>

<font color=red>选择目标值为AQI指数，空气质量指数（也就是手机里面经常看到的）</font>

In [134]:
data_target = data["AQI"].values

In [135]:
data_target.shape

(744,)

<font color=red>剩下的数据作为特征数据</font>

In [136]:
data_features = data.drop(["AQI"], axis=1).values
data_features

array([[2.019e+03, 1.000e+00, 1.000e+00, ..., 1.600e+00, 1.300e+00,
        1.000e+00],
       [2.019e+03, 1.000e+00, 1.000e+00, ..., 1.600e+00, 1.300e+00,
        1.000e+00],
       [2.019e+03, 1.000e+00, 1.000e+00, ..., 1.700e+00, 1.300e+00,
        1.000e+00],
       ...,
       [2.019e+03, 1.000e+00, 3.100e+01, ..., 1.100e+00, 1.100e+00,
        3.000e+00],
       [2.019e+03, 1.000e+00, 3.100e+01, ..., 1.000e+00, 1.100e+00,
        3.000e+00],
       [2.019e+03, 1.000e+00, 3.100e+01, ..., 1.000e+00, 1.100e+00,
        3.000e+00]])

In [137]:
data_features.shape

(744, 19)

**<font size=6>机器学习算法</font>**

In [138]:
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.model_selection import cross_val_score

使用K折交叉验证方式

In [139]:
kfold = KFold(n_splits=3, shuffle=True, random_state=0)

In [140]:
#三折交叉验证
kfold.get_n_splits(data_features)

3

我们利用SVM中的SVR支持向量回归查看下数据的运行效果

In [144]:
for train_index,test_index in kfold.split(data_features):
    data_train, data_test = data_features[train_index], data_features[test_index]
    target_train, target_test = data_target[train_index], data_target[test_index]
    svc = svm.SVR(kernel='rbf', C=10, gamma=0.001)
    svc.fit(data_train, target_train)
    print("在训练集上的精度:{:.2f}".format(svc.score(data_train,target_train)))
    print("在测试集上的精度:{:.2f}".format(svc.score(data_test,target_test)))

在训练集上的精度:0.56
在测试集上的精度:0.51
在训练集上的精度:0.62
在测试集上的精度:0.47
在训练集上的精度:0.57
在测试集上的精度:0.44
