**<font size=6>时空序列分析</font>**

<font size=5>1. 读取数据</font>

In [488]:
import pandas as pd
import numpy as np

这里利用1316A站点2019年1月的数据作为尝试

In [489]:
stationID = "1316A"
datadate = "201901"

In [490]:
filename = "data/"+stationID+"/"+datadate+".csv"
filename

'data/1316A/201901.csv'

In [491]:
data = pd.read_csv(filename, header=0, encoding="utf-8")

<font size=5>2. 查看数据格式及是否有缺失值<font>

In [492]:
#查看前五行数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4


<font color="red">提示：这里PM_2.5_24h表示的是PM2.5在24h内的均值，而PM2.5表示的是实时浓度</font>

In [493]:
#查看数据格式
data.dtypes

date           int64
hour           int64
stationID     object
AQI          float64
PM2.5        float64
PM2.5_24h    float64
PM10         float64
PM10_24h     float64
SO2          float64
SO2_24h      float64
NO2          float64
NO2_24h      float64
O3           float64
O3_24h       float64
O3_8h        float64
O3_8h_24h    float64
CO           float64
CO_24h       float64
dtype: object

In [494]:
#查看数据形状
data.shape

(744, 18)

In [495]:
#查看是否有缺失值
data.count()

date         744
hour         744
stationID    744
AQI          709
PM2.5        706
PM2.5_24h    740
PM10         702
PM10_24h     740
SO2          738
SO2_24h      743
NO2          733
NO2_24h      743
O3           735
O3_24h       743
O3_8h        743
O3_8h_24h    743
CO           737
CO_24h       743
dtype: int64

这里可以看到还是有缺失值存在的
**<font color=red>在将数据应用于系统之前，一定要对缺省值进行处理</font>**

In [496]:
#找出含有缺失值的列
hasNAN = data.isnull().any()
hasNAN[hasNAN==True]

AQI          True
PM2.5        True
PM2.5_24h    True
PM10         True
PM10_24h     True
SO2          True
SO2_24h      True
NO2          True
NO2_24h      True
O3           True
O3_24h       True
O3_8h        True
O3_8h_24h    True
CO           True
CO_24h       True
dtype: bool

<font size=5>3. 处理缺失值</font>

我的想法是：先找出含NAN值的列，然后分别计算这些列的均值，然后进行填充

In [497]:
#nancolslist存放含NAN值的列名信息
nancolslist = list(hasNAN[hasNAN==True].index)
len(nancolslist)

15

这里用round(数据，小数点位数)使得均值精确到小数点后两位

In [498]:
#计算每一项指标的均值
for indicator in nancolslist:
    print(round(data[indicator].mean(),2))

163.04
121.8
123.16
180.52
181.14
17.9
18.04
66.65
67.22
22.65
53.0
22.09
25.97
1.43
1.43


In [499]:
#利用均值填充缺失值
for indicator in nancolslist:
    colsmean = round(data[indicator].mean(),2)
    data[indicator] = data[indicator].fillna(colsmean)

In [500]:

#查看填充后的数据
data[50:70]

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
50,20190103,2,1316A,159.0,121.0,140.0,182.0,205.0,22.0,20.0,86.0,77.0,13.0,94.0,16.0,20.0,1.5,1.6
51,20190103,3,1316A,159.0,121.0,138.0,174.0,203.0,19.0,20.0,84.0,76.0,14.0,94.0,16.0,20.0,1.4,1.6
52,20190103,4,1316A,165.0,125.0,136.0,176.0,201.0,18.0,20.0,89.0,76.0,8.0,94.0,14.0,20.0,1.5,1.6
53,20190103,5,1316A,166.0,126.0,133.0,189.0,199.0,17.0,20.0,87.0,76.0,7.0,94.0,12.0,20.0,1.5,1.6
54,20190103,6,1316A,163.0,124.0,131.0,178.0,196.0,16.0,20.0,85.0,75.0,6.0,94.0,11.0,20.0,1.5,1.6
55,20190103,7,1316A,168.0,127.0,129.0,181.0,193.0,17.0,20.0,82.0,75.0,6.0,94.0,11.0,20.0,1.6,1.6
56,20190103,8,1316A,176.0,133.0,127.0,188.0,190.0,19.0,20.0,84.0,74.0,6.0,94.0,10.0,20.0,1.8,1.5
57,20190103,9,1316A,188.0,141.0,125.0,215.0,188.0,24.0,20.0,82.0,74.0,9.0,94.0,9.0,20.0,1.8,1.5
58,20190103,10,1316A,198.0,148.0,124.0,228.0,186.0,21.0,20.0,80.0,74.0,10.0,94.0,8.0,20.0,1.9,1.5
59,20190103,11,1316A,203.0,153.0,124.0,224.0,186.0,20.0,20.0,83.0,74.0,13.0,94.0,8.0,20.0,2.0,1.5


这样就填充好缺失值了，可以进行下一步操作。

<font size=5>4. 日期处理</font>

毕竟是时间序列，肯定要对时间进行处理。

*<font size=4 color="red">想要将date和hour合并，作为一个新列time添加到数据中</font>*

①因为当前的hour是int型的，而且有一位或者两位。如果直接合并，可能会出错<br>
因此，我们使用map函数将其映射为两位int

In [501]:
data["hour"] = data["hour"].map(lambda x:("%02d")%x)

In [502]:
#查看修改后的hour
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4


In [503]:
data.dtypes

date           int64
hour          object
stationID     object
AQI          float64
PM2.5        float64
PM2.5_24h    float64
PM10         float64
PM10_24h     float64
SO2          float64
SO2_24h      float64
NO2          float64
NO2_24h      float64
O3           float64
O3_24h       float64
O3_8h        float64
O3_8h_24h    float64
CO           float64
CO_24h       float64
dtype: object

②合并date和hour，生成一个time列。

In [429]:
data["time"] = data["date"].map(str) + data["hour"].map(str)
data[0:25]

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,time
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,2019010100
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,2019010101
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019010102
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019010103
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,2019010104
5,20190101,5,1316A,147.0,112.0,93.0,170.0,150.0,18.0,21.0,80.0,68.0,6.0,48.0,6.0,6.0,1.8,1.4,2019010105
6,20190101,6,1316A,148.0,113.0,95.0,178.0,153.0,17.0,21.0,79.0,68.0,6.0,48.0,6.0,6.0,1.7,1.4,2019010106
7,20190101,7,1316A,156.0,119.0,96.0,182.0,155.0,17.0,21.0,82.0,68.0,5.0,48.0,6.0,6.0,1.8,1.4,2019010107
8,20190101,8,1316A,145.0,111.0,96.0,189.0,157.0,20.0,21.0,81.0,69.0,5.0,48.0,6.0,6.0,1.7,1.5,2019010108
9,20190101,9,1316A,138.0,105.0,96.0,185.0,158.0,20.0,21.0,78.0,69.0,6.0,48.0,6.0,6.0,1.7,1.5,2019010109


将time转换为datetime格式的数据

In [430]:
#将数据转化为datetime类型
data["time"]= pd.to_datetime(data["time"], format='%Y%m%d%H')

In [431]:
data.shape

(744, 19)

In [432]:
#查看更改为datetime类型的date数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,time
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,2019-01-01 00:00:00
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,2019-01-01 01:00:00
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019-01-01 02:00:00
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,2019-01-01 03:00:00
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,2019-01-01 04:00:00


想要生成一列用于表示当前天数星期的特征

In [433]:
data["dayOfWeek"] = pd.DatetimeIndex(data.time).dayofweek

**<font color="red">注意：dayOfWeek开始值是0，而不是我们想象中的周一，结束值是6，代表周日</font>**

In [434]:
#查看下类型
data.shape

(744, 20)

发现当前数据集中有20个属性,我们新加入的time和dayOfWeek列

目前的datetime中年月日信息是连在一起的，便于后续操作，我们把信息独立分成三列

In [435]:
data["year"] = pd.DatetimeIndex(data.time).year
data["month"] = pd.DatetimeIndex(data.time).month
data["day"] = pd.DatetimeIndex(data.time).day

In [436]:
#再看下当前数据
data.head()

Unnamed: 0,date,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,...,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,time,dayOfWeek,year,month,day
0,20190101,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,...,48.0,13.0,35.0,1.6,1.3,2019-01-01 00:00:00,1,2019,1,1
1,20190101,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,...,48.0,6.0,6.0,1.6,1.3,2019-01-01 01:00:00,1,2019,1,1
2,20190101,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,...,48.0,6.0,6.0,1.7,1.3,2019-01-01 02:00:00,1,2019,1,1
3,20190101,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,...,48.0,6.0,6.0,1.7,1.3,2019-01-01 03:00:00,1,2019,1,1
4,20190101,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,...,48.0,6.0,6.0,1.8,1.4,2019-01-01 04:00:00,1,2019,1,1


In [437]:
#查看更新后的data的shape：
data.shape

(744, 23)

这里可以看到，由之前的18列扩展为23列。因为我们新增了四列：time, dayOfWeek, year,month,day

其实，此时的dateD数据就不是很有用了，我们可以先将这列去掉，<font color=red>不过保险起见，我们先保存一下吧</font>

In [438]:
initialData = data

In [439]:
#扔掉date字段
data = data.drop(["date"], axis=1)

In [440]:
#查看data的shape:
data.shape

(744, 22)

可以发现由于我们删掉了一列，此时的数据从23列变为22列

In [441]:
data.head()

Unnamed: 0,hour,stationID,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,SO2,SO2_24h,NO2,...,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,time,dayOfWeek,year,month,day
0,0,1316A,152.0,116.0,85.0,184.0,135.0,21.0,21.0,79.0,...,48.0,13.0,35.0,1.6,1.3,2019-01-01 00:00:00,1,2019,1,1
1,1,1316A,133.0,101.0,87.0,178.0,139.0,21.0,21.0,79.0,...,48.0,6.0,6.0,1.6,1.3,2019-01-01 01:00:00,1,2019,1,1
2,2,1316A,134.0,102.0,88.0,162.0,142.0,23.0,21.0,79.0,...,48.0,6.0,6.0,1.7,1.3,2019-01-01 02:00:00,1,2019,1,1
3,3,1316A,140.0,107.0,90.0,172.0,145.0,21.0,22.0,78.0,...,48.0,6.0,6.0,1.7,1.3,2019-01-01 03:00:00,1,2019,1,1
4,4,1316A,143.0,109.0,92.0,171.0,148.0,19.0,21.0,77.0,...,48.0,6.0,6.0,1.8,1.4,2019-01-01 04:00:00,1,2019,1,1


这样数据好很多了，就是位置需要再调整一下

In [442]:
orderlist = ["stationID", "time", "year", "month", "day", "hour", "AQI", "PM2.5", "PM2.5_24h", "PM10", "PM10_24h","SO2","SO2_24h","NO2","NO2_24h","O3", "O3_24h", "O3_8h", "O3_8h_24h", "CO", "CO_24h", "dayOfWeek"]
data = data[orderlist]
data.head()

Unnamed: 0,stationID,time,year,month,day,hour,AQI,PM2.5,PM2.5_24h,PM10,...,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,dayOfWeek
0,1316A,2019-01-01 00:00:00,2019,1,1,0,152.0,116.0,85.0,184.0,...,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,1
1,1316A,2019-01-01 01:00:00,2019,1,1,1,133.0,101.0,87.0,178.0,...,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,1
2,1316A,2019-01-01 02:00:00,2019,1,1,2,134.0,102.0,88.0,162.0,...,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
3,1316A,2019-01-01 03:00:00,2019,1,1,3,140.0,107.0,90.0,172.0,...,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
4,1316A,2019-01-01 04:00:00,2019,1,1,4,143.0,109.0,92.0,171.0,...,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,1


利用pd.set_index("time")将time设置为索引,将以time为索引的数据保存在data_indexed中

In [443]:
data_indexed = data.set_index("time")
data_indexed.head()

Unnamed: 0_level_0,stationID,year,month,day,hour,AQI,PM2.5,PM2.5_24h,PM10,PM10_24h,...,SO2_24h,NO2,NO2_24h,O3,O3_24h,O3_8h,O3_8h_24h,CO,CO_24h,dayOfWeek
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-01-01 00:00:00,1316A,2019,1,1,0,152.0,116.0,85.0,184.0,135.0,...,21.0,79.0,65.0,6.0,48.0,13.0,35.0,1.6,1.3,1
2019-01-01 01:00:00,1316A,2019,1,1,1,133.0,101.0,87.0,178.0,139.0,...,21.0,79.0,66.0,6.0,48.0,6.0,6.0,1.6,1.3,1
2019-01-01 02:00:00,1316A,2019,1,1,2,134.0,102.0,88.0,162.0,142.0,...,21.0,79.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
2019-01-01 03:00:00,1316A,2019,1,1,3,140.0,107.0,90.0,172.0,145.0,...,22.0,78.0,67.0,6.0,48.0,6.0,6.0,1.7,1.3,1
2019-01-01 04:00:00,1316A,2019,1,1,4,143.0,109.0,92.0,171.0,148.0,...,21.0,77.0,67.0,6.0,48.0,6.0,6.0,1.8,1.4,1


In [470]:
#将以time为索引的数据的索引存储再dataIndex中
dataIndex = data_indexed.index
dataIndex

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00', '2019-01-01 07:00:00',
               '2019-01-01 08:00:00', '2019-01-01 09:00:00',
               ...
               '2019-01-31 14:00:00', '2019-01-31 15:00:00',
               '2019-01-31 16:00:00', '2019-01-31 17:00:00',
               '2019-01-31 18:00:00', '2019-01-31 19:00:00',
               '2019-01-31 20:00:00', '2019-01-31 21:00:00',
               '2019-01-31 22:00:00', '2019-01-31 23:00:00'],
              dtype='datetime64[ns]', name='time', length=744, freq=None)

<font size=5>5. 绘制这个月的AQI趋势图<font>

In [445]:
#查看AQI值
data_indexed.AQI.values

array([152.  , 133.  , 134.  , 140.  , 143.  , 147.  , 148.  , 156.  ,
       145.  , 138.  , 137.  , 133.  , 125.  , 124.  , 129.  , 129.  ,
       130.  , 137.  , 152.  , 168.  , 169.  , 170.  , 175.  , 185.  ,
       193.  , 203.  , 210.  , 217.  , 228.  , 236.  , 230.  , 226.  ,
       232.  , 233.  , 225.  , 210.  , 175.  , 155.  , 150.  , 156.  ,
       166.  , 160.  , 152.  , 156.  , 150.  , 142.  , 142.  , 148.  ,
       153.  , 160.  , 159.  , 159.  , 165.  , 166.  , 163.  , 168.  ,
       176.  , 188.  , 198.  , 203.  , 203.  , 200.  , 198.  , 133.  ,
       189.  , 189.  , 211.  , 237.  , 240.  , 225.  , 218.  , 222.  ,
       234.  , 248.  , 250.  , 254.  , 259.  , 255.  , 253.  , 256.  ,
       245.  , 283.  , 368.  , 395.  , 395.  , 395.  , 395.  , 395.  ,
       395.  , 395.  , 395.  , 206.  , 361.  , 352.  , 344.  , 163.04,
       163.04, 360.  , 368.  , 341.  , 306.  , 232.  , 182.  , 155.  ,
       133.  , 113.  ,  97.  ,  83.  ,  73.  ,  66.  ,  66.  ,  66.  ,
      

In [446]:
import matplotlib.pyplot as plt

**<font color="red">注意：这里drawAQI中data类型是Pandas中DataFrame类型，所以plt.plot中数据参数一个data就可以了，实际上它是二维的。<br>
    dataIndex是索引，其格式为DatetimeIndex类型</font>**

In [504]:
def drawAQI(data, dataIndex, msg):
    plt.figure(figsize=(10,3))
    xticks = pd.date_range(start=dataIndex.min(), end=dataIndex.max(), freq="D")
    plt.xticks(xticks, xticks.strftime("%Y/%m/%d"), rotation=75, ha="left")
    plt.plot(data, linewidth=1)
    plt.xlabel("Time")
    plt.ylabel("scaler quantity")
    plt.title("the trend of AQI with"+msg)

In [505]:
%matplotlib

Using matplotlib backend: Qt5Agg


In [506]:
#绘制利用原始AQI数据图像
drawAQI(data_indexed.AQI, dataIndex, "(initial data)")

<font size=5>6. 将DataFrame格式的数据转化为ndarray类型</font>

为了下面使用库函数对AQI值进行标准化，这里必须要把DataFrame格式的数据转化成ndarray类型

In [450]:
AQIarray = data_indexed.AQI.values

In [451]:
AQIarray

array([152.  , 133.  , 134.  , 140.  , 143.  , 147.  , 148.  , 156.  ,
       145.  , 138.  , 137.  , 133.  , 125.  , 124.  , 129.  , 129.  ,
       130.  , 137.  , 152.  , 168.  , 169.  , 170.  , 175.  , 185.  ,
       193.  , 203.  , 210.  , 217.  , 228.  , 236.  , 230.  , 226.  ,
       232.  , 233.  , 225.  , 210.  , 175.  , 155.  , 150.  , 156.  ,
       166.  , 160.  , 152.  , 156.  , 150.  , 142.  , 142.  , 148.  ,
       153.  , 160.  , 159.  , 159.  , 165.  , 166.  , 163.  , 168.  ,
       176.  , 188.  , 198.  , 203.  , 203.  , 200.  , 198.  , 133.  ,
       189.  , 189.  , 211.  , 237.  , 240.  , 225.  , 218.  , 222.  ,
       234.  , 248.  , 250.  , 254.  , 259.  , 255.  , 253.  , 256.  ,
       245.  , 283.  , 368.  , 395.  , 395.  , 395.  , 395.  , 395.  ,
       395.  , 395.  , 395.  , 206.  , 361.  , 352.  , 344.  , 163.04,
       163.04, 360.  , 368.  , 341.  , 306.  , 232.  , 182.  , 155.  ,
       133.  , 113.  ,  97.  ,  83.  ,  73.  ,  66.  ,  66.  ,  66.  ,
      

In [452]:
AQIarray.shape

(744,)

In [453]:
#将AQIarray调整为一个列数组
AQIarray = AQIarray.reshape(-1,1)

<font size=5>7. 尝试对数据进行标准化处理</font>

In [476]:
from sklearn.preprocessing import StandardScaler

In [477]:
scaler = StandardScaler().fit(AQIarray)

In [478]:
AQIarray_scalered = scaler.transform(AQIarray)
AQIarray_scalered

array([[-0.14],
       [-0.38],
       [-0.37],
       [-0.29],
       [-0.25],
       [-0.2 ],
       [-0.19],
       [-0.09],
       [-0.23],
       [-0.32],
       [-0.33],
       [-0.38],
       [-0.48],
       [-0.49],
       [-0.43],
       [-0.43],
       [-0.42],
       [-0.33],
       [-0.14],
       [ 0.06],
       [ 0.08],
       [ 0.09],
       [ 0.15],
       [ 0.28],
       [ 0.38],
       [ 0.51],
       [ 0.59],
       [ 0.68],
       [ 0.82],
       [ 0.92],
       [ 0.85],
       [ 0.8 ],
       [ 0.87],
       [ 0.89],
       [ 0.78],
       [ 0.59],
       [ 0.15],
       [-0.1 ],
       [-0.17],
       [-0.09],
       [ 0.04],
       [-0.04],
       [-0.14],
       [-0.09],
       [-0.17],
       [-0.27],
       [-0.27],
       [-0.19],
       [-0.13],
       [-0.04],
       [-0.05],
       [-0.05],
       [ 0.02],
       [ 0.04],
       [-0.  ],
       [ 0.06],
       [ 0.16],
       [ 0.32],
       [ 0.44],
       [ 0.51],
       [ 0.51],
       [ 0.47],
       [

这里出现了科学计数法，我们调整一下ndarray的显示方式：

In [479]:
np.set_printoptions(precision=2, suppress=True, threshold=np.nan)

In [480]:
#重新查看
AQIarray_scalered

array([[-0.14],
       [-0.38],
       [-0.37],
       [-0.29],
       [-0.25],
       [-0.2 ],
       [-0.19],
       [-0.09],
       [-0.23],
       [-0.32],
       [-0.33],
       [-0.38],
       [-0.48],
       [-0.49],
       [-0.43],
       [-0.43],
       [-0.42],
       [-0.33],
       [-0.14],
       [ 0.06],
       [ 0.08],
       [ 0.09],
       [ 0.15],
       [ 0.28],
       [ 0.38],
       [ 0.51],
       [ 0.59],
       [ 0.68],
       [ 0.82],
       [ 0.92],
       [ 0.85],
       [ 0.8 ],
       [ 0.87],
       [ 0.89],
       [ 0.78],
       [ 0.59],
       [ 0.15],
       [-0.1 ],
       [-0.17],
       [-0.09],
       [ 0.04],
       [-0.04],
       [-0.14],
       [-0.09],
       [-0.17],
       [-0.27],
       [-0.27],
       [-0.19],
       [-0.13],
       [-0.04],
       [-0.05],
       [-0.05],
       [ 0.02],
       [ 0.04],
       [-0.  ],
       [ 0.06],
       [ 0.16],
       [ 0.32],
       [ 0.44],
       [ 0.51],
       [ 0.51],
       [ 0.47],
       [

<font size=5>8. 利用标准化之后的AQI数据绘制其趋势图</font>

这里修改了之前的drawScaleredAQI函数，因为此时的data是ndarray类型，属于一维数据。利用之前保存的X轴刻度作为X轴坐标。

In [484]:
dataIndex

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 01:00:00',
               '2019-01-01 02:00:00', '2019-01-01 03:00:00',
               '2019-01-01 04:00:00', '2019-01-01 05:00:00',
               '2019-01-01 06:00:00', '2019-01-01 07:00:00',
               '2019-01-01 08:00:00', '2019-01-01 09:00:00',
               ...
               '2019-01-31 14:00:00', '2019-01-31 15:00:00',
               '2019-01-31 16:00:00', '2019-01-31 17:00:00',
               '2019-01-31 18:00:00', '2019-01-31 19:00:00',
               '2019-01-31 20:00:00', '2019-01-31 21:00:00',
               '2019-01-31 22:00:00', '2019-01-31 23:00:00'],
              dtype='datetime64[ns]', name='time', length=744, freq=None)

In [507]:
def drawScaleredAQI(data, dataIndex, msg):
    plt.figure(figsize=(10,3))
    xticks = pd.date_range(start=dataIndex.min(), end=dataIndex.max(), freq="D")
    plt.xticks(xticks, xticks.strftime("%Y/%m/%d"), rotation=75, ha="left")
    plt.plot(dataIndex, data, linewidth=1)
    plt.xlabel("Time")
    plt.ylabel("scaler quantity")
    plt.title("the trend of AQI with"+msg)

In [509]:
drawScaleredAQI(AQIarray_scalered, dataIndex, "(Scalered Data)")