针对历史磁盘数据，采用时间序列分析方法，预测应用系统服务器磁盘已使用空间大小。

根据用户需求设置不同的预警等级，将预测值与容量值进行比较，对其结果进行预警判断，为系统管理员提供定制化的预警提示。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

In [2]:
dicdata = pd.read_excel('F:/pydata/Data/chapter11/demo/data/discdata.xls')

**属性变换**

In [4]:
data = dicdata[dicdata['TARGET_ID'] == 184].copy()
data_group = data.groupby('COLLECTTIME')

In [5]:
def attr_trans(x):
    result = pd.Series(index=['SYS_NAME','CWXT_DB:184:C:\\',
                              'CWXT_DB:184:D:\\','COLLECTTIME'])
    result['SYS_NAME'] = x['SYS_NAME'].iloc[0]
    result['COLLECTTIME'] = x['COLLECTTIME'].iloc[0]
    result['CWXT_DB:184:C:\\'] = x['VALUE'].iloc[0]
    result['CWXT_DB:184:D:\\'] = x['VALUE'].iloc[1]
    return result

In [7]:
data_processed = data_group.apply(attr_trans)
data_processed.to_excel('charpter11/discdata_processed.xlsx',
                       index=False)

**模型构建**

（1）平稳性检验

In [9]:
data = data_processed[:len(data_processed)-5]

In [12]:
from statsmodels.tsa.stattools import adfuller as ADF

In [13]:
diff = 0
adf = ADF(data['CWXT_DB:184:D:\\'])
while adf[1] >= 0.05:
    diff = diff + 1
    adf = ADF(data['CWXT_DB:184:D:\\'].diff(diff).dropna())
print('原始序列经过%s阶差分后归于平稳，p值为%s' %(diff,adf[1]))

原始序列经过1阶差分后归于平稳，p值为4.79259126339e-07


（2）白噪声检验

In [17]:
from statsmodels.stats.diagnostic import acorr_ljungbox
[[lb],[p]] = acorr_ljungbox(data['CWXT_DB:184:D:\\'],lags=1)
if p < 0.05:
    print('原始序列为非白噪声序列，对应的p值为：%s' %p)
else:
    print('原始序列为白噪声序列，对应的p值为：%s' %p)
    
[[lb],[p]] = acorr_ljungbox(data['CWXT_DB:184:D:\\'].diff().dropna(),lags=1)
if p < 0.05:
    print('一阶差分序列为非白噪声序列，对应的p值为：%s' %p)
else:
    print('一阶差分序列为白噪声序列，对应的p值为：%s' %p)

原始序列为非白噪声序列，对应的p值为：9.95850372977e-06
一阶差分序列为白噪声序列，对应的p值为：0.114330259776


In [18]:
from statsmodels.tsa.arima_model import ARIMA

（3）模型识别

In [21]:
xdata = data['CWXT_DB:184:D:\\']
# 定阶
pmax = int(len(xdata)/10)
qmax = int(len(xdata)/10)
bic_matrix = []
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        try:
            tmp.append(ARIMA(xdata,(p,1,q)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)
bic_matrix = pd.DataFrame(bic_matrix)
p,q = bic_matrix.stack().idxmin()
print('BIC最小的p值和q值为：%s,%s' %(p,q))



BIC最小的p值和q值为：1,1


（4）模型检验

In [26]:
from statsmodels.tsa.arima_model import ARIMA
lagnum = 12  # 残差延迟个数
arima = ARIMA(xdata,(0,1,1)).fit()
xdata_pred = arima.predict(typ = 'levels')
pred_error = (xdata_pred - xdata).dropna()
from statsmodels.stats.diagnostic import acorr_ljungbox
lb, p = acorr_ljungbox(pred_error, lags=lagnum)
h = (p < 0.05).sum()
if h > 0:
    print('模型ARIMA(0,1,1)不符合白噪声检验')
else:
    print('模型ARIMA(0,1,1)符合白噪声检验')

模型ARIMA(0,1,1)符合白噪声检验


（5）模型预测

In [3]:
data = pd.read_excel('charpter11/discdata_processed.xlsx', 
                     index_col='COLLECTTIME')
data = data.iloc[len(data)-5:]

In [4]:
ydata_pred = arima.forecast(5)
ydata_pred[0]

NameError: name 'arima' is not defined

In [51]:
df = pd.DataFrame()
df['实际值'] = data['CWXT_DB:184:D:\\']
df['预测值'] = ydata_pred[0]

In [52]:
df

Unnamed: 0_level_0,实际值,预测值
COLLECTTIME,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-11-12,87249335.55,88034300.0
2014-11-13,86986142.2,88217010.0
2014-11-14,86678240.0,88399710.0
2014-11-15,89766600.0,88582420.0
2014-11-16,89377527.25,88765130.0


**模型评价**

In [2]:
import pandas as pd

In [None]:
abs_ = (df['预测值'] - df['实际值']).abs()
mae_ = abs_.mean()
rmse_ = (abs_**2).mean()**0.5
mape_ = (abs_/df['实际值']).mean()
print('平均绝对误差为：%0.4f, \n均方根误差为：%0.4f, \n平均绝对误差为：%0.6f。'%(mae_,rmse_,mape_))