## Synapse.Hack
<p>
    Использованы только данные по CPU, исключены данные по трафику.
</p>

In [1]:
%matplotlib inline
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from datetime import datetime, timedelta

<p>
Загрузим тренировочный набор и посмотрим на него:
</p>

In [2]:
df_train = pd.read_csv('train.csv')
df_train.head()

Unnamed: 0,date,service_name,cpu,recieved_bytes,transmitted_bytes
0,1598325000000000000,srv0,0.014633,39752.63,76176.21
1,1598325300000000000,srv0,0.014996,43341.84,76860.8
2,1598325600000000000,srv0,0.014508,36184.17,77465.36
3,1598325900000000000,srv0,0.015153,38901.52,77920.4
4,1598326200000000000,srv0,0.015373,42180.09,81457.19


In [3]:
df_train.describe()

Unnamed: 0,date,cpu,recieved_bytes,transmitted_bytes
count,69120.0,56051.0,56049.0,56049.0
mean,1.600917e+18,0.06665,169642.9,339279.3
std,1496503000000000.0,0.047884,140124.4,252234.0
min,1.598325e+18,0.001008,2203.431,2137.374
25%,1.599621e+18,0.024231,56548.66,114484.2
50%,1.600917e+18,0.05091,121965.2,252602.2
75%,1.602213e+18,0.106674,249349.9,560193.0
max,1.603509e+18,0.359843,1504770.0,2281394.0


<p>
Обратим внимание на то, что количество значений *cpu* несколько меньше чем значений *date*. Это указывает на отсутствующие значения *cpu* - обратим их в ноль. Но сначала удалим лишнее:
</p>

In [4]:
df_train.drop(['recieved_bytes', 'transmitted_bytes'], axis=1, inplace=True)
df_train.fillna(value=0, inplace=True)

In [5]:
df_train.describe(include='all')

Unnamed: 0,date,service_name,cpu
count,69120.0,69120,69120.0
unique,,4,
top,,srv0,
freq,,17280,
mean,1.600917e+18,,0.054048
std,1496503000000000.0,,0.050403
min,1.598325e+18,,0.0
25%,1.599621e+18,,0.011536
50%,1.600917e+18,,0.036352
75%,1.602213e+18,,0.096691


In [6]:
df_train['service_name'].value_counts()

srv0    17280
srv2    17280
srv3    17280
srv1    17280
Name: service_name, dtype: int64

<p>
Переведём *date* в __человеческий__ формат:
</p>

In [7]:
df_train['timestamp'] = pd.to_datetime(df_train['date'])

In [8]:
df_train.head()

Unnamed: 0,date,service_name,cpu,timestamp
0,1598325000000000000,srv0,0.014633,2020-08-25 03:10:00
1,1598325300000000000,srv0,0.014996,2020-08-25 03:15:00
2,1598325600000000000,srv0,0.014508,2020-08-25 03:20:00
3,1598325900000000000,srv0,0.015153,2020-08-25 03:25:00
4,1598326200000000000,srv0,0.015373,2020-08-25 03:30:00


Видим что измерения происходят каждые пять минут, т.е. мы имеем двенадцать измерений в час.


> Основная идея всего решения заключается в том, что нагрузка на систему, например в понедельник в 10 утра, в среднем такая же как и в любой другой понедельник в 10 утра.


Разберём *timestamp* по косточкам. Достанем из него день недели и час, в котором произведены измереня и номер измерения внутри часа: 

In [10]:
df_train['day_of_week'] = df_train['timestamp'].dt.dayofweek.astype(int)
df_train['hour_of_measure'] = df_train['timestamp'].dt.hour.astype(int)
df_train['index_of_measure'] = (df_train['timestamp'].dt.minute/5).astype(int)

In [11]:
df_train.head()

Unnamed: 0,date,service_name,cpu,timestamp,day_of_week,hour_of_measure,index_of_measure
0,1598325000000000000,srv0,0.014633,2020-08-25 03:10:00,1,3,2
1,1598325300000000000,srv0,0.014996,2020-08-25 03:15:00,1,3,3
2,1598325600000000000,srv0,0.014508,2020-08-25 03:20:00,1,3,4
3,1598325900000000000,srv0,0.015153,2020-08-25 03:25:00,1,3,5
4,1598326200000000000,srv0,0.015373,2020-08-25 03:30:00,1,3,6


Посчитаем средние значения *cpu* для каждого сервиса на каждый день недели, на каждый час и на каждое измерение:

In [12]:
df_daily_mean = df_train[['service_name','day_of_week','cpu']].groupby(['day_of_week', 'service_name']).mean()
df_daily_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,cpu
day_of_week,service_name,Unnamed: 2_level_1
0,srv0,0.081854
0,srv1,0.027882
0,srv2,0.070363
0,srv3,0.096629
1,srv0,0.063721
1,srv1,0.021171
1,srv2,0.052546
1,srv3,0.072746
2,srv0,0.076898
2,srv1,0.026012


In [13]:
df_hourly_mean = df_train[['service_name','day_of_week','hour_of_measure','cpu']].groupby(['day_of_week','hour_of_measure','service_name']).mean()
df_hourly_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cpu
day_of_week,hour_of_measure,service_name,Unnamed: 3_level_1
0,0,srv0,0.020308
0,0,srv1,0.007427
0,0,srv2,0.014913
0,0,srv3,0.021074
0,1,srv0,0.015006
...,...,...,...
6,22,srv3,0.048720
6,23,srv0,0.028613
6,23,srv1,0.009995
6,23,srv2,0.022408


In [14]:
df_measure_mean = df_train[['service_name','day_of_week','hour_of_measure','index_of_measure','cpu']].groupby(['day_of_week','hour_of_measure','index_of_measure','service_name']).mean()
df_measure_mean

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,cpu
day_of_week,hour_of_measure,index_of_measure,service_name,Unnamed: 4_level_1
0,0,0,srv0,0.024524
0,0,0,srv1,0.008829
0,0,0,srv2,0.017659
0,0,0,srv3,0.025043
0,0,1,srv0,0.024028
...,...,...,...,...
6,23,10,srv3,0.026879
6,23,11,srv0,0.024872
6,23,11,srv1,0.008760
6,23,11,srv2,0.018352


Занесём полученные значения в тренировочный набор:

In [15]:
df_train['cpu_daily_mean'] = df_train[['service_name','day_of_week']].apply(lambda x: df_daily_mean.loc[x[1],x[0]][0], axis=1)
df_train['cpu_hourly_mean'] = df_train[['service_name','day_of_week','hour_of_measure']].apply(lambda x: df_hourly_mean.loc[x[1],x[2],x[0]][0], axis=1)
df_train['cpu_measure_mean'] = df_train[['service_name','day_of_week','hour_of_measure','index_of_measure']].apply(lambda x: df_measure_mean.loc[x[1],x[2],x[3],x[0]][0], axis=1)

Согласно заданию нам необходимо определить значения *cpu* текущего часа исходя из значений часа предыдущего. Создадим такой атрибут посредством операции *shift*. Для корректного *смещения* разделим тренировочный набор на четыре части(по каждому севису), выполним *смещение* и соединим всё обратно.

In [16]:
df_train_srv0 = df_train[df_train['service_name'] == 'srv0'].copy()


In [17]:
df_train_srv0['prev_cpu'] = df_train_srv0['cpu'].shift(12)

> Если внимательно взглянуть на тестовый набор (test.csv), то можно увидеть, что значения *cpu*, которые определены в предыдущем часе, никогда не бывают нулевыми. 

Следовательно, из тренировочного набора мы их просто выкидываем:

In [18]:
df_train_srv0 = df_train_srv0[df_train_srv0['cpu']>0]

In [19]:
df_train_srv0.tail()

Unnamed: 0,date,service_name,cpu,timestamp,day_of_week,hour_of_measure,index_of_measure,cpu_daily_mean,cpu_hourly_mean,cpu_measure_mean,prev_cpu
17275,1603507500000000000,srv0,0.017758,2020-10-24 02:45:00,5,2,9,0.034683,0.009441,0.009433,0.019466
17276,1603507800000000000,srv0,0.015819,2020-10-24 02:50:00,5,2,10,0.034683,0.009441,0.009334,0.019245
17277,1603508100000000000,srv0,0.001977,2020-10-24 02:55:00,5,2,11,0.034683,0.009441,0.00769,0.019102
17278,1603508400000000000,srv0,0.002625,2020-10-24 03:00:00,5,3,0,0.034683,0.00841,0.007958,0.018842
17279,1603508700000000000,srv0,0.0129,2020-10-24 03:05:00,5,3,1,0.034683,0.00841,0.009338,0.018812


Повторим те же операции для остальных сервисов:

In [20]:
df_train_srv1 = df_train[df_train['service_name'] == 'srv1'].copy()
df_train_srv2 = df_train[df_train['service_name'] == 'srv2'].copy()
df_train_srv3 = df_train[df_train['service_name'] == 'srv3'].copy()

In [21]:
df_train_srv1['prev_cpu'] = df_train_srv1['cpu'].shift(12)
df_train_srv2['prev_cpu'] = df_train_srv2['cpu'].shift(12)
df_train_srv3['prev_cpu'] = df_train_srv3['cpu'].shift(12)

In [22]:
df_train_srv1 = df_train_srv1[df_train_srv1['cpu']>0]
df_train_srv2 = df_train_srv2[df_train_srv2['cpu']>0]
df_train_srv3 = df_train_srv3[df_train_srv3['cpu']>0]

Соберём полный тренировочный набор:

In [23]:
df_dataset = pd.concat([df_train_srv0, df_train_srv1, df_train_srv2, df_train_srv3])
df_dataset

Unnamed: 0,date,service_name,cpu,timestamp,day_of_week,hour_of_measure,index_of_measure,cpu_daily_mean,cpu_hourly_mean,cpu_measure_mean,prev_cpu
0,1598325000000000000,srv0,0.014633,2020-08-25 03:10:00,1,3,2,0.063721,0.016518,0.015438,
1,1598325300000000000,srv0,0.014996,2020-08-25 03:15:00,1,3,3,0.063721,0.016518,0.015564,
2,1598325600000000000,srv0,0.014508,2020-08-25 03:20:00,1,3,4,0.063721,0.016518,0.015759,
3,1598325900000000000,srv0,0.015153,2020-08-25 03:25:00,1,3,5,0.063721,0.016518,0.016020,
4,1598326200000000000,srv0,0.015373,2020-08-25 03:30:00,1,3,6,0.063721,0.016518,0.016564,
...,...,...,...,...,...,...,...,...,...,...,...
69115,1603507500000000000,srv3,0.021527,2020-10-24 02:45:00,5,2,9,0.044508,0.011322,0.011266,0.022668
69116,1603507800000000000,srv3,0.021254,2020-10-24 02:50:00,5,2,10,0.044508,0.011322,0.011243,0.022644
69117,1603508100000000000,srv3,0.021408,2020-10-24 02:55:00,5,2,11,0.044508,0.011322,0.011285,0.022446
69118,1603508400000000000,srv3,0.021849,2020-10-24 03:00:00,5,3,0,0.044508,0.011131,0.011626,0.022138


Итак мы получили датасет для обучения. В качестве входных будем использовать следующие атрибуты:
* *service_name* - имя сервиса (единственный категориальный атрибут)
* *day_of_week* - день недели (0 - понедельник, 1 -  вторник, и т.д)
* *hour_of_measure* - час, в которое произведено измерение (0-23) 
* *index_of_measure* - номер изменения внутри часа (каждй час производится 12 измерений)
* *prev_cpu* - предыдущее значение *cpu* (соответствует значению *cpu* в предыдущем часе для такого же номера измерения)
* *cpu_daily_mean* - среднее значение *cpu* за день (для каждого сервиса и дня недели)
* *cpu_hourly_mean* - среднее значение *cpu* за час (для каждого сервиса и дня недели)
* *cpu_measure_mean* - среднее значение *cpu* за измерение (для каждого сервиса, дня недели и часа)

Для начала разделим датасет для обучения - выделим вход и выход.

In [25]:
X_train = df_dataset[['service_name','day_of_week','hour_of_measure','index_of_measure','prev_cpu','cpu_daily_mean','cpu_hourly_mean','cpu_measure_mean']].copy()
Y_train = df_dataset['cpu'].copy()

В комплекте с файлами данных идёт файл с расчётом метрики. Используем функцию для дальнейших расчётов:

In [26]:
Np = 20
p = 0.02
def CountScore(test, referenceModel):
    size = len(referenceModel)
    if len(test) != size:
        print("dimensions are not equal")
        return 0
    counters = [0] * Np
    for ref, t in zip(referenceModel, test):
        dev = abs(ref - t)
        for j in range(Np):
            if dev < (j + 1) * p * t:
                counters[j] += 1
    metric = sum(counters) / size / Np
    return metric

Озаботимся некоторой степенью автоматизации. Сначала создадим пару классов, которые будут полезны для пайплайна-транформатора, преобразуюшего входные значения в приемлемый для машинного обучения вид.


Первый класс - *DataFrameSelector*. Его задача преобразовать датафрейм pandas в масссив numpy.


Второй класс - *CycleTransformator*. Это класс-кодировщик циклических атрибутов. Что я считаю циклическими атрибутами? Это атрибуты, описывющие события, происходящие с изрядной периодичностью. Например, у нас понедельник случается каждый понедельник, а 10 часов утра происходит в каждые 10 утра. И даже наши 12 измерений *cpu* в час происходят каждый час. Для кодировки таких атрибутов удобно использовать тригонометрические функции *sin* и *cos*. 


In [28]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score

date_ix = 0

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attributes_names):
        self.attributes_names = attributes_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attributes_names].values # convert to NumPy array

class CycleTransformator( BaseEstimator, TransformerMixin):
    #Class Constructor 
    def __init__( self,  cycle_columns ):
        self._cycle_columns = cycle_columns
        self._cycle_stats = {}
    
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        for column in self._cycle_columns:
            self._cycle_stats[column] = { 'max': X[column].max(), 'min': X[column].min() }
        return self
    
    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        self._df = pd.DataFrame(index=X.index)
        for column in self._cycle_columns:
            self._df[column+'_sin'] = np.sin(2*np.pi/(self._cycle_stats[column]['max']+1)*X[column].fillna(self._cycle_stats[column]['min']-1))
            self._df[column+'_cos'] = np.cos(2*np.pi/(self._cycle_stats[column]['max']+1)*X[column].fillna(self._cycle_stats[column]['min']-1))

        #self._df = self._df.reset_index(drop=True)
        #print('Cycle transform shape is {}'.format(self._df.values.shape))
        return self._df.values

Теперь создадим пайплайн для трансформации входных значений в numpy массив, котрый и будем использовать для обучения моделей. Я не использую *StandardScaler*, т.к. значения *cpu* уже лежат в пределах [0..1]. 

In [29]:
num_attribs = ['prev_cpu','cpu_daily_mean','cpu_hourly_mean','cpu_measure_mean']
cat_attribs = ['service_name']
date_attribs = ['day_of_week', 'hour_of_measure', 'index_of_measure']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy="median")),
#    ('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('cat_encoder', OneHotEncoder(sparse=False)),
])
counted_pipeline = Pipeline([
    ('cycle_transformator', CycleTransformator(cycle_columns=date_attribs)),
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
    ('counted_pipeline', counted_pipeline),
])

Поскольку данных немного, я не буду "по классике" разделять набор на обучающий и проверочный, а буду использовать просто кросс-валидацию.

In [30]:
X_train_prepared = full_pipeline.fit_transform(X_train)
X_train_prepared.shape

(56051, 14)

Обучим несколько моделей и посмотрим на метрики и значения среднеквадратичной ошибки.

In [32]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Std Deviation:", scores.std())

# train
tree_reg = DecisionTreeRegressor(random_state=57)
tree_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = tree_reg.predict(X_train_prepared)

scores = cross_val_score(tree_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
tree_rmse_scores = np.sqrt(-scores)

display_scores(tree_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.0129452  0.00732512 0.01034512 0.013506   0.02645976]
Mean: 0.014116240344880907
Std Deviation: 0.006548998079222805
Metric: 0.99
R2-score: 1.00


In [33]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=57)
forest_reg.fit(X_train_prepared, Y_train.values)
# predict
hackAI_predictions = forest_reg.predict(X_train_prepared)

scores = cross_val_score(forest_reg, X_train_prepared, Y_train.values, scoring="neg_mean_squared_error")
forest_rmse_scores = np.sqrt(-scores)
display_scores(forest_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.00977249 0.00571492 0.00737922 0.00857103 0.02085083]
Mean: 0.010457697138573627
Std Deviation: 0.005366882701147156
Metric: 0.98
R2-score: 1.00


Посмотрим какие атрибуты вносят наиболее весомый вклад в предсказания:

In [34]:
feature_importances = forest_reg.feature_importances_
feature_importances

array([8.17093441e-01, 1.42818939e-03, 7.65837639e-03, 1.30799717e-01,
       3.43366683e-04, 1.61429897e-04, 2.27335264e-04, 1.39907522e-03,
       3.46704728e-03, 1.21518899e-03, 1.11014135e-02, 2.31570845e-02,
       8.64106430e-04, 1.08422803e-03])

In [35]:
cat_encoder = cat_pipeline.named_steps["cat_encoder"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
counted_encoder = counted_pipeline.named_steps["cycle_transformator"]
counted_attribs = list(counted_encoder._df.columns)
attributes = num_attribs + cat_one_hot_attribs + counted_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.8170934413723446, 'prev_cpu'),
 (0.13079971705869553, 'cpu_measure_mean'),
 (0.02315708447126322, 'hour_of_measure_cos'),
 (0.011101413523226227, 'hour_of_measure_sin'),
 (0.007658376386239411, 'cpu_hourly_mean'),
 (0.003467047282835506, 'day_of_week_sin'),
 (0.0014281893878993347, 'cpu_daily_mean'),
 (0.0013990752192332952, 'srv3'),
 (0.0012151889922201323, 'day_of_week_cos'),
 (0.0010842280313858184, 'index_of_measure_cos'),
 (0.0008641064303826466, 'index_of_measure_sin'),
 (0.00034336668338013254, 'srv0'),
 (0.00022733526379498342, 'srv2'),
 (0.000161429897099018, 'srv1')]

Обучим ещё несколько моделей.

In [36]:
from sklearn import svm

# train
svm_reg = svm.SVR(gamma='auto')
svm_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = svm_reg.predict(X_train_prepared)

scores = cross_val_score(svm_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
svm_rmse_scores = np.sqrt(-scores)

display_scores(svm_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.03648103 0.059947   0.04088903 0.04698442 0.04262454]
Mean: 0.04538520310305193
Std Deviation: 0.00802152999509224
Metric: 0.24
R2-score: -0.39


In [37]:
from sklearn.linear_model import LinearRegression

# train
lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = lin_reg.predict(X_train_prepared)

scores = cross_val_score(lin_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
lin_rmse_scores = np.sqrt(-scores)

display_scores(lin_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.01172065 0.00743654 0.00869142 0.00962229 0.01359725]
Mean: 0.010213630366287477
Std Deviation: 0.0021954041751608656
Metric: 0.73
R2-score: 0.96


In [38]:
from sklearn.linear_model import Ridge

rid_reg = Ridge()
rid_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = rid_reg.predict(X_train_prepared)

scores = cross_val_score(rid_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
rid_rmse_scores = np.sqrt(-scores)

display_scores(rid_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.01187289 0.0075139  0.00886511 0.00976063 0.01392988]
Mean: 0.010388480576662789
Std Deviation: 0.002268172995648455
Metric: 0.73
R2-score: 0.95


In [39]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso()
lasso_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = lasso_reg.predict(X_train_prepared)

scores = cross_val_score(lasso_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
lasso_rmse_scores = np.sqrt(-scores)

display_scores(lasso_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.04462129 0.0492479  0.04199967 0.04384942 0.06781241]
Mean: 0.04950613868326836
Std Deviation: 0.00945931903538782
Metric: 0.12
R2-score: 0.00


In [40]:
from sklearn.linear_model import ElasticNet

net_reg = ElasticNet(random_state=57)
net_reg.fit(X_train_prepared, Y_train)
# predict
hackAI_predictions = net_reg.predict(X_train_prepared)

scores = cross_val_score(net_reg, X_train_prepared, Y_train, scoring="neg_mean_squared_error")
net_rmse_scores = np.sqrt(-scores)

display_scores(net_rmse_scores)
print("Metric: %.2f" % CountScore(Y_train.values, hackAI_predictions) )
print("R2-score: %.2f" % r2_score(hackAI_predictions , Y_train.values) )

Scores: [0.04462129 0.0492479  0.04199967 0.04384942 0.06781241]
Mean: 0.04950613868326836
Std Deviation: 0.00945931903538782
Metric: 0.12
R2-score: 0.00


Если взглянуть на значение метрики, котроая расчитана по фукции организаторов хакатона, то лучшая модель это *DecisionTreeRegressor*. Однако, я выбираю *RandomForestRegressor*, потому что среднее значение ошибки и её стандарного отклонения у это модели лучше - что подтверждается моими 26-ю submit-ами (26-ю, Карл!!!).


Используем GridSearchCV для тонкой настройки нашего случайного заповедника:

In [41]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [30, 50, 70, 100], 'max_features': ['auto', 'sqrt', 'log2']}, 
]

forest_reg = RandomForestRegressor(random_state=57)

grid_search = GridSearchCV(forest_reg, param_grid, scoring="neg_mean_squared_error", verbose=5, n_jobs=2)
grid_search.fit(X_train_prepared, Y_train.values)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  14 tasks      | elapsed:  2.7min
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:  7.7min finished


GridSearchCV(cv=None, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=57,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_j

In [42]:
grid_search.best_params_

{'max_features': 'auto', 'n_estimators': 100}

In [43]:
grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=57, verbose=0, warm_start=False)

Итак, модель у нас есть - настало время предсказаний.


Для этого откроем тестовый набор, преобразуем его тем же самым способом и предскажем недостающие значения *cpu*.

In [44]:
df_test = pd.read_csv('test.csv')
df_test.head()

Unnamed: 0,date,service_name,cpu,recieved_bytes,transmitted_bytes
0,1603512000000000000,srv0,0.020587,34285.98,95601.43
1,1603512300000000000,srv0,0.021137,42330.43,96762.45
2,1603512600000000000,srv0,0.021553,35407.8,99925.01
3,1603512900000000000,srv0,0.021571,36284.38,98606.03
4,1603513200000000000,srv0,0.021617,42170.67,101624.8


In [45]:
df_test.fillna(value=0, inplace=True)

In [46]:
df_test['service_name'].value_counts()

srv1    2640
srv2    2640
srv3    2616
srv0    2616
Name: service_name, dtype: int64

In [47]:
df_test['timestamp'] = pd.to_datetime(df_test['date'])
df_test['day_of_week'] = df_test['timestamp'].dt.dayofweek.astype(int)
df_test['hour_of_measure'] = df_test['timestamp'].dt.hour.astype(int)
df_test['index_of_measure'] = (df_test['timestamp'].dt.minute/5).astype(int)

In [48]:
df_test['day_of_week'].value_counts()

6    1920
0    1872
5    1728
1    1344
4    1248
2    1248
3    1152
Name: day_of_week, dtype: int64

In [49]:
df_test['hour_of_measure'].value_counts()

13    1008
12    1008
5      912
4      912
9      864
1      864
8      864
0      864
17     840
16     840
21     768
20     768
Name: hour_of_measure, dtype: int64

Используем уже посчитанные средние значения за день, за час и за измерение:

In [50]:
df_test['cpu_daily_mean'] = df_test[['service_name','day_of_week']].apply(lambda x: df_daily_mean.loc[x[1],x[0]][0], axis=1)
df_test['cpu_hourly_mean'] = df_test[['service_name','day_of_week','hour_of_measure']].apply(lambda x: df_hourly_mean.loc[x[1],x[2],x[0]][0], axis=1)
df_test['cpu_measure_mean'] = df_test[['service_name','day_of_week','hour_of_measure','index_of_measure']].apply(lambda x: df_measure_mean.loc[x[1],x[2],x[3],x[0]][0], axis=1)

Также разделим набор по сервисам, посчитаем предыдущие значения и сложим обратно:

In [52]:
df_test_srv0 = df_test[df_test['service_name'] == 'srv0'].copy()
df_test_srv1 = df_test[df_test['service_name'] == 'srv1'].copy()
df_test_srv2 = df_test[df_test['service_name'] == 'srv2'].copy()
df_test_srv3 = df_test[df_test['service_name'] == 'srv3'].copy()

In [53]:
df_test_srv0['prev_cpu'] = df_test_srv0['cpu'].shift(12)
df_test_srv1['prev_cpu'] = df_test_srv1['cpu'].shift(12)
df_test_srv2['prev_cpu'] = df_test_srv2['cpu'].shift(12)
df_test_srv3['prev_cpu'] = df_test_srv3['cpu'].shift(12)

In [54]:
df_test_dataset = pd.concat([df_test_srv0, df_test_srv1, df_test_srv2, df_test_srv3])

In [55]:
X_test = df_test_dataset[['service_name','day_of_week','hour_of_measure','index_of_measure','prev_cpu','cpu_daily_mean','cpu_hourly_mean','cpu_measure_mean']].copy()
Y_test = df_test_dataset['cpu'].copy()

Поскольку нужно определить только пропущенные значения *cpu*, выделим их из всего набора:

In [56]:
X_test_missed = X_test[Y_test==0]
Y_test_missed = Y_test[Y_test==0]
X_test_missed

Unnamed: 0,service_name,day_of_week,hour_of_measure,index_of_measure,prev_cpu,cpu_daily_mean,cpu_hourly_mean,cpu_measure_mean
12,srv0,5,5,0,0.020587,0.034683,0.015027,0.012960
13,srv0,5,5,1,0.021137,0.034683,0.015027,0.013192
14,srv0,5,5,2,0.021553,0.034683,0.015027,0.013571
15,srv0,5,5,3,0.021571,0.034683,0.015027,0.014070
16,srv0,5,5,4,0.021617,0.034683,0.015027,0.014404
...,...,...,...,...,...,...,...,...
10507,srv3,2,13,7,0.162288,0.090530,0.152080,0.149263
10508,srv3,2,13,8,0.161447,0.090530,0.152080,0.148725
10509,srv3,2,13,9,0.161522,0.090530,0.152080,0.147570
10510,srv3,2,13,10,0.161806,0.090530,0.152080,0.148279


Используем нашу модель для предсказания:

In [57]:
from sklearn.metrics import mean_squared_error


final_model = grid_search.best_estimator_
X_test_prepared = full_pipeline.transform(X_test_missed)
Y_test_missed = final_model.predict(X_test_prepared)
Y_test_missed

array([0.02715811, 0.02723461, 0.02821832, ..., 0.1656307 , 0.16701092,
       0.17275344])

Заполним пробелы:

In [58]:
df_test_dataset.loc[df_test_dataset['cpu']==0,['cpu']] = Y_test_missed

In [59]:
df_test_dataset

Unnamed: 0,date,service_name,cpu,recieved_bytes,transmitted_bytes,timestamp,day_of_week,hour_of_measure,index_of_measure,cpu_daily_mean,cpu_hourly_mean,cpu_measure_mean,prev_cpu
0,1603512000000000000,srv0,0.020587,34285.98,95601.43,2020-10-24 04:00:00,5,4,0,0.034683,0.011245,0.010018,
1,1603512300000000000,srv0,0.021137,42330.43,96762.45,2020-10-24 04:05:00,5,4,1,0.034683,0.011245,0.010391,
2,1603512600000000000,srv0,0.021553,35407.80,99925.01,2020-10-24 04:10:00,5,4,2,0.034683,0.011245,0.010577,
3,1603512900000000000,srv0,0.021571,36284.38,98606.03,2020-10-24 04:15:00,5,4,3,0.034683,0.011245,0.010436,
4,1603513200000000000,srv0,0.021617,42170.67,101624.80,2020-10-24 04:20:00,5,4,4,0.034683,0.011245,0.010651,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10507,1605706500000000000,srv3,0.169182,0.00,0.00,2020-11-18 13:35:00,2,13,7,0.090530,0.152080,0.149263,0.162288
10508,1605706800000000000,srv3,0.164649,0.00,0.00,2020-11-18 13:40:00,2,13,8,0.090530,0.152080,0.148725,0.161447
10509,1605707100000000000,srv3,0.165631,0.00,0.00,2020-11-18 13:45:00,2,13,9,0.090530,0.152080,0.147570,0.161522
10510,1605707400000000000,srv3,0.167011,0.00,0.00,2020-11-18 13:50:00,2,13,10,0.090530,0.152080,0.148279,0.161806


Записываем решение в файл:

In [61]:
df_test_dataset.to_csv('submission_hackAI.csv', columns=['date','service_name','cpu'], index=False)

---
Вот как-то так. 


Сильно не бейте - пианист играет как умеет.



<p>
**С уважением**,<br> 
Вест Андрей Владимирович<br>
Главный инженер по разработке<br>
IT-Сопровождение АС Трайб CBP<br>
г.Санкт-Петербург<br>
Вн.тел.: 8-789-19382.<br>
</p>