## Regressão

Utilizando as bases de treinamento e testes disponivel no link:

https://archive.ics.uci.edu/ml/machine-learning-databases/00492/

Teste dois métodos de aprendizagem de máquina para identificar o que apresenta melhores resultados na previsão do volume de tráfego. Avaliem o impacto de usar validação cruzada (com 10 grupos) e split (70% para treinamento e 30% para teste). Para esse último, execute o experimento duas vezes selecionando as instâncias aleatoriamente para o treinamento e teste e avalie o impacto nos resultados.

Outras informações como dia da semana podem ser inclusas.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pickle
import sklearn
from sklearn import preprocessing, model_selection, svm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import RobustScaler
import xgboost as xgb
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error

In [43]:
df = pd.read_csv('metro.csv')
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [3]:
df.tail()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
48199,,283.45,0.0,0.0,75,Clouds,broken clouds,2018-09-30 19:00:00,3543
48200,,282.76,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 20:00:00,2781
48201,,282.73,0.0,0.0,90,Thunderstorm,proximity thunderstorm,2018-09-30 21:00:00,2159
48202,,282.09,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 22:00:00,1450
48203,,282.12,0.0,0.0,90,Clouds,overcast clouds,2018-09-30 23:00:00,954


In [4]:
df.describe()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
count,48204.0,48204.0,48204.0,48204.0,48204.0
mean,281.20587,0.334264,0.000222,49.362231,3259.818355
std,13.338232,44.789133,0.008168,39.01575,1986.86067
min,0.0,0.0,0.0,0.0,0.0
25%,272.16,0.0,0.0,1.0,1193.0
50%,282.45,0.0,0.0,64.0,3380.0
75%,291.806,0.0,0.0,90.0,4933.0
max,310.07,9831.3,0.51,100.0,7280.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB


In [6]:
for column in df.columns:
    print(f'{column} info:\n{df[column].describe()}\n\n')

holiday info:
count     48204
unique       12
top        None
freq      48143
Name: holiday, dtype: object


temp info:
count    48204.000000
mean       281.205870
std         13.338232
min          0.000000
25%        272.160000
50%        282.450000
75%        291.806000
max        310.070000
Name: temp, dtype: float64


rain_1h info:
count    48204.000000
mean         0.334264
std         44.789133
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max       9831.300000
Name: rain_1h, dtype: float64


snow_1h info:
count    48204.000000
mean         0.000222
std          0.008168
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          0.510000
Name: snow_1h, dtype: float64


clouds_all info:
count    48204.000000
mean        49.362231
std         39.015750
min          0.000000
25%          1.000000
50%         64.000000
75%         90.000000
max        100.000000
Name: clouds_all, dtype: float64


wea

In [7]:
df.isnull().sum()

holiday                0
temp                   0
rain_1h                0
snow_1h                0
clouds_all             0
weather_main           0
weather_description    0
date_time              0
traffic_volume         0
dtype: int64

In [8]:
df.duplicated().unique()

array([False,  True])

In [44]:
df.drop_duplicates(inplace = True)

df.duplicated().sum()

0

In [10]:
df.drop(columns = ['weather_description'],inplace = True)

In [11]:
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,2012-10-02 13:00:00,4918


In [45]:
issue_index = df[df['date_time'].duplicated()].index

In [46]:
issue_index

Int64Index([  179,   181,   183,   270,   271,   273,   275,   277,   278,
              280,
            ...
            48050, 48052, 48054, 48065, 48066, 48069, 48072, 48112, 48193,
            48195],
           dtype='int64', length=7612)

In [47]:
df.drop(index = issue_index,inplace = True)

In [48]:
for column in df.columns:
    print(f'{column} info:\n{df[column].describe()}\n\n')
    x = df[column].value_counts()
    print(f'Value counts: \n{x/len(df)}\n\n\n')

holiday info:
count     40575
unique       12
top        None
freq      40522
Name: holiday, dtype: object


Value counts: 
None                         0.998694
Christmas Day                0.000123
State Fair                   0.000123
Columbus Day                 0.000123
New Years Day                0.000123
Washingtons Birthday         0.000123
Thanksgiving Day             0.000123
Memorial Day                 0.000123
Labor Day                    0.000123
Veterans Day                 0.000123
Independence Day             0.000123
Martin Luther King Jr Day    0.000074
Name: holiday, dtype: float64



temp info:
count    40575.000000
mean       281.316763
std         13.816618
min          0.000000
25%        271.840000
50%        282.860000
75%        292.280000
max        310.070000
Name: temp, dtype: float64


Value counts: 
274.150    0.001972
276.793    0.001922
291.150    0.001503
271.150    0.001380
272.150    0.001331
             ...   
284.811    0.000025
252.930    0.000

In [49]:
df['date_time'] = pd.to_datetime(df['date_time'],format='%Y-%m-%d %H:%M:%S')
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [50]:
df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
temp,1.0,0.008395,-0.015351,-0.098291,0.136924
rain_1h,0.008395,1.0,-9e-05,0.00503,0.005402
snow_1h,-0.015351,-9e-05,1.0,0.024202,-0.002086
clouds_all,-0.098291,0.00503,0.024202,1.0,0.07837
traffic_volume,0.136924,0.005402,-0.002086,0.07837,1.0


In [62]:
df.drop(columns = ['holiday','rain_1h','snow_1h'],inplace = True)

for column in df.columns:
    print(f'{column} info:\n{df[column].describe()}\n\n')
    x = df[column].value_counts()
    print(f'Value counts: \n{x/len(df)}\n\n\n')

temp info:
count    40575.000000
mean       281.316763
std         13.816618
min          0.000000
25%        271.840000
50%        282.860000
75%        292.280000
max        310.070000
Name: temp, dtype: float64


Value counts: 
274.150    0.001972
276.793    0.001922
291.150    0.001503
271.150    0.001380
272.150    0.001331
             ...   
284.811    0.000025
252.930    0.000025
256.690    0.000025
254.680    0.000025
253.410    0.000025
Name: temp, Length: 5841, dtype: float64



clouds_all info:
count    40575.000000
mean        44.199162
std         38.683447
min          0.000000
25%          1.000000
50%         40.000000
75%         90.000000
max        100.000000
Name: clouds_all, dtype: float64


Value counts: 
1      0.298534
90     0.271596
75     0.110881
40     0.091115
0      0.048404
20     0.043795
64     0.033173
5      0.026322
92     0.015354
12     0.006556
8      0.006506
32     0.004387
24     0.004239
48     0.004116
80     0.004091
88     0.003993
36    

In [51]:
le = preprocessing.LabelEncoder()

for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = le.fit_transform(df[column])
        
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,7,288.28,0.0,0.0,40,1,24,2012-10-02 09:00:00,5545
1,7,289.36,0.0,0.0,75,1,2,2012-10-02 10:00:00,4516
2,7,289.58,0.0,0.0,90,1,19,2012-10-02 11:00:00,4767
3,7,290.13,0.0,0.0,90,1,19,2012-10-02 12:00:00,5026
4,7,291.14,0.0,0.0,75,1,2,2012-10-02 13:00:00,4918


In [52]:
df_corr = df.corr()

In [53]:
df_corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume
holiday,1.0,-0.001756,6.3e-05,0.000299,0.007741,-0.002802,-0.002507,0.017067
temp,-0.001756,1.0,0.008395,-0.015351,-0.098291,-0.038347,-0.068917,0.136924
rain_1h,6.3e-05,0.008395,1.0,-9e-05,0.00503,0.009474,0.00997,0.005402
snow_1h,0.000299,-0.015351,-9e-05,1.0,0.024202,0.043872,0.007017,-0.002086
clouds_all,0.007741,-0.098291,0.00503,0.024202,1.0,0.481227,-0.314441,0.07837
weather_main,-0.002802,-0.038347,0.009474,0.043872,0.481227,1.0,-0.139632,-0.02896
weather_description,-0.002507,-0.068917,0.00997,0.007017,-0.314441,-0.139632,1.0,-0.077234
traffic_volume,0.017067,0.136924,0.005402,-0.002086,0.07837,-0.02896,-0.077234,1.0


In [54]:
df.set_index(df['date_time'],inplace = True)
df.drop(columns = ['date_time'],inplace = True)
df.sort_values(by = 'date_time', ascending = True, inplace = True)
df.head()

Unnamed: 0_level_0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,traffic_volume
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-10-02 09:00:00,7,288.28,0.0,0.0,40,1,24,5545
2012-10-02 10:00:00,7,289.36,0.0,0.0,75,1,2,4516
2012-10-02 11:00:00,7,289.58,0.0,0.0,90,1,19,4767
2012-10-02 12:00:00,7,290.13,0.0,0.0,90,1,19,5026
2012-10-02 13:00:00,7,291.14,0.0,0.0,75,1,2,4918


In [23]:
'''df.sort_index(inplace = True)
df['traffic_volume'].rolling(4000).mean().plot(figsize=(8,8))'''

"df.sort_index(inplace = True)\ndf['traffic_volume'].rolling(4000).mean().plot(figsize=(8,8))"

In [63]:
df.head()

Unnamed: 0_level_0,temp,clouds_all,weather_main,weather_description,traffic_volume
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-10-02 09:00:00,288.28,40,1,24,5545
2012-10-02 10:00:00,289.36,75,1,2,4516
2012-10-02 11:00:00,289.58,90,1,19,4767
2012-10-02 12:00:00,290.13,90,1,19,5026
2012-10-02 13:00:00,291.14,75,1,2,4918


In [64]:

X = df.drop('traffic_volume', axis = 1).values
X = preprocessing.scale(X)
y = df['traffic_volume'].values

print(f'X:{X}\n\n\ny:{y}')

X:[[ 0.50398167 -0.10855325 -0.42827505  0.87193277]
 [ 0.58214938  0.79623767 -0.42827505 -1.66932318]
 [ 0.59807243  1.1840052  -0.42827505  0.2943746 ]
 ...
 [ 0.10228654  1.1840052   2.95514791  0.52539787]
 [ 0.05596493  1.1840052  -0.42827505  0.2943746 ]
 [ 0.05813626  1.1840052  -0.42827505  0.2943746 ]]


y:[5545 4516 4767 ... 2159 1450  954]


## Train Test Split (0.7,0.3)

In [68]:
xgb_model = xgb.XGBRegressor()
clf = GridSearchCV(xgb_model,
                   {'max_depth': [2,4,6],
                    'n_estimators': [50,100,200]}, verbose=1)
clf.fit(X, y)

print(clf.best_score_)
print(clf.best_params_)


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.07101454208962128
{'max_depth': 2, 'n_estimators': 50}


[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:   14.4s finished


In [67]:
accuracy = clf.score(X_test, y_test) # Teste

print(f'Squared Error: {round(accuracy*100, 3)}%')

Squared Error: 7.966%


## Cross Validation

In [65]:
rng = np.random.RandomState(31337)

kf = KFold(n_splits=10, shuffle=True, random_state=rng)

sme = []
for train_index, test_index in kf.split(X):
    xgb_model = xgb.XGBRegressor().fit(X[train_index], y[train_index])
    predictions = xgb_model.predict(X[test_index])
    actuals = y[test_index]
    mse.append(mean_squared_error(actuals, predictions))

print(f'mse list :{mse}\navg mse:{mean(mse)}')

[8.52548096185397,
 8.334576539113037,
 7.468091117812725,
 8.035840713759779,
 7.987176889192704,
 8.136824157787236,
 9.636363306530471,
 8.269014471782965,
 8.863007289834822,
 9.387610680633497]

In [58]:
results = model_selection.cross_val_score(xgb_model, X, y, cv=kf)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Accuracy: 8.92% (0.88%)
