# 08 - Kaggle - bike share system - Adding customer-value-logarithm average to the features


For problem formulation refer to **"01 - Kaggle - bike share system - problem formulation.ipynb"**.
In section **"02 - Kaggle - bike share system - Data preprocessing.ipynb"** we transformed the raw data and extracted time, date, and dummy matrices. The results are stored in two formats:
 * In `train_prep_orig.csv` and `test_prep_orig.csv` the categorical data are in the original form.
 * In `train_prep_dum.csv` and `test_prep_dum.csv` the categorical data are converted to dummy matrices. 

In section **03 - Kaggle - bike share system - data visualization.ipynb** we ploted the average of customers at different time periods over 2011 and 2012 and discused the pattern of customer behavior. We concluded by a decision to consider these average values as new features of the problem so that the machine learning model will be able to use them as the basis values and the other features apply the necessary correction to make use closer to the actual values. That is the job of this section.

We split the hours time into 6 chunks of 4 hours periods. Here was the observation of customer behavior:
* 1: **[2:00 am, 3:00 am, 4:00 am, 5:00 am]**  ------> both casual and registered (and therefore total) customers are highest during the weekend.
* 2: ** [6:00 am, 7:00 am, 8:00 am, 9:00 am]**  ------> Casual customers behave for all day alsmot within the same range below the week average. Registered customers use the system on average more during the workdays above the the week average and the usage during the weekends in below the week average.
* 3: ** [10:00 am, 11:00 am, 12:00 pm, 1:00 pm]**  --> both casual and registered (and therefore total) customers are highest during the weekend.
* 4: ** [2:00 pm, 3:00 pm, 4:00 pm, 5:00 pm] **  ------> All are above the week average. The casual and total is highest during the weekend. For registered, all days are of the same order. 
* 5: ** [6:00 pm, 7:00 pm, 8:00 pm, 9:00 pm] **  ------> no comment.
* 6: ** [10:00 pm, 11:00 pm, 12:00 am, 1:00 pm] ** --> All are below the week average and the average number of customers is highest during the weekend.

Since we do not have the number of customers from day 20 to the end of each month, we have about three isntances of a weekday for each month and year. We take the average number of customers during these days and use it as a basis. We call the new features **avg_casual**, **avg_registered** and **avg_tot**. We finally, update the data sets as:
 * In `train_prep_orig_avg.csv` and `test_prep_orig_avg.csv` the categorical data are in the original form.
 * In `train_prep_dum_avg.csv` and `test_prep_dum_avg.csv` the categorical data are converted to dummy matrices. 



### Basic settings and importing the libraries

In [33]:
# Resets the namespace by removing all names defined by the user without asking for confirmation
%reset -f


# Panas is used for DataFrame
import pandas as pd

# NumPy is used for manipulating arrays
import numpy as np

# MatPlotLib is used for plotting
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# the output of plotting commands is displayed inline directly below the code cell that produced it.
%matplotlib inline

# Seaborn is used for statistical plotting
import seaborn as sns

# Used for display dataframes as html tables
from IPython.display import display

### Importing the train data from `train_prep_orig.csv`

In [34]:
#Load train data
data_train = pd.read_csv('data/train_prep_orig.csv')

print "The shape of the train dataset:", data_train.shape
display(data_train.head())


The shape of the train dataset: (10886, 15)


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot
0,9.84,14.395,81.0,0.0,2011,1,1,5,0,0,0,1,3,13,16
1,9.02,13.635,80.0,0.0,2011,1,1,5,1,0,0,1,8,32,40
2,9.02,13.635,80.0,0.0,2011,1,1,5,2,0,0,1,5,27,32
3,9.84,14.395,75.0,0.0,2011,1,1,5,3,0,0,1,3,10,13
4,9.84,14.395,75.0,0.0,2011,1,1,5,4,0,0,1,0,1,1


## Adding the log1p values of target variables

In [35]:
data_train['l_casual']=np.log1p(data_train['casual'])
data_train['l_registered']=np.log1p(data_train['registered'])
data_train['l_tot']=np.log1p(data_train['tot'])

display(data_train.head())


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,l_casual,l_registered,l_tot
0,9.84,14.395,81.0,0.0,2011,1,1,5,0,0,0,1,3,13,16,1.386294,2.639057,2.833213
1,9.02,13.635,80.0,0.0,2011,1,1,5,1,0,0,1,8,32,40,2.197225,3.496508,3.713572
2,9.02,13.635,80.0,0.0,2011,1,1,5,2,0,0,1,5,27,32,1.791759,3.332205,3.496508
3,9.84,14.395,75.0,0.0,2011,1,1,5,3,0,0,1,3,10,13,1.386294,2.397895,2.639057
4,9.84,14.395,75.0,0.0,2011,1,1,5,4,0,0,1,0,1,1,0.0,0.693147,0.693147


## features and dictionaries

In [40]:
cat_var = ['year','season', 'month', 'weekday', 'hour', 'workingday', 'holiday', 'weather']
num_var = ['temp', 'atemp', 'humidity', 'windspeed']
target_var = ['casual', 'registered', 'tot']
target_l_var = ['l_casual', 'l_registered', 'l_tot']

weekday_dic = {0:'Monday',
              1:'Tuesday',
              2:'Wednesday',
              3:'Thursday',
              4:'Friday',
              5:'Saturday',
              6:'Sunday'}

### Grouping by ['`year`','`month`','`weekday`','`period`']

Now, we would like to learn how does it changes acros different months, weekdays and day periods. We only work with `mean` here .

In [41]:
hours = np.array(data_train.hour)

data_train['periods']=np.where( (2 <= hours) &  (hours <= 5), 1,
                        np.where( (6 <= hours) &  (hours <= 9), 2,
                                 np.where( (10 <= hours) &  (hours <= 13), 3,
                                          np.where( (14 <= hours) &  (hours <= 17), 4,
                                                  np.where( (18 <= hours) &  (hours <= 23), 5, 6)
                                                  )
                                         )
                                )
                        )
display(data_train.head(20))

groupby_year_month_weekday_periods = data_train.groupby(['year','month','weekday','periods']).mean()[target_l_var]
display(groupby_year_month_weekday_periods.head(40))

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,l_casual,l_registered,l_tot,periods
0,9.84,14.395,81.0,0.0,2011,1,1,5,0,0,0,1,3,13,16,1.386294,2.639057,2.833213,6
1,9.02,13.635,80.0,0.0,2011,1,1,5,1,0,0,1,8,32,40,2.197225,3.496508,3.713572,6
2,9.02,13.635,80.0,0.0,2011,1,1,5,2,0,0,1,5,27,32,1.791759,3.332205,3.496508,1
3,9.84,14.395,75.0,0.0,2011,1,1,5,3,0,0,1,3,10,13,1.386294,2.397895,2.639057,1
4,9.84,14.395,75.0,0.0,2011,1,1,5,4,0,0,1,0,1,1,0.0,0.693147,0.693147,1
5,9.84,12.88,75.0,6.0032,2011,1,1,5,5,0,0,2,0,1,1,0.0,0.693147,0.693147,1
6,9.02,13.635,80.0,0.0,2011,1,1,5,6,0,0,1,2,0,2,1.098612,0.0,1.098612,2
7,8.2,12.88,86.0,0.0,2011,1,1,5,7,0,0,1,1,2,3,0.693147,1.098612,1.386294,2
8,9.84,14.395,75.0,0.0,2011,1,1,5,8,0,0,1,1,7,8,0.693147,2.079442,2.197225,2
9,13.12,17.425,76.0,0.0,2011,1,1,5,9,0,0,1,8,6,14,2.197225,1.94591,2.70805,2


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,l_casual,l_registered,l_tot
year,month,weekday,periods,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011,1,0,1,0.138629,1.173139,1.230675
2011,1,0,2,1.03878,3.816931,3.88224
2011,1,0,3,1.969823,3.866267,4.005175
2011,1,0,4,1.959106,4.312191,4.413583
2011,1,0,5,1.077602,3.653726,3.712849
2011,1,0,6,0.529676,1.647078,1.849811
2011,1,1,1,0.0,1.34668,1.34668
2011,1,1,2,0.89588,4.507533,4.521242
2011,1,1,3,1.542084,3.59704,3.691943
2011,1,1,4,1.631812,4.191619,4.279062


In [43]:
data_train['avg_l_casual'] = data_train['avg_l_registered'] = data_train['avg_l_tot'] = np.zeros(data_train.shape[0])
print data_train.shape[0]
data_train.head()

10886


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,...,casual,registered,tot,l_casual,l_registered,l_tot,periods,avg_l_casual,avg_l_registered,avg_l_tot
0,9.84,14.395,81.0,0.0,2011,1,1,5,0,0,...,3,13,16,1.386294,2.639057,2.833213,6,0.0,0.0,0.0
1,9.02,13.635,80.0,0.0,2011,1,1,5,1,0,...,8,32,40,2.197225,3.496508,3.713572,6,0.0,0.0,0.0
2,9.02,13.635,80.0,0.0,2011,1,1,5,2,0,...,5,27,32,1.791759,3.332205,3.496508,1,0.0,0.0,0.0
3,9.84,14.395,75.0,0.0,2011,1,1,5,3,0,...,3,10,13,1.386294,2.397895,2.639057,1,0.0,0.0,0.0
4,9.84,14.395,75.0,0.0,2011,1,1,5,4,0,...,0,1,1,0.0,0.693147,0.693147,1,0.0,0.0,0.0


In [44]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_l_casual'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)]['l_casual']

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [45]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_l_registered'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), 
                                                                               level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)].l_registered


0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [46]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_l_tot'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)].l_tot

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [47]:
data_train.loc[130:140]


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,...,casual,registered,tot,l_casual,l_registered,l_tot,periods,avg_l_casual,avg_l_registered,avg_l_tot
130,10.66,12.88,38.0,11.0014,2011,1,1,3,16,1,...,12,74,86,2.564949,4.317488,4.465908,4,1.929471,4.330747,4.420036
131,9.02,11.365,51.0,11.0014,2011,1,1,3,17,1,...,9,163,172,2.302585,5.099866,5.153292,4,1.929471,4.330747,4.420036
132,9.02,11.365,51.0,8.9981,2011,1,1,3,18,1,...,5,158,163,1.791759,5.068904,5.099866,5,0.803294,4.032818,4.06603
133,9.02,12.88,55.0,6.0032,2011,1,1,3,19,1,...,3,109,112,1.386294,4.70048,4.727388,5,0.803294,4.032818,4.06603
134,8.2,10.605,51.0,11.0014,2011,1,1,3,20,1,...,3,66,69,1.386294,4.204693,4.248495,5,0.803294,4.032818,4.06603
135,9.02,10.605,55.0,15.0013,2011,1,1,3,21,1,...,0,48,48,0.0,3.89182,3.89182,5,0.803294,4.032818,4.06603
136,9.02,10.605,51.0,19.0012,2011,1,1,3,22,1,...,1,51,52,0.693147,3.951244,3.970292,5,0.803294,4.032818,4.06603
137,8.2,9.85,59.0,12.998,2011,1,1,3,23,1,...,4,19,23,1.609438,2.995732,3.178054,5,0.803294,4.032818,4.06603
138,8.2,9.85,64.0,12.998,2011,1,1,4,0,1,...,4,13,17,1.609438,2.639057,2.890372,6,0.677013,2.232657,2.367406
139,8.2,9.85,69.0,15.0013,2011,1,1,4,1,1,...,2,5,7,1.098612,1.791759,2.079442,6,0.677013,2.232657,2.367406


In [48]:
data_train=data_train.astype('float')

In [49]:
data_train.dtypes

temp                float64
atemp               float64
humidity            float64
windspeed           float64
year                float64
season              float64
month               float64
weekday             float64
hour                float64
workingday          float64
holiday             float64
weather             float64
casual              float64
registered          float64
tot                 float64
l_casual            float64
l_registered        float64
l_tot               float64
periods             float64
avg_l_casual        float64
avg_l_registered    float64
avg_l_tot           float64
dtype: object

### Saving `train_prep_orig_l_avg.csv`

In [50]:
cat_var = ['year','season', 'month', 'weekday', 'hour', 'workingday', 'holiday', 'weather']
num_var = ['temp', 'atemp', 'humidity', 'windspeed']
avg_var = ['avg_casual','avg_registered','avg_tot']
avg_l_var = ['avg_l_casual','avg_l_registered','avg_l_tot']
target_var = ['casual', 'registered', 'tot']
target_l_var = ['l_casual', 'l_registered', 'l_tot']



display(data_train[num_var+cat_var+avg_l_var+target_l_var].head())

data_train[num_var+cat_var+avg_l_var+target_l_var].to_csv('data/train_prep_orig_l_avg.csv', index=False)

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,avg_l_casual,avg_l_registered,avg_l_tot,l_casual,l_registered,l_tot
0,9.84,14.395,81.0,0.0,2011.0,1.0,1.0,5.0,0.0,0.0,0.0,1.0,1.242453,3.054927,3.174986,1.386294,2.639057,2.833213
1,9.02,13.635,80.0,0.0,2011.0,1.0,1.0,5.0,1.0,0.0,0.0,1.0,1.242453,3.054927,3.174986,2.197225,3.496508,3.713572
2,9.02,13.635,80.0,0.0,2011.0,1.0,1.0,5.0,2.0,0.0,0.0,1.0,0.438125,1.787425,1.847208,1.791759,3.332205,3.496508
3,9.84,14.395,75.0,0.0,2011.0,1.0,1.0,5.0,3.0,0.0,0.0,1.0,0.438125,1.787425,1.847208,1.386294,2.397895,2.639057
4,9.84,14.395,75.0,0.0,2011.0,1.0,1.0,5.0,4.0,0.0,0.0,1.0,0.438125,1.787425,1.847208,0.0,0.693147,0.693147


The next step is to add the new features to the test data set. First we load the data set and group by ['`year`','`month`','`weekday`','`period`'].

In [52]:
data_test_avg = pd.read_csv('data/test_prep_orig.csv')
data_test_avg.head()

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather
0,10.66,11.365,56.0,26.0027,2011,1,1,3,0,1,0,1
1,10.66,13.635,56.0,0.0,2011,1,1,3,1,1,0,1
2,10.66,13.635,56.0,0.0,2011,1,1,3,2,1,0,1
3,10.66,12.88,56.0,11.0014,2011,1,1,3,3,1,0,1
4,10.66,12.88,56.0,11.0014,2011,1,1,3,4,1,0,1


In [53]:
hours = np.array(data_test_avg.hour)

data_test_avg['periods']=np.where( (2 <= hours) &  (hours <= 5), 1,
                           np.where( (6 <= hours) &  (hours <= 9), 2,
                                 np.where( (10 <= hours) &  (hours <= 13), 3,
                                          np.where( (14 <= hours) &  (hours <= 17), 4,
                                                  np.where( (18 <= hours) &  (hours <= 23), 5, 6)
                                                  )
                                         )
                                )
                        )
display(data_test_avg.head(10))

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods
0,10.66,11.365,56.0,26.0027,2011,1,1,3,0,1,0,1,6
1,10.66,13.635,56.0,0.0,2011,1,1,3,1,1,0,1,6
2,10.66,13.635,56.0,0.0,2011,1,1,3,2,1,0,1,1
3,10.66,12.88,56.0,11.0014,2011,1,1,3,3,1,0,1,1
4,10.66,12.88,56.0,11.0014,2011,1,1,3,4,1,0,1,1
5,9.84,11.365,60.0,15.0013,2011,1,1,3,5,1,0,1,1
6,9.02,10.605,60.0,15.0013,2011,1,1,3,6,1,0,1,2
7,9.02,10.605,55.0,15.0013,2011,1,1,3,7,1,0,1,2
8,9.02,10.605,55.0,19.0012,2011,1,1,3,8,1,0,1,2
9,9.84,11.365,52.0,15.0013,2011,1,1,3,9,1,0,2,2


In [54]:
data_test_avg['avg_l_casual'] = data_test_avg['avg_l_registered'] = data_test_avg['avg_l_tot'] = np.zeros(data_test_avg.shape[0])
print data_test_avg.shape[0]
data_test_avg.head(10)

6493


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods,avg_l_casual,avg_l_registered,avg_l_tot
0,10.66,11.365,56.0,26.0027,2011,1,1,3,0,1,0,1,6,0.0,0.0,0.0
1,10.66,13.635,56.0,0.0,2011,1,1,3,1,1,0,1,6,0.0,0.0,0.0
2,10.66,13.635,56.0,0.0,2011,1,1,3,2,1,0,1,1,0.0,0.0,0.0
3,10.66,12.88,56.0,11.0014,2011,1,1,3,3,1,0,1,1,0.0,0.0,0.0
4,10.66,12.88,56.0,11.0014,2011,1,1,3,4,1,0,1,1,0.0,0.0,0.0
5,9.84,11.365,60.0,15.0013,2011,1,1,3,5,1,0,1,1,0.0,0.0,0.0
6,9.02,10.605,60.0,15.0013,2011,1,1,3,6,1,0,1,2,0.0,0.0,0.0
7,9.02,10.605,55.0,15.0013,2011,1,1,3,7,1,0,1,2,0.0,0.0,0.0
8,9.02,10.605,55.0,19.0012,2011,1,1,3,8,1,0,1,2,0.0,0.0,0.0
9,9.84,11.365,52.0,15.0013,2011,1,1,3,9,1,0,2,2,0.0,0.0,0.0


In [55]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_l_casual'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].l_casual

0
1000
2000
3000
4000
5000
6000


In [56]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_l_registered'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), 
                                                                               level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].l_registered

0
1000
2000
3000
4000
5000
6000


In [57]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_l_tot'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].l_tot

0
1000
2000
3000
4000
5000
6000


In [58]:
data_test_avg.head(10)

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods,avg_l_casual,avg_l_registered,avg_l_tot
0,10.66,11.365,56.0,26.0027,2011,1,1,3,0,1,0,1,6,0.173287,1.784717,1.8181
1,10.66,13.635,56.0,0.0,2011,1,1,3,1,1,0,1,6,0.173287,1.784717,1.8181
2,10.66,13.635,56.0,0.0,2011,1,1,3,2,1,0,1,1,0.0,1.268834,1.268834
3,10.66,12.88,56.0,11.0014,2011,1,1,3,3,1,0,1,1,0.0,1.268834,1.268834
4,10.66,12.88,56.0,11.0014,2011,1,1,3,4,1,0,1,1,0.0,1.268834,1.268834
5,9.84,11.365,60.0,15.0013,2011,1,1,3,5,1,0,1,1,0.0,1.268834,1.268834
6,9.02,10.605,60.0,15.0013,2011,1,1,3,6,1,0,1,2,0.777822,4.523827,4.53671
7,9.02,10.605,55.0,15.0013,2011,1,1,3,7,1,0,1,2,0.777822,4.523827,4.53671
8,9.02,10.605,55.0,19.0012,2011,1,1,3,8,1,0,1,2,0.777822,4.523827,4.53671
9,9.84,11.365,52.0,15.0013,2011,1,1,3,9,1,0,2,2,0.777822,4.523827,4.53671


In [59]:
# checking the last three. Intetestingly, the below table has no value for hour == 3  

data_train.loc[115:124]


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,...,casual,registered,tot,l_casual,l_registered,l_tot,periods,avg_l_casual,avg_l_registered,avg_l_tot
115,7.38,12.12,55.0,0.0,2011.0,1.0,1.0,3.0,0.0,1.0,...,0.0,11.0,11.0,0.0,2.484907,2.484907,6.0,0.173287,1.784717,1.8181
116,6.56,11.365,64.0,0.0,2011.0,1.0,1.0,3.0,1.0,1.0,...,0.0,4.0,4.0,0.0,1.609438,1.609438,6.0,0.173287,1.784717,1.8181
117,6.56,11.365,64.0,0.0,2011.0,1.0,1.0,3.0,2.0,1.0,...,0.0,2.0,2.0,0.0,1.098612,1.098612,1.0,0.0,1.268834,1.268834
118,6.56,9.85,64.0,6.0032,2011.0,1.0,1.0,3.0,4.0,1.0,...,0.0,1.0,1.0,0.0,0.693147,0.693147,1.0,0.0,1.268834,1.268834
119,5.74,9.09,69.0,6.0032,2011.0,1.0,1.0,3.0,5.0,1.0,...,0.0,4.0,4.0,0.0,1.609438,1.609438,1.0,0.0,1.268834,1.268834
120,5.74,8.335,63.0,7.0015,2011.0,1.0,1.0,3.0,6.0,1.0,...,0.0,36.0,36.0,0.0,3.610918,3.610918,2.0,0.777822,4.523827,4.53671
121,6.56,11.365,59.0,0.0,2011.0,1.0,1.0,3.0,7.0,1.0,...,0.0,95.0,95.0,0.0,4.564348,4.564348,2.0,0.777822,4.523827,4.53671
122,6.56,11.365,59.0,0.0,2011.0,1.0,1.0,3.0,8.0,1.0,...,3.0,216.0,219.0,1.386294,5.379897,5.393628,2.0,0.777822,4.523827,4.53671
123,7.38,12.12,51.0,0.0,2011.0,1.0,1.0,3.0,9.0,1.0,...,6.0,116.0,122.0,1.94591,4.762174,4.812184,2.0,0.777822,4.523827,4.53671
124,8.2,12.88,47.0,0.0,2011.0,1.0,1.0,3.0,10.0,1.0,...,3.0,42.0,45.0,1.386294,3.7612,3.828641,3.0,1.52359,3.959058,4.035925


In [62]:
display(data_test_avg[num_var+cat_var+avg_l_var].head())

data_test_avg[num_var+cat_var+avg_l_var].to_csv('data/test_prep_orig_l_avg.csv', index=False)


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,avg_l_casual,avg_l_registered,avg_l_tot
0,10.66,11.365,56.0,26.0027,2011,1,1,3,0,1,0,1,0.173287,1.784717,1.8181
1,10.66,13.635,56.0,0.0,2011,1,1,3,1,1,0,1,0.173287,1.784717,1.8181
2,10.66,13.635,56.0,0.0,2011,1,1,3,2,1,0,1,0.0,1.268834,1.268834
3,10.66,12.88,56.0,11.0014,2011,1,1,3,3,1,0,1,0.0,1.268834,1.268834
4,10.66,12.88,56.0,11.0014,2011,1,1,3,4,1,0,1,0.0,1.268834,1.268834
