# 04 - Kaggle - bike share system - Adding customer average to the features


For problem formulation refer to **"01 - Kaggle - bike share system - problem formulation.ipynb"**.
In section **"02 - Kaggle - bike share system - Data preprocessing.ipynb"** we transformed the raw data and extracted time, date, and dummy matrices. The results are stored in two formats:
 * In `train_prep_orig.csv` and `test_prep_orig.csv` the categorical data are in the original form.
 * In `train_prep_dum.csv` and `test_prep_dum.csv` the categorical data are converted to dummy matrices. 

In section **03 - Kaggle - bike share system - data visualization.ipynb** we ploted the average of customers at different time periods over 2011 and 2012 and discused the pattern of customer behavior. We concluded by a decision to consider these average values as new features of the problem so that the machine learning model will be able to use them as the basis values and the other features apply the necessary correction to make use closer to the actual values. That is the job of this section.

We split the hours time into 6 chunks of 4 hours periods. Here was the observation of customer behavior:
* 1: **[2:00 am, 3:00 am, 4:00 am, 5:00 am]**  ------> both casual and registered (and therefore total) customers are highest during the weekend.
* 2: ** [6:00 am, 7:00 am, 8:00 am, 9:00 am]**  ------> Casual customers behave for all day alsmot within the same range below the week average. Registered customers use the system on average more during the workdays above the the week average and the usage during the weekends in below the week average.
* 3: ** [10:00 am, 11:00 am, 12:00 pm, 1:00 pm]**  --> both casual and registered (and therefore total) customers are highest during the weekend.
* 4: ** [2:00 pm, 3:00 pm, 4:00 pm, 5:00 pm] **  ------> All are above the week average. The casual and total is highest during the weekend. For registered, all days are of the same order. 
* 5: ** [6:00 pm, 7:00 pm, 8:00 pm, 9:00 pm] **  ------> no comment.
* 6: ** [10:00 pm, 11:00 pm, 12:00 am, 1:00 pm] ** --> All are below the week average and the average number of customers is highest during the weekend.

Since we do not have the number of customers from day 20 to the end of each month, we have about three isntances of a weekday for each month and year. We take the average number of customers during these days and use it as a basis. We call the new features **avg_casual**, **avg_registered** and **avg_tot**. We finally, update the data sets as:
 * In `train_prep_orig_avg.csv` and `test_prep_orig_avg.csv` the categorical data are in the original form.
 * In `train_prep_dum_avg.csv` and `test_prep_dum_avg.csv` the categorical data are converted to dummy matrices. 



### Basic settings and importing the libraries

In [162]:
# Resets the namespace by removing all names defined by the user without asking for confirmation
%reset -f


# Panas is used for DataFrame
import pandas as pd

# NumPy is used for manipulating arrays
import numpy as np

# MatPlotLib is used for plotting
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# the output of plotting commands is displayed inline directly below the code cell that produced it.
%matplotlib inline

# Seaborn is used for statistical plotting
import seaborn as sns

# Used for display dataframes as html tables
from IPython.display import display

### Importing the train data from `train_prep_orig.csv`

In [163]:
#Load train data
data_train = pd.read_csv('data/train_prep_orig.csv')

print "The shape of the train dataset:", data_train.shape
display(data_train.head())


The shape of the train dataset: (10886, 15)


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot
0,9.84,14.395,81,0,2011,1,1,5,0,0,0,1,3,13,16
1,9.02,13.635,80,0,2011,1,1,5,1,0,0,1,8,32,40
2,9.02,13.635,80,0,2011,1,1,5,2,0,0,1,5,27,32
3,9.84,14.395,75,0,2011,1,1,5,3,0,0,1,3,10,13
4,9.84,14.395,75,0,2011,1,1,5,4,0,0,1,0,1,1


## features and dictionaries

In [164]:
cat_var = ['year','season', 'month', 'weekday', 'hour', 'workingday', 'holiday', 'weather']
num_var = ['temp', 'atemp', 'humidity', 'windspeed']
target_var = ['casual', 'registered', 'tot']

weekday_dic = {0:'Monday',
              1:'Tuesday',
              2:'Wednesday',
              3:'Thursday',
              4:'Friday',
              5:'Saturday',
              6:'Sunday'}

### Grouping by ['`year`','`month`','`weekday`','`period`']

Now, we would like to learn how does it changes acros different months, weekdays and day periods. We only work with `mean` here .

In [165]:
hours = np.array(data_train.hour)

data_train['periods']=np.where( (2 <= hours) &  (hours <= 5), 1,
                        np.where( (6 <= hours) &  (hours <= 9), 2,
                                 np.where( (10 <= hours) &  (hours <= 13), 3,
                                          np.where( (14 <= hours) &  (hours <= 17), 4,
                                                  np.where( (18 <= hours) &  (hours <= 23), 5, 6)
                                                  )
                                         )
                                )
                        )
display(data_train.head(20))

groupby_year_month_weekday_periods = data_train.groupby(['year','month','weekday','periods']).mean()[target_var]
display(groupby_year_month_weekday_periods.head(40))

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,periods
0,9.84,14.395,81,0.0,2011,1,1,5,0,0,0,1,3,13,16,6
1,9.02,13.635,80,0.0,2011,1,1,5,1,0,0,1,8,32,40,6
2,9.02,13.635,80,0.0,2011,1,1,5,2,0,0,1,5,27,32,1
3,9.84,14.395,75,0.0,2011,1,1,5,3,0,0,1,3,10,13,1
4,9.84,14.395,75,0.0,2011,1,1,5,4,0,0,1,0,1,1,1
5,9.84,12.88,75,6.0032,2011,1,1,5,5,0,0,2,0,1,1,1
6,9.02,13.635,80,0.0,2011,1,1,5,6,0,0,1,2,0,2,2
7,8.2,12.88,86,0.0,2011,1,1,5,7,0,0,1,1,2,3,2
8,9.84,14.395,75,0.0,2011,1,1,5,8,0,0,1,1,7,8,2
9,13.12,17.425,76,0.0,2011,1,1,5,9,0,0,1,8,6,14,2


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,casual,registered,tot
year,month,weekday,periods,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011,1,0,1,0.2,2.6,2.8
2011,1,0,2,2.666667,66.0,68.666667
2011,1,0,3,7.666667,48.833333,56.5
2011,1,0,4,8.166667,80.0,88.166667
2011,1,0,5,2.722222,54.833333,57.555556
2011,1,0,6,0.833333,6.833333,7.666667
2011,1,1,1,0.0,3.2,3.2
2011,1,1,2,1.75,108.5,110.25
2011,1,1,3,5.7,44.0,49.7
2011,1,1,4,6.0,76.833333,82.833333


In [166]:
data_train['avg_casual'] = data_train['avg_registered'] = data_train['avg_tot'] = np.zeros(data_train.shape[0])
print data_train.shape[0]
data_train.head()

10886


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,periods,avg_casual,avg_registered,avg_tot
0,9.84,14.395,81,0,2011,1,1,5,0,0,0,1,3,13,16,6,0,0,0
1,9.02,13.635,80,0,2011,1,1,5,1,0,0,1,8,32,40,6,0,0,0
2,9.02,13.635,80,0,2011,1,1,5,2,0,0,1,5,27,32,1,0,0,0
3,9.84,14.395,75,0,2011,1,1,5,3,0,0,1,3,10,13,1,0,0,0
4,9.84,14.395,75,0,2011,1,1,5,4,0,0,1,0,1,1,1,0,0,0


In [167]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_casual'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)].casual

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [168]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_registered'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), 
                                                                               level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)].registered


0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [169]:
for i in range(data_train.shape[0]):
    if i % 1000 == 0: print i
    data_train.loc[i,'avg_tot'] = groupby_year_month_weekday_periods.xs(int(data_train.loc[i].year), level ='year'
                                     ).xs(int(data_train.loc[i].month), level ='month'
                                         ).xs(int(data_train.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_train.loc[i].periods)].tot

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [170]:
data_train.loc[130:140]


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,periods,avg_casual,avg_registered,avg_tot
130,10.66,12.88,38,11.0014,2011,1,1,3,16,1,0,1,12,74,86,4,6.75,84.75,91.5
131,9.02,11.365,51,11.0014,2011,1,1,3,17,1,0,1,9,163,172,4,6.75,84.75,91.5
132,9.02,11.365,51,8.9981,2011,1,1,3,18,1,0,1,5,158,163,5,1.75,69.833333,71.583333
133,9.02,12.88,55,6.0032,2011,1,1,3,19,1,0,1,3,109,112,5,1.75,69.833333,71.583333
134,8.2,10.605,51,11.0014,2011,1,1,3,20,1,0,1,3,66,69,5,1.75,69.833333,71.583333
135,9.02,10.605,55,15.0013,2011,1,1,3,21,1,0,2,0,48,48,5,1.75,69.833333,71.583333
136,9.02,10.605,51,19.0012,2011,1,1,3,22,1,0,2,1,51,52,5,1.75,69.833333,71.583333
137,8.2,9.85,59,12.998,2011,1,1,3,23,1,0,2,4,19,23,5,1.75,69.833333,71.583333
138,8.2,9.85,64,12.998,2011,1,1,4,0,1,0,2,4,13,17,6,1.5,9.25,10.75
139,8.2,9.85,69,15.0013,2011,1,1,4,1,1,0,2,2,5,7,6,1.5,9.25,10.75


In [171]:
data_train=data_train.astype('float')

In [172]:
data_train.dtypes

temp              float64
atemp             float64
humidity          float64
windspeed         float64
year              float64
season            float64
month             float64
weekday           float64
hour              float64
workingday        float64
holiday           float64
weather           float64
casual            float64
registered        float64
tot               float64
periods           float64
avg_casual        float64
avg_registered    float64
avg_tot           float64
dtype: object

### Saving `train_prep_orig_avg.csv`

In [173]:
cat_var = ['year','season', 'month', 'weekday', 'hour', 'workingday', 'holiday', 'weather']
num_var = ['temp', 'atemp', 'humidity', 'windspeed']
avg_var = ['avg_casual','avg_registered','avg_tot']
target_var = ['casual', 'registered', 'tot']


display(data_train[num_var+cat_var+avg_var+target_var].head())

data_train[num_var+cat_var+avg_var+target_var].to_csv('data/train_prep_orig_avg.csv', index=False)

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,avg_casual,avg_registered,avg_tot,casual,registered,tot
0,9.84,14.395,81,0,2011,1,1,5,0,0,0,1,3,21.166667,24.166667,3,13,16
1,9.02,13.635,80,0,2011,1,1,5,1,0,0,1,3,21.166667,24.166667,8,32,40
2,9.02,13.635,80,0,2011,1,1,5,2,0,0,1,1,7.5,8.5,5,27,32
3,9.84,14.395,75,0,2011,1,1,5,3,0,0,1,1,7.5,8.5,3,10,13
4,9.84,14.395,75,0,2011,1,1,5,4,0,0,1,1,7.5,8.5,0,1,1


### Saving `train_prep_dum_avg.csv`

In [174]:
dummy_train = pd.read_csv('data/train_prep_dum.csv')

display(dummy_train.head())
dummy_train_avg = dummy_train.ix[:,'temp':'d_weather_4'].copy()

dummy_train_avg['avg_casual'] = data_train['avg_casual']
dummy_train_avg['avg_registered'] = data_train['avg_registered']
dummy_train_avg['avg_tot'] = data_train['avg_tot']

dummy_train_avg['casual'] = dummy_train['casual']
dummy_train_avg['registered'] = dummy_train['registered']
dummy_train_avg['tot'] = dummy_train['tot']

display(dummy_train_avg.head())
display(data_train[num_var+cat_var+avg_var+target_var].head())

dummy_train_avg.to_csv('data/train_prep_dum_avg.csv', index=False)

Unnamed: 0,temp,atemp,humidity,windspeed,is2011,d_season_1,d_season_2,d_season_3,d_season_4,d_month_1,...,d_workingday_1,d_holiday_0,d_holiday_1,d_weather_1,d_weather_2,d_weather_3,d_weather_4,casual,registered,tot
0,9.84,14.395,81,0,1,1,0,0,0,1,...,0,1,0,1,0,0,0,3,13,16
1,9.02,13.635,80,0,1,1,0,0,0,1,...,0,1,0,1,0,0,0,8,32,40
2,9.02,13.635,80,0,1,1,0,0,0,1,...,0,1,0,1,0,0,0,5,27,32
3,9.84,14.395,75,0,1,1,0,0,0,1,...,0,1,0,1,0,0,0,3,10,13
4,9.84,14.395,75,0,1,1,0,0,0,1,...,0,1,0,1,0,0,0,0,1,1


Unnamed: 0,temp,atemp,humidity,windspeed,is2011,d_season_1,d_season_2,d_season_3,d_season_4,d_month_1,...,d_weather_1,d_weather_2,d_weather_3,d_weather_4,avg_casual,avg_registered,avg_tot,casual,registered,tot
0,9.84,14.395,81,0,1,1,0,0,0,1,...,1,0,0,0,3,21.166667,24.166667,3,13,16
1,9.02,13.635,80,0,1,1,0,0,0,1,...,1,0,0,0,3,21.166667,24.166667,8,32,40
2,9.02,13.635,80,0,1,1,0,0,0,1,...,1,0,0,0,1,7.5,8.5,5,27,32
3,9.84,14.395,75,0,1,1,0,0,0,1,...,1,0,0,0,1,7.5,8.5,3,10,13
4,9.84,14.395,75,0,1,1,0,0,0,1,...,1,0,0,0,1,7.5,8.5,0,1,1


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,avg_casual,avg_registered,avg_tot,casual,registered,tot
0,9.84,14.395,81,0,2011,1,1,5,0,0,0,1,3,21.166667,24.166667,3,13,16
1,9.02,13.635,80,0,2011,1,1,5,1,0,0,1,3,21.166667,24.166667,8,32,40
2,9.02,13.635,80,0,2011,1,1,5,2,0,0,1,1,7.5,8.5,5,27,32
3,9.84,14.395,75,0,2011,1,1,5,3,0,0,1,1,7.5,8.5,3,10,13
4,9.84,14.395,75,0,2011,1,1,5,4,0,0,1,1,7.5,8.5,0,1,1


The next step is to add the new features to the test data set. First we load the data set and group by ['`year`','`month`','`weekday`','`period`'].

In [175]:
data_test_avg = pd.read_csv('data/test_prep_orig.csv')
data_test_avg.head()

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather
0,10.66,11.365,56,26.0027,2011,1,1,3,0,1,0,1
1,10.66,13.635,56,0.0,2011,1,1,3,1,1,0,1
2,10.66,13.635,56,0.0,2011,1,1,3,2,1,0,1
3,10.66,12.88,56,11.0014,2011,1,1,3,3,1,0,1
4,10.66,12.88,56,11.0014,2011,1,1,3,4,1,0,1


In [176]:
hours = np.array(data_test_avg.hour)

data_test_avg['periods']=np.where( (2 <= hours) &  (hours <= 5), 1,
                           np.where( (6 <= hours) &  (hours <= 9), 2,
                                 np.where( (10 <= hours) &  (hours <= 13), 3,
                                          np.where( (14 <= hours) &  (hours <= 17), 4,
                                                  np.where( (18 <= hours) &  (hours <= 23), 5, 6)
                                                  )
                                         )
                                )
                        )
display(data_test_avg.head(10))

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods
0,10.66,11.365,56,26.0027,2011,1,1,3,0,1,0,1,6
1,10.66,13.635,56,0.0,2011,1,1,3,1,1,0,1,6
2,10.66,13.635,56,0.0,2011,1,1,3,2,1,0,1,1
3,10.66,12.88,56,11.0014,2011,1,1,3,3,1,0,1,1
4,10.66,12.88,56,11.0014,2011,1,1,3,4,1,0,1,1
5,9.84,11.365,60,15.0013,2011,1,1,3,5,1,0,1,1
6,9.02,10.605,60,15.0013,2011,1,1,3,6,1,0,1,2
7,9.02,10.605,55,15.0013,2011,1,1,3,7,1,0,1,2
8,9.02,10.605,55,19.0012,2011,1,1,3,8,1,0,1,2
9,9.84,11.365,52,15.0013,2011,1,1,3,9,1,0,2,2


In [177]:
data_test_avg['avg_casual'] = data_test_avg['avg_registered'] = data_test_avg['avg_tot'] = np.zeros(data_test_avg.shape[0])
print data_test_avg.shape[0]
data_test_avg.head(10)

6493


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods,avg_casual,avg_registered,avg_tot
0,10.66,11.365,56,26.0027,2011,1,1,3,0,1,0,1,6,0,0,0
1,10.66,13.635,56,0.0,2011,1,1,3,1,1,0,1,6,0,0,0
2,10.66,13.635,56,0.0,2011,1,1,3,2,1,0,1,1,0,0,0
3,10.66,12.88,56,11.0014,2011,1,1,3,3,1,0,1,1,0,0,0
4,10.66,12.88,56,11.0014,2011,1,1,3,4,1,0,1,1,0,0,0
5,9.84,11.365,60,15.0013,2011,1,1,3,5,1,0,1,1,0,0,0
6,9.02,10.605,60,15.0013,2011,1,1,3,6,1,0,1,2,0,0,0
7,9.02,10.605,55,15.0013,2011,1,1,3,7,1,0,1,2,0,0,0
8,9.02,10.605,55,19.0012,2011,1,1,3,8,1,0,1,2,0,0,0
9,9.84,11.365,52,15.0013,2011,1,1,3,9,1,0,2,2,0,0,0


In [178]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_casual'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].casual

0
1000
2000
3000
4000
5000
6000


In [179]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_registered'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), 
                                                                               level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].registered

0
1000
2000
3000
4000
5000
6000


In [180]:
for i in range(data_test_avg.shape[0]):
    if i % 1000 == 0: print i
    data_test_avg.loc[i,'avg_tot'] = groupby_year_month_weekday_periods.xs(int(data_test_avg.loc[i].year), level ='year'
                                     ).xs(int(data_test_avg.loc[i].month), level ='month'
                                         ).xs(int(data_test_avg.loc[i].weekday), level ='weekday'
                                             ).loc[int(data_test_avg.loc[i].periods)].tot

0
1000
2000
3000
4000
5000
6000


In [181]:
data_test_avg.head(10)

Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,periods,avg_casual,avg_registered,avg_tot
0,10.66,11.365,56,26.0027,2011,1,1,3,0,1,0,1,6,0.25,5.75,6.0
1,10.66,13.635,56,0.0,2011,1,1,3,1,1,0,1,6,0.25,5.75,6.0
2,10.66,13.635,56,0.0,2011,1,1,3,2,1,0,1,1,0.0,2.714286,2.714286
3,10.66,12.88,56,11.0014,2011,1,1,3,3,1,0,1,1,0.0,2.714286,2.714286
4,10.66,12.88,56,11.0014,2011,1,1,3,4,1,0,1,1,0.0,2.714286,2.714286
5,9.84,11.365,60,15.0013,2011,1,1,3,5,1,0,1,1,0.0,2.714286,2.714286
6,9.02,10.605,60,15.0013,2011,1,1,3,6,1,0,1,2,2.0,112.125,114.125
7,9.02,10.605,55,15.0013,2011,1,1,3,7,1,0,1,2,2.0,112.125,114.125
8,9.02,10.605,55,19.0012,2011,1,1,3,8,1,0,1,2,2.0,112.125,114.125
9,9.84,11.365,52,15.0013,2011,1,1,3,9,1,0,2,2,2.0,112.125,114.125


In [182]:
# checking the last three. Intetestingly, the below table has no value fo hour == 3  

data_train.loc[115:124]


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,casual,registered,tot,periods,avg_casual,avg_registered,avg_tot
115,7.38,12.12,55,0.0,2011,1,1,3,0,1,0,1,0,11,11,6,0.25,5.75,6.0
116,6.56,11.365,64,0.0,2011,1,1,3,1,1,0,1,0,4,4,6,0.25,5.75,6.0
117,6.56,11.365,64,0.0,2011,1,1,3,2,1,0,1,0,2,2,1,0.0,2.714286,2.714286
118,6.56,9.85,64,6.0032,2011,1,1,3,4,1,0,2,0,1,1,1,0.0,2.714286,2.714286
119,5.74,9.09,69,6.0032,2011,1,1,3,5,1,0,2,0,4,4,1,0.0,2.714286,2.714286
120,5.74,8.335,63,7.0015,2011,1,1,3,6,1,0,2,0,36,36,2,2.0,112.125,114.125
121,6.56,11.365,59,0.0,2011,1,1,3,7,1,0,2,0,95,95,2,2.0,112.125,114.125
122,6.56,11.365,59,0.0,2011,1,1,3,8,1,0,1,3,216,219,2,2.0,112.125,114.125
123,7.38,12.12,51,0.0,2011,1,1,3,9,1,0,2,6,116,122,2,2.0,112.125,114.125
124,8.2,12.88,47,0.0,2011,1,1,3,10,1,0,1,3,42,45,3,4.25,53.875,58.125


In [183]:
display(data_test_avg[num_var+cat_var+avg_var].head())

data_test_avg[num_var+cat_var+avg_var].to_csv('data/test_prep_orig_avg.csv', index=False)


Unnamed: 0,temp,atemp,humidity,windspeed,year,season,month,weekday,hour,workingday,holiday,weather,avg_casual,avg_registered,avg_tot
0,10.66,11.365,56,26.0027,2011,1,1,3,0,1,0,1,0.25,5.75,6.0
1,10.66,13.635,56,0.0,2011,1,1,3,1,1,0,1,0.25,5.75,6.0
2,10.66,13.635,56,0.0,2011,1,1,3,2,1,0,1,0.0,2.714286,2.714286
3,10.66,12.88,56,11.0014,2011,1,1,3,3,1,0,1,0.0,2.714286,2.714286
4,10.66,12.88,56,11.0014,2011,1,1,3,4,1,0,1,0.0,2.714286,2.714286


### Saving `test_prep_dum_avg.csv`

In [184]:
dummy_test = pd.read_csv('data/test_prep_dum.csv')

display(dummy_test.head())
dummy_test_avg = dummy_test.ix[:,'temp':'d_weather_4'].copy()

dummy_test_avg['avg_casual'] = data_test_avg['avg_casual']
dummy_test_avg['avg_registered'] = data_test_avg['avg_registered']
dummy_test_avg['avg_tot'] = data_test_avg['avg_tot']

display(dummy_test_avg.head())

dummy_test_avg.to_csv('data/test_prep_dum_avg.csv', index=False)

Unnamed: 0,temp,atemp,humidity,windspeed,is2011,d_season_1,d_season_2,d_season_3,d_season_4,d_month_1,...,d_hour_22,d_hour_23,d_workingday_0,d_workingday_1,d_holiday_0,d_holiday_1,d_weather_1,d_weather_2,d_weather_3,d_weather_4
0,10.66,11.365,56,26.0027,1,1,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
1,10.66,13.635,56,0.0,1,1,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
2,10.66,13.635,56,0.0,1,1,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
3,10.66,12.88,56,11.0014,1,1,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
4,10.66,12.88,56,11.0014,1,1,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0


Unnamed: 0,temp,atemp,humidity,windspeed,is2011,d_season_1,d_season_2,d_season_3,d_season_4,d_month_1,...,d_workingday_1,d_holiday_0,d_holiday_1,d_weather_1,d_weather_2,d_weather_3,d_weather_4,avg_casual,avg_registered,avg_tot
0,10.66,11.365,56,26.0027,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0.25,5.75,6.0
1,10.66,13.635,56,0.0,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0.25,5.75,6.0
2,10.66,13.635,56,0.0,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0.0,2.714286,2.714286
3,10.66,12.88,56,11.0014,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0.0,2.714286,2.714286
4,10.66,12.88,56,11.0014,1,1,0,0,0,1,...,1,1,0,1,0,0,0,0.0,2.714286,2.714286
