# Data analysis exercise 1: Anova

## Dataset: Bike sharing
I am using the bike sharing dataset of https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset, which provides data on how many bikes are rent out on specific hours of specific days, the weather on these days, whether the day was a weekday or not, etc. 

## Research question
Does the number of rent out bikes differ between seasons?

In [73]:
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 

data = pandas.read_csv('hour.csv', low_memory=False)
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0,0,1,1


In [76]:
#The season variable in the data is just the number of the quartal, so I recode my own from the month
recode = {1:"Winter",2:"Winter",3:"Spring",4:"Spring",5:"Spring",6:"Summer",7:"Summer",8:"Summer",9:"Fall",10:"Fall",11:"Fall",12:"Winter"}
data['seasonname'] = data['mnth'].map(recode)
sub=data[['seasonname','cnt']]
ct = sub.groupby('seasonname').mean()
print (ct)

                   cnt
seasonname            
Fall        213.450636
Spring      188.653934
Summer      236.766750
Winter      116.934164


In [77]:
model = smf.ols(formula='cnt ~ C(seasonname)', data=sub)
results = model.fit()
print (results.summary())

                            OLS Regression Results                            
Dep. Variable:                    cnt   R-squared:                       0.061
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     374.4
Date:                Sun, 10 Apr 2016   Prob (F-statistic):          1.20e-235
Time:                        15:51:15   Log-Likelihood:            -1.1450e+05
No. Observations:               17379   AIC:                         2.290e+05
Df Residuals:                   17375   BIC:                         2.290e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
Intercept                 

## Result of ANOVA
ANOVA shows that there is a significant difference in the mean number of rented out bikes in the different seasons, so I do a posthoc test to figure out which difference exactly.

In [78]:
mc = multi.MultiComparison(sub['cnt'], sub['seasonname'])
res = mc.tukeyhsd()
print(res.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
 Fall  Spring  -24.7967  -34.4698 -15.1236  True 
 Fall  Summer  23.3161   13.6458  32.9865   True 
 Fall  Winter  -96.5165  -106.271 -86.762   True 
Spring Summer  48.1128    38.483  57.7426   True 
Spring Winter  -71.7198  -81.434  -62.0055  True 
Summer Winter -119.8326 -129.5441 -110.121  True 
-------------------------------------------------
