# Regression Modeling in Practice, exercise 4: Logistic Regression

## Dataset: Bike sharing

I am using the bike sharing dataset of https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset, which provides data on how many bikes are rent out to casual users and registered users on specific hours of specific days, the weather on these days, whether the day was a workingday or not, etc. 

## Hypotheses
So far I have taken the number of rented out bikes as a response variable, but for this exercise I decided to flip that around and try to predict wether it was a working day or not based on the number of bikes rent out by casual and registered users. 

Hypotheses:
 * It is more likely to be a working day if more bikes are rented out by registered users
 * It is less likely to be a working day if more bikes are rented out by casual users

In [1]:
%matplotlib inline
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.api as sm
import seaborn
import matplotlib.pyplot as plt


data = pandas.read_csv('day.csv', low_memory=False)
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


## Centering explanatory variables

In [4]:
data['casual_c']=data['casual']-numpy.mean(data['casual'])
print("mean centered casual_c: ",numpy.mean(data['casual_c']))
data['registered_c']=data['registered']-numpy.mean(data['registered'])
print("mean centered registered_c: ",numpy.mean(data['registered_c']))

mean centered casual_c:  -1.6253640780145131e-12
mean centered registered_c:  -5.497404021591906e-12


## Fitting a logistic regression model

In [20]:
lreg = smf.logit(formula = 'workingday ~ registered_c + casual_c', data = data).fit()
print (lreg.summary())

Optimization terminated successfully.
         Current function value: 0.230172
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:             workingday   No. Observations:                  731
Model:                          Logit   Df Residuals:                      728
Method:                           MLE   Df Model:                            2
Date:                Sat, 21 May 2016   Pseudo R-squ.:                  0.6310
Time:                        16:01:42   Log-Likelihood:                -168.26
converged:                       True   LL-Null:                       -456.01
                                        LLR p-value:                1.068e-125
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        1.4212      0.187      7.607      0.000         1.055     1.787
registered_c     0.0022

### Odd ratios:

In [21]:
print (numpy.exp(lreg.params))

Intercept       4.142021
registered_c    1.002232
casual_c        0.993828
dtype: float64


### Odd ratios with 95% confidence intervals:

In [22]:
params = lreg.params
conf = lreg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))

              Lower CI  Upper CI        OR
Intercept     2.872011  5.973633  4.142021
registered_c  1.001843  1.002621  1.002232
casual_c      0.992678  0.994980  0.993828


## Conclusion
My hypotheses were supported. Whether it is a working day or not is significantly associated with the number of rented out bikes by casual and registered users, such that it is more likely to be a working day if more registered users rent out a bike (OR=1.002232, CI=1.001843 -1.002621, p<0.0001) or if less casual users rent out a bike (OR=0.993828, CI=0.992678-0.994980, p<0.0001)

There was no evidence of confounding.