## One-Way Annova
When there are more than 2 samples we use ANOVA testing. Anova is used to compare means across different groups. 

For example ----- we want to check whether the average weigth of babies born in 3 different states are similar or different.

Before moving forward with any kind of hypothesis testing we should always have a question in our mind. And based on the question we decide the kind of hypothesis testing. And test our hypothesis against it.

Here we will work with the bike sharing data again. The question that we are asking here is ------- are the number of bike rentals similar or different in all 4 seasons.

###Pre-processing
We have already looked at preprocessing so here let's dive in directly to hypothesis testing.

In [0]:
#import the libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import warnings
from scipy import stats
%matplotlib inline

  import pandas.util.testing as tm


In [0]:
#read the data
df = pd.read_csv('/content/drive/My Drive/data_set/bike_sharing.csv')

In [0]:
#drop datetime
df.drop(['datetime','atemp'],axis = 1,inplace=True)

In [0]:
df['weather'].value_counts()

1    7192
2    2834
3     859
4       1
Name: weather, dtype: int64

We have only 1 record in 4th category. We will drop the records of 4th weather situation.

In [0]:
df.drop(df[df['weather']==4].index,axis=0,inplace=True)  #remove the records where weather == 4

In [0]:
df.groupby('weather')['count'].describe()  #groupby weather situation and check the description

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
weather,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,7192.0,205.236791,187.959566,1.0,48.0,161.0,305.0,977.0
2,2834.0,178.95554,168.366413,1.0,41.0,134.0,264.0,890.0
3,859.0,118.846333,138.581297,1.0,23.0,71.0,161.0,891.0


Clearly we can see that the means of 3 groups are very different. But are these differences stastically significant. We will use one-way anova to test whether this difference in mean is stastically significant or not.

So the question is does weather situation has any impact on the number of bikes rented or not.

In [0]:
#perfrom one way annova using stats module from scipy library
#H0 : There is no difference in the mean
#H1 : There is a difference in the mean
#Alpha : 0.05

alpha = 0.05
Stats,p_value = stats.f_oneway(df['count'][df['weather']==1],
                               df['count'][df['weather']==2],
                               df['count'][df['weather']==3])

if p_value > alpha :
  print(f' Failed to reject null hypothesis \n Weather situation have no impact on bike rentals \n p-value : {p_value}')
else:
  print(f' Reject null hypothesis \n Weather situation has impact on bike rentals \n p-value : {p_value}')

 Reject null hypothesis 
 Weather situation has impact on bike rentals 
 p-value : 4.976448509904196e-43


Here our p-value is less than alpha. Which means that the weather situation impact the number of bike rentals.

Using one way anova we only know that the Means of the groups are not same. But we don't know which group mean are not same.

We use post-hoc test to find out which group mean are not equal.

###Tukey HSD Post-Hoc-Test

In [0]:
#Use TukeyHSD to know which group mean are not similar.
from statsmodels.stats.multicomp import MultiComparison 
mul_comp = MultiComparison(df['count'],df['weather'])   
mul_result = mul_comp.tukeyhsd()
print(mul_result) 

 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     1      2 -26.2813 0.001  -35.6159 -16.9466   True
     1      3 -86.3905 0.001 -101.5842 -71.1968   True
     2      3 -60.1092 0.001   -76.502 -43.7164   True
------------------------------------------------------


If you look at the last column. All the values are Reject = True. which means reject the null (Means are same) hypothesis . The mean of all the groups are significantly different.

##Two Way Anova
Two way Anova is used to examine the influence of 2 different independent categorical variable on 1 dependent continuous variable.

In two way Annova we have 3 null hypothesis.
1. There is no effect on independent variable 1 on dependent variable.
2. There is no effect of independent variable 2 on dependent variable.
3. There is no interaction between variable 1 and variable 2. 

Before ypothessis testing we perform a regression analysis using the two variables. We will go one step at a time and keep it simple to understand.

Let's examine whether Season and weather situation has any effect on bike rentals or not. We have 4 seasons and 3 weather situations.

In [0]:
#check the description of groups of different weather situations
df.groupby('weather')['count'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
weather,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,7192.0,205.236791,187.959566,1.0,48.0,161.0,305.0,977.0
2,2834.0,178.95554,168.366413,1.0,41.0,134.0,264.0,890.0
3,859.0,118.846333,138.581297,1.0,23.0,71.0,161.0,891.0


We had checked it before as well. And have proven that the means are stastically different in weather situations.

In [0]:
#check the description of groups of different seasons
df.groupby('season')['count'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2685.0,116.325512,125.293931,1.0,24.0,78.0,164.0,801.0
2,2733.0,215.251372,192.007843,1.0,49.0,172.0,321.0,873.0
3,2733.0,234.417124,197.151001,1.0,68.0,195.0,347.0,977.0
4,2734.0,198.988296,177.622409,1.0,51.0,161.0,294.0,948.0


We can see that the means of the groups are different.

To Perform ANOVA Anlaysis we will first perform regression analysis.

In [0]:
#Perfrom regression analysis with weather situation
model = ols('count ~ C(weather) * C(season)',df).fit()  #fit the regression model
print(model.summary())  #print summary

                            OLS Regression Results                            
Dep. Variable:                  count   R-squared:                       0.080
Model:                            OLS   Adj. R-squared:                  0.079
Method:                 Least Squares   F-statistic:                     85.66
Date:                Sat, 18 Apr 2020   Prob (F-statistic):          8.63e-187
Time:                        07:31:15   Log-Likelihood:                -71587.
No. Observations:               10885   AIC:                         1.432e+05
Df Residuals:                   10873   BIC:                         1.433e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept   

After performing regression analysis. We get 2 tables.

---

Table - 1 tells us whether the regression was significant or not. And table 2 tells us whether the variable is significant or not.

By looking at the table-1 we can see that the p-value related with f-statistics is very low. Which means the regression was significant. Similarly when we look at the p-value assosiated with the t-statistic in table-2, we observe that the p-value is almost close to zero for most of the variables.

In [0]:
#H0: There's no difference in mean of weather
#    There is No difference in Mean of Season
#    There is no difference in mean of Weather and Season combined

sm.stats.anova_lm(model) ##perform two way anova

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(weather),2.0,6337309.0,3168655.0,104.81881,8.148093e-46
C(season),3.0,21587080.0,7195692.0,238.032851,1.350921e-149
C(weather):C(season),6.0,558835.2,93139.2,3.081036,0.005150817
Residual,10873.0,328688900.0,30229.83,,


By looking at the p-values i.e the last columns we can see that most of the values are close to zero. So we can say that the means are significantly different. 