<h2> ANOVA</h2>

<b> By Michael Kumakech </b>

<b> ANOVA: Analysis of Variance</b>

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

<b> F-test score:</b> ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

<b> P-value: </b> P-value tells how statistically significant is our calculated score value.

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

<b> 
Import libaries</b>

In [20]:
import itertools
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

<b> Load data and store in dataframe df:</b>

This dataset was hosted on IBM Cloud object click HERE for free storage

In [21]:

path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.84863,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.84863,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1


<h2> Drive Wheels</h2>

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

Let's see if different types 'drive-wheels' impact 'price', we group the data

Let's see if different types 'drive-wheels' impact 'price', we group the data.

In [3]:

%%capture
! pip install seaborn

<b> Visulisation</b>

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<h2> Grouping Methods</h2>

In [9]:
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

Unnamed: 0,drive-wheels,body-style,price
0,4wd,hatchback,7603.0
1,4wd,sedan,12647.333333
2,4wd,wagon,9095.75
3,fwd,convertible,11595.0
4,fwd,hardtop,8249.0
5,fwd,hatchback,8396.387755
6,fwd,sedan,9811.8
7,fwd,wagon,9997.333333
8,rwd,convertible,23949.6
9,rwd,hardtop,24202.714286


In [10]:
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)

Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
5,fwd,15250.0
136,4wd,7603.0


In [11]:
# Write your code below and press Shift+Enter to execute 
df_gptest2 = df_gptest[['body-style','price']]
grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).mean()
grouped_test_bodystyle

Unnamed: 0,body-style,price
0,convertible,21890.5
1,hardtop,22208.5
2,hatchback,9957.441176
3,sedan,14459.755319
4,wagon,12371.96


In [12]:
grouped_test2 = df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)

Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
5,fwd,15250.0
136,4wd,7603.0


In [13]:
df_gptest

Unnamed: 0,drive-wheels,body-style,price
0,rwd,convertible,13495.0
1,rwd,convertible,16500.0
2,rwd,hatchback,16500.0
3,fwd,sedan,13950.0
4,4wd,sedan,17450.0
...,...,...,...
196,rwd,sedan,16845.0
197,rwd,sedan,19045.0
198,rwd,sedan,21485.0
199,rwd,sedan,22470.0


We can obtain the values of the method group using the method "get_group".

In [14]:

grouped_test2.get_group('4wd')['price']

4      17450.0
136     7603.0
140     9233.0
141    11259.0
144     8013.0
145    11694.0
150     7898.0
151     8778.0
Name: price, dtype: float64


we can use the function 'f_oneway' in the module 'stats' to obtain the <b>  F-test score and P-value</b>

In [19]:
# ANOVA
from scipy import stats
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23


<b> Comments</b>

This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

<h2> ACTIVITY</h2>

<b> Separately: fwd and rwd</b>

In [None]:
# Write your codes for ANOVA of fwd and rwd

In [None]:
# Write your Comment here!

<b> 4wd and rwd</b>

In [None]:
# Write your codes for ANOVA of 4wd and rwd

In [None]:
# Write your Comment here!

<b> 4wd and fwd</b>

In [None]:
# Write your codes for ANOVA of 4wd and fwd

In [None]:
# Write your Comment here!

<h2> Conclusion: Important Variables</h2>


We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

<b>Continuous numerical variables: Using Pearson Correlation</b>

Length,
Width,
Curb-weight,
Engine-size,
Horsepower,
City-mpg,
Highway-mpg,
Wheel-base,
Bore,

<b> Categorical variables: ANOVA</b>

In [None]:
# write the Name of the feature(s) here

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

<h2> References</h2>

<b> 1. IBM Developer Skills Network</b>

<b>2.  MIT </b>