# Case Study on ANOVA

XYZ Company has offices in four different zones. The company wishes to investigate the following :

- The mean sales generated by each zone.
- Total sales generated by all the zones for each month.
- Check whether all the zones generate the same amount of sales.

Help the company to carry out their study with the help of data provided.

In [1]:
#importing pandas,numply,matplotlib.pyplot,seaborn libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Reading the dataset to the python environment
data=pd.read_csv('Sales_data_zone_wise.csv')

In [3]:
#displaying  top 5  rows of the dataset
data.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [4]:
#To know data types of each columns and checking for null values in the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


<div class="alert alert-block alert-info"><b>Note:</b><br>  
    
- It appears that we do not have empty/NaN values in the dataset,as the number of Non-Null counts for each column is equal to 29, which is exactly the number of rows in the dataset.<br>
- We also note that our numeric data are detected as integer values.
</div>

In [5]:
#Check for the null values present in each column of  dataset using 'isnull().sum()' function
data.isnull().sum()

Month       0
Zone - A    0
Zone - B    0
Zone - C    0
Zone - D    0
dtype: int64

**Note-**

No  <mark>Null values <mark>


In [6]:
# to know  statistical summary of integer datatype columns 
data.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


## The company wishes to investigate  :
### 1. The mean sales generated by each zone.

In [7]:
data.mean()

Zone - A    1.540493e+06
Zone - B    1.755560e+06
Zone - C    1.772871e+06
Zone - D    1.842927e+06
dtype: float64

In [8]:
print("Mean of Zone - A",round(data['Zone - A'].mean(),2))
print("Mean of Zone - B",round(data['Zone - B'].mean(),2))
print("Mean of Zone - C",round(data['Zone - C'].mean(),2))
print("Mean of Zone - D",round(data['Zone - D'].mean(),2))

Mean of Zone - A 1540493.14
Mean of Zone - B 1755559.59
Mean of Zone - C 1772871.03
Mean of Zone - D 1842926.76


In [9]:
print("Mean of each zone\n ",round(data.mean(),2))

Mean of each zone
  Zone - A    1540493.14
Zone - B    1755559.59
Zone - C    1772871.03
Zone - D    1842926.76
dtype: float64


#use this

In [10]:
data.agg(['std','mean',]).round(3)

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
std,261940.062,168389.886,333193.725,375016.479
mean,1540493.138,1755559.586,1772871.034,1842926.759


<div class="alert alert-block alert-info"><b>Note:</b><br>  
    
- mean sales generated by  Zone-D is higher.<br>
- mean sales generated by  Zone-A is lower.
</div>

## The company wishes to investigate  :
### 2.Total sales generated by all the zones for each month.

In [11]:
data["Total sales"] = data.sum(axis=1)
data

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D,Total sales
0,Month - 1,1483525,1748451,1523308,2267260,7022544
1,Month - 2,1238428,1707421,2212113,1994341,7152303
2,Month - 3,1860771,2091194,1282374,1241600,6475939
3,Month - 4,1871571,1759617,2290580,2252681,8174449
4,Month - 5,1244922,1606010,1818334,1326062,5995328
5,Month - 6,1534390,1573128,1751825,2292044,7151387
6,Month - 7,1820196,1992031,1786826,1688055,7287108
7,Month - 8,1625696,1665534,2161754,2363315,7816299
8,Month - 9,1652644,1873402,1755290,1422059,6703395
9,Month - 10,1852450,1913059,1754314,1608387,7128210


In [27]:

data['Total Sales']=data.iloc[:,0:5].sum(axis=1)
data


Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D,Total sales,total,Total Sales
0,Month - 1,1483525,1748451,1523308,2267260,7022544,7022544,7022544
1,Month - 2,1238428,1707421,2212113,1994341,7152303,7152303,7152303
2,Month - 3,1860771,2091194,1282374,1241600,6475939,6475939,6475939
3,Month - 4,1871571,1759617,2290580,2252681,8174449,8174449,8174449
4,Month - 5,1244922,1606010,1818334,1326062,5995328,5995328,5995328
5,Month - 6,1534390,1573128,1751825,2292044,7151387,7151387,7151387
6,Month - 7,1820196,1992031,1786826,1688055,7287108,7287108,7287108
7,Month - 8,1625696,1665534,2161754,2363315,7816299,7816299,7816299
8,Month - 9,1652644,1873402,1755290,1422059,6703395,6703395,6703395
9,Month - 10,1852450,1913059,1754314,1608387,7128210,7128210,7128210


In [12]:
#zones = ['Zone - A', 'Zone - B', 'Zone - C', 'Zone - D']
#df['Total_Sales'] = df[zones].sum(axis=1)
#df

In [30]:
data['Total Sales'].sort_values()

12    5925424
14    5934156
4     5995328
24    6095918
11    6111084
26    6267918
2     6475939
15    6506659
25    6512360
22    6687919
8     6703395
28    6772277
18    6971953
0     7022544
10    7032783
17    7083490
19    7124599
9     7128210
16    7149383
5     7151387
1     7152303
13    7155515
6     7287108
20    7389597
27    7470920
21    7560001
23    7784747
7     7816299
3     8174449
Name: Total Sales, dtype: int64

<div class="alert alert-block alert-info"><b>Find:</b><br>  
    
- Highest Total Sales was  generated in month-4 .<br>
- Lowest  Total Sales was  generated in month-13.
</div>

## The company wishes to investigate  :
### 3.Check whether all the zones generate the same amount of sales.

We check whether all the zones generate the same amount of sales using **ANOVA Test**<br>
even though there are 3 or more groups being tested , those groups are under one categorical variable so we use **One-way ANOVA Test.**

>Set the hypothesis 

Null Hypothesis (H0):  all the zones generate the same amount of sales

Alternate Hypothesis (H1): all the zones does not generate the same amount of sales.


>Set the significance level 

Significance level,α= 0.05


>One-way ANOVA Test.

In [39]:
from scipy import stats

F, p = stats.f_oneway(data['Zone - A'], data['Zone - B'], data['Zone - C'],data['Zone - D'])
print('F-value:',round(F,4),'p-value:',round(p,4))

F-value: 5.6721 p-value: 0.0012


In [40]:
if p > 0.05:
    print('Null Hypothesis is proved')
else :
    print ('Alternate Hypothesis is proved')

Alternate Hypothesis is proved


>Making a Decision 

<div class="alert alert-block alert-info"><b>Find:</b><br>  
    
From the Ttest performed,<br>  

p-value(=0.001) < alpha(=0.05)<br>  

We `reject the null hypothesis H0`<br> 

So we conclude that , all the zones does not generate the same amount of sales..
                        
</div>

In [None]:
FInd-


