# Hypothesis Testing the University of Malaysia Paper

## Claims

* That the distributions do not differ between 2020 and 2019
* That the means do no differ between 2020 and 2019

## What will be testing.  

* That the Data are independent and evenly distributed: Test for normality
    * Shapiro-Wilk Test
* That the means between 2019 and 2020 do not differ: Parametric Statistical Hypothesis Tests
    * T Test because we have less than 25 observations
* If nonparametirc: That the distributions between 2019 and 2020 do not differ
    Mann-Whitney U Test



## Data Import

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats


In [2]:
filepath = "/Users/jnapolitano/Projects/wattime-takehome/data/ch4_2015-2021.xlsx"

hypothesis_testing_df = pd.read_excel(filepath)

### Drop total row from the data

In [3]:
hypothesis_testing_df = hypothesis_testing_df.loc[(hypothesis_testing_df['country_name'] != "Total")].copy() #copying to avoid modifying slices in memory.  Old df should also drop from memory in production environment.

In [4]:
hypothesis_testing_df

Unnamed: 0,iso3_country,country_name,tCH4_2015,tCH4_2016,tCH4_2017,tCH4_2018,tCH4_2019,tCH4_2020,tCH4_2021
0,BGD,Bangladesh,2344420.0,2278158.0,2098958.0,2141231.0,2070985.0,2106781.0,1983974.0
1,BRA,Brazil,341023.3,310418.9,372517.3,371703.0,329471.3,490287.4,454487.4
2,CHN,China,6133647.0,5859531.0,6355071.0,5413962.0,5603352.0,6402353.0,6068210.0
3,ESP,Spain,11414.64,13348.03,12172.99,14054.1,11483.24,13054.61,8531.579
4,IDN,Indonesia,1283649.0,1023129.0,961532.7,1176982.0,1266668.0,1188195.0,1009936.0
5,IND,India,6219887.0,5309413.0,6228451.0,6589798.0,7501556.0,7599764.0,6567960.0
6,IRN,Iran (Islamic Republic of),87744.07,91801.21,96202.17,88757.44,95001.99,96002.54,90535.25
7,ITA,Italy,49959.68,49377.85,54436.79,44699.02,45669.14,51015.47,50897.59
8,JPN,Japan,230546.5,228413.3,270893.5,154825.2,233205.6,283516.7,157400.7
9,KHM,Cambodia,495469.8,573169.8,451704.5,559261.0,594727.7,641280.2,564489.1


### Test for Normality: Shapiro-Wilk

#### 2019

In [5]:
## Selecting Malaysia 2019 Data 
data_2019 = hypothesis_testing_df['tCH4_2019']
data_2019

0     2.070985e+06
1     3.294713e+05
2     5.603352e+06
3     1.148324e+04
4     1.266668e+06
5     7.501556e+06
6     9.500199e+04
7     4.566914e+04
8     2.332056e+05
9     5.947277e+05
10    1.327782e+05
11    1.461058e+04
12    8.476088e+04
13    1.256888e+06
14    1.056287e+05
15    1.164235e+05
16    6.528548e+05
17    3.584550e+05
18    9.655062e+04
19    1.305046e+06
20    8.990870e+04
21    1.691351e+05
22    1.269751e+06
Name: tCH4_2019, dtype: float64

In [6]:
results = stats.shapiro(data_2019)
print('stat=%.3f, p=%.3f' % (results.statistic, results.pvalue))
if results.pvalue > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.567, p=0.000
Probably not Gaussian


##### Results

The distribution is not gausian so a non-paremtric test must be completed.  It is not necessary to perform this test on the 2020 data, but I will do so anyways for practice.

#### 2020

In [7]:
## Selecting the Malaysia Data 2020
data_2020 = hypothesis_testing_df['tCH4_2020']

In [8]:
results = stats.shapiro(data_2020)
print('stat=%.3f, p=%.3f' % (results.statistic, results.pvalue))
if results.pvalue > 0.05:
	print('Probably Gaussian')
else:
	print('Probably not Gaussian')

stat=0.565, p=0.000
Probably not Gaussian


##### Results

The 2020 data is not gausian which verifies that we will need to perform a non parmetric test

### Independence of Samples.  
We have to assume that the samples are independent of each other as we know they are dependent on hecatares.  
Though the correlations are rather high this is due to the smiliarity of hectares per year.  Thus the amount of ch4 is similiar


### Distribution Similiarity

#### Mann-Whitney U Test

In [9]:
# Example of the Mann-Whitney U Test

stat, p = stats.mannwhitneyu(data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=266.000, p=0.982
Probably the same distribution


### Kruskal Wallis test

In [10]:

stat, p = stats.kruskal(data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.001, p=0.974
Probably the same distribution


### Friedman Test

Just for the sake of it I will compare data across all distributions

In [11]:
# Example of the Friedman Test
#data_2014 = hypothesis_testing_df['tCH4_2014']
data_2015 = hypothesis_testing_df['tCH4_2015']
data_2016 = hypothesis_testing_df['tCH4_2016']
data_2017 = hypothesis_testing_df['tCH4_2017']
data_2018 = hypothesis_testing_df['tCH4_2018']

stat, p = stats.friedmanchisquare(data_2015, data_2016, data_2017, data_2018, data_2019, data_2020)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=11.472, p=0.043
Probably different distributions


#### Results.  

Some distributions differ from one another.  Which those are have yet to be discovered.  For the sake of this analysis I will not attempt to identify them.  

The statment that the distributions of the 2019 and 2020 data do not differ cannot differ.  That said we also cannot claim that the means are statistically equivalent as the data is not parametric.  

