<a href="https://colab.research.google.com/github/nistrate/Statistics/blob/main/Math345_LAB_W5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing

A useful procedure when making statistical decisions about a dataset is *hypothesis testing*. Through hypothesis testing we are able to make an assumption about the population parameters. When performing such testing, we evaluate two mutually exclusive statements about a population in order to determine which one is better supported by the gathered sample data. 

**Null hypothesis** $H_{0}$ : is a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion.


**The alternative hypothesis** $H_{\alpha}$: is a claim about the population that is contradictory to $H_{0}$ and what we conclude whenwe reject $H_{0}$.

Ha: The alternative hypothesis: It is a claim about the population that is contradictory to $H_{0}$ and what we conclude whenwe reject $H_{0}$.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a
decision. They are "reject $H_{0}$" if the sample information favors the alternative hypothesis or "do not reject H0" or "decline
to reject $H_{0}$" if the sample information is insufficient to reject the null hypothesis.

Another useful quantity in hypothesis testing is **significance level**, $\alpha$. This quantity describes significance (in percent) for which we accept or reject $H_{0}$.


**Problem:** Hypothesis Testing: Worldwide Nutrition

In this problem you will work with the same dataset as in Lab IV. The goal of this lab is to perform hypothesis testing on a food category from the dataset. Your baseline year will be $1961$, which means that your worlwide population mean will be extracted from here. The hypothesis testing will be performed with samples from all the other years. 

Your mission, if you chose to accept it, is to develop a function that will:


1.   Present your null hypothesis $H_{0}$.
2.   Given an sample from a different year, present your alternative hypothesis $H_{\alpha}$.
3.   Perform the hypothesis testing, and output a decision regarding $H_0$.
4.   Indicate the type of error that one could make given set decision. (Don't just say 'Type 1' or 'Type 2', but actuallly articulate it.)

Make sure that your $\alpha$ is a parameter in your function. A good choice for it is $\alpha  = 5 \%$, however feel free to use other values as long as you're confident about their meaning.

Once you have your function ready consider running it by using samples from the following years:

`years = np.arange(1962, 2013, 5)`

Analyze and describe the results that you`re observing. In other words, imagine that you're working for WHO and you're goal is to summarize your results for the general public.



In [None]:
# Your Solution Goes Here

# Feel free to use as many cells as your heart desires

In [137]:
from google.colab import files 

import numpy as np
import pandas as pd
import statsmodels.api as sm

from scipy.integrate import quad 
from scipy.stats import norm, sem

import random

import matplotlib.pyplot as plt

In [146]:
def standard_normal(x):
    f_x = (1 / np.sqrt(2*np.pi) ) * np.exp( - (x)*(x) / (2) ) 
    return f_x

def hypothesis_test(sample, sample_size, mu_baseline, alpha):

    x_bar = np.mean(sample)
    print(f'H0: The average consumption per person is {np.round(mu_baseline,2)} Cal.')
    print(f'H\u03B1: Given a sample mean of {np.round(x_bar,2)} Cal, the average consumption per person has changed.\n')
    s = np.std(sample)    
    z_bar = (x_bar - mu_baseline)/(s/np.sqrt(sample_size))

    if x_bar >= mu_baseline:
        statistic_type = 0
#        print('Right Hand Statistics:')
    else:
        statistic_type = 1
#       print('Left Hand Statistics:')
    
    if statistic_type == 0:
        p_val = (quad(standard_normal, -np.infty, -z_bar)[0] + quad(standard_normal, z_bar, np.infty)[0])
    elif statistic_type == 1:
        p_val = (quad(standard_normal, -np.infty, z_bar)[0] + quad(standard_normal, -z_bar, np.infty)[0])

    if p_val < alpha:
        print(f'Your p_value = {np.round(p_val*100,2)} % is smaller than the indicated confidence of {alpha*100} %\nYour null hypothesis H0 must be rejected.')
        print(f'Type I error: We think that the average consumption has changed, when in fact it has remained the same.\n')
    else:
        print(f'Your p_value = {np.round(p_val*100,2)} % is greater than the indicated confidence of {alpha*100} %\nYour null hypothesis H0 should not be rejected.')
        print(f'Type II error: We think that the average consumption has remained the same, when in fact it has changed.\n')
    
    return 0

In [139]:
# Here, we will upload our data from a local machine directly into our colab notebook

uploaded = files.upload()
df = pd.read_csv('dietary-compositions-by-commodity-group.csv')
df

Saving dietary-compositions-by-commodity-group.csv to dietary-compositions-by-commodity-group (2).csv


Unnamed: 0,Entity,Code,Year,Cereals and Grains (FAO (2017)) (kilocalories per person per day),Pulses (FAO (2017)) (kilocalories per person per day),Starchy Roots (FAO (2017)) (kilocalories per person per day),Sugar (FAO (2017)) (kilocalories per person per day),Oils & Fats (FAO (2017)) (kilocalories per person per day),Meat (FAO (2017)) (kilocalories per person per day),Dairy & Eggs (FAO (2017)) (kilocalories per person per day),Fruit and Vegetables (FAO (2017)) (kilocalories per person per day),Other (FAO (2017)) (kilocalories per person per day),Alcoholic Beverages (FAO (2017)) (kilocalories per person per day)
0,Afghanistan,AFG,1961,2530,16,25,51,92,88,102,82,13,0.0
1,Afghanistan,AFG,1962,2458,17,22,45,98,88,101,76,12,0.0
2,Afghanistan,AFG,1963,2212,17,23,47,106,91,110,79,13,0.0
3,Afghanistan,AFG,1964,2445,18,24,55,102,93,110,95,11,0.0
4,Afghanistan,AFG,1965,2431,18,24,57,105,95,118,95,13,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8149,Zimbabwe,ZWE,2009,1222,56,56,209,345,94,53,28,17,67.0
8150,Zimbabwe,ZWE,2010,1233,49,54,223,342,94,56,32,19,66.0
8151,Zimbabwe,ZWE,2011,1257,39,56,232,343,98,59,32,18,66.0
8152,Zimbabwe,ZWE,2012,1210,35,55,264,353,99,55,31,16,79.0


In [150]:
category = 'Sugar (FAO (2017)) (kilocalories per person per day)'
year_bl = 1961

print(f'Compared to the global average consumption in {year_bl}:\n')

df_bl = df[df.Year == year_bl]
df_category_bl = list(df_bl[category])

mu_bl = np.mean(df_category_bl)

years = np.arange(1962, 2013, 5)
alpha  = 0.003
sample_size = 50

for year in years:

    print(f'In {year}:\n')

    next_df = list(df[df.Year == year][category])
    sample = random.sample(next_df, sample_size)

    hypothesis_test(sample, sample_size, mu_bl, alpha)
    print('____________________________________________________')

Compared to the global average consumption in 1961:

In 1962:

H0: The average consumption per person is 227.61 Cal.
Hα: Given a sample mean of 216.0 Cal, the average consumption per person has changed.

Your p_value = 61.26 % is greater than the indicated confidence of 0.3 %
Your null hypothesis H0 should not be rejected.
Type II error: We think that the average consumption has remained the same, when in fact it has changed.

____________________________________________________
In 1967:

H0: The average consumption per person is 227.61 Cal.
Hα: Given a sample mean of 285.38 Cal, the average consumption per person has changed.

Your p_value = 0.51 % is greater than the indicated confidence of 0.3 %
Your null hypothesis H0 should not be rejected.
Type II error: We think that the average consumption has remained the same, when in fact it has changed.

____________________________________________________
In 1972:

H0: The average consumption per person is 227.61 Cal.
Hα: Given a sample me