# **Hypothesis Testing**


The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.


In [1]:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#!  mamba install seaborn=0.9.0-y
#!  mamba install matplotlib=3.4.3-y
#!  mamba install statsmodels=0.12.0-y

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats

In [3]:
URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'

In [4]:
ratings_df = pd.read_csv(URL)

## Practice Questions

### Question 1: Using the teachers rating data set, does tenure affect teaching evaluation scores?

*   Use α = 0.05

In [6]:
## insert code here
scipy.stats.ttest_ind(ratings_df[ratings_df['tenure'] == 'yes']['eval'],

                   ratings_df[ratings_df['tenure'] == 'no']['eval'], equal_var = True)

TtestResult(statistic=np.float64(-2.8046798258451777), pvalue=np.float64(0.005249471210198793), df=np.float64(461.0))

Note: <u> *The p-value is less than 0.05 that means that - we will reject the null hypothesis as there evidence that being tenured affects teaching evaluation scores*


### Question 2: Using the teachers rating data set, is there an association between age and tenure?

*   Discretize the age into three groups 40 years and youngers, between 40 and 57 years, 57 years and older (This has already been done for you above.)
*   What is your conclusion at α = 0.01 and α = 0.05?


In [12]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

In [21]:
## insert code here
## state your hypothesis
# Null Hypothesis: There is no association between age and tenure
# Alternative Hypothesis: There is an association between age and tenure


cont_table  = pd.crosstab(ratings_df['tenure'], ratings_df['age_group'])
cont_table


age_group,40 years and younger,57 years and older,between 40 and 57 years
tenure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,15,25,62
yes,98,97,166


In [14]:
## use the chi-square function
scipy.stats.chi2_contingency(cont_table)


Chi2ContingencyResult(statistic=np.float64(8.749576239010711), pvalue=np.float64(0.012590809706820845), dof=2, expected_freq=array([[ 24.89416847,  26.87688985,  50.22894168],
       [ 88.10583153,  95.12311015, 177.77105832]]))

1. <u>At the α = 0.01, p-value is greater, we fail to reject null hypothesis as there is no evidence of an association between age and
2. <u> At the α = 0.05, p-value is less, we reject null hypoothesis as there is evidence of an association between age and tenure

### Question 3: Test for equality of variance for beauty scores between tenured and non-tenured instructors

*   Use α = 0.05

In [15]:
## insert code here
### use the levene function to find the p-value and conclusion
scipy.stats.levene(ratings_df[ratings_df['tenure'] == 'yes']['beauty'],
                   ratings_df[ratings_df['tenure'] == 'no']['beauty'], 
                   center='mean')

LeveneResult(statistic=np.float64(0.4884241652750426), pvalue=np.float64(0.4849835158609811))

<u> Since the p-value is greater than 0.05, we will assume equality of variance of both groups

### Question 4: Using the teachers rating data set, is there an association between visible minorities and tenure?

*   Use α = 0.05


In [17]:
## insert code here
##State you hypothesis and Create a cross-tab:
# Null Hypothesis: There is no association between visible minorities and tenure
# Alternative Hypothesis: There is an association between visible minorities and tenure

cont_table  = pd.crosstab(ratings_df['vismin'], ratings_df['tenure'])

In [18]:
## run the chi2_contingency() on the contigency table
scipy.stats.chi2_contingency(cont_table, correction = True)

Chi2ContingencyResult(statistic=np.float64(1.3675127484429763), pvalue=np.float64(0.24223968800237183), dof=1, expected_freq=array([[ 87.90064795, 311.09935205],
       [ 14.09935205,  49.90064795]]))

<u> Since the p-value is greater than 0.05, we fail to reject null hypothesis as there is no evidence of an association between visible minorities and tenure <u>