# Hypothesis Testing using Python

- Hypothesis testing is a statistical method for making inferences or predictions about a population based on a sample of data.

### Basics of Hypothesis Testing
- Null Hypothesis( H~0~ ) :The null hypothesis assumes no effect or difference in the population. It serves as the default assumption that we try to reject.
- Alternate Hypothesis (Ha or H1): This is what you want to prove --That there is an effect or a difference
- Significance Level (α): The probability of rejecting the null hypothesis when it is actually true. Common choises are 0.01, 0.05 and 0.1
- Test Statistic: This is a measure calculated from the sample data that is used to assess the evidence against the null hypothesis.
- P-value : The probability of observing a test statistic as extreme as the one calculated, assuming that the null hypothesis is true

### Steps in Hypothesis Testing
1. State the Hypotheses: Null and hypothesis
2. Choose significance level: Typically 0.05
3. Collect Data --sample data
4. Calculate the Test Statistic and P-value
5. Make a decision --Reject or fail to reject the null hypothesis based on the P value. 

## One-sample t-test

- We compare the mean of a sample against a known value, or theoretical expectation.
- Let's say, we have exam scores for a class and we want to test if the average is significantly different from 50. 

In [1]:
import numpy as np
from scipy import stats

#Generate some example data

np.random.seed(0)
scores = np.random.normal(52,10,50)  # mean=52, std_dev=10, sample_size=50

# Null hypothesis: the mean is 50
# Alternative hypothesis: the mean is not 50

alpha = 0.05 #significance level

t_stat, p_value = stats.ttest_1samp(scores, 50)  # one-sample t-test

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Making a decision
if p_value < alpha:
    print("Reject the null hypothesis: The mean is significantly different from 50.")
else:
    print("Fail to reject the null hypothesis: The mean is not significantly different from 50.")

t-statistic: 2.1180509633837428
p-value: 0.0392675594917796
Reject the null hypothesis: The mean is significantly different from 50.


## Two-sample t-test


- The two-sample t-test is used to compare the means of two independent samples to see if they are significantly different.


In [5]:
# Implementation

# Generating some example data
group1_scores = np.random.normal(52, 10, 50)  # mean=52, std_dev=10, sample_size=50
group2_scores = np.random.normal(48, 10, 50)  # mean=48, std_dev=10, sample_size=50

# Null hypothesis: the means are equal
# Alternative hypothesis: the means are not equal
alpha = 0.05  # significance level
t_stat, p_value = stats.ttest_ind(group1_scores, group2_scores)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Making a decision
if p_value < alpha:
    print("Reject the null hypothesis: The means are significantly different.")
else:
    print("Fail to reject the null hypothesis: The means are not significantly different.")

t-statistic: 3.0615775843215727
p-value: 0.0028416162064019927
Reject the null hypothesis: The means are significantly different.


## Chi-Square Test for Independence

- The Chi-Square test is used to test the independence between two categorical variables.
- It is used to determine if there is a significant association between two categorical variables. For example, you may want to determine if there's a relationship between gender and preference for a specific product, political party affiliation, etc.

### Example Scenario
Let's consider a hypothetical scenario where a college offers two majors: Science and Arts. We have surveyed 100 students to find out their gender and major. We want to know if the choice of major is independent of gender.

Here is the observed data:

Science major: 40 females, 30 males
Arts major: 10 females, 20 males
1. The null hypothesis states that there is no association between the choice of major and gender (they are independent).

2. The alternative hypothesis states that there is an association (they are not independent).


### Contingency Table
| Major / Gender  | Female | Male  | Row Total |
|-----------------|--------|-------|-----------|
| Science         |   40   |  30   |     70    |
| Arts            |   10   |  20   |     30    |
| **Column Total**|   50   |  50   |    100    |


In [7]:
# Implementation
# Define the observed frequencies
observed = np.array([
    [40, 30],  # Science
    [10, 20]   # Arts
])

#Run the Chi-Square test using SciPy's chi2_contingency function:

chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

print("Chi2 Stat:", chi2_stat)
print("Degrees of Freedom:", dof)
print("P-Value:", p_value)
print("Expected Frequencies Table:")
print(expected)

# Decision Making
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between gender and choice of major.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between gender and choice of major.")


Chi2 Stat: 3.857142857142857
Degrees of Freedom: 1
P-Value: 0.04953461343562649
Expected Frequencies Table:
[[35. 35.]
 [15. 15.]]
Reject the null hypothesis: There is a significant association between gender and choice of major.


- Given that the p-value is less than the significance level
α=0.05, we reject the null hypothesis. Therefore, there is enough evidence to conclude that there is a significant association between the choice of major and gender in this sample of students.

- This means that in this particular example, gender appears to influence the choice of major (Science or Arts) among the surveyed students. However, remember that "correlation does not imply causation." While we found an association, this does not mean that one variable causes the other.