# Hypothesis Testing in Statistics

## Introduction to Hypothesis Testing
Hypothesis testing is a method used to determine if there is enough statistical evidence in a sample to infer that a certain condition is true for the entire population. It is a key technique in inferential statistics, helping to make decisions based on sample data.

**Key Terms:**
  - *Null Hypothesis (H₀):* The hypothesis that there is no effect or no difference.

  - *Alternative Hypothesis (H₁)*: The hypothesis that there is an effect or a difference.

  - *Significance Level (α)*: The probability of rejecting the null hypothesis when it is actually true. Common values are 0.05, 0.01.

  - *p value*: The probability of observing the data, or something more extreme, if the null hypothesis is true. If the p-value is less than α, reject the null hypothesis.
  
<div align="center">  
  
| Hypothesis   | Accept/Reject          |
|--------------|------------------------|
|p-value < 0.05|Reject H₀ (Accept H₁)   |
|p-value >=0.05|Accept H₀ (Reject H₀)   |

</div>

## Types of Hypothesis Tests

###  `Z-Test`
A `Z-test` is used when the population variance is known, or the sample size is large (n > 30). It checks whether the means of two large samples are different.

---

**Example: Testing the Average Height of Males in India**


Assume you want to test whether the average height of males in India differs from 5'7" (170 cm).



- *Null Hypothesis (H₀)*: The average height of males in India is 170 cm.

- *Alternative Hypothesis (H₁)*: The average height of males in India is not 170 cm.

In [None]:
import numpy as np
from scipy import stats

# Sample data: heights of 50 males in India
heights = np.array([172, 167, 171, 175, 169, 170, 168, 173, 174, 166])

# Population mean and standard deviation
population_mean = 170
sample_mean = np.mean(heights)
sample_std = np.std(heights, ddof=1)
n = len(heights)

# Z-test calculation
z_statistic = (sample_mean - population_mean) / (sample_std / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z_statistic)))

print(f"Z-statistic: {z_statistic}")
print(f"p-value: {p_value}")


Z-statistic: 0.5222329678670935
p-value: 0.6015081344405899


As we can see p-value is far more greater than 0.05 we can accept H₀(Null Hypothesis)

##  T-Test

A `T-test` is used when:

- The sample size is small (n < 30).

- The population standard deviation is unknown.

- There are three types of `T-tests`:

  - `One-sample T-test`
  - `Independent two-sample T-test`
  - `Paired T-test`


###  `One-Sample T-Test`

A `one-sample T-test` tests whether the mean of a single sample differs significantly from a known or hypothesized population mean.

---

**Example: Testing the Average Age of Employees in an Indian Tech Company.**

Let's assume you want to test whether the average age of employees in a company differs from 30 years.

- *Null Hypothesis (H₀)*: The average age of employees is 30.

- *Alternative Hypothesis (H₁)*: The average age of employees is not 30.

In [None]:
from scipy import stats

# Sample data: ages of employees in an Indian tech company
ages = [28, 32, 29, 31, 27, 33, 34, 30, 29, 32]

# Population mean
population_mean = 30

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(ages, population_mean)

print(f"T-statistic: {t_statistic}")
print(f"p-value: {p_value}")

T-statistic: 0.6956083436402524
p-value: 0.5042379030441878


We can conclude from above result that p-value is greater than significance level(i.e.- 0.05) hence we will accept H₀(Null Hypothesis).

###  `Independent Two-Sample T-Test`

An `independent two-sample T-test` compares the means of two independent groups to see if there is a statistically significant difference between them.

---

**Example: Comparing Test Scores Between Students in Two Indian Cities**

Assume you want to compare the test scores of students from Delhi and Bangalore.

- *Null Hypothesis (H₀)*: The average test scores are the same for both cities.
- *Alternative Hypothesis (H₁)*: The average test scores are different.

In [None]:
from scipy import stats

# Sample test scores for students in Delhi and Bangalore
delhi_scores = [88, 92, 85, 90, 91, 87]
bangalore_scores = [84, 86, 89, 83, 85, 88]

# Perform independent t-test
t_statistic, p_value = stats.ttest_ind(delhi_scores, bangalore_scores)

print(f"T-statistic: {t_statistic}")
print(f"p-value: {p_value}")

T-statistic: 2.092457497388747
p-value: 0.06286899974610866


Above result shows us p-value >0.05 hence we will accept H₀(Null Hypothesis).

###  `Paired T-Test`

A `paired T-test` is used to compare the means of the same group at two different times or under two different conditions. This test is often used in "before and after" experiments.

---

**Example: Effect of a New Teaching Method on Students’ Scores**

Suppose you want to test if a new teaching method has improved students' test scores. You have test scores before and after applying the new method.

  - *Null Hypothesis (H₀)*: The new teaching method has no effect on test scores.
  - *Alternative Hypothesis (H₁)*: The new teaching method improves test scores.

In [None]:
from scipy import stats

# Sample data: test scores before and after the new teaching method
before_scores = [75, 78, 72, 74, 77, 80]
after_scores = [78, 82, 76, 79, 81, 85]

# Perform paired t-test
t_statistic, p_value = stats.ttest_rel(before_scores, after_scores)

print(f"T-statistic: {t_statistic}")
print(f"[-value: {p_value}")

T-statistic: -13.558153613666013
[-value: 3.911248690451021e-05


As the p-value is way too below our significance level(0.05) we will reject H₀ (accept H₁).

##  `Chi-Square Test`

A `Chi-square test` is used to examine the association between two categorical variables or to test the goodness of fit.

---

**Example: Testing the Relationship Between Education Level and Job Preference**

Suppose you want to examine if education level (Graduate or Postgraduate) is related to job preference (Private sector or Government job) in a sample of 100 individuals.


  - *Null Hypothesis (H₀)*: There is no relationship between education level and job preference.

  - *Alternative Hypothesis (H₁)*: There is a relationship between education level and job preference.

In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Contingency table of education level and job preference
data = np.array([[30, 20], [25, 25]])

# Perform Chi-square test
chi2, p_value, dof, expected = chi2_contingency(data)

print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p_value}")


Chi-square statistic: 0.6464646464646464
P-value: 0.4213795037428696


As the result of our chi-square test shows us that the p-value is greate than 0.05 we will accept the H₀ (Null Hypothesis).

##  `ANOVA` (Analysis of Variance)

`ANOVA` is used to compare means across multiple groups to see if there is a significant difference. It is useful when comparing more than two groups.

---

**Example: Comparing Monthly Salaries Across Three Industries**

Assume you want to compare the monthly salaries of employees across three industries: IT, Manufacturing, and Healthcare.

  - *Null Hypothesis (H₀)*: The average monthly salary is the same across all industries.

  - *Alternative Hypothesis (H₁)*: The average monthly salary is different for at least one industry.

In [None]:
from scipy import stats

# Sample monthly salaries for employees in three industries
it_salaries = [50000, 55000, 52000, 58000, 57000]
manufacturing_salaries = [48000, 49000, 51000, 53000, 52000]
healthcare_salaries = [54000, 56000, 55000, 57000, 58000]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(it_salaries, manufacturing_salaries, healthcare_salaries)

print(f"F-statistic: {f_statistic}")
print(f"p-value: {p_value}")


F-statistic: 6.375690607734808
p-value: 0.012986461774327282


As we can see p-value is far slightly lower than 0.05 we can reject H₀(accept H₁ Hypothesis).

## Conclusion

Hypothesis testing provides a framework for making decisions about population parameters based on sample data. Different tests like Z-tests, T-tests, Chi-square tests, and ANOVA are used based on the nature of the data and the hypothesis. The results of these tests are typically expressed in terms of the test statistic and the p-value, which help decide whether to reject or fail to reject the null hypothesis.


**Table of test which test to use when**

<div align = 'center'>

|Type of Data|Categorical          |Continuous            |
|------------|---------------------|----------------------|
|Categorical | Chi-Square test     | T-Test/Anova test    |
|Continous   | Logistic Regression | Correlation Test     |


</div>