# What is Hypothesis Testing and Techniques?


Hypothesis testing is a statistical method that used in making statistical decisions using experimental data about 
population parameters based on sample data. It involves formulating null and alternative hypotheses, collecting data, 
and using statistical tests to determine the validity of the null hypothesis.

# Understanding Hypothesis Testing Techniques and their Implementation using Python

# One-Sample t-Test
The one-sample t-test determines if the mean (average) of a single group or sample is significantly different 
from a known population mean. It involves comparing the sample mean to the known population mean while considering 
the variability within the sample.

Example: Suppose you have a sample of test scores from a class. You want to test if their average is significantly 
different from the national average of 70.

In [23]:
import scipy.stats as stats
import numpy as np

# Sample data: test scores of a class
sample_scores = np.array([65, 78, 67, 72, 74, 62, 76, 70, 68, 71])

In [22]:
import statistics

n = [65, 78, 67, 72, 74, 62, 76, 70, 68, 71]
mean_value = statistics.mean(n)
print(mean_value)


70.3


In [24]:
# Known population mean (hypothesized)
population_mean = 70

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_scores, population_mean)
print(f"t_statistic = {t_statistic}, P-value = {p_value}")

t_statistic = 0.19097135526615505, P-value = 0.8527865916734706


# Given the high p-value (0.853), much greater than the common alpha level of 0.05, 
we do not have sufficient evidence to reject the null hypothesis. It suggests that the sample mean 
is not significantly different from the population mean.

Understanding P-values
In hypothesis testing, the p-value represents the probability of obtaining an observed result, or more extreme results, 
if the null hypothesis were true. A smaller p-value suggests stronger evidence against the null hypothesis.

The interpretation of p-values is as follows:

If p-value ≤ alpha (α): You have enough evidence to reject the null hypothesis. It suggests that the observed data is unlikely 
to have occurred under the assumption that the null hypothesis is true.

If p-value > alpha (α): You do not have enough evidence to reject the null hypothesis. It suggests that the observed data is 
    consistent with the null hypothesis.

# Two-Sample t-Test
The two-sample t-test helps determine if there’s a significant difference between the means of two independent groups or samples.
It assesses if the difference in sample means is statistically significant while accounting for the variability within each 
group.
Example: Comparing the average heights of two different groups of plants treated with different fertilizers.

In [25]:
# Sample data: heights of plants with different fertilizers
heights_fertilizer1 = np.array([15, 16, 17, 14, 16, 15, 16, 17])
heights_fertilizer2 = np.array([14, 15, 15, 15, 16, 14, 15, 15])

In [26]:
# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(heights_fertilizer1, heights_fertilizer2)
print(f"t_statistic = {t_statistic}, P-value = {p_value}")

t_statistic = 2.032862543430305, P-value = 0.06148225337599243


Given the p-value (0.061), slightly higher than the conventional alpha level of 0.05, we do not have enough evidence to 
reject the null hypothesis at a 5% significance level. It suggests that while there is a tendency towards a difference between
the two group means, this difference is not statistically significant at the 5% level.

# Chi-Square Test
The chi-square test assesses the association or independence between two categorical variables. It involves comparing the 
observed frequency of data with the expected frequency assuming independence. The larger the chi-square statistic, the less 
likely the variables are independent.
Example: Testing if there is an association between gender (male/female) and preference for a new product (like/dislike)

In [27]:
# Rows: Gender, Columns: Product Preference
data = np.array([[30, 10],  # 30 males like, 10 dislike
                 [35, 5]])  # 35 females like, 5 dislike

In [28]:
# Perform Chi-Square Test
chi2_statistic, p_value, dof, expected = stats.chi2_contingency(data)
print(f"t_statistic = {t_statistic}, P-value = {p_value}, Degrees of Freedom = {dof}, Expected frequencies = {expected}")

t_statistic = 2.032862543430305, P-value = 0.2518846204641586, Degrees of Freedom = 1, Expected frequencies = [[32.5  7.5]
 [32.5  7.5]]


Degrees of Freedom are calculated based on the number of categories in the data. For a 2×2 contingency table, the degrees of freedom are typically (rows – 1) * (columns – 1) = 1. And Expected Frequencies are the frequencies that would be expected if there were no association between the variables. In our case, the expected frequencies are 32.5 and 7.5 for both categories. It is what we would expect to see if the null hypothesis were true.

In this case, the p-value (0.252) is greater than the common alpha level of 0.05, so we do not have sufficient evidence to reject the null hypothesis. It suggests that the data do not provide strong evidence of a significant association between the two categorical variables.

# ANOVA (Analysis of Variance)
ANOVA is used to analyze the differences among means of three or more groups. It tells you if there are statistically significant differences between these groups. ANOVA examines the variance within each group and between groups. It calculates an F-statistic to test if group means are equal.

Example: Testing if three different diets have different effects on weight loss.

In [29]:
from scipy.stats import f_oneway

# Sample data: weight loss for three different diets
diet1 = np.array([2, 3, 1, 2, 2])
diet2 = np.array([4, 5, 4, 4, 5])
diet3 = np.array([5, 6, 7, 6, 5])

In [30]:
# Perform ANOVA
f_statistic, p_value = f_oneway(diet1, diet2, diet3)
print(f"f_statistic = {f_statistic}, P-value = {p_value}")

f_statistic = 36.933333333333294, P-value = 7.449718327740603e-06


The F-statistic is a ratio of the variance between the group means to the variance within the groups. A higher F-statistic typically indicates a greater probability that there are significant differences between the means of the groups.Given the extremely low p-value (far below the conventional alpha level of 0.05), there is strong evidence to reject the null hypothesis. It indicates that there are significant differences among the group means.

So, these were the Hypothesis Testing techniques with their implementation using Python you should know as a Data Science professional.