#Statistics Part 2


1. What is hypothesis testing in statistics?
- Hypothesis testing is a statistical method used to make decisions based on data. It tests assumptions (hypotheses) about population parameters. It helps determine whether there is enough evidence to support a specific claim or if results happened by random chance.

2. What is the null hypothesis, and how does it differ from the alternative hypothesis?
- The null hypothesis (H₀) assumes no effect or difference, while the alternative hypothesis (H₁) proposes a change or effect. Hypothesis testing aims to reject or fail to reject the null hypothesis based on the data and evidence.

3. What is the significance level in hypothesis testing, and why is it important?
- The significance level (α) is the probability of rejecting the null hypothesis when it is true, commonly set at 0.05. It represents the tolerance for error and helps determine the threshold for deciding whether results are statistically significant.

4. What does a P-value represent in hypothesis testing?
- The P-value measures the probability of observing the data, or something more extreme, assuming the null hypothesis is true. A small P-value indicates strong evidence against the null hypothesis and supports the alternative hypothesis.

5. How do you interpret the P-value in hypothesis testing?
- If the P-value is less than the significance level (e.g., 0.05), you reject the null hypothesis. If it is greater, you fail to reject the null hypothesis. The smaller the P-value, the stronger the evidence against the null.

6. What are Type 1 and Type 2 errors in hypothesis testing?
- A Type 1 error occurs when the null hypothesis is rejected when it is actually true. A Type 2 error occurs when the null hypothesis is not rejected when it is false. Both errors affect the reliability of conclusions.

7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing?
- A one-tailed test checks for a deviation in one specific direction, while a two-tailed test checks for deviations in both directions. Use one-tailed when the effect is expected in a specific direction and two-tailed when any direction is possible.

8. What is the Z-test, and when is it used in hypothesis testing?
- The Z-test is used to compare sample and population means when the population variance is known and the sample size is large (typically n > 30). It assumes normal distribution and helps test whether observed differences are statistically significant.

9. How do you calculate the Z-score, and what does it represent in hypothesis testing?
- The Z-score is calculated as (sample mean − population mean) divided by the standard error. It represents how many standard deviations the sample result is from the population mean. A high absolute value suggests statistical significance.

10. What is the T-distribution, and when should it be used instead of the normal distribution?
- The T-distribution is used when the sample size is small (typically n < 30) and the population standard deviation is unknown. It accounts for more variability and is ideal for estimating means when data follows a normal distribution.

11. What is the difference between a Z-test and a T-test?
- The Z-test is used for large samples with known population variance, while the T-test is for small samples with unknown population variance. Both test differences in means, but the T-test uses the sample standard deviation.

12. What is the T-test, and how is it used in hypothesis testing?
- A T-test compares the means of one or two groups to determine if there is a significant difference. It's used when the sample size is small and the population standard deviation is unknown, assuming data is approximately normal.

13. What is the relationship between Z-test and T-test in hypothesis testing?
- Both the Z-test and T-test assess differences in means. The Z-test is used with known variance and large samples; the T-test is used with estimated variance and small samples. As sample size increases, the T-distribution approaches the Z-distribution.

14. What is a confidence interval, and how is it used to interpret statistical results?
- A confidence interval gives a range of values within which a population parameter likely falls. It provides more information than a single estimate and shows the uncertainty level in the result, often using 95% or 99% confidence levels.

15. What is the margin of error, and how does it affect the confidence interval?
- The margin of error represents the maximum expected difference between the true population parameter and a sample estimate. A larger margin results in a wider confidence interval, indicating more uncertainty, while a smaller margin gives more precision.

16. How is Bayes' Theorem used in statistics, and what is its significance?
- Bayes' Theorem updates the probability of a hypothesis based on new evidence. It is useful in fields like machine learning, diagnostics, and decision-making where prior knowledge is combined with observed data to make informed predictions.

17. What is the Chi-square distribution, and when is it used?
- The Chi-square distribution is used in hypothesis testing for categorical data. It is applied in tests like the Chi-square test of independence or goodness of fit to assess whether observed data differs from expected data.

18. What is the Chi-square goodness of fit test, and how is it applied?
- The Chi-square goodness of fit test evaluates whether a sample distribution matches an expected distribution. It compares observed frequencies to expected ones across categories and determines if differences are due to chance or a real effect.

19. What is the F-distribution, and when is it used in hypothesis testing?
- The F-distribution is used to compare variances between two or more groups. It's most commonly used in ANOVA and regression analysis. It helps determine if group variances are equal or if differences are statistically significant.

20. What is an ANOVA test, and what are its assumptions?
- ANOVA (Analysis of Variance) tests whether there are significant differences between means of three or more groups. Assumptions include normality, equal variances, and independent samples. A significant result suggests at least one group differs.

21. What are the different types of ANOVA tests?
- There are three main types: one-way ANOVA (one independent variable), two-way ANOVA (two independent variables), and repeated measures ANOVA (same subjects measured multiple times). Each tests different kinds of group differences in means.

22. What is the F-test, and how does it relate to hypothesis testing?
- The F-test compares the variances of two populations to assess if they are significantly different. It is commonly used in ANOVA and regression analysis. A significant F-test indicates that group variances or model fits differ meaningfully.

#Practical

In [None]:
# 1. Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results
import numpy as np
from scipy.stats import norm

data = [67, 70, 72, 65, 68, 75, 74, 69, 73, 71]
sample_mean = np.mean(data)
population_mean = 70
std_dev = 3
n = len(data)

z = (sample_mean - population_mean) / (std_dev / np.sqrt(n))
p_value = 2 * (1 - norm.cdf(abs(z)))
print("Z-score:", z)
print("P-value:", p_value)


In [None]:
# 2. Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python
import numpy as np
from scipy.stats import ttest_1samp

data = np.random.normal(loc=100, scale=15, size=50)
t_stat, p_value = ttest_1samp(data, 100)
print("T-statistic:", t_stat)
print("P-value:", p_value)


In [None]:
# 3. Implement a one-sample Z-test using Python to compare the sample mean with the population mean
import numpy as np
from scipy.stats import norm

sample = np.array([98, 100, 102, 99, 97, 101, 103])
pop_mean = 100
std = 2
z_score = (np.mean(sample) - pop_mean) / (std / np.sqrt(len(sample)))
p_val = 2 * (1 - norm.cdf(abs(z_score)))
print("Z-score:", z_score)
print("P-value:", p_val)


In [None]:
# 4. Perform a two-tailed Z-test using Python and visualize the decision region on a plot
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mean, std, n = 100, 15, 36
sample_mean = 105
z = (sample_mean - mean) / (std / np.sqrt(n))
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x)

plt.plot(x, y)
plt.axvline(x=z, color='r', linestyle='--')
plt.fill_between(x, 0, y, where=(x < -1.96) | (x > 1.96), color='gray', alpha=0.5)
plt.title(f"Z = {z}")
plt.show()


In [None]:
# 5. Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def visualize_errors(mu0=100, mu1=105, sigma=10, n=30, alpha=0.05):
    se = sigma / np.sqrt(n)
    z_alpha = norm.ppf(1 - alpha)
    x = np.linspace(80, 120, 500)
    y0 = norm.pdf(x, mu0, se)
    y1 = norm.pdf(x, mu1, se)

    critical_value = mu0 + z_alpha * se
    plt.plot(x, y0, label='H0')
    plt.plot(x, y1, label='H1')
    plt.fill_between(x, 0, y0, where=(x > critical_value), color='red', alpha=0.3, label='Type I Error')
    plt.fill_between(x, 0, y1, where=(x <= critical_value), color='blue', alpha=0.3, label='Type II Error')
    plt.legend()
    plt.title("Type I and II Errors")
    plt.show()

visualize_errors()


In [None]:
# 6. Write a Python program to perform an independent T-test and interpret the results
import numpy as np
from scipy.stats import ttest_ind

group1 = np.random.normal(100, 10, 30)
group2 = np.random.normal(105, 10, 30)
t_stat, p_val = ttest_ind(group1, group2)
print("T-statistic:", t_stat)
print("P-value:", p_val)


In [None]:
# 7. Perform a paired sample T-test using Python and visualize the comparison results
import numpy as np
from scipy.stats import ttest_rel
import matplotlib.pyplot as plt

before = np.random.normal(90, 10, 20)
after = before + np.random.normal(5, 5, 20)
t_stat, p_val = ttest_rel(before, after)
plt.plot(before, label="Before")
plt.plot(after, label="After")
plt.legend()
plt.title("Paired Sample T-Test")
plt.show()
print("T-statistic:", t_stat)
print("P-value:", p_val)


In [None]:
# 8. Simulate data and perform both Z-test and T-test, then compare the results using Python
import numpy as np
from scipy.stats import ttest_1samp, norm

data = np.random.normal(100, 10, 25)
pop_mean = 100
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
z = (sample_mean - pop_mean) / (10 / np.sqrt(len(data)))
t_stat, t_pval = ttest_1samp(data, pop_mean)
z_pval = 2 * (1 - norm.cdf(abs(z)))
print("Z-test P-value:", z_pval)
print("T-test P-value:", t_pval)


In [None]:
# 9. Write a Python function to calculate the confidence interval for a sample mean and explain its significance
import numpy as np
from scipy.stats import t

def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    std_err = np.std(data, ddof=1) / np.sqrt(n)
    margin = t.ppf((1 + confidence) / 2, n - 1) * std_err
    return (mean - margin, mean + margin)

data = np.random.normal(50, 5, 30)
print("Confidence Interval:", confidence_interval(data))


In [None]:
# 10. Write a Python program to calculate the margin of error for a given confidence level using sample data

import numpy as np
from scipy.stats import norm

def margin_of_error(data, confidence=0.95):
    n = len(data)
    std_err = np.std(data, ddof=1) / np.sqrt(n)
    z_score = norm.ppf((1 + confidence) / 2)
    return z_score * std_err

sample_data = np.random.normal(100, 15, 50)
moe = margin_of_error(sample_data, 0.95)
print("Margin of Error:", moe)


In [None]:
# 11. Implement a Bayesian inference method using Bayes' Theorem in Python and explain the process

# P(A|B) = (P(B|A) * P(A)) / P(B)

def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
    p_not_a = 1 - p_a
    p_b = p_b_given_a * p_a + p_b_given_not_a * p_not_a
    p_a_given_b = (p_b_given_a * p_a) / p_b
    return p_a_given_b

# Example values
p_a = 0.01               # Prior probability of having disease
p_b_given_a = 0.9        # Probability of testing positive given disease
p_b_given_not_a = 0.05   # Probability of testing positive without disease

posterior = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)
print("Posterior probability P(A|B):", posterior)


In [None]:
# 12. Perform a Chi-square test for independence between two categorical variables in Python

import pandas as pd
from scipy.stats import chi2_contingency

# Sample contingency table
data = pd.DataFrame({
    'Male': [20, 30],
    'Female': [30, 20]
}, index=['Likes Product', 'Dislikes Product'])

# Perform Chi-square test
chi2, p, dof, expected = chi2_contingency(data)

print("Chi-square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)
print("Expected Frequencies:\n", expected)


In [None]:
# 13. Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data
import numpy as np
from scipy.stats import chi2_contingency

observed = np.array([[20, 30], [30, 20]])
chi2, p, dof, expected = chi2_contingency(observed)
print("Expected Frequencies:\n", expected)


In [None]:
# 14. Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution
from scipy.stats import chisquare

observed = [50, 30, 20]
expected = [40, 40, 20]
stat, p = chisquare(f_obs=observed, f_exp=expected)
print("Chi-square Statistic:", stat, "\nP-value:", p)


In [None]:
# 15. Create a Python script to simulate and visualize the Chi-square distribution and discuss its characteristics
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import chi2

x = np.linspace(0, 30, 500)
df = 4
plt.plot(x, chi2.pdf(x, df), label=f'df={df}')
plt.title('Chi-square Distribution')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid()
plt.show()


In [None]:
# 16. Implement an F-test using Python to compare the variances of two random samples
import scipy.stats as stats

data1 = np.random.normal(10, 2, 50)
data2 = np.random.normal(10, 3, 50)

f_stat = np.var(data1, ddof=1) / np.var(data2, ddof=1)
dof1 = len(data1) - 1
dof2 = len(data2) - 1
p_value = 1 - stats.f.cdf(f_stat, dof1, dof2)

print("F-statistic:", f_stat, "\nP-value:", p_value)


In [None]:
# 17. Write a Python program to perform an ANOVA test to compare means between multiple groups and interpret the results
import scipy.stats as stats

group1 = np.random.normal(5, 1, 30)
group2 = np.random.normal(5.5, 1, 30)
group3 = np.random.normal(6, 1, 30)

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print("F-statistic:", f_stat, "\nP-value:", p_value)


In [None]:
# 18. Perform a one-way ANOVA test using Python to compare the means of different groups and plot the results
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = pd.DataFrame({
    "value": np.concatenate([group1, group2, group3]),
    "group": ["A"] * 30 + ["B"] * 30 + ["C"] * 30
})

sns.boxplot(x="group", y="value", data=data)
plt.title("One-way ANOVA Boxplot")
plt.show()


In [None]:
# 19. Write a Python function to check the assumptions (normality, independence, and equal variance) for ANOVA
from scipy.stats import shapiro, levene

def check_anova_assumptions(*groups):
    for i, group in enumerate(groups):
        stat, p = shapiro(group)
        print(f"Group {i+1} Shapiro-Wilk p-value (Normality):", p)
    stat, p = levene(*groups)
    print("Levene's Test p-value (Equal variance):", p)

check_anova_assumptions(group1, group2, group3)


In [None]:
# 20. Perform a two-way ANOVA test using Python to study the interaction between two factors and visualize the results
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({
    'score': np.random.rand(60),
    'factor1': ['A', 'A', 'B', 'B'] * 15,
    'factor2': ['X', 'Y'] * 30
})

model = ols('score ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


In [None]:
# 21. Write a Python program to visualize the F-distribution and discuss its use in hypothesis testing
x = np.linspace(0.01, 5, 500)
dfn, dfd = 5, 20
plt.plot(x, stats.f.pdf(x, dfn, dfd), label='F-distribution')
plt.title('F-distribution Curve')
plt.xlabel('F value')
plt.ylabel('Probability Density')
plt.grid()
plt.legend()
plt.show()


In [None]:
# 22. Perform a one-way ANOVA test in Python and visualize the results with boxplots to compare group means
# Reuse group1, group2, group3 and previous boxplot code from Q18
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print("One-way ANOVA result:\nF-statistic:", f_stat, "\nP-value:", p_value)


In [None]:
# 23. Simulate random data from a normal distribution, then perform hypothesis testing to evaluate the means
sample = np.random.normal(loc=50, scale=10, size=100)
t_stat, p_value = stats.ttest_1samp(sample, popmean=50)
print("T-statistic:", t_stat, "\nP-value:", p_value)


In [None]:
# 24. Perform a hypothesis test for population variance using a Chi-square distribution and interpret the results
sample_var = np.var(sample, ddof=1)
n = len(sample)
sigma_sq = 100  # population variance
chi2_stat = (n - 1) * sample_var / sigma_sq
p_value = 1 - stats.chi2.cdf(chi2_stat, df=n - 1)
print("Chi-square Statistic:", chi2_stat, "\nP-value:", p_value)


In [None]:
# 25. Write a Python script to perform a Z-test for comparing proportions between two datasets or groups
from statsmodels.stats.proportion import proportions_ztest

success = np.array([40, 30])
nobs = np.array([100, 90])
z_stat, p_val = proportions_ztest(success, nobs)
print("Z-statistic:", z_stat, "\nP-value:", p_val)


In [None]:
# 26. Implement an F-test for comparing the variances of two datasets, then interpret and visualize the results
f_stat = np.var(group1, ddof=1) / np.var(group2, ddof=1)
p_value = 1 - stats.f.cdf(f_stat, len(group1)-1, len(group2)-1)
print("F-statistic:", f_stat, "\nP-value:", p_value)


In [None]:
# 27. Perform a Chi-square test for goodness of fit with simulated data and analyze the results
observed = [18, 22, 20]
expected = [20, 20, 20]
chi2_stat, p_val = chisquare(observed, expected)
print("Chi-square Statistic:", chi2_stat, "\nP-value:", p_val)
