# **Chapter 2. Fundamentals of Statistics**

## **2.3. Probability Distributions**

Probability distributions describe how values of a random variable are distributed. Common distributions include:

1. **Uniform Distribution:** All outcomes are equally likely.
2. **Normal Distribution:** Data is symmetrically distributed around the mean (bell-shaped curve).
3. **Binomial Distribution:** Models the number of successes in a sequence of independent trials.
4. **Poisson Distribution:** Models the number of events in a fixed interval of time or space.

### **2.3.1. Uniform distribution**

The **uniform distribution** is the simplest probability distribution, where all outcomes within a specific range \([a, b]\) are equally likely. 

**Probability Density Function (PDF):**

$$
f(x) = \begin{cases} 
\frac{1}{b-a}, & a \leq x \leq b \\ 
0, & \text{otherwise} 
\end{cases}
$$

**Use Cases:**
- Modeling scenarios where all outcomes have equal likelihood, such as rolling a fair die or drawing random numbers in a fixed range.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform

# Generate data for a uniform distribution
a, b = 0, 1  # Define range [a, b]
x = np.linspace(a, b, 100)
y = uniform.pdf(x, loc=a, scale=b-a)

# Plot the uniform distribution
plt.plot(x, y, label='Uniform PDF')
plt.fill_between(x, y, alpha=0.2, color='blue')
plt.title('Uniform Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Random sampling
samples = uniform.rvs(loc=a, scale=b-a, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

1. Generate 1,000 random numbers from a uniform distribution over the interval $[0, 5]$:
   - Compute the sample mean and variance.
   - Verify that the theoretical mean and variance match the sample values.

2. Plot the probability density function (PDF) of a uniform distribution over $[10, 20]$. Overlay a histogram of 500 samples from the same distribution.

### **2.3.2. Normal Distribution**

The **normal distribution** (or **Gaussian distribution**) is the most commonly used probability distribution. It is defined by its mean ($\mu$) and standard deviation ($\sigma$).

**Probability Density Function (PDF):**
$$
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$

**Properties:**
- Bell-shaped curve, symmetric about the mean ($\mu$).
- Approximately 68% of data falls within 1 standard deviation ($\sigma$) from the mean, 95% within 2, and 99.7% within 3 (empirical rule).

**Use Cases:**
- Modeling natural phenomena like heights, weights, test scores, and measurement errors.

In [None]:
from scipy.stats import norm

# Parameters for the normal distribution
mean, std_dev = 0, 1

# Generate data for normal distribution
x = np.linspace(-4, 4, 100)
y = norm.pdf(x, loc=mean, scale=std_dev)

# Plot the normal distribution
plt.plot(x, y, label='Normal PDF')
plt.fill_between(x, y, alpha=0.2, color='green')
plt.title('Normal Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Random sampling
samples = norm.rvs(loc=mean, scale=std_dev, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 2</b></p>

1. Generate 1,000 random numbers from a normal distribution with a mean of 50 and a standard deviation of 5:
   - Compute the sample mean and standard deviation.
   - Compare these values with the theoretical mean and standard deviation.

2. Plot the probability density function (PDF) of a normal distribution with $\mu = 0$ and $\sigma = 2$. Overlay a histogram of 1,000 samples drawn from the same distribution.

3. Create a cumulative distribution function (CDF) plot for a standard normal distribution. Highlight the region corresponding to values within 1 standard deviation of the mean.

### **2.3.3. Student's t-Distribution**

The **Student's t-distribution** (or simply **t-distribution**) is a probability distribution that is used to estimate population parameters when the sample size is small and/or the population variance is unknown. It is defined by its degrees of freedom (df).

**Probability Density Function (PDF):**
$$
f(x) = \frac{\Gamma\left( \frac{df + 1}{2} \right)}{\sqrt{df\pi} \, \Gamma\left( \frac{df}{2} \right)} \left( 1 + \frac{x^2}{df} \right)^{-\frac{df + 1}{2}}
$$
where $\Gamma$ is the gamma function.

$$
\Gamma(n) = \int_0^\infty t^{n-1} e^{-t} \, dt \quad (n > 0)
$$

**Properties:**
- Symmetrical and bell-shaped, similar to the normal distribution, but with heavier tails.
- As degrees of freedom increase, the t-distribution approaches the normal distribution.
- The shape of the t-distribution depends on the degrees of freedom (df): with lower df, the distribution has heavier tails.

**Use Cases:**
- Often used in hypothesis testing (e.g., t-tests) when the sample size is small.
- Useful for estimating confidence intervals for the mean of a normally distributed population when the sample size is small.

In [None]:
from scipy.stats import t
import numpy as np
import matplotlib.pyplot as plt

# Parameters for the Student's t-distribution
df = 5  # degrees of freedom

# Generate data for t-distribution
x = np.linspace(-4, 4, 100)
y = t.pdf(x, df)

# Plot the t-distribution
plt.plot(x, y, label=f'Student\'s t-distribution (df={df})')
plt.fill_between(x, y, alpha=0.2, color='blue')
plt.title('Student\'s t-Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Random sampling
samples = t.rvs(df, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Standard Deviation: {np.std(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 3</b></p>

1. Generate 1,000 random numbers from a Student's t-distribution with 10 degrees of freedom:
   - Compute the sample mean and standard deviation.
   - Compare these values with the theoretical mean and standard deviation (which are 0 and $\sqrt{\frac{df}{df-2}}$ for df > 2).

2. Plot the probability density function (PDF) of a Student's t-distribution with 3 degrees of freedom. Overlay a histogram of 1,000 samples drawn from the same distribution.

3. Create a cumulative distribution function (CDF) plot for a Student's t-distribution with 5 degrees of freedom. Highlight the region corresponding to values within 1 standard deviation of the mean.

### **2.3.4. F Distribution**

The **F Distribution** is a probability distribution that arises frequently in the context of statistical inference, particularly in the analysis of variance (ANOVA) and regression analysis. It describes the ratio of two scaled chi-squared distributions and is defined by two sets of degrees of freedom: $\text{df}_1$ (numerator degrees of freedom) and $\text{df}_2$ (denominator degrees of freedom).

**Probability Density Function (PDF):**
$$
f(x; \text{df}_1, \text{df}_2) = \frac{\left(\frac{\text{df}_1}{\text{df}_2}\right)^{\frac{\text{df}_1}{2}} \frac{x^{\frac{\text{df}_1}{2}-1}}{(1+\frac{\text{df}_1}{\text{df}_2} x)^{\frac{\text{df}_1+\text{df}_2}{2}}}}{\text{B}\left(\frac{\text{df}_1}{2}, \frac{\text{df}_2}{2}\right)} \quad \text{for } x \geq 0
$$
where $\text{B}$ is the beta function.

**Properties:**
- The F distribution is right-skewed, especially for smaller degrees of freedom, and approaches a normal distribution as degrees of freedom increase.
- The mean of the F distribution is given by $\frac{\text{df}_1}{\text{df}_1 - 2}$ for $\text{df}_1 > 2$, and the variance is $\frac{2 \cdot \text{df}_1^2(\text{df}_2 + 1)}{\text{df}_2(\text{df}_1 - 2)^2 (\text{df}_1 - 4)}$ for $\text{df}_1 > 4$.

**Use Cases:**
- Commonly used in ANOVA tests to compare variances among groups.
- Essential in regression analysis to compare the fits of different models.

In [None]:
from scipy.stats import f
import numpy as np
import matplotlib.pyplot as plt

# Parameters for the F distribution
df1 = 5  # degrees of freedom numerator
df2 = 2  # degrees of freedom denominator

# Generate data for F distribution
x = np.linspace(0, 5, 100)
y = f.pdf(x, df1, df2)

# Plot the F distribution
plt.plot(x, y, label=f'F Distribution (df1={df1}, df2={df2})')
plt.fill_between(x, y, alpha=0.2, color='purple')
plt.title('F Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.xlim(0, 5)
plt.show()

# Random sampling
samples = f.rvs(df1, df2, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 4</b></p>

1. Generate 1,000 random numbers from an F distribution with 10 and 5 degrees of freedom respectively for the numerator and the denominator:
   - Compute the sample mean and variance.
   - Compare these values with the theoretical mean (\(\frac{df_1}{df_1 - 2}\) for \(df_1 > 2\)) and variance.

2. Plot the probability density function (PDF) of an F distribution with 3 and 2 degrees of freedom. Overlay a histogram of 1,000 samples drawn from the same distribution.

3. Create a cumulative distribution function (CDF) plot for an F distribution with 5 and 10 degrees of freedom. Highlight the region corresponding to values within 1 standard deviation of the mean.

### **2.3.5. Chi-Squared Distribution**

The **Chi-Squared distribution** is a probability distribution that is commonly used in hypothesis testing, particularly in tests of independence and goodness-of-fit. It is defined by its degrees of freedom (df) and is used in scenarios where we are dealing with the sum of the squares of independent standard normal random variables.

**Probability Density Function (PDF):**
$$
f(x; df) = \frac{1}{2^{df/2} \Gamma\left(\frac{df}{2}\right)} x^{\frac{df}{2}-1} e^{-\frac{x}{2}} \quad \text{for } x \geq 0
$$

**Properties:**
- The Chi-Squared distribution is non-negative and right-skewed, especially for low degrees of freedom.
- As degrees of freedom increase, the distribution approaches a normal distribution.
- The mean of the Chi-Squared distribution is equal to its degrees of freedom (\(df\)), and the variance is equal to \(2 \times df\).

**Use Cases:**
- Widely used in statistical tests (e.g., Chi-Squared tests for independence, goodness-of-fit tests).
- Useful in the construction of confidence intervals for variance.

In [None]:
from scipy.stats import chi2
import numpy as np
import matplotlib.pyplot as plt

# Parameters for the Chi-Squared distribution
df = 5  # degrees of freedom

# Generate data for Chi-Squared distribution
x = np.linspace(0, 20, 100)
y = chi2.pdf(x, df)

# Plot the Chi-Squared distribution
plt.plot(x, y, label=f'Chi-Squared Distribution (df={df})')
plt.fill_between(x, y, alpha=0.2, color='orange')
plt.title('Chi-Squared Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.xlim(0, 20)
plt.show()

# Random sampling
samples = chi2.rvs(df, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 5</b></p>

1. Generate 1,000 random numbers from a Chi-Squared distribution with 10 degrees of freedom:
   - Compute the sample mean and variance.
   - Compare these values with the theoretical mean (df) and variance (2 * df).

2. Plot the probability density function (PDF) of a Chi-Squared distribution with 3 degrees of freedom. Overlay a histogram of 1,000 samples drawn from the same distribution.

3. Create a cumulative distribution function (CDF) plot for a Chi-Squared distribution with 5 degrees of freedom. Highlight the region corresponding to values within one standard deviation from the mean.

### **2.3.6. Binomial Distribution**

The **binomial distribution** models the number of successes in \(n\) independent trials, where each trial has two possible outcomes (success or failure).

**Probability Mass Function (PMF):**
$$
P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
$$

Where:
- \(n\): Number of trials.
- \(p\): Probability of success.
- \(k\): Number of successes.

**Properties:**
- Discrete distribution.
- The mean is $np$ and the variance is $np(1-p)$.

**Use Cases:**
- Modeling pass/fail scenarios like flipping a coin, quality control, or survey results.

In [None]:
from scipy.stats import binom

# Parameters for the binomial distribution
n, p = 10, 0.5  # Number of trials, probability of success

# Generate data for binomial distribution
x = np.arange(0, n+1)
y = binom.pmf(x, n, p)

# Plot the binomial distribution
plt.bar(x, y, label='Binomial PMF', color='purple', alpha=0.7)
plt.title('Binomial Distribution (PMF)')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.legend()
plt.show()

# Random sampling
samples = binom.rvs(n, p, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 6</b></p>

1. A die is rolled 10 times. Assuming a success is rolling a 6:
   - Plot the probability mass function (PMF) for the number of successes.
   - Compute the probability of rolling exactly 3 sixes.

2. Simulate 1,000 trials of flipping a coin 20 times, where $p=0.4$ (probability of heads). Compute the mean and variance of the sample and compare them to the theoretical values.

3. Write a function that computes the cumulative probability of at least $k$ successes in $n$ trials, given $p$. Test it for $n=15$, $p=0.3$, and $k=5$.

### **2.3.7. Poisson Distribution**

The **Poisson distribution** models the number of events occurring in a fixed interval of time or space, given a known average rate ($\lambda$) and independence of events.

**Probability Mass Function (PMF):**
$$
P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}
$$

Where:
- $k$: Number of events.
- $\lambda$: Average rate of occurrence.

**Properties:**
- Discrete distribution.
- The mean and variance are both equal to $\lambda$.

**Use Cases:**
- Modeling rare events like the number of calls at a call center, traffic accidents, or radioactive decay.

In [None]:
from scipy.stats import poisson

# Parameter for the Poisson distribution
lambda_ = 4  # Average rate of occurrence

# Generate data for Poisson distribution
x = np.arange(0, 15)
y = poisson.pmf(x, lambda_)

# Plot the Poisson distribution
plt.bar(x, y, label='Poisson PMF', color='orange', alpha=0.7)
plt.title('Poisson Distribution (PMF)')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.legend()
plt.show()

# Random sampling
samples = poisson.rvs(lambda_, size=1000)
print(f"Sample Mean: {np.mean(samples)}, Sample Variance: {np.var(samples)}")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 7</b></p>

1. A call center receives an average of 4 calls per hour. Assuming calls follow a Poisson process:
   - Plot the PMF for the number of calls received in an hour.
   - Compute the probability of receiving exactly 5 calls in an hour.
   - Compute the probability of receiving more than 7 calls in an hour.

2. Simulate 1,000 random samples from a Poisson distribution with $\lambda = 3$:
   - Compute the mean and variance of the sample.
   - Compare these values to the theoretical mean and variance.

3. Generate 100 random Poisson-distributed data points with $\lambda = 5$. Create a histogram and overlay it with the theoretical PMF.

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 8</b></p>

**Task 1: Load the Dataset**
1. Load the Iris dataset (`IrisFlower.csv`) into a Pandas DataFrame.
2. Display the first 10 rows of the dataset to understand its structure.

**Task 2: Descriptive Statistics**
1. Compute the mean, median, variance, and standard deviation for the `sepal length` feature for the entire dataset.
2. Group the data by species and compute the mean and variance of all numeric features for each species.

**Task 3: Probability Distributions**
1. Plot the histogram and probability density function (PDF) for the `sepal length` feature.
2. Compute the skewness and kurtosis for the `sepal length` feature.

**Task 4: Data Visualization**
1. Create a box plot to compare the `sepal length` distribution across species.
2. Create a scatter plot of `sepal length` vs. `petal length`, coloring points by species.


## **2.4. Inferential Statistics**

Inferential statistics involves drawing conclusions about a population based on a sample. Key concepts include:

1. **Confidence Intervals**: Estimating the range within which a population parameter lies.
2. **Hypothesis Testing**: Testing assumptions or claims about a population using sample data.
3. **p-values**: Measuring the strength of evidence against the null hypothesis.

In this section, we will explore these concepts with examples.

### **2.4.1. Confidence Intervals**

A **confidence interval (CI)** is a range of values that is likely to contain a population parameter (e.g., mean) with a specified level of confidence.

**Formula for CI of the Mean:**
$$
CI = \bar{x} \pm Z \cdot \frac{s}{\sqrt{n}}
$$

Where:
- $\bar{x}$: Sample mean
- $Z$: Z-critical value for the desired confidence level (e.g., 1.96 for 95%)
- $s$: Sample standard deviation
- $n$: Sample size

**Interpretation:**
- A 95% confidence interval means that if we repeated the sampling process multiple times, 95% of the calculated intervals would contain the true population mean.

**Use Cases:**
- Estimating population parameters (e.g., mean, proportion).
- Assessing the precision of sample estimates.

In [None]:
import numpy as np
from scipy.stats import norm

# Example: Confidence Interval for the Mean
data = [15, 18, 22, 20, 18, 30, 28, 22, 18]  # Sample data
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
n = len(data)
confidence_level = 0.95

# Compute the margin of error
z_critical = norm.ppf((1 + confidence_level) / 2)  # Z-critical value for 95% confidence
margin_of_error = z_critical * (sample_std / np.sqrt(n))

# Confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

print(f"Sample Mean: {sample_mean}")
print(f"95% Confidence Interval: [{lower_bound}, {upper_bound}]")

In [None]:
# Visualization
x = np.linspace(sample_mean - 3 * sample_std, sample_mean + 3 * sample_std, 1000)
y = norm.pdf(x, loc=sample_mean, scale=sample_std / np.sqrt(n))

plt.plot(x, y, label="Sampling Distribution")
plt.axvline(sample_mean, color='blue', linestyle='--', label="Sample Mean")
plt.axvline(lower_bound, color='red', linestyle='--', label="Lower Bound (95% CI)")
plt.axvline(upper_bound, color='red', linestyle='--', label="Upper Bound (95% CI)")
plt.fill_between(x, y, where=(x >= lower_bound) & (x <= upper_bound), color='red', alpha=0.2, label="95% CI Region")
plt.title("Confidence Interval Visualization")
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.legend()
plt.show()

### **2.4.2. Statistical Hypothesis Testing**

**Statistical hypothesis testing** is a method used to make inferences or draw conclusions about a population based on sample data. It involves evaluating two competing hypotheses: the null hypothesis and the alternative hypothesis.

**Key Components:**
1. **Null Hypothesis ($H_0$)**: The statement being tested, usually positing that there is no effect or no difference. It serves as the default assumption.

2. **Alternative Hypothesis ($H_1$ or $H_a$)**: The statement that is accepted if the null hypothesis is rejected. It represents a new effect or difference, suggesting that there is something happening in the population.

3. **Significance Level ($\alpha$)**: The probability of rejecting the null hypothesis when it is actually true, commonly set at values like 0.05 (5%) or 0.01 (1%).

4. **Test Statistic**: A standardized value derived from sample data that is used to determine whether to reject the null hypothesis. The choice of test statistic depends on the hypothesis test being conducted.

5. **P-Value**: The probability of observing the test statistic, or something more extreme, given that the null hypothesis is true. A low p-value (typically less than $\alpha$) indicates strong evidence against the null hypothesis.

6. **Decision Rule**: A systematic method to decide whether to reject the null hypothesis based on the comparison of the p-value with the significance level. If the p-value is less than $\alpha$, you reject $H_0$.

**Steps in Hypothesis Testing:**
1. **State the Hypotheses**: Clearly articulate the null and alternative hypotheses.
2. **Choose a Significance Level**: Decide on the threshold for rejecting the null hypothesis.
3. **Collect Data**: Gather and prepare the data required for the test.
4. **Calculate the Test Statistic and P-Value**: Determine the test statistic from the sample data and compute the corresponding p-value.
5. **Make a Decision**: Compare the p-value with the significance level and decide whether to reject or fail to reject the null hypothesis.
6. **Draw Conclusions**: Interpret the results in the context of the research question.

**Considerations:**
- Type I Error: The error made when the null hypothesis is rejected when it is true (false positive).
- Type II Error: The error made when the null hypothesis is not rejected when it is false (false negative).
- Power of the Test: The probability of correctly rejecting the null hypothesis when it is false. Higher power indicates a greater ability to detect an effect.

Statistical hypothesis testing is a critical tool in inferential statistics, providing a formal framework for making decisions based on data. Understanding the principles of hypothesis testing is essential for analyzing data and interpreting results in various fields, including science, business, and social sciences.

### **2.4.3. t-Test**

The **t-test** is used to determine whether there is a significant difference between the means of one or two groups.

**Assumptions:**
- Data is approximately normally distributed.
- Variances of the groups are equal (can be relaxed for Welch's t-test).

**Use Cases:**
- Comparing the effectiveness of treatments.
- Testing claims about population means.

#### ***2.4.3.1. One-Sample t-Test***

Tests whether the mean of a sample is significantly different from a hypothesized population mean.

$$
t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}
$$

Where:
- $\bar{x}$: Sample mean
- $\mu_0$: Hypothesized population mean
- $s$: Sample standard deviation
- $n$: Sample size

In [None]:
from scipy.stats import ttest_1samp

# Example: One-Sample t-Test
# Null Hypothesis (H0): The mean of the sample is equal to a hypothesized value (e.g., 20)
hypothesized_mean = 20
t_stat, p_value = ttest_1samp(data, popmean=hypothesized_mean)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The sample mean is significantly different from the hypothesized mean.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the sample mean and the hypothesized mean.")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 9</b></p>

**Exercise:** One-Sample t-Test

1. **Data Collection**: Suppose you have collected the following sample data representing the weights (in kg) of a group of individuals:

   ```python
   data = [65, 70, 75, 80, 60, 58, 90, 77, 85, 68]
   ```

2. **Hypothesize**: Your research question is to determine if the average weight of this group is significantly different from a known population mean weight of 70 kg.

3. **Perform the One-Sample t-Test**:
   - State the null hypothesis (\(H_0\)) and alternative hypothesis (\(H_a\)).
   - Use the One-Sample t-Test to test the hypothesis, calculating the t-statistic and p-value.
   - Choose a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the results of the t-test and the chosen significance level, conclude whether to reject or fail to reject the null hypothesis.
   - What does this imply about the average weight of your sample compared to the population mean?

5. **Bonus Question**: If you repeat the test with a significance level of 0.01, how does your conclusion change?

#### ***2.4.3.2. Two-Sample t-Test***

Tests whether the means of two independent samples are significantly different.

$$
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

Where:
- $\bar{x}_1, \bar{x}_2$: Sample means
- $s_1^2, s_2^2$: Sample variances
- $n_1, n_2$: Sample sizes

In [None]:
from scipy.stats import ttest_ind

# Example: Two-Sample t-Test
# Sample data for two groups
group1 = [15, 18, 22, 20, 18]
group2 = [30, 28, 25, 27, 22]

# Null Hypothesis (H0): The means of the two groups are equal
t_stat, p_value = ttest_ind(group1, group2)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpret the results
if p_value < alpha:
    print("Reject the null hypothesis: The means of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the means of the two groups.")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 10</b></p>

**Exercise:** Two-Sample t-Test

1. **Data Collection**: You have conducted an experiment to compare the effects of two different treatments on blood pressure reduction. The following data shows the reduction in systolic blood pressure (in mmHg) for each group:

   - **Treatment Group 1**:
   ```python
   group1 = [8, 6, 7, 5, 9]
   ```

   - **Treatment Group 2**:
   ```python
   group2 = [12, 14, 13, 11, 15]
   ```

2. **Hypothesize**: Your research question is to determine if there is a significant difference in the mean blood pressure reduction between the two treatment groups.

3. **Perform the Two-Sample t-Test**:
   - State the null hypothesis (\(H_0\)) and alternative hypothesis (\(H_a\)).
   - Use the Two-Sample t-Test to test the hypothesis, calculating the t-statistic and p-value.
   - Choose a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the results of the t-test and the chosen significance level, conclude whether to reject or fail to reject the null hypothesis.
   - What does this imply about the effectiveness of the two treatments concerning blood pressure reduction?

5. **Bonus Question**: If you were to collect additional data that lowers the variance within the treatment groups, how might that affect your results? Discuss the implications of sample size and variance on the hypothesis test.

### **2.4.4. F-Test**

The **F-Test** is a statistical test used to determine whether there are significant differences between the variances of two populations. It is often used in the context of ANOVA (Analysis of Variance) to compare multiple group variances, but it can also be used to compare the variances of two independent samples.

**Test Statistic:**
$$
F = \frac{s_1^2}{s_2^2}
$$
Where:
- $s_1^2$: Sample variance of the first group
- $s_2^2$: Sample variance of the second group

**Key Components:**
1. **Null Hypothesis ($H_0$)**: The null hypothesis states that the variances of the two populations are equal ($\sigma_1^2 = \sigma_2^2$).

2. **Alternative Hypothesis ($H_a$)**: The alternative hypothesis posits that the variances are not equal ($\sigma_1^2 \neq \sigma_2^2$).

3. **Degrees of Freedom**:
   - For the numerator: $df_1 = n_1 - 1$, where $n_1$ is the sample size of group 1.
   - For the denominator: $df_2 = n_2 - 1$, where $_2$ is the sample size of group 2.

4. **Significance Level ($\alpha$)**: The threshold used to decide whether to reject the null hypothesis. Common choices for $\alpha$ are 0.05 or 0.01.

**Steps in F-Test:**
1. **State the Hypotheses**: Clearly articulate the null and alternative hypotheses.
2. **Calculate the Sample Variances**: Compute the variances for the two samples.
3. **Calculate the F-Statistic**: Use the F formula to calculate the test statistic.
4. **Determine the Critical Value**: Find the critical value from the F-distribution table based on the degrees of freedom and chosen significance level.
5. **Make a Decision**: Compare the calculated F-statistic with the critical value:
   - If $F > F_{\text{critical}}$, reject the null hypothesis.
   - If $F \leq F_{\text{critical}}$, fail to reject the null hypothesis.
6. **Draw Conclusions**: Interpret the results in the context of the research question.

**Use Cases:**
- Comparing variances across two or more groups.
- Assisting in the assumptions of other statistical tests, such as ANOVA, which assumes equal variances among groups.

**Considerations:**
- The F-Test is sensitive to non-normality; both groups should ideally be normally distributed for reliable results.
- It can also be sensitive to outliers, which may affect the estimated variances.

In summary, the F-Test is a vital tool in inferential statistics for assessing the equality of variances between two or more populations, aiding in various statistical analyses and modeling. Understanding the F-Test's principles and applications is crucial for accurate data interpretation in research and experimentation.

In [None]:
import numpy as np
from scipy.stats import f

# Sample data for two groups
group1 = [30, 35, 31, 29, 33]
group2 = [41, 39, 45, 37, 43]

# Calculate the sample variances
var1 = np.var(group1, ddof=1)  # Unbiased estimator (N-1)
var2 = np.var(group2, ddof=1)

# Calculate the F-statistic
F_statistic = var1 / var2

# Determine degrees of freedom
n1 = len(group1)
n2 = len(group2)
df1 = n1 - 1  # Degrees of freedom for group 1
df2 = n2 - 1  # Degrees of freedom for group 2

# Calculate the p-value using the cumulative distribution function (CDF)
p_value = 1 - f.cdf(F_statistic, df1, df2)

# Output the results
print(f"F-statistic: {F_statistic}")
print(f"P-value: {p_value}")

# Choose a significance level
alpha = 0.05

# Interpret the results
if p_value < alpha:
    print("Reject the null hypothesis: The variances of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the variances of the two groups.")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 11</b></p>

**Exercise:** F-Test

1. **Data Collection**: You have conducted an experiment to compare the variability in scores of two different teaching methods. The following scores represent the final exam results (out of 100) for each group:

   - **Group 1 (Method A)**:
   ```python
   group1 = [78, 85, 88, 75, 90]
   ```

   - **Group 2 (Method B)**:
   ```python
   group2 = [82, 78, 84, 90, 70]
   ```

2. **Hypothesize**: Your research question is to determine if there is a significant difference in the variances of the two groups' scores.

3. **Perform the F-Test**:
   - State the null hypothesis (\(H_0\)) and alternative hypothesis (\(H_a\)).
   - Use the F-Test to test the hypothesis by calculating the F-statistic and p-value.
   - Choose a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the results of the F-test and the chosen significance level, conclude whether to reject or fail to reject the null hypothesis.
   - What does this imply about the variability in scores between the two teaching methods?

5. **Bonus Question**: If you were to increase the sample sizes for both groups, how would this impact the power of the test? Discuss the relationship between sample size and the ability to detect differences in variances.

### **2.4.5. Chi-Squared Test**

The **chi-squared test** is used to determine whether there is a significant association between categorical variables.

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

Where:
- $O$: Observed frequencies.
- $E$: Expected frequencies under the null hypothesis.

**Types of Chi-Squared Tests:**
1. **Goodness-of-Fit Test**: Determines if a sample matches an expected distribution.
2. **Test for Independence**: Determines if two categorical variables are independent.

**Example: Test for Independence**
Given a contingency table:

$$
\begin{bmatrix}
O_{11} & O_{12} \\
O_{21} & O_{22}
\end{bmatrix}
$$

The expected frequencies are computed as:
$$
E_{ij} = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}
$$

**Interpretation:**
- Small p-value ($p < 0.05$): Reject the null hypothesis; the variables are associated.
- Large p-value ($p \geq 0.05$): Fail to reject the null hypothesis; no significant association.

**Use Cases:**
- Analyzing survey data (e.g., preference vs. demographics).
- Testing independence in contingency tables.

In [None]:
from scipy.stats import chi2_contingency

# Example: Chi-Squared Test
# Contingency table (e.g., observed frequencies of categories)
data = [[50, 30], [20, 40]]  # Rows: Categories, Columns: Groups

# Perform the chi-squared test
chi2_stat, p_value, dof, expected = chi2_contingency(data)

print(f"Chi-Squared Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:\n{expected}")

# Interpret the results
if p_value < alpha:
    print("Reject the null hypothesis: The variables are not independent.")
else:
    print("Fail to reject the null hypothesis: The variables are independent.")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 12</b></p>

**Exercise:** Chi-Squared Test for Independence

1. **Data Collection**: You conducted a survey to analyze the relationship between pet ownership (Dog, Cat) and gender (Male, Female). The observed frequencies from the survey are as follows:

   |            | Male | Female |
   |------------|------|--------|
   | Dog        | 30   | 10     |
   | Cat        | 20   | 40     |

   Represent this data in a contingency table:

   ```python
   data = [[30, 10], [20, 40]]
   ```

2. **Hypothesize**: Your research question is to determine if there is a significant association between pet ownership and gender.

3. **Perform the Chi-Squared Test for Independence**:
   - State the null hypothesis ($H_0$) and alternative hypothesis ($H_a$).
   - Use the Chi-Squared test to evaluate the hypothesis by calculating the Chi-Squared statistic and p-value.
   - Choose a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the results of the Chi-Squared test and the chosen significance level, conclude whether to reject or fail to reject the null hypothesis.
   - What does this imply about the association between pet ownership and gender?

5. **Bonus Question**: If you were to collect more data and the sample sizes increased significantly, how might this affect your Chi-Squared test results? Discuss the implications of sample size on the power of the test.

## **2.5. Analysis of Variance (ANOVA)**

Analysis of Variance (ANOVA) is a statistical method used to compare means of three or more samples (groups) to understand if at least one sample mean is significantly different from the others. It tests the hypothesis that the means of different groups are equal under the assumption that the samples are drawn from normally distributed populations with equal variances.

ANOVA decomposes the observed variance in a particular variable into components attributable to different sources of variation. This helps in understanding whether the variation between groups is significant compared to the variation within groups.

### **2.5.1. One-Factor ANOVA (One-Way ANOVA)**

**Definition:**

One-Factor ANOVA, also known as One-Way ANOVA, is used when comparing the means of more than two groups based on one independent variable (factor). It helps determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

**Hypotheses:**

- **Null Hypothesis ($H_0$)**: All group means are equal.
  $$
  H_0: \mu_1 = \mu_2 = \cdots = \mu_k
  $$
- **Alternative Hypothesis ($H_a$)**: At least one group mean is different.
  $$
  H_a: \text{At least one } \mu_i \text{ is different}
  $$

**Test Statistic:**

ANOVA uses the F-statistic to determine the significance.

The F-statistic is calculated as:
$$
F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{\text{Mean Square Between (MSB)}}{\text{Mean Square Within (MSW)}}
$$

Where:

- **Between-group variance (MSB)**: Measures variation due to the interaction between the samples.
- **Within-group variance (MSW)**: Measures variation within each sample.

**Assumptions:**

1. Independence of observations.
2. Normally distributed populations.
3. Homogeneity of variances (equal variances among groups).

In [None]:
import numpy as np
from scipy import stats

# Sample data for three groups
group1 = [22, 23, 27, 30, 25]
group2 = [18, 20, 16, 21, 19]
group3 = [28, 32, 35, 30, 31]

# Perform One-Way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: At least one group mean is significantly different.")
else:
    print("Fail to reject the null hypothesis: No significant difference between group means.")

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 13</b></p>

**Exercise:** One-Way ANOVA

1. **Data Collection**: An educator wants to know if three different teaching methods have different effects on student performance. The test scores (out of 100) for students using each method are as follows:

   - **Method A**:
     ```python
     method_A = [85, 88, 90, 87, 86]
     ```
   - **Method B**:
     ```python
     method_B = [78, 82, 80, 76, 79]
     ```
   - **Method C**:
     ```python
     method_C = [92, 95, 93, 91, 94]
     ```

2. **Hypothesize**:
   - State the null hypothesis ($H_0$) and alternative hypothesis ($H_a$).

3. **Perform One-Way ANOVA**:
   - Use Python to perform the ANOVA test.
   - Calculate the F-statistic and p-value.
   - Use a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the ANOVA results, determine whether to reject or fail to reject the null hypothesis.
   - What does this indicate about the effectiveness of the different teaching methods?

5. **Post-Hoc Analysis (Bonus)**:
   - If significant differences are found, perform a post-hoc test (e.g., Tukey's HSD) to determine which groups differ from each other.

### **2.5.2. Multiple Factors (Two-Way ANOVA)**

**Definition:**

Two-Way ANOVA is used when analyzing the effect of two independent variables (factors) on a dependent variable. It also helps in understanding if there is an interaction effect between the two factors.

**Hypotheses:**

1. **Main Effects**:
   - For Factor A:
     - $H_0$: No effect of Factor A on the dependent variable.
     - $H_a$: Significant effect of Factor A.
   - For Factor B:
     - $H_0$: No effect of Factor B on the dependent variable.
     - $H_a$: Significant effect of Factor B.
2. **Interaction Effect**:
   - $H_0$: No interaction between Factor A and Factor B.
   - $H_a$: Significant interaction between Factor A and Factor B.

**Assumptions:**

1. Independence of observations.
2. Normally distributed populations.
3. Homogeneity of variances.
4. The groups are defined by both factors.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create sample data
data = {
    'Score': [85, 88, 90, 87, 86, 78, 82, 80, 76, 79, 92, 95, 93, 91, 94],
    'Method': ['A']*5 + ['B']*5 + ['C']*5,
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M']
}
df = pd.DataFrame(data)

# Perform Two-Way ANOVA
model = ols('Score ~ C(Method) + C(Gender) + C(Method):C(Gender)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

**Output Interpretation:**

- **Sum Sq**: Sum of squares attributable to each source.
- **df**: Degrees of freedom.
- **F**: F-statistic.
- **PR(>F)**: P-value.

**Notes:**

- `C(Method)`: Categorical variable 'Method'.
- `C(Gender)`: Categorical variable 'Gender'.
- `C(Method):C(Gender)`: Interaction between Method and Gender.

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 14</b></p>

**Exercise:** Two-Way ANOVA

1. **Data Collection**: A researcher is studying the effect of diet and exercise on weight loss. Participants are split into groups based on diet type (Diet 1, Diet 2) and exercise level (Low, High). The weight loss (in kg) for each group is recorded:

   | Diet | Exercise Level | Weight Loss (kg) |
   |------|----------------|------------------|
   | 1    | Low            | [2, 3, 1.5, 2.5] |
   | 1    | High           | [4, 4.5, 5, 3.5] |
   | 2    | Low            | [1, 2, 1.5, 1.8] |
   | 2    | High           | [3, 3.5, 4, 2.8] |

2. **Hypothesize**:
   - State the null and alternative hypotheses for both main effects and interaction effect.

3. **Perform Two-Way ANOVA**:
   - Organize the data into a pandas DataFrame.
   - Use Python to perform the Two-Way ANOVA.
   - Use a significance level of 0.05.

4. **Interpret the Results**:
   - Based on the ANOVA table, determine the significance of diet, exercise level, and their interaction.
   - What does this indicate about the effects on weight loss?

5. **Conclusions**:
   - Summarize your findings and suggest practical implications.