<a href="https://colab.research.google.com/github/pkmariya/Statistics/blob/main/Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing

Hypothesis testing is a statistical method used in various industries to make decisions or draw conclusions about a population based on sample data. Hypothesis testing is used to confirm your conclusion (hypothesis) about a population.
Here are several industry-based examples of hypothesis testing:

**1.Pharmaceuticals: Drug Efficacy**

**Scenario**: A pharmaceutical company develops a new drug to treat a specific medical condition and wants to determine if the drug is more effective than the existing treatment. Hypothesis:

* *Null Hypothesis (H0)*: The new drug is equally effective as the existing treatment.
* *Alternative Hypothesis (H1)*: The new drug is more effective than the existing treatment.

**Test**: A clinical trial is conducted, and statistical tests are performed to analyze the data and determine if there is enough evidence to reject the null hypothesis in favor of the alternative.

**2.Finance: Investment Strategy**

**Scenario**: A financial analyst proposes a new investment strategy that claims to outperform the current market average. Hypothesis:

* *Null Hypothesis (H0)*: The new investment strategy does not outperform the market average.
* *Alternative Hypothesis (H1)*: The new investment strategy outperforms the market average.

**Test**: Historical data is collected and analyzed using statistical tests to assess whether the returns from the proposed strategy are significantly different from the market average.

**3.Manufacturing: Production Process Improvement**

**Scenario**: A manufacturing plant implements changes to its production process with the goal of reducing defects in the final product. Hypothesis:

* *Null Hypothesis (H0)*: The changes to the production process do not reduce defects.
* *Alternative Hypothesis (H1)*: The changes to the production process reduce defects.

**Test**: Data on defect rates before and after the changes are collected, and statistical tests are performed to determine if there is a significant improvement.

In each of these examples, hypothesis testing provides a structured approach to assess claims or changes within different industries, helping decision-makers make informed choices based on statistical evidence.

Hypothesis Testing starts with the formulation of these two hypotheses:

**Null hypothesis (H₀)**: There is no significant difference between the observed sample and the general population or between two samples.

**Alternate hypothesis (H₁)**: Claims there is some statistical significance between two variables.

Defining the hypothesis is often complicated so we can follow these rules to correctly formulate both the hypotheses.

The null hypothesis always has the following signs: =  OR   ≤   OR    ≥

The alternate hypothesis always has the following signs: ≠   OR  >   OR    <

**Example 1**: Zomato claimed that its total valuation in December 2023 was at least $500 million. Here, the claim contains ≥ sign (i.e. the at least sign), so the null hypothesis is the original claim.

**Example 2**: Zomato claimed that its total valuation in December 2023 was greater than $500 million. Here, the claim contains > sign (i.e. the ‘more than’ sign), so the null hypothesis is the complement of the original claim.


## Making a Decision
Once you have formulated the null and alternate hypotheses, we have to decide to either reject or fail to reject the null hypothesis.

Let's say Maruti has to buy tires for it's cars from a tire manufacturer. The tire manufacture claims that life of their tire is 36 months. Now we have to find out whether they are correct or not.


* Null hypothesis (H₀) : Life of tire = 36 months
* Alternate hypothesis (H₁) : Life of tire != 36 months

Now Maruti tests 100 tires and the average comes to be 32 months. So do we accept their claim or reject it. If average comes 28 months then what?

For this case we define critical region. Upper and lower critical region. If the average is less than lower critical value and more than upper critical value then we reject the null hypothesis.

The formulation of the null and alternate hypotheses determines the type of the test and the position of the critical regions in the normal distribution.

You can tell the type of the test and the position of the critical region on the basis of the ‘sign’ in the alternate hypothesis.

   ≠ in H₁    →   Two-tailed test        →     Rejection region on both sides of distribution
   < in H₁    →   Lower-tailed test     →     Rejection region on left side of distribution
   > in H₁    →   Upper-tailed test     →     Rejection region on right side of distribution

## Steps to perform Hyposthesis Testing:

1.   **Formulate the Hypotheses** The first step is to clearly define the null hypothesis (H₀) and the alternative hypothesis (H₁).

* Null Hypothesis (H₀): There is no significant difference between the observed sample and the general population or between two samples.
* Alternative Hypothesis (H₁): There is a significant difference or effect that is not due to chance alone.

2.   **Select the Significance Level (α)** The significance level (α) is the probability of rejecting the null hypothesis when it is actually true (Type I error). Commonly used α levels are 0.05 (5%) or 0.01 (1%).

3.   **Collect and Analyze Data** Data collection can be done through random sampling or experimental design. The data must be relevant to the hypotheses and collected in a way that minimizes bias. Once collected, the data is analyzed to summarize its main characteristics, often through descriptive statistics or visualizations.

4.   **Calculate the Test Statistic** The test statistic is a numerical value calculated from the sample data that, under the null hypothesis, follows a known distribution. The choice of test statistic depends on the nature of the data and the hypothesis being tested. Common tests include t-tests, z-tests, chi-square tests, and ANOVA, among others.

5.   **Determine the Critical Region or Calculate the P-value**

* **Critical Region**: This approach involves comparing the test statistic to critical values that define regions where the null hypothesis would be rejected. The critical values are determined based on the significance level and the distribution of the test statistic.
* **P-value**: Alternatively, the p-value approach calculates the probability of observing a test statistic as extreme as, or more extreme than, the value calculated from the sample data, assuming the null hypothesis is true. If the p-value is less than or equal to the significance level (α), the null hypothesis is rejected.
**We prefer using the p-value method over the critical-region method.**

6.   **Make a Decision**

If the test statistic falls within the critical region or the p-value is less than or equal to α, reject the null hypothesis in favor of the alternative hypothesis.
If the test statistic does not fall within the critical region or the p-value is greater than α, do not reject the null hypothesis.

7.   **Draw Conclusions** Finally, interpret the results in the context of the original research question or business problem. This involves stating whether the findings support the alternative hypothesis and discussing the implications of the results for the problem at hand.

### **Z-Test**
Jeep, a well-known car maker, claims that its car 'Compass' gives a mileage of at least 17 km/litre.

Null Hypothesis : μ ≥ 17
Alternate Hypothesis : μ < 17
Google claims that its internet browser ‘Chrome’ is the best in the industry, as it has an optimum boot time of only 250 ms, with a standard deviation of 9 ms. Sam, a tech geek, wanted to test the claim of Google. So, he randomly collected boot time data of 165 devices of Chrome and got a sample mean of 247 ms.

Ho: μ = 250, i.e., the mean boot time is 250 ms.
Ha: μ ≠ 250, i.e., the mean boot time is not 250 ms.
For Z-Test these two conditions should be met:

Condition 1: n >30, which means that the population sample size should be greater than 30 observations.

Condition 2: 𝝈 is known, i.e., the population standard deviation is known.

If any of these conditions are not met then we use t-test.

The next step is to determine the test statistic. A test statistic, in simple terms, is a value that is to be calculated from some given data, which is then used to compare the results arrived at with the tabular values.

The test statistic for a normal distribution or a Z-test is defined as:

Z = x−μ / σ/√n

x is the process mean, μ is the population mean, σ is the standard deviation and n is the sample size.

Z = (247 - 250)/(9/√165) Z = -4.3

We will now test our hypothesis at a 95% confidence level. For a 95% confidence interval, Z critical value = +1.96 and -1.96; these are the upper and lower critical values, respectively. The test statistic value we calculated is -4.3.

The region between +1.96 and -1.96 is called the acceptance region, and the region outside it is called the critical region.

If the calculated Z-statistic is in the region of acceptance, you fail to reject the null hypothesis. If the calculated Z-statistic lies outside the region of acceptance, i.e., in the critical region, you reject the null hypothesis.

In our case, the test statistic value is -4.3, which lies outside the region of acceptance of ±1.96. So, you reject the null hypothesis.

In [None]:
from statsmodels.stats.weightstats import ztest as ztest
import numpy as np

boot_times = list(np.random.randint(180, 200, size=165))

# perform one sample Z-test
z_statistic, p_value = ztest(boot_times, value=247)

# Interpret the results
if p_value < 0.05:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("The mean boot time is 250ms.")
else:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("The mean boot time is not 250ms but significantly different from 250ms.")


Z-statistic: -128.8276, p-value: 0.0000
We have sufficient evidence to reject the null hypothesis.
The mean boot time is 250ms.


### **T-test**
The t-distribution is kind of a normal distribution; it is also symmetric and single peaked but less concentrated around its peak. In layman’s terms, a t-distribution is shorter and flatter around the centre than a normal distribution. It is used to study the mean of a population that has a distribution fairly close to a normal distribution (but not an exact normal distribution).

**Two simple conditions to determine when to use the t-statistic are as follows:**

* The population standard deviation is unknown.
* The sample size is less than 30.

Even if one of them is applicable in a situation, you can comfortably go for a t-test. The formula to determine the t-statistic is:

                        t = x–μ / s/√n
The company claims that their new algorithm can process a specific dataset in an average of 20 minutes, which is faster than the current average processing time of 22 minutes using the standard algorithm. To validate this claim, a data scientist decides to conduct a t-test. The data scientist collects a sample of processing times using the new algorithm. The sample consists of 10 processing times (in minutes):

Sample Data : 19,18,21,20,19,22,18,17,21,20

The data scientist wants to test if the new algorithm significantly reduces the processing time compared to the standard average of 22 minutes. The hypothesis for the t-test would be set up as follows:

Null Hypothesis (H₀): The mean processing time using the new algorithm is equal to or greater than 22 minutes. (μ≥22)
Alternative Hypothesis (H₁): The mean processing time using the new algorithm is less than 22 minutes. (μ<22)

#### one-sample t-test

In [1]:
from scipy.stats import ttest_1samp

global_average_score = 22
sample_scores = [19,18,21,20,19,22,18,17,21,20]

t_stat, p_value = ttest_1samp(sample_scores, global_average_score)

In [2]:
p_value

0.0007389679098032424

The result of the t-test indicates that there is significant evidence to conclude that the new algorithm reduces the processing time for processing the datasets compared to the standard processing time of 22 minutes. Therefore, we can confidently claim that the new algorithm is more efficient.

#### two-sample t-test
A two-sample t-test is often used to determine if there is a significant difference between the means of two groups. Let's consider a scenario where a company wants to test if a new training program improves employee productivity. The productivity scores (measured in units of work completed per day) of employees who underwent the training are compared to those who did not.

**Hypothesis**

* Null Hypothesis (): There is no difference in productivity between trained and untrained employees.
* Alternative Hypothesis (): There is a difference in productivity between trained and untrained employees.

In [8]:
import numpy as np
from scipy import stats

# Sample data
trained = np.array([10, 12, 11, 15, 14])
untrained = np.array([8, 9, 7, 6, 10])

# Calculate means and standard deviations
mean_trained = np.mean(trained)
mean_untrained = np.mean(untrained)
std_trained = np.std(trained, ddof=1)  # ddof=1 for sample standard deviation
std_untrained = np.std(untrained, ddof=1)

# Sample sizes
n_trained = len(trained)
n_untrained = len(untrained)

# Calculate t-statistic and degrees of freedom
t_statistic, p_value = stats.ttest_ind(trained, untrained, equal_var=False)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 3.772968873135195
P-value: 0.0061614896845513705


This Python code uses the ttest_ind function from the scipy.stats module to perform a two-sample t-test assuming unequal variances (Welch's t-test). The equal_var=False parameter is set to handle cases where the two groups have different variances.

**Another Example of two sample t-test**
A dataset from a recent health survey that includes information on participants' gender (male or female) and their cholesterol levels (a quantitative variable). The data scientist wants to investigate whether there is a significant difference in the mean cholesterol levels between male and female participants.

In [9]:
import numpy as np
from scipy import stats

gender = np.array(["Male", "Female", "Female", "Male", "Female", "Male", "Male",
                   "Female", "Male", "Female"])

cholesterol = np.array([200, 220, 210, 190, 205, 195, 180, 230, 175, 225])

# Since the gender array contains categorical data, we need to separate the cholesterol data by gender
male_cholesterol = cholesterol[gender == "Male"]
female_cholesterol = cholesterol[gender == "Female"]

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(male_cholesterol, female_cholesterol)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -4.57495710997814
P-value: 0.0018139585097282133


The ttest_ind function is used to compare the means of two independent samples, which in this case are the cholesterol levels of males and females. The gender array is used to filter the cholesterol array into two groups: male_cholesterol and female_cholesterol.


Determine **Degrees of Freedom (DoF)**:
dof = (No. of Rows - 1) * (No. of Cols - 1)

### **Chi-Squared Test (Test of Independence)**
Chi-squared test of independence: This is used to determine whether or not there is a significant relationship between two nominal (categorical) variables.



In [10]:
import numpy as np
from scipy.stats import chi2_contingency

In [16]:
data = np.array([[30, 20], [40, 110]])

In [17]:
test_stat, p, dof, exp_val = chi2_contingency(data)

In [18]:
exp_val

array([[17.5, 32.5],
       [52.5, 97.5]])

In [19]:
print(test_stat)

16.879120879120876


In [20]:
print(f'p-value: {p}')

p-value: 3.983738939937843e-05


In [21]:
if p_value < 0.05:
    print("We have sufficient evidence to reject the null hypothesis.")
else:
    print("We do not have sufficient evidence to reject the null hypothesis")

We have sufficient evidence to reject the null hypothesis.


Finally, a Chi-Square test evaluates whether the observed contingency table is significantly different from the table that would be expected if there were no association between the variables.

### ANOVA (Analysis of Variance)
Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It helps to identify whether the observed differences among group means are due to actual differences or random chance. ANOVA uses F-tests to statistically test the equality of means.

Two sample t-tests can validate a hypothesis containing only two groups at a time. For samples involving three or more groups, the t-test becomes tedious, as you have to perform the tests for each combination of the groups. Also, the possibility of a type-1 error increases in this process.

**One-Way ANOVA**

One-Way ANOVA is used when there is one independent variable with three or more levels (groups) and one dependent variable. It tests whether the means of the groups are equal.

With ANOVA we run **levene** test, to check for the assumption that various groups of data must have similar variance.

**Example**: Suppose a researcher wants to test the effect of three different diets on weight loss. The independent variable is the type of diet (Diet A, Diet B, Diet C), and the dependent variable is the weight loss.

In [27]:
from scipy.stats import f_oneway
from scipy.stats import levene

# Sample data
diet_A = [2.5, 3.0, 2.8, 3.2, 2.9]
diet_B = [3.1, 3.3, 3.0, 3.5, 3.2]
diet_C = [2.7, 2.8, 2.9, 3.0, 2.6]

f_statistic, p_val= levene(diet_A, diet_B, diet_C)

print(f"F-Statistic: {f_statistic}")
print(f"P-Value: {p_val}")

F-Statistic: 0.29787234042553357
P-Value: 0.7477292735219543


In [28]:
# Perform One-Way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print(f"F-Statistic: {f_statistic}")
print(f"P-Value: {p_value}")

F-Statistic: 5.782945736434107
P-Value: 0.01743348596415625


In [30]:
from scipy.stats import f_oneway

# Sample data: three groups with different values
group1 = [20, 21, 19, 20, 22]
group2 = [28, 32, 30, 29, 27]
group3 = [35, 38, 40, 35, 36]

# Perform the ANOVA test
f_value, p_value = f_oneway(group1, group2, group3)

print("F-value:", f_value)
print("P-value:", p_value)

# Determine if the results are significant
if p_value < 0.05:
    print("There is a statistically significant difference between the groups.")
else:
    print("There is no statistically significant difference between the groups.")

F-value: 104.16494845360837
P-value: 2.6100383513609835e-08
There is a statistically significant difference between the groups.


In this example, group1, group2, and group3 represent three different samples with their respective values. These could be, for example, the blood pressure levels of patients under three different treatments, or any other set of quantitative data.

The f_oneway calculates the F-value and the P-value. The F-value is the test statistic, and the P-value tells us whether the observed differences in means across the groups are statistically significant.

Remember, ANOVA tells us if there's at least one significant difference but doesn't specify where it is. If the test is significant, you would typically follow up with post-hoc tests to find out which specific groups differ from each other.

**Two-Way ANOVA**
Two-Way ANOVA is used when there are two independent variables. It tests the effect of each independent variable on the dependent variable and also examines the interaction effect between the two independent variables.

**Scenario**

Suppose a financial analyst wants to study the effect of two factors: investment strategy (Strategy A, Strategy B) and market condition (Bull, Bear) on the return on investment (ROI). The independent variables are the investment strategy and market condition, and the dependent variable is the ROI.

In [31]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {
    'Strategy': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'Market': ['Bull', 'Bull', 'Bear', 'Bear', 'Bull', 'Bull', 'Bear', 'Bear'],
    'ROI': [10, 12, 8, 7, 15, 14, 9, 10]
}

df = pd.DataFrame(data)

# Perform Two-Way ANOVA
model = ols('ROI ~ C(Strategy) + C(Market) + C(Strategy):C(Market)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                       sum_sq   df          F    PR(>F)
C(Strategy)            15.125  1.0  17.285714  0.014173
C(Market)              36.125  1.0  41.285714  0.003016
C(Strategy):C(Market)   1.125  1.0   1.285714  0.320188
Residual                3.500  4.0        NaN       NaN


#### **ANOVA and Tukey’s range test**
When the categorical variable has three or more categories, an ANOVA can be used to see if there is a significant difference between any of the groups. Then, if at least one pair of groups are significantly different, Tukey’s range test can be used to determine which groups are different. This is better than running multiple two-sample t-tests because it leads to a lower probability of making a type I error.

For example, if we want to compare the heights of three different tree species, in order to test the hypothesis that average tree heights vary by species, we can use an ANOVA. Then, if the p-value from the ANOVA is below our significance threshold, we can run Tukey’s range test to determine which tree species have significantly different heights.

In [32]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data
data = {
    'Strategy': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'Market': ['Bull', 'Bull', 'Bear', 'Bear', 'Bull', 'Bull', 'Bear', 'Bear', 'Bull', 'Bull', 'Bear', 'Bear'],
    'ROI': [10, 12, 8, 7, 15, 14, 9, 10, 20, 18, 11, 12]
}

df = pd.DataFrame(data)

# Perform Two-Way ANOVA
model = ols('ROI ~ C(Strategy) + C(Market) + C(Strategy):C(Market)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("ANOVA Table:")
print(anova_table)

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['ROI'], groups=df['Strategy'], alpha=0.05)
print("\nTukey's HSD Test Results:")
print(tukey)

ANOVA Table:
                          sum_sq   df          F    PR(>F)
C(Strategy)            72.166667  2.0  36.083333  0.000452
C(Market)              85.333333  1.0  85.333333  0.000091
C(Strategy):C(Market)   8.166667  2.0   4.083333  0.075972
Residual                6.000000  6.0        NaN       NaN

Tukey's HSD Test Results:
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B     2.75 0.4989 -3.8143  9.3143  False
     A      C      6.0 0.0726 -0.5643 12.5643  False
     B      C     3.25 0.3894 -3.3143  9.8143  False
----------------------------------------------------


#### Binomial Test
The Binomial test, also known as the Binomial exact test, is a statistical test used to determine whether the proportion of a binary outcome in a sample is significantly different from a hypothesized value. It is particularly useful when the sample size is small, and the data follows a binomial distribution, which is a distribution where there are only two possible outcomes for each trial, often labeled as "success" and "failure".

The Null Hypothesis typically states that the population proportion of one outcome is equal to a specific hypothesized value, while the Alternative Hypothesis suggests that the population proportion is different from the hypothesized value.

It does not produce a test statistic like many other tests; instead, the p-value is calculated directly from the binomial distribution.

Can be used to analyze experimental results, such as the effectiveness of a new feature on a website, the success rate of a marketing campaign, or the reliability of a manufacturing process.

Example: A website has a conversion rate of 5%. After introducing a new feature, out of 100 visitors, 8 converted.Is the new conversion rate significantly different from the old rate?

In [33]:
from scipy.stats import binomtest

# Number of successes (conversions)
k = 8
# Number of trials (visitors)
n = 100
# Hypothesized probability of success (old conversion rate)
p = 0.05

# Perform the binomial test
p_value = binomtest(k, n, p, alternative='two-sided')

print(f"The p-value of the binomial test is: {p_value}")

The p-value of the binomial test is: BinomTestResult(k=8, n=100, alternative='two-sided', statistic=0.08, pvalue=0.16504168794773402)


## How to choose the Hypothesis Test
**Frequently used hypothesis tests:**

#### **1. Z-Test**
* Use When: Comparing the mean of a sample to a known population mean when the population variance is known and the sample size is large (n > 30).
* Example: Testing if the average height of a sample of students is different from the known average height of the population.


---


#### **2. T-Test**
* One-Sample T-Test: Compare the sample mean to a known value.
* Two-Sample T-Test: Compare the means of two independent samples.
* Paired T-Test: Compare means from the same group at different times.
* Use When: The population variance is unknown and the sample size is small (n < 30).
* Example: Testing if the average test scores of two different classes are significantly different.


---


#### **3. ANOVA (Analysis of Variance)**
* **One-Way ANOVA**: Compare means of three or more independent groups.

---


* **Two-Way ANOVA**: Compare means with two independent variables.
* Use When: Comparing the means of three or more groups to see if at least one group mean is different.
* Example: Testing if different teaching methods result in different student performance.


---


#### **4. Chi-Square Test**
* Chi-Square Goodness of Fit Test: Determine if a sample matches a population.
* Chi-Square Test of Independence: Determine if two categorical variables are independent.
* Use When: Dealing with categorical data to test relationships between variables.
* Example: Testing if there is an association between gender and voting preference.

#### **Other Tests:**

**1. Mann-Whitney U Test:**
* Use When: Comparing differences between two independent groups when the data is not normally distributed.
* Example: Testing if the distribution of scores differs between two different teaching methods.
---
**2. Wilcoxon Signed-Rank Test:**
* When to use: Compare the distributions of two related groups (e.g., before and after a treatment).
* Description: Tests if the distributions of two related groups are significantly different.
* Example: Testing if there is a difference in performance before and after a training program.
---
**3. Kruskal-Wallis H Test:**
* When to use: To test if the distributions of multiple groups are significantly different or Comparing more than two independent groups when the data is not normally distributed.
* Example: Testing if there are differences in customer satisfaction scores across multiple stores.
---
**4. Shapiro-Wilk Test or D'Agostino's K^2 Test or Anderson-Darling Test:**
* When to use: Check if the data follows a normal distribution.
* Description: Tests if the data is normally distributed.
---
**5. Augmented Dickey-Fuller Test or Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test:**
* When to use: Check if a time series is stationary.
* Description: Tests if a time series is stationary.


#### Important Note
Beyond choosing a hypothesis test, it is important to understand whether the data you have meets the assumptions of the test you want to run. Each hypothesis test has a unique set of assumptions, however, there is one assumption that all hypothesis tests share: the data was randomly sampled from the population of interest.

This is important because random sampling ensures that the sample is representative of the population in terms of observed (and unobserved) characteristics. Unfortunately, there may be situations where random sampling is impossible, but it is important to understand how this can bias results of a test.

For example, let’s return to the example with the yogurt company “The Dairy Culture”. Let’s say the company had multiple factories, but the quality assurance team only collected yogurts from one specific factory. The data is thus not randomly sampled from the entire population that we care about (all factories), and could be biased if the quality of yogurt differs at each one.

There can also be ethical issues that arise when a sample is not representative of a population. When developing and testing a vaccine, for example, researchers must make sure to find volunteers from an appropriate proportion of genders, races, age ranges, pre-existing conditions, and so on to test efficacy for the entire population that the vaccine will be used on. If the vaccine manufacturers test on a sample that doesn’t include sufficient data for one race, there is a risk that there could be reduced (if during the initial research phase) or unknown efficacy for that group.

It can often be challenging to find a representative sample or even to recognize when there is biased data, but it is essential to think about when designing an experiment.