
# Khipus.ai

## Applied Statistics with Python

### Inferential Statistics

<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>

# Assignment 2



## Instructions
1. Complete each task below by filling in the appropriate code in the provided cells.
2. Ensure that your code runs without errors and produces the expected outputs.
3. Submit your completed notebook as instructed by your instructor.

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib for making easy and beautiful data visualizations.
import matplotlib.pyplot as plt
from scipy import stats # This module contains a large number of probability distributions as well as a growing library of statistical functions.

In [None]:
# Load the sales_data.csv file

sales_data = pd.read_csv('sales_data.csv')

# Display the first few rows of the dataset
sales_data.head()



## Task 1: Simulating Probability distributions

1. Simulate Poisson distribution with lambda (lam) parameter 3 and size 10000

In [None]:
# Task 1
# Simulate Poisson distribution with lambda (lam) parameter 3 and size 10000


poisson_samples = np.random.poisson(lam=3, size=10000)
#<your code goes here>
# Plot histogram
plt.figure(figsize=(10, 5))
sns.histplot(poisson_samples, kde=False, bins=30)
plt.title('Poisson Distribution (λ=3)')
plt.xlabel('Number of Events')
plt.ylabel('Frequency')
plt.show()

2. Simulate exponential distribution with scale parameter 1 and size 1000


In [None]:
# Simulate exponential distribution with scale parameter 1

exponential_samples = np.random.exponential(scale=1, size=10000)
#<your code goes here>
# Plot histogram
plt.figure(figsize=(10, 5))
sns.histplot(exponential_samples, kde=True, bins=30)
plt.title('Exponential Distribution (λ=1)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()


## Task 2: Central Limit Theorem (CLT)
### Sampling Distribution Generation


Objective:

The objective of this assignment is to understand the concept of the sampling distribution of the mean. By drawing multiple samples from the population and calculating their means, you will observe how the sample means are distributed, which should approximate a normal distribution according to the Central Limit Theory.

1. Load the data from the CSV file (coffee_quality.csv)

In [None]:
coffee_quality = pd.read_csv('coffee_quality.csv')# Load the data from the CSV file
coffee_quality.head()# Display the first few rows of the data

Note: The `aroma` column in the `coffee_quality` dataset represents the aroma quality score of the coffee samples. This score typically evaluates the fragrance and smell of the coffee, which is an important attribute in determining the overall quality of the coffee. The scores are usually given by coffee experts or through sensory evaluation processes and can range on a scale (from 0 to 10) where higher scores indicate better aroma quality.

In [None]:
coffee_quality['aroma']

2. Extract Population Data:

Extract the aroma column from the coffee_quality DataFrame.
Drop any missing values and convert the data to a NumPy array called population.

3. Set Parameters:

Define the sample size as 50.
Define the number of samples to draw as 1000.

4. Set Random Seed:

Set the random seed to 2024 using np.random.seed(2024) to ensure reproducibility of the random sampling.

5. Simulate Sampling Distribution:

Initialize an empty list sample_means to store the means of the samples.
Loop 1000 times (as specified by num_samples):
In each iteration, draw a random sample of size 50 from the population using np.random.choice.
Calculate the mean of the sample and append it to the sample_means list.

6. Plot the Sampling Distribution:

Create a histogram of the sample_means list with 30 bins.

Set the figure size to 10x5 inches.

Add a title to the plot: "Sampling Distribution of the Mean (Sample Size = 50)".

Label the x-axis as "Sample Mean" and the y-axis as "Frequency".

Rotate the x-axis tick labels by 45 degrees for better readability.

Display the plot using plt.show().

In [None]:

# Task 2
# Think of our coffee_quality data as a population to draw from
population = coffee_quality['aroma'].dropna().values

# <your code goes here>
# Parameters
sample_size = 50
num_samples = 1000

# Set random seed
np.random.seed(2024)

# Simulate sampling distribution of the mean
sample_means = []
for _ in range(num_samples):
    sample = np.random.choice(population, sample_size)
    sample_means.append(np.mean(sample))

# Plot the sampling distribution of the sample means
plt.figure(figsize=(10, 5))
plt.hist(sample_means, bins=30, edgecolor='k', alpha=0.7)
plt.title('Sampling Distribution of the Mean (Sample Size = 50)')
plt.xlabel('Sample Mean')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()





## Task 3: Hypothesis Testing

Hypothesis testing is a fundamental concept in statistics that allows researchers to make conclusions based on sample data. It begins with formulating a null hypothesis, which states that there is no effect or difference. The alternative hypothesis suggests otherwise. By determining a significance level, researchers can assess whether to reject the null hypothesis based on the evidence provided by the sample data.

Null Hypothesis (H₀): The population mean flavor score is equal to 7.8.

Alternative Hypothesis (H₁): The population mean flavor score is less than 7.8.

Note: The flavor column in the coffee_quality dataset represents the flavor quality score of the coffee samples. This score evaluates the taste characteristics of the coffee, including its richness, balance, and complexity. The scores are typically given by coffee experts or through sensory evaluation processes and can range on a scale (from 0 to 10), where higher scores indicate better flavor quality.

In [None]:
coffee_quality['flavor']

1. Define Hypothesized Mean:

Set the hypothesized population mean for the flavor column to 7.8 and store it in the variable flavor_mean.

2. Perform One-Sample t-Test:

Use the ttest_1samp function from the scipy.stats module to perform a one-sample t-test.

The test will compare the sample mean of the flavor column in the coffee_quality DataFrame to the hypothesized mean (7.8).

Drop any missing values from the flavor column using .dropna().

Set the alternative parameter to 'less' to test the alternative hypothesis that the population mean is less than the hypothesized mean.

3. Store Test Results:

Store the t-statistic and p-value returned by the ttest_1samp function in the variables t_stat and p_value, respectively.

4. Print Results:

Print the t-statistic and p-value using formatted strings.

Expected output:

T-statistic: -2.8437004395767462

P-value: 0.0024543417348696964

In [None]:

# Task 3
# One-sample t-test checking for evidence that mu flavor < 7.8
#<your code goes here>

flavor_mean = 7.8
t_stat, p_value = stats.ttest_1samp(
  coffee_quality['flavor'].dropna(), 
  flavor_mean,
  alternative='less')

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


Analysis:
The T-statistic being negative suggests that the sample mean is less than the hypothesized mean.

The P-value (0.0025) is much smaller than the common significance level (e.g., 0.05), indicating that the observed difference is highly unlikely to have occurred by chance.

Conclusion:

Since the p-value is significantly less than 0.05, we reject the null hypothesis. This suggests strong evidence that the population mean flavor score is less than 7.8.