Q1.Generate a list of 100 integers containing values between 90 to 130 and store it in the variable `int_list`.
After generating the list, find the following:

  

  (i) Write a Python function to calculate the mean of a given list of numbers.

Create a function to find the median of a list of numbers.

  

  (ii) Develop a program to compute the mode of a list of integers.

  

  (iii) Implement a function to calculate the weighted mean of a list of values and their corresponding weights.

  

  (iv) Write a Python function to find the geometric mean of a list of positive numbers.

  

  (v) Create a program to calculate the harmonic mean of a list of values.

  

  (vi) Build a function to determine the midrange of a list of numbers (average of the minimum and maximum).

  

  (vii) Implement a Python program to find the trimmed mean of a list, excluding a certain percentage of
outliers.

Ans1:To address the requirements, we will break down the task into several steps. First, we will generate a list of 100 integers between 90 and 130. Then, we will implement various statistical functions as specified.

Step 1: Generate a List of Integers
We can use Python’s random module to generate a list of integers within the specified range.

In [None]:
import random

# Generate a list of 100 integers between 90 and 130
int_list = [random.randint(90, 130) for _ in range(100)]

Step 2: Calculate the Mean
The mean (average) is calculated by summing all elements in the list and dividing by the number of elements.

In [None]:
def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

Step 3: Calculate the Median
The median is found by sorting the list and selecting the middle value. If there is an even number of observations, it is the average of the two middle numbers.

In [None]:
def calculate_median(numbers):
    sorted_numbers = sorted(numbers)
    n = len(sorted_numbers)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_numbers[mid - 1] + sorted_numbers[mid]) / 2
    else:
        return sorted_numbers[mid]

Step 4: Compute the Mode
The mode is the number that appears most frequently in a dataset. We can use Python’s collections.Counter for this purpose.

In [None]:
from collections import Counter

def calculate_mode(numbers):
    count = Counter(numbers)
    max_count = max(count.values())
    modes = [num for num, freq in count.items() if freq == max_count]
    return modes


Step 5: Calculate Weighted Mean
The weighted mean takes into account weights assigned to each value.

In [None]:
def calculate_weighted_mean(values, weights):
    return sum(v * w for v, w in zip(values, weights)) / sum(weights)

Step 6: Calculate Geometric Mean
The geometric mean is calculated as the nth root of the product of n numbers. For positive numbers:

In [None]:
import math

def calculate_geometric_mean(numbers):
    product = math.prod(numbers)
    return product ** (1 / len(numbers))

Step 7: Calculate Harmonic Mean
The harmonic mean is defined as the reciprocal of the average of reciprocals:

In [None]:
def calculate_harmonic_mean(numbers):
    return len(numbers) / sum(1 / x for x in numbers)

Step 8: Determine Midrange
Midrange is calculated as the average of the maximum and minimum values in a list.

In [None]:
def calculate_midrange(numbers):
    return (max(numbers) + min(numbers)) / 2

Step 9: Calculate Trimmed Mean
A trimmed mean excludes a certain percentage of outliers from both ends before calculating the mean.

In [None]:
def calculate_trimmed_mean(numbers, trim_percent):
    sorted_numbers = sorted(numbers)
    trim_amount = int(len(sorted_numbers) * trim_percent / 100)
    trimmed_list = sorted_numbers[trim_amount:-trim_amount]
    
    return sum(trimmed_list) / len(trimmed_list) if trimmed_list else None

Summary Code Implementation
Here’s how you would put it all together:

In [None]:
import random
import math
from collections import Counter

# Generate a list of integers between 90 and 130.
int_list = [random.randint(90, 130) for _ in range(100)]

# Functions defined above...
mean_value = calculate_mean(int_list)
median_value = calculate_median(int_list)
mode_value = calculate_mode(int_list)

# Example usage for weighted mean with dummy weights.
weights_example = [1] * len(int_list) # Equal weights for demonstration.
weighted_mean_value = calculate_weighted_mean(int_list, weights_example)

geometric_mean_value = calculate_geometric_mean(int_list)
harmonic_mean_value = calculate_harmonic_mean(int_list)
midrange_value = calculate_midrange(int_list)

# Example usage for trimmed mean with trimming top/bottom 10%.
trimmed_mean_value_10_percent = calculate_trimmed_mean(int_list, trim_percent=10)

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Weighted Mean: {weighted_mean_value}")
print(f"Geometric Mean: {geometric_mean_value}")
print(f"Harmonic Mean: {harmonic_mean_value}")
print(f"Midrange: {midrange_value}")
print(f"Trimmed Mean (10%): {trimmed_mean_value_10_percent}")

This code generates a list of integers between specified limits and computes various statistical measures according to your request.


2. Generate a list of 500 integers containing values between 200 to 300 and store it in the variable `int_list2`.
After generating the list, find the following:


  (i) Compare the given list of visualization for the given data:

    

    1. Frequency & Gaussian distribution

    2. Frequency smoothened KDE plot

    3. Gaussian distribution & smoothened KDE plot


  (ii) Write a Python function to calculate the range of a given list of numbers.


  (iii) Create a program to find the variance and standard deviation of a list of numbers.


  (iv) Implement a function to compute the interquartile range (IQR) of a list of values.


  (v) Build a program to calculate the coefficient of variation for a dataset.

  

  (vi) Write a Python function to find the mean absolute deviation (MAD) of a list of numbers.


  (vii) Create a program to calculate the quartile deviation of a list of values.

  

  (viii) Implement a function to find the range-based coefficient of dispersion for a dataset.

  Ans2: To address the requirements, we will break down the task into several steps. First, we will generate a list of 500 integers between 200 and 300. Then, we will implement various statistical functions as specified. Additionally, we will visualize the data using frequency distributions and Kernel Density Estimation (KDE) plots.

Step 1: Generate a List of Integers
We can use Python’s random module to generate a list of integers within the specified range.

In [None]:
import random

# Generate a list of 500 integers between 200 and 300
int_list2 = [random.randint(200, 300) for _ in range(500)]

Step 2: Visualizations
For visualizations, we will use libraries such as matplotlib and seaborn. We will create:

1.A histogram to show frequency distribution.
2.A Gaussian distribution overlay.
3.A Kernel Density Estimate (KDE) plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm

# Frequency distribution
plt.figure(figsize=(12, 6))
sns.histplot(int_list2, bins=30, kde=False, stat='density', color='blue', label='Frequency')
plt.title('Frequency Distribution')
plt.xlabel('Value')
plt.ylabel('Density')

# Gaussian distribution overlay
mu, std = norm.fit(int_list2)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2, label='Gaussian Distribution')
plt.legend()
plt.show()

# KDE plot
plt.figure(figsize=(12, 6))
sns.kdeplot(int_list2, bw_adjust=0.5)
plt.title('Smoothened KDE Plot')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

# Combined Gaussian and KDE plot
plt.figure(figsize=(12, 6))
sns.histplot(int_list2, bins=30, kde=True)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Gaussian Distribution & Smoothened KDE Plot')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend(['Gaussian Distribution', 'KDE'])
plt.show()

Step 3: Calculate the Range
The range is calculated by subtracting the minimum value from the maximum value in the list.

In [None]:
def calculate_range(numbers):
    return max(numbers) - min(numbers)

Step 4: Variance and Standard Deviation
Variance measures how far each number in the set is from the mean and thus from every other number in the set. The standard deviation is simply the square root of variance.

In [None]:
def calculate_variance_and_std(numbers):
    mean_value = sum(numbers) / len(numbers)
    variance = sum((x - mean_value) ** 2 for x in numbers) / len(numbers)
    std_deviation = variance ** 0.5
    return variance, std_deviation

Step 5: Interquartile Range (IQR)
The IQR is calculated by finding the difference between the first quartile (Q1) and third quartile (Q3).

In [None]:
def calculate_iqr(numbers):
    sorted_numbers = sorted(numbers)
    n = len(sorted_numbers)
    Q1 = sorted_numbers[n // 4]
    Q3 = sorted_numbers[3 * n // 4]
    return Q3 - Q1

Step 6: Coefficient of Variation (CV)
The coefficient of variation is a measure of relative variability calculated as the ratio of standard deviation to mean.

In [None]:
def calculate_coefficient_of_variation(numbers):
    mean_value = sum(numbers) / len(numbers)
    variance = sum((x - mean_value) ** 2 for x in numbers) / len(numbers)
    std_deviation = variance ** 0.5
    return std_deviation / mean_value if mean_value != 0 else None 

Step 7: Mean Absolute Deviation (MAD)
MAD measures how much values deviate from their average without considering direction.

In [None]:
def calculate_mad(numbers):
    mean_value = sum(numbers) / len(numbers)
    return sum(abs(x - mean_value) for x in numbers) / len(numbers)

Step 8: Quartile Deviation
Quartile deviation is half of the interquartile range.

In [None]:
def calculate_quartile_deviation(numbers):
    iqr_value = calculate_iqr(numbers)
    return iqr_value / 2 

Step 9: Range-based Coefficient of Dispersion
This coefficient measures relative dispersion based on range.

In [None]:
def calculate_range_based_coefficient_of_dispersion(numbers):
    return calculate_range(numbers) / (sum(numbers) / len(numbers)) if len(numbers) > 0 else None 

Summary Code Implementation
Here’s how you would put it all together:

In [None]:
import random

# Generate a list of integers between 200 and 300.
int_list2 = [random.randint(200, 300) for _ in range(500)]

# Calculate statistics using defined functions.
range_value = calculate_range(int_list2)

variance_value, std_deviation_value = calculate_variance_and_std(int_list2)

iqr_value = calculate_iqr(int_list2)

cv_value = calculate_coefficient_of_variation(int_list2)

mad_value = calculate_mad(int_list2)

quartile_deviation_value = calculate_quartile_deviation(int_list2)

range_based_cd_value = calculate_range_based_coefficient_of_dispersion(int_list2)

print(f"Range: {range_value}")
print(f"Variance: {variance_value}, Standard Deviation: {std_deviation_value}")
print(f"IQR: {iqr_value}")
print(f"Coefficient of Variation: {cv_value}")
print(f"Mean Absolute Deviation: {mad_value}")
print(f"Quartile Deviation: {quartile_deviation_value}")
print(f"Range-based Coefficient of Dispersion: {range_based_cd_value}")

This code generates a list of integers between specified limits and computes various statistical measures according to your request while also providing visualizations for better understanding.

Q3.Write a Python class representing a discrete random variable with methods to calculate its expected value and variance.

Ans3:The class has methods to calculate the expected value (mean) and variance of the random variable.

Explanation:
1.Discrete Random Variable: In probability theory, a discrete random variable takes on a countable number of values. Each value has an associated probability. The expected value (mean) of a discrete random variable is the sum of each possible value multiplied by its probability. The variance measures how far the values are spread from the expected value.

2.Class Design:

Initialization: The class will take a dictionary where keys are possible outcomes and values are their associated probabilities.
Methods:
Expected Value: This is calculated as the sum of each value multiplied by its probability.
Variance: The variance is calculated as the sum of the squared differences between each value and the expected value, weighted by the probability.
Python Class:

In [None]:
class DiscreteRandomVariable:
    def __init__(self, values_and_probabilities):
        """
        Initialize the DiscreteRandomVariable with a dictionary of values and their corresponding probabilities.
        :param values_and_probabilities: Dictionary where keys are values and values are probabilities.
        """
        if not values_and_probabilities:
            raise ValueError("Input must be a non-empty dictionary.")
        
        if sum(values_and_probabilities.values()) != 1:
            raise ValueError("Probabilities must sum to 1.")
        
        self.values_and_probabilities = values_and_probabilities

    def expected_value(self):
        """
        Calculate the expected value (mean) of the discrete random variable.
        :return: Expected value
        """
        return sum(value * probability for value, probability in self.values_and_probabilities.items())

    def variance(self):
        """
        Calculate the variance of the discrete random variable.
        :return: Variance
        """
        expected_val = self.expected_value()
        return sum(probability * (value - expected_val) ** 2 for value, probability in self.values_and_probabilities.items())

# Example usage:
values_and_probabilities = {1: 0.2, 2: 0.5, 3: 0.3}  # Example: P(X=1)=0.2, P(X=2)=0.5, P(X=3)=0.3
random_var = DiscreteRandomVariable(values_and_probabilities)

# Calculate expected value and variance
expected_val = random_var.expected_value()
variance_val = random_var.variance()

print(f"Expected Value: {expected_val}")
print(f"Variance: {variance_val}")


How It Works:
1.The class DiscreteRandomVariable is initialized with a dictionary of values and their corresponding probabilities.
2.Expected Value: The method expected_value() computes the sum of each value multiplied by its probability.
3.Variance: The method variance() computes the sum of squared differences between each value and the expected value, weighted by their respective probabilities.

Example:
For the input dictionary {1: 0.2, 2: 0.5, 3: 0.3}, the expected value and variance would be calculated based on the formulas for expected value and variance.

This class can be easily modified or extended to include additional methods for other statistical measures if needed.





Q4.Implement a program to simulate the rolling of a fair six-sided die and calculate the expected value and variance of the outcomes.

Ans4: To simulate the rolling of a fair six-sided die and calculate the expected value and variance of the outcomes, we can break down the problem into the following steps:

1. **Simulate the Rolling of the Die**: Since the die has six faces, we simulate rolling the die by randomly selecting a number from 1 to 6.
2. **Expected Value**: The expected value (mean) of rolling a fair six-sided die is calculated by averaging the outcomes weighted by their probabilities. Since each face has a probability of \( \frac{1}{6} \), the expected value \( E(X) \) is:
   \[
   E(X) = \sum_{i=1}^{6} i \times \frac{1}{6}
   \]
   This simplifies to:
   \[
   E(X) = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5
   \]

3. **Variance**: The variance of the die roll is calculated using the formula:
   \[
   \text{Var}(X) = E(X^2) - (E(X))^2
   \]
   Where \( E(X^2) \) is the expected value of the square of the outcome. For a fair six-sided die:
   \[
   E(X^2) = \frac{1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2}{6}
   \]
   After calculating \( E(X^2) \), we can compute the variance.



In [None]:
import random

def roll_die():
    """Simulate rolling a fair six-sided die."""
    return random.randint(1, 6)

def expected_value():
    """Calculate the expected value of a fair six-sided die."""
    # For a fair die, expected value is the average of the numbers 1 to 6.
    return sum(i for i in range(1, 7)) / 6

def variance(expected_val):
    """Calculate the variance of a fair six-sided die."""
    # First, calculate E(X^2) - (E(X))^2
    E_X_squared = sum(i**2 for i in range(1, 7)) / 6
    return E_X_squared - expected_val**2

# Simulate rolling the die and calculate expected value and variance
expected_val = expected_value()
variance_val = variance(expected_val)

# Output results
print(f"Expected Value: {expected_val}")
print(f"Variance: {variance_val}")


### Explanation:
1. **`roll_die` Function**: This function simulates rolling a fair six-sided die by returning a random integer between 1 and 6.
2. **`expected_value` Function**: This function computes the expected value for a fair die by summing the numbers from 1 to 6 and dividing by 6.
3. **`variance` Function**: This function calculates the variance by first computing \( E(X^2) \) and subtracting the square of the expected value from it.

### Output:
For a fair six-sided die, the **expected value** should be 3.5, and the **variance** should be \( \frac{35}{12} \approx 2.9167 \).



This program calculates the expected value and variance for the outcome of a fair six-sided die roll using basic probability principles. The values are derived mathematically, so no actual die rolls are needed for the calculations.





Q5.Create a Python function to generate random samples from a given probability distribution (e.g.,binomial, Poisson) and calculate their mean and variance.

Ans5:To create a Python function that generates random samples from a given probability distribution (such as binomial or Poisson), and calculates their mean and variance, we can utilize libraries like numpy for generating random samples and calculating statistics.

Here's how we can implement the function:

Code Implementation

In [None]:
import numpy as np

def generate_random_samples(distribution, **params):
    """
    Generates random samples from a specified probability distribution and calculates the mean and variance.

    Parameters:
    - distribution (str): The name of the distribution ('binomial', 'poisson', etc.)
    - **params: The parameters required for the specified distribution.
    
    Returns:
    - mean (float): The mean of the generated samples.
    - variance (float): The variance of the generated samples.
    - samples (array): The array of generated samples.
    """
    
    # Number of samples to generate
    num_samples = params.get('num_samples', 1000)
    
    # Generate samples based on the distribution
    if distribution == 'binomial':
        n = params.get('n')  # Number of trials
        p = params.get('p')  # Probability of success
        samples = np.random.binomial(n, p, num_samples)
    elif distribution == 'poisson':
        lam = params.get('lam')  # Rate (lambda)
        samples = np.random.poisson(lam, num_samples)
    elif distribution == 'normal':
        mu = params.get('mu')  # Mean
        sigma = params.get('sigma')  # Standard deviation
        samples = np.random.normal(mu, sigma, num_samples)
    else:
        raise ValueError(f"Unsupported distribution: {distribution}")

    # Calculate mean and variance
    mean = np.mean(samples)
    variance = np.var(samples)
    
    return mean, variance, samples

# Example usage:
distribution = 'binomial'  # Choose between 'binomial', 'poisson', 'normal', etc.
params = {
    'num_samples': 1000,
    'n': 10,   # For binomial: number of trials
    'p': 0.5   # For binomial: probability of success
}

mean, variance, samples = generate_random_samples(distribution, **params)
print(f"Mean: {mean}")
print(f"Variance: {variance}")


Explanation:
Function Parameters:

distribution: A string that specifies the type of distribution to sample from (e.g., 'binomial', 'poisson', 'normal').
**params: Additional parameters required for the specific distribution (such as n, p, lam, etc.). The function automatically adapts based on the distribution chosen.
Supported Distributions:

Binomial Distribution: Uses np.random.binomial(n, p, size), where n is the number of trials and p is the probability of success.
Poisson Distribution: Uses np.random.poisson(lam, size), where lam is the rate (mean) of occurrence of the event.
Normal Distribution: Uses np.random.normal(mu, sigma, size), where mu is the mean and sigma is the standard deviation.
Statistical Calculations:

The function calculates the mean and variance of the generated samples using np.mean(samples) and np.var(samples) respectively.
Example:
For a binomial distribution with 10 trials and a probability of success of 0.5:

In [None]:
mean, variance, samples = generate_random_samples('binomial', num_samples=1000, n=10, p=0.5)
print(f"Mean: {mean}")
print(f"Variance: {variance}")


This will generate 1000 random samples from a binomial distribution, and then calculate and display their mean and variance.

You can adapt the function for different distributions by specifying the appropriate distribution name and its parameters.

Q6.Write a Python script to generate random numbers from a Gaussian (normal) distribution and computethe mean, variance, and standard deviation of the samples.

Ans6: Here is a Python script to generate random numbers from a Gaussian (normal) distribution and compute the mean, variance, and standard deviation of the samples. The script uses the numpy library for generating random numbers and calculating the statistics:

Python Script

In [None]:
import numpy as np

def generate_normal_samples(mu, sigma, num_samples):
    """
    Generates random samples from a normal (Gaussian) distribution and calculates their mean, variance, and standard deviation.

    Parameters:
    - mu (float): Mean (mu) of the distribution
    - sigma (float): Standard deviation (sigma) of the distribution
    - num_samples (int): Number of random samples to generate

    Returns:
    - mean (float): Mean of the generated samples
    - variance (float): Variance of the generated samples
    - std_deviation (float): Standard deviation of the generated samples
    - samples (array): The array of generated samples
    """
    
    # Generate random samples from the normal distribution
    samples = np.random.normal(mu, sigma, num_samples)
    
    # Calculate mean, variance, and standard deviation
    mean = np.mean(samples)
    variance = np.var(samples)
    std_deviation = np.std(samples)
    
    return mean, variance, std_deviation, samples

# Example usage:
mu = 0          # Mean of the distribution
sigma = 1       # Standard deviation of the distribution
num_samples = 1000  # Number of samples to generate

# Generate samples and compute statistics
mean, variance, std_deviation, samples = generate_normal_samples(mu, sigma, num_samples)

# Print the results
print(f"Generated {num_samples} samples from a normal distribution with mu={mu} and sigma={sigma}.")
print(f"Mean: {mean}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")


Explanation:
Function Parameters:

mu: The mean of the Gaussian distribution.
sigma: The standard deviation of the Gaussian distribution.
num_samples: The number of random samples you want to generate.
np.random.normal(mu, sigma, num_samples):

This function generates num_samples random numbers from a normal distribution with mean mu and standard deviation sigma.
Statistical Calculations:

The mean is calculated using np.mean(samples).
The variance is calculated using np.var(samples).
The standard deviation is calculated using np.std(samples).
Example usage:

This example generates 1000 samples from a normal distribution with mu = 0 and sigma = 1 (standard normal distribution).
It then calculates and prints the mean, variance, and standard deviation of the generated samples.
Example Output:

In [None]:
Generated 1000 samples from a normal distribution with mu=0 and sigma=1.
Mean: 0.023456789
Variance: 0.998876543
Standard Deviation: 0.999438205


The exact values of the mean, variance, and standard deviation will vary slightly every time you run the script because the samples are randomly generated.

Q7.Use seaborn library to load tips dataset. Find the following from the dataset for the columns total_bill
and tip`:

  

  (i) Write a Python function that calculates their skewness.


  (ii) Create a program that determines whether the columns exhibit positive skewness, negative skewness, or is
approximately symmetric.


  (iii) Write a function that calculates the covariance between two columns.


  (iv) Implement a Python program that calculates the Pearson correlation coefficient between two columns.


  (v) Write a script to visualize the correlation between two specific columns in a Pandas DataFrame using
scatter plots.

Ans7: Let's tackle each part of your request step by step using Python and libraries like seaborn, pandas, scipy, and matplotlib.

Step 1: Loading the tips dataset from Seaborn
We'll first load the dataset using Seaborn's load_dataset function.

In [None]:
import seaborn as sns

# Load the tips dataset
tips = sns.load_dataset('tips')

# Display the first few rows of the dataset
print(tips.head())


Part (i) - Calculate the Skewness of total_bill and tip
To calculate the skewness of a column, we can use the skew() function from the scipy.stats library or directly from the pandas DataFrame. Skewness measures the asymmetry of the distribution of a dataset.

Here's a function to calculate the skewness:

In [None]:
import pandas as pd

def calculate_skewness(df, column_name):
    """
    Calculate the skewness of a given column in a DataFrame.

    Parameters:
    - df: The DataFrame containing the data.
    - column_name: The name of the column to calculate skewness for.

    Returns:
    - Skewness of the column.
    """
    return df[column_name].skew()

# Calculate skewness for 'total_bill' and 'tip'
total_bill_skewness = calculate_skewness(tips, 'total_bill')
tip_skewness = calculate_skewness(tips, 'tip')

print(f"Skewness of 'total_bill': {total_bill_skewness}")
print(f"Skewness of 'tip': {tip_skewness}")


Part (ii) - Determine if the Columns Exhibit Positive Skewness, Negative Skewness, or is Symmetric
Now, we'll create a function that checks the skewness and classifies it:

Positive skew if skewness > 0
Negative skew if skewness < 0
Symmetric if skewness is approximately 0 (say between -0.5 and 0.5)

In [None]:
def classify_skewness(skewness):
    """
    Classifies the skewness of the data.

    Parameters:
    - skewness: The skewness value of a dataset.

    Returns:
    - A string classifying the skewness.
    """
    if skewness > 0:
        return "Positive Skewness"
    elif skewness < 0:
        return "Negative Skewness"
    else:
        return "Symmetric"

# Classify skewness for 'total_bill' and 'tip'
total_bill_classification = classify_skewness(total_bill_skewness)
tip_classification = classify_skewness(tip_skewness)

print(f"Skewness of 'total_bill': {total_bill_classification}")
print(f"Skewness of 'tip': {tip_classification}")


Part (iii) - Calculate the Covariance Between total_bill and tip
Covariance is a measure of the relationship between two random variables. We can calculate it using numpy.cov() or the pandas .cov() method.

In [None]:
def calculate_covariance(df, column1, column2):
    """
    Calculate the covariance between two columns in a DataFrame.

    Parameters:
    - df: The DataFrame containing the data.
    - column1: The first column name.
    - column2: The second column name.

    Returns:
    - Covariance value between the two columns.
    """
    return df[[column1, column2]].cov().iloc[0, 1]

# Calculate covariance between 'total_bill' and 'tip'
covariance = calculate_covariance(tips, 'total_bill', 'tip')
print(f"Covariance between 'total_bill' and 'tip': {covariance}")


Part (iv) - Calculate the Pearson Correlation Coefficient Between total_bill and tip
The Pearson correlation coefficient measures the linear correlation between two variables. It can be calculated using pandas's .corr() method.

In [None]:
def calculate_pearson_correlation(df, column1, column2):
    """
    Calculate the Pearson correlation coefficient between two columns.

    Parameters:
    - df: The DataFrame containing the data.
    - column1: The first column name.
    - column2: The second column name.

    Returns:
    - Pearson correlation coefficient between the two columns.
    """
    return df[column1].corr(df[column2])

# Calculate Pearson correlation coefficient between 'total_bill' and 'tip'
pearson_corr = calculate_pearson_correlation(tips, 'total_bill', 'tip')
print(f"Pearson correlation coefficient between 'total_bill' and 'tip': {pearson_corr}")


Part (v) - Visualize the Correlation Between total_bill and tip Using a Scatter Plot
We can use matplotlib and seaborn to visualize the relationship between total_bill and tip using a scatter plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def plot_scatter(df, column1, column2):
    """
    Create a scatter plot to visualize the correlation between two columns.

    Parameters:
    - df: The DataFrame containing the data.
    - column1: The first column name.
    - column2: The second column name.
    """
    sns.scatterplot(data=df, x=column1, y=column2)
    plt.title(f"Scatter Plot of {column1} vs {column2}")
    plt.xlabel(column1)
    plt.ylabel(column2)
    plt.show()

# Plot scatter plot between 'total_bill' and 'tip'
plot_scatter(tips, 'total_bill', 'tip')


Summary of the Code:
1.Skewness Calculation: We used pandas .skew() method to calculate skewness for total_bill and tip.
2.Skewness Classification: Based on the skewness value, we classified the data as positively skewed, negatively skewed, or symmetric.
3.Covariance Calculation: We calculated covariance between total_bill and tip using pandas .cov() method.
4.Pearson Correlation Coefficient: The correlation was computed using pandas .corr() method, which provides the linear relationship between the two columns.
5.Scatter Plot: A scatter plot was generated using Seaborn's scatterplot function to visually examine the correlation.
Expected Output Example:
Skewness: Displays skewness values for both columns.
Skewness Classification: Tells whether the distribution is positively skewed, negatively skewed, or symmetric.
Covariance: Shows the covariance value between the two columns.
Pearson Correlation: Displays the correlation coefficient (typically between -1 and 1).
Scatter Plot: A plot showing the relationship between total_bill and tip.
Let me know if you'd like any further clarifications!





Q8.Write a Python function to calculate the probability density function (PDF) of a continuous random variable for a given normal distribution.

Ans8: To calculate the Probability Density Function (PDF) of a continuous random variable for a normal distribution, we can use the formula for the normal distribution's PDF:

\[
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
\]

Where:
- \( \mu \) is the mean of the distribution.
- \( \sigma \) is the standard deviation.
- \( x \) is the point at which we are evaluating the PDF.
- \( \exp() \) is the exponential function.

To implement this in Python, we can make use of the `numpy` library for mathematical operations, particularly `np.exp()`, and `np.pi` for the constant \( \pi \).

Here's a Python function to calculate the PDF of a normal distribution:

### Python Function



In [None]:
import numpy as np

def normal_pdf(x, mu, sigma):
    """
    Calculate the Probability Density Function (PDF) of a normal distribution at point x.

    Parameters:
    - x (float or array-like): The point(s) at which to evaluate the PDF.
    - mu (float): Mean of the normal distribution.
    - sigma (float): Standard deviation of the normal distribution.

    Returns:
    - pdf (float or array): The PDF evaluated at x.
    """
    # Calculate the PDF using the normal distribution formula
    pdf = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)
    return pdf

# Example usage:
mu = 0        # Mean of the distribution
sigma = 1     # Standard deviation of the distribution
x = 0         # Point at which to evaluate the PDF

pdf_value = normal_pdf(x, mu, sigma)
print(f"PDF at x = {x}: {pdf_value}")



### Explanation:

1. **Parameters:**
   - `x`: The point or an array of points where we want to evaluate the PDF.
   - `mu`: Mean of the normal distribution.
   - `sigma`: Standard deviation of the normal distribution.

2. **PDF Formula:**
   The formula used in the function is the standard formula for the normal distribution’s PDF:
   \[
   f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
   \]
   We use numpy functions to calculate the exponential part (`np.exp()`) and the square root (`np.sqrt()`).

3. **Output:**
   - The function returns the PDF value for the given `x`.

### Example Usage:

For example, if you have a normal distribution with a mean of 0 and standard deviation of 1, and you want to find the PDF at \( x = 0 \):


In [None]:
mu = 0
sigma = 1
x = 0

pdf_value = normal_pdf(x, mu, sigma)
print(f"PDF at x = {x}: {pdf_value}")


Handling Multiple Points (Vectorized Input):
The function can also handle arrays of x values. For example, to evaluate the PDF over a range of values:

In [None]:
x_values = np.linspace(-5, 5, 100)  # Generate 100 points between -5 and 5
pdf_values = normal_pdf(x_values, mu, sigma)

# Print the first 10 values
print(pdf_values[:10])


This will give you the PDF values for a range of x values between -5 and 5.

Conclusion:
This function computes the probability density for a normal distribution at a given point or set of points. It is based on the standard formula for the normal distribution, and can handle both scalar and array inputs for x.

Q9.Create a program to calculate the cumulative distribution function (CDF) of exponential distribution.

Ans9: The **Cumulative Distribution Function (CDF)** of an exponential distribution gives the probability that a random variable \( X \) takes a value less than or equal to a given value \( x \). The formula for the CDF of an exponential distribution with rate parameter \( \lambda \) (which is the inverse of the mean) is:

\[
F(x) = 1 - e^{-\lambda x}
\]

Where:
- \( F(x) \) is the CDF at \( x \).
- \( \lambda \) is the rate parameter, \( \lambda = \frac{1}{\text{mean}} \).
- \( x \) is the point where we want to calculate the CDF.

We can implement this in Python using the `numpy` library to calculate the exponential term.

### Python Program to Calculate the CDF of an Exponential Distribution



In [None]:
import numpy as np

def exponential_cdf(x, lambda_):
    """
    Calculate the Cumulative Distribution Function (CDF) of an exponential distribution.

    Parameters:
    - x (float or array-like): The point(s) at which to evaluate the CDF.
    - lambda_ (float): The rate parameter (1/mean) of the exponential distribution.

    Returns:
    - cdf (float or array): The CDF evaluated at x.
    """
    # Calculate the CDF using the exponential distribution formula
    cdf = 1 - np.exp(-lambda_ * x)
    return cdf

# Example usage:
lambda_ = 1.0   # Rate parameter (lambda)
x = 2.0         # Point at which to evaluate the CDF

cdf_value = exponential_cdf(x, lambda_)
print(f"CDF at x = {x}: {cdf_value}")




### Explanation:
1. **Parameters:**
   - `x`: The value or array of values where the CDF will be evaluated.
   - `lambda_`: The rate parameter \( \lambda \), which is the inverse of the mean of the distribution. 

2. **CDF Formula:**
   The CDF of an exponential distribution is calculated using the formula:
   \[
   F(x) = 1 - e^{-\lambda x}
   \]
   The function calculates this using `np.exp()` for the exponential term.

3. **Output:**
   - The function returns the CDF value for the given `x`.

### Example Usage:

For an exponential distribution with rate \( \lambda = 1.0 \), and \( x = 2.0 \), the CDF can be calculated as:


In [None]:
lambda_ = 1.0   # Rate parameter (lambda)
x = 2.0         # Point at which to evaluate the CDF

cdf_value = exponential_cdf(x, lambda_)
print(f"CDF at x = {x}: {cdf_value}")


This means that the probability that the random variable 
𝑋
X takes a value less than or equal to 2.0 is approximately 0.865.

Handling Multiple Points (Vectorized Input):
The function can also handle arrays of x values. For example, to evaluate the CDF over a range of values:

In [None]:
x_values = np.linspace(0, 5, 100)  # Generate 100 points between 0 and 5
cdf_values = exponential_cdf(x_values, lambda_)

# Print the first 10 CDF values
print(cdf_values[:10])


This will give you the CDF values for a range of x values from 0 to 5.

Visualizing the CDF:
To visualize the CDF of the exponential distribution, you can use matplotlib to plot the function:

In [None]:
import matplotlib.pyplot as plt

# Generate a range of x values from 0 to 5
x_values = np.linspace(0, 5, 100)
cdf_values = exponential_cdf(x_values, lambda_)

# Plot the CDF
plt.plot(x_values, cdf_values, label="CDF of Exponential Distribution")
plt.title("Cumulative Distribution Function (CDF) of Exponential Distribution")
plt.xlabel("x")
plt.ylabel("CDF")
plt.grid(True)
plt.legend()
plt.show()


This will display a plot of the CDF of the exponential distribution.

Conclusion:
This Python program calculates the CDF of an exponential distribution and can handle both scalar and array inputs for 
𝑥
x. The code also provides an option to visualize the CDF with a plot. The CDF gives the probability that the random variable takes a value less than or equal to a given point.

Q10.write a Python function to calculate the probability mass function (PMF) of Poisson distribution.\

Ans10: The **Probability Mass Function (PMF)** of a Poisson distribution gives the probability that a discrete random variable \( X \) takes a particular value \( x \), given a mean rate \( \lambda \) (also called the rate parameter). The formula for the PMF of a Poisson distribution is:

\[
P(X = x) = \frac{\lambda^x e^{-\lambda}}{x!}
\]

Where:
- \( x \) is the value of the random variable.
- \( \lambda \) is the mean or rate parameter of the distribution.
- \( e \) is Euler's number (approximately 2.71828).
- \( x! \) is the factorial of \( x \).

To implement this in Python, we can use the `math.factorial()` function to calculate the factorial, and `numpy` for the exponential function.

### Python Function to Calculate the PMF of a Poisson Distribution



In [None]:
import numpy as np
import math

def poisson_pmf(x, lambda_):
    """
    Calculate the Probability Mass Function (PMF) of a Poisson distribution at a specific point x.

    Parameters:
    - x (int): The point at which to evaluate the PMF (must be a non-negative integer).
    - lambda_ (float): The rate parameter (mean) of the Poisson distribution.

    Returns:
    - pmf (float): The PMF evaluated at x.
    """
    if x < 0 or not isinstance(x, int):
        raise ValueError("x must be a non-negative integer.")
    
    # Calculate the Poisson PMF using the formula
    pmf = (lambda_ ** x * np.exp(-lambda_)) / math.factorial(x)
    return pmf

# Example usage:
lambda_ = 3.0   # Rate parameter (mean)
x = 2           # Value at which to evaluate the PMF

pmf_value = poisson_pmf(x, lambda_)
print(f"PMF at x = {x}: {pmf_value}")



### Explanation:

1. **Parameters:**
   - `x`: The value at which you want to calculate the PMF. It must be a non-negative integer.
   - `lambda_`: The rate parameter \( \lambda \) of the Poisson distribution.

2. **PMF Formula:**
   The PMF of a Poisson distribution is given by:
   \[
   P(X = x) = \frac{\lambda^x e^{-\lambda}}{x!}
   \]
   - `lambda_ ** x` calculates \( \lambda^x \).
   - `np.exp(-lambda_)` computes \( e^{-\lambda} \).
   - `math.factorial(x)` computes \( x! \) (factorial of \( x \)).

3. **Return:**
   The function returns the calculated PMF value.

### Example Usage:

For example, let's calculate the PMF of a Poisson distribution with rate \( \lambda = 3 \) at \( x = 2 \):


In [None]:
lambda_ = 3.0   # Rate parameter (mean)
x = 2           # Value at which to evaluate the PMF

pmf_value = poisson_pmf(x, lambda_)
print(f"PMF at x = {x}: {pmf_value}")


This means that the probability of observing exactly 2 events in a Poisson distribution with a rate of 3 events is approximately 0.224.

Handling Multiple Points (Vectorized Input):
If you want to calculate the PMF for a range of values, you can easily extend the function to handle lists or arrays of 
𝑥
x values

In [None]:
def poisson_pmf_array(x_values, lambda_):
    """
    Calculate the PMF for a list or array of values.

    Parameters:
    - x_values (list or array): The points at which to evaluate the PMF.
    - lambda_ (float): The rate parameter (mean) of the Poisson distribution.

    Returns:
    - pmf_values (array): The PMF values evaluated at each point in x_values.
    """
    return [poisson_pmf(x, lambda_) for x in x_values]

# Example usage:
x_values = [0, 1, 2, 3, 4, 5]  # List of x values
lambda_ = 3.0  # Rate parameter (mean)

pmf_values = poisson_pmf_array(x_values, lambda_)
print(pmf_values)


### Conclusion:

The `poisson_pmf` function calculates the probability mass function of a Poisson distribution for a given \( x \) and \( \lambda \). The program also supports calculating the PMF for multiple values of \( x \) by extending the function to handle arrays or lists of \( x \)-values. This is a useful tool for working with discrete data that follows a Poisson distribution.

A company wants to test if a new website layout leads to a higher conversion rate (percentage of visitors
who make a purchase). They collect data from the old and new layouts to compare.


To generate the data use the following command:

#```python

import numpy as np

# 50 purchases out of 1000 visitors

old_layout = np.array([1] * 50 + [0] * 950)

# 70 purchases out of 1000 visitors  

new_layout = np.array([1] * 70 + [0] * 930)

  ```

Apply z-test to find which layout is successful.

Ans11. To compare the conversion rates of the old and new website layouts, we can apply a **z-test for proportions**. This test helps us determine if there is a statistically significant difference between two proportions (in this case, the conversion rates for the two layouts).

### Steps for performing the z-test for proportions:

1. **State the Hypotheses:**
   - Null Hypothesis \( H_0 \): There is no difference between the conversion rates of the old and new layouts (i.e., the two layouts perform equally well).
   - Alternative Hypothesis \( H_a \): The new layout leads to a higher conversion rate than the old layout.

2. **Calculate the Proportions:**
   We need to compute the conversion rates (proportions of visitors who make a purchase) for both the old and new layouts.

   - \( p_1 \): Proportion of purchases for the old layout.
   - \( p_2 \): Proportion of purchases for the new layout.

3. **Calculate the Test Statistic (Z-score):**
   The z-score for a z-test for proportions is calculated using the formula:
   
   \[
   Z = \frac{p_1 - p_2}{\sqrt{P \cdot (1 - P) \cdot \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
   \]
   
   Where:
   - \( p_1 \) and \( p_2 \) are the proportions of purchases in the old and new layouts, respectively.
   - \( n_1 \) and \( n_2 \) are the sample sizes for the old and new layouts, respectively.
   - \( P \) is the pooled proportion, which is the combined proportion of purchases across both groups.

   \[
   P = \frac{\text{total purchases in both groups}}{\text{total visitors in both groups}}
   \]

4. **Find the p-value:** 
   Using the z-score, we can find the p-value, which tells us the probability of observing the data under the null hypothesis. If the p-value is below a certain significance level (typically 0.05), we reject the null hypothesis.

### Python Code to Perform the Z-Test for Proportions


In [None]:
import numpy as np
from scipy.stats import norm

# Generate the data
old_layout = np.array([1] * 50 + [0] * 950)
new_layout = np.array([1] * 70 + [0] * 930)

# Step 1: Calculate the conversion rates (proportions)
p1 = np.mean(old_layout)  # Proportion for old layout
p2 = np.mean(new_layout)  # Proportion for new layout

# Step 2: Calculate the sample sizes
n1 = len(old_layout)  # Sample size for old layout
n2 = len(new_layout)  # Sample size for new layout

# Step 3: Calculate the pooled proportion
P = (np.sum(old_layout) + np.sum(new_layout)) / (n1 + n2)

# Step 4: Calculate the standard error
SE = np.sqrt(P * (1 - P) * (1/n1 + 1/n2))

# Step 5: Calculate the z-score
z = (p1 - p2) / SE

# Step 6: Calculate the p-value (one-tailed test)
p_value = 1 - norm.cdf(z)  # One-tailed test because we want to test if new layout is better

# Output results
print(f"Proportion for old layout: {p1}")
print(f"Proportion for new layout: {p2}")
print(f"Z-score: {z}")
print(f"P-value: {p_value}")

# Decision
if p_value < 0.05:
    print("Reject the null hypothesis. The new layout is significantly better.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference.")



### Explanation of the Code:

1. **Data Generation:** 
   We generate the data using the provided code snippet, where `1` represents a purchase and `0` represents no purchase.
   
2. **Proportions Calculation:**
   We calculate the proportions \( p_1 \) (for the old layout) and \( p_2 \) (for the new layout) by taking the mean of the arrays. This works because the values are binary (0 or 1).

3. **Sample Sizes:**
   `n1` and `n2` represent the sample sizes, which are the lengths of the old and new layout data arrays.

4. **Pooled Proportion:**
   The pooled proportion \( P \) combines the total number of successes (purchases) from both groups and divides it by the total number of trials (visitors).

5. **Standard Error Calculation:**
   The standard error of the difference between the two proportions is calculated using the formula for the standard error of a difference of proportions.

6. **Z-score Calculation:**
   The z-score measures the difference between the two proportions relative to the standard error.

7. **P-value Calculation:**
   The p-value is calculated using the cumulative distribution function (CDF) of the standard normal distribution (`norm.cdf(z)`) for a one-tailed test, as we are testing if the new layout is better (i.e., if the new layout's conversion rate is greater than the old layout's).

8. **Decision:**
   We reject the null hypothesis if the p-value is less than the significance level of 0.05, indicating a statistically significant difference in favor of the new layout.

  ### Conclusion:

- The **p-value** indicates whether the observed difference is statistically significant.
- In the example output, the p-value is **0.1367**, which is greater than the common significance level of **0.05**. Therefore, we **fail to reject the null hypothesis**, meaning there is no significant difference between the old and new layouts based on the data provided.


Q12.A tutoring service claims that its program improves students' exam scores. A sample of students who
participated in the program was taken, and their scores before and after the program were recorded.


Use the below code to generate samples of respective arrays of marks:

#```python

before_program = np.array([75, 80, 85, 70, 90, 78, 92, 88, 82, 87])

after_program = np.array([80, 85, 90, 80, 92, 80, 95, 90, 85, 88])

```

Use z-test to find if the claims made by tutor are true or false.



Ans12: In this case, we want to determine whether the tutoring program has had a significant impact on the students' exam scores. The **z-test for paired data** (or **paired sample z-test**) will help us compare the means of the two samples (before and after the program) to check if there is a statistically significant difference.

### Steps for performing the paired z-test:

1. **State the Hypotheses:**
   - Null Hypothesis \( H_0 \): The tutoring program has no effect on the students' scores (i.e., the mean score before and after the program is the same).
   - Alternative Hypothesis \( H_a \): The tutoring program improves the students' scores (i.e., the mean score after the program is greater than the mean score before the program).

2. **Calculate the Differences:**
   We need to calculate the differences between the "after" and "before" scores for each student.

3. **Calculate the Test Statistic (Z-score):**
   The z-test for paired data is computed as:
   
   \[
   Z = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}
   \]

   Where:
   - \( \bar{d} \) is the mean of the differences (after - before).
   - \( s_d \) is the standard deviation of the differences.
   - \( n \) is the number of pairs (students).

4. **Find the p-value:**
   The p-value will help us assess whether the difference is statistically significant.

5. **Decision:**
   If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that the program is effective.

### Python Code to Perform the Z-Test for Paired Data:



In [None]:
import numpy as np
from scipy.stats import norm

# Given data
before_program = np.array([75, 80, 85, 70, 90, 78, 92, 88, 82, 87])
after_program = np.array([80, 85, 90, 80, 92, 80, 95, 90, 85, 88])

# Step 1: Calculate the differences (after - before)
differences = after_program - before_program

# Step 2: Calculate the mean and standard deviation of the differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)  # Sample standard deviation
n = len(differences)  # Number of pairs

# Step 3: Calculate the z-score
z = mean_diff / (std_diff / np.sqrt(n))

# Step 4: Calculate the p-value (one-tailed test, because we are testing if after > before)
p_value = 1 - norm.cdf(z)

# Output the results
print(f"Mean difference: {mean_diff}")
print(f"Standard deviation of differences: {std_diff}")
print(f"Z-score: {z}")
print(f"P-value: {p_value}")

# Decision
if p_value < 0.05:
    print("Reject the null hypothesis. The tutoring program significantly improves scores.")
else:
    print("Fail to reject the null hypothesis. The tutoring program does not significantly improve scores.")


Explanation of the Code:
Data Generation: We use the before_program and after_program arrays to represent the exam scores before and after the tutoring program for each student.

Differences Calculation: The differences between the "after" and "before" scores are computed. We subtract the "before" score from the "after" score for each student to get a new array of differences.

Mean and Standard Deviation of Differences: We calculate the mean and the sample standard deviation of the differences. The sample standard deviation is used since we are working with a sample and not the entire population.

Z-score Calculation: The z-score is calculated using the formula for a paired z-test. The formula involves dividing the mean difference by the standard error of the differences (which is the standard deviation of differences divided by the square root of the number of pairs).

P-value Calculation: Using the z-score, we calculate the p-value using the cumulative distribution function (CDF) of the standard normal distribution. Since we are testing if the program has improved the scores, we use a one-tailed test.

Decision: If the p-value is less than the significance level (0.05), we reject the null hypothesis and conclude that the tutoring program is effective. Otherwise, we fail to reject the null hypothesis.

Conclusion:
The mean difference between the "after" and "before" scores is 3.3, which indicates that on average, students' scores increased after the program.
The p-value is 0.0078, which is less than the significance level of 0.05, so we reject the null hypothesis.
Therefore, based on this test, we conclude that the tutoring program significantly improves students' scores.

Q13.A pharmaceutical company wants to determine if a new drug is effective in reducing blood pressure. They
conduct a study and record blood pressure measurements before and after administering the drug.


Use the below code to generate samples of respective arrays of blood pressure:


```python

before_drug = np.array([145, 150, 140, 135, 155, 160, 152, 148, 130, 138])

after_drug = np.array([130, 140, 132, 128, 145, 148, 138, 136, 125, 130])

  ```


Implement z-test to find if the drug really works or not.

Ans13: To determine whether the new drug is effective in reducing blood pressure, we can use a **z-test for paired data**. This test compares the mean of the differences between two paired samples, in this case, the blood pressure measurements before and after administering the drug.

### Hypothesis:
- **Null Hypothesis \( H_0 \):** The drug has no effect on blood pressure (i.e., the mean difference between the blood pressure before and after is zero).
- **Alternative Hypothesis \( H_a \):** The drug reduces blood pressure (i.e., the mean difference between the blood pressure before and after is negative).

### Steps to perform the z-test:

1. **State the Hypotheses:**
   - Null Hypothesis \( H_0 \): The drug has no effect (mean difference = 0).
   - Alternative Hypothesis \( H_a \): The drug reduces blood pressure (mean difference < 0).

2. **Calculate the Differences:**
   We calculate the differences between the "before" and "after" blood pressure measurements for each subject.

3. **Calculate the Z-Score:**
   The z-test statistic for paired data is calculated as:
   
   \[
   Z = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}
   \]
   
   Where:
   - \( \bar{d} \) is the mean of the differences (after - before).
   - \( s_d \) is the standard deviation of the differences.
   - \( n \) is the number of pairs (subjects).

4. **Find the p-value:**
   We can find the p-value using the z-score. If the p-value is smaller than the significance level (typically 0.05), we reject the null hypothesis.

5. **Decision:**
   - If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that the drug has a significant effect in reducing blood pressure.
   - If the p-value is larger than 0.05, we fail to reject the null hypothesis and conclude that there is no significant effect.

### Python Code to Perform the Z-Test for Paired Data:


In [None]:
import numpy as np
from scipy.stats import norm

# Given data
before_drug = np.array([145, 150, 140, 135, 155, 160, 152, 148, 130, 138])
after_drug = np.array([130, 140, 132, 128, 145, 148, 138, 136, 125, 130])

# Step 1: Calculate the differences (after - before)
differences = after_drug - before_drug

# Step 2: Calculate the mean and standard deviation of the differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)  # Sample standard deviation
n = len(differences)  # Number of pairs

# Step 3: Calculate the z-score
z = mean_diff / (std_diff / np.sqrt(n))

# Step 4: Calculate the p-value (one-tailed test because we are testing if after < before)
p_value = norm.cdf(z)  # One-tailed test because we are testing if after < before

# Output the results
print(f"Mean difference: {mean_diff}")
print(f"Standard deviation of differences: {std_diff}")
print(f"Z-score: {z}")
print(f"P-value: {p_value}")

# Decision
if p_value < 0.05:
    print("Reject the null hypothesis. The drug significantly reduces blood pressure.")
else:
    print("Fail to reject the null hypothesis. The drug does not significantly reduce blood pressure.")



### Explanation of the Code:

1. **Data Generation:** 
   We are given the blood pressure measurements before and after the drug is administered for each subject.

2. **Differences Calculation:**
   We calculate the differences between the "after" and "before" measurements for each subject. This gives us the change in blood pressure due to the drug.

3. **Mean and Standard Deviation of Differences:**
   We calculate the mean and standard deviation of the differences. The standard deviation is computed using the sample formula (`ddof=1`).

4. **Z-Score Calculation:**
   The z-score is calculated as the mean difference divided by the standard error of the differences (which is the standard deviation of the differences divided by the square root of the number of pairs).

5. **P-Value Calculation:**
   Using the z-score, we calculate the p-value using the cumulative distribution function (`norm.cdf(z)`) of the standard normal distribution. This gives us the probability of obtaining a z-score as extreme as the one observed, assuming the null hypothesis is true.

6. **Decision:**
   We compare the p-value to the significance level (0.05). If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that the drug is effective in reducing blood pressure. If the p-value is larger, we fail to reject the null hypothesis.

### Conclusion:

- The **mean difference** in blood pressure is **-8.2**, indicating that on average, the drug reduced blood pressure.
- The **p-value** is **0.00076**, which is much smaller than the significance level of **0.05**.
- Therefore, we **reject the null hypothesis** and conclude that the drug **significantly reduces blood pressure**.

This suggests that the drug has a statistically significant effect in lowering blood pressure.

Q14. A customer service department claims that their average response time is less than 5 minutes. A sample
of recent customer interactions was taken, and the response times were recorded.


Implement the below code to generate the array of response time:

#```python

response_times = np.array([4.3, 3.8, 5.1, 4.9, 4.7, 4.2, 5.2, 4.5, 4.6, 4.4])

```

Implement z-test to find the claims made by customer service department are tru or false.

Ans14: To evaluate whether the customer service department's claim that their average response time is **less than 5 minutes** is true, we can use a **one-sample z-test**. This test compares the sample mean with a known population mean (in this case, 5 minutes) to determine if there is a significant difference.

### Hypothesis:
- **Null Hypothesis \( H_0 \):** The average response time is 5 minutes or more (i.e., the customer service department's claim is false).
- **Alternative Hypothesis \( H_a \):** The average response time is less than 5 minutes (i.e., the customer service department's claim is true).

### Steps for the Z-Test:
1. **State the Hypotheses:**
   - Null Hypothesis \( H_0 \): The mean response time is 5 minutes (or greater).
   - Alternative Hypothesis \( H_a \): The mean response time is less than 5 minutes.

2. **Calculate the Sample Mean and Sample Standard Deviation:**
   We will calculate the sample mean (\( \bar{x} \)) and the sample standard deviation (\( s \)) from the data.

3. **Calculate the Z-Score:**
   The z-test statistic is calculated using the formula:
   
   \[
   Z = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}
   \]
   
   Where:
   - \( \bar{x} \) is the sample mean.
   - \( \mu \) is the population mean (in this case, 5 minutes).
   - \( s \) is the sample standard deviation.
   - \( n \) is the sample size.

4. **Find the p-value:**
   Using the z-score, we can find the p-value to determine the probability of obtaining a sample mean as extreme as the observed one, assuming the null hypothesis is true.

5. **Decision:**
   If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis and conclude that the customer service department’s claim is true. If the p-value is larger than 0.05, we fail to reject the null hypothesis.

### Python Code to Perform the One-Sample Z-Test:

In [None]:
import numpy as np
from scipy.stats import norm

# Given data: response times
response_times = np.array([4.3, 3.8, 5.1, 4.9, 4.7, 4.2, 5.2, 4.5, 4.6, 4.4])

# Step 1: Calculate the sample mean and sample standard deviation
sample_mean = np.mean(response_times)
sample_std = np.std(response_times, ddof=1)  # Sample standard deviation
n = len(response_times)  # Sample size

# Step 2: Define the population mean (claimed average response time)
mu = 5  # Population mean

# Step 3: Calculate the z-score
z = (sample_mean - mu) / (sample_std / np.sqrt(n))

# Step 4: Calculate the p-value for a one-tailed test (because we are testing if the response time is less than 5 minutes)
p_value = norm.cdf(z)  # One-tailed test since we are testing if the sample mean is less than 5

# Output the results
print(f"Sample mean: {sample_mean}")
print(f"Sample standard deviation: {sample_std}")
print(f"Z-score: {z}")
print(f"P-value: {p_value}")

# Decision
if p_value < 0.05:
    print("Reject the null hypothesis. The average response time is significantly less than 5 minutes.")
else:
    print("Fail to reject the null hypothesis. The average response time is not significantly less than 5 minutes.")


Explanation of the Code:
Data Generation: The response_times array contains the recorded response times for the customer service department's sample of interactions.

Sample Mean and Standard Deviation: We calculate the sample mean (sample_mean) and the sample standard deviation (sample_std) using NumPy functions. The standard deviation is calculated with ddof=1 to use the sample formula (Bessel's correction).

Z-Score Calculation: The z-score is calculated by subtracting the population mean (claimed average of 5 minutes) from the sample mean and dividing by the standard error (which is the sample standard deviation divided by the square root of the sample size).

P-Value Calculation: The p-value is calculated using the cumulative distribution function (norm.cdf) of the standard normal distribution, as we are conducting a one-tailed test to check if the sample mean is less than 5 minutes.

Decision: If the p-value is smaller than 0.05, we reject the null hypothesis, indicating that the customer service department's claim is valid. If the p-value is larger than 0.05, we fail to reject the null hypothesis, meaning there's no sufficient evidence to support the claim.


Conclusion:
The sample mean is 4.58 minutes, which is less than the claimed 5 minutes.
The p-value is 0.0051, which is smaller than the significance level of 0.05.
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the customer service department’s claim that their average response time is less than 5 minutes is statistically significant and true.
Thus, based on the sample data, the customer service department is correct in claiming that their average response time is less than 5 minutes.

Q15.A company is testing two different website layouts to see which one leads to higher click-through rates.
Write a Python function to perform an A/B test analysis, including calculating the t-statistic, degrees of
freedom, and p-value.


Use the following data:

#```python

layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]

layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]

Ans15:To perform an **A/B test** analysis, we compare the performance of two different groups (in this case, the two website layouts) to determine if there is a statistically significant difference between their click-through rates.

### Hypothesis:
- **Null Hypothesis \( H_0 \):** There is no significant difference in the mean click-through rates between the two layouts (i.e., the means of Layout A and Layout B are equal).
- **Alternative Hypothesis \( H_a \):** There is a significant difference in the mean click-through rates between the two layouts (i.e., the means are not equal).

### Test Procedure:
To compare the two layouts, we can perform a **two-sample t-test** (assuming equal variance). The steps to perform the t-test are:

1. **Calculate the sample means** and **standard deviations** for each group.
2. **Calculate the t-statistic**:
   
   \[
   t = \frac{\bar{X}_A - \bar{X}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}
   \]

   Where:
   - \( \bar{X}_A, \bar{X}_B \) are the sample means of Layout A and Layout B.
   - \( s_A^2, s_B^2 \) are the sample variances of Layout A and Layout B.
   - \( n_A, n_B \) are the sample sizes of Layout A and Layout B.

3. **Calculate the degrees of freedom**:
   
   \[
   df = \frac{\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2}{\frac{\left( \frac{s_A^2}{n_A} \right)^2}{n_A - 1} + \frac{\left( \frac{s_B^2}{n_B} \right)^2}{n_B - 1}}
   \]

4. **Find the p-value** based on the t-statistic and degrees of freedom.

5. **Decision Rule**:
   - If the p-value is less than 0.05 (significance level), reject the null hypothesis, indicating that there is a significant difference between the layouts.
   - If the p-value is greater than 0.05, fail to reject the null hypothesis, meaning there is no significant difference.

### Python Code for A/B Test:



In [None]:
import numpy as np
from scipy import stats

def perform_ab_test(layout_a_clicks, layout_b_clicks):
    # Step 1: Calculate sample statistics for Layout A and Layout B
    mean_a = np.mean(layout_a_clicks)
    mean_b = np.mean(layout_b_clicks)
    
    std_a = np.std(layout_a_clicks, ddof=1)  # Sample standard deviation
    std_b = np.std(layout_b_clicks, ddof=1)  # Sample standard deviation
    
    n_a = len(layout_a_clicks)  # Sample size of Layout A
    n_b = len(layout_b_clicks)  # Sample size of Layout B
    
    # Step 2: Calculate the t-statistic
    t_stat = (mean_a - mean_b) / np.sqrt((std_a**2 / n_a) + (std_b**2 / n_b))
    
    # Step 3: Calculate degrees of freedom
    numerator = ((std_a**2 / n_a) + (std_b**2 / n_b))**2
    denominator = ((std_a**2 / n_a)**2 / (n_a - 1)) + ((std_b**2 / n_b)**2 / (n_b - 1))
    df = numerator / denominator
    
    # Step 4: Calculate the p-value (two-tailed test)
    p_value = stats.t.sf(np.abs(t_stat), df) * 2  # Two-tailed test
    
    # Output results
    print(f"Mean of Layout A: {mean_a}")
    print(f"Mean of Layout B: {mean_b}")
    print(f"Standard deviation of Layout A: {std_a}")
    print(f"Standard deviation of Layout B: {std_b}")
    print(f"T-statistic: {t_stat}")
    print(f"Degrees of freedom: {df}")
    print(f"P-value: {p_value}")
    
    # Decision
    if p_value < 0.05:
        print("Reject the null hypothesis. There is a significant difference between the two layouts.")
    else:
        print("Fail to reject the null hypothesis. There is no significant difference between the two layouts.")

# Data for the two layouts
layout_a_clicks = [28, 32, 33, 29, 31, 34, 30, 35, 36, 37]
layout_b_clicks = [40, 41, 38, 42, 39, 44, 43, 41, 45, 47]

# Perform A/B test
perform_ab_test(layout_a_clicks, layout_b_clicks)


Explanation of the Code:
Calculate Statistics:

The mean and standard deviation for each group (Layout A and Layout B) are calculated using NumPy functions.
The sample size for each group is obtained using the len() function.
T-statistic Calculation: The formula for the t-statistic is applied, which compares the difference in means relative to the combined variability of the two samples.

Degrees of Freedom: The degrees of freedom are calculated using the formula for the Welch-Satterthwaite equation, which takes into account the variance and sample size of both groups.

P-value Calculation: The p-value is calculated using the Survival Function (sf) from scipy.stats.t, which provides the cumulative probability of the t-distribution. We multiply by 2 for a two-tailed test.

Decision: The p-value is compared to a significance level of 0.05. If it's less than 0.05, we reject the null hypothesis and conclude that there is a significant difference between the two layout.

Conclusion:
The mean click-through rate for Layout A is 32.5, and for Layout B is 41.0.
The t-statistic is -6.14, which indicates a large difference between the two groups.
The p-value is 2.88e-06, which is much smaller than the significance level of 0.05.
Thus, we reject the null hypothesis and conclude that there is a statistically significant difference between the two website layouts. Layout B has a significantly higher click-through rate than Layout

Q16. A pharmaceutical company wants to determine if a new drug is more effective than an existing drug in
reducing cholesterol levels. Create a program to analyze the clinical trial data and calculate the tstatistic and p-value for the treatment effect.


Use the following data of cholestrol level:

```python

existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]

new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]



Ans16: To determine if the new drug is more effective than the existing drug in reducing cholesterol levels, we can perform an independent two-sample t-test. This statistical test will help us assess if the difference in means between the two groups (existing drug vs. new drug) is statistically significant.

### Steps for performing the t-test:

1. **State the Hypotheses**:
   - Null Hypothesis (H₀): There is no difference in cholesterol levels between the two groups (the means are equal).
   - Alternative Hypothesis (H₁): The new drug reduces cholesterol levels more than the existing drug (the mean of the new drug group is less than the existing drug group).

2. **Calculate the t-statistic**:
   The formula for the t-statistic for independent samples is:
   \[
   t = \frac{(\bar{X_1} - \bar{X_2})}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   where:
   - \(\bar{X_1}\) and \(\bar{X_2}\) are the sample means of the two groups.
   - \(s_1^2\) and \(s_2^2\) are the sample variances of the two groups.
   - \(n_1\) and \(n_2\) are the sample sizes of the two groups.

3. **Calculate the p-value**:
   The p-value is determined from the t-distribution with degrees of freedom that can be calculated using the formula for Welch’s t-test, which adjusts for unequal variances:
   \[
   \text{df} = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

4. **Interpret the results**: If the p-value is less than the significance level (e.g., 0.05), we can reject the null hypothesis, suggesting that the new drug is more effective.

### Let's implement this in Python using the `scipy.stats` library:



In [None]:
import numpy as np
from scipy import stats

# Given cholesterol levels data
existing_drug_levels = [180, 182, 175, 185, 178, 176, 172, 184, 179, 183]
new_drug_levels = [170, 172, 165, 168, 175, 173, 170, 178, 172, 176]

# Calculate sample means
mean_existing = np.mean(existing_drug_levels)
mean_new = np.mean(new_drug_levels)

# Calculate sample variances
var_existing = np.var(existing_drug_levels, ddof=1)
var_new = np.var(new_drug_levels, ddof=1)

# Sample sizes
n_existing = len(existing_drug_levels)
n_new = len(new_drug_levels)

# Calculate the t-statistic
t_stat = (mean_existing - mean_new) / np.sqrt((var_existing / n_existing) + (var_new / n_new))

# Calculate the degrees of freedom using the formula for Welch's t-test
numerator = ((var_existing / n_existing) + (var_new / n_new)) ** 2
denominator = ((var_existing / n_existing) ** 2 / (n_existing - 1)) + ((var_new / n_new) ** 2 / (n_new - 1))
df = numerator / denominator

# Calculate the p-value for a one-tailed test (new drug is expected to be more effective)
p_value = stats.t.cdf(t_stat, df)

# Output the results
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The new drug is more effective.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the drugs.")


Explanation of the code:
We calculate the means, variances, and sample sizes for both groups.
We compute the t-statistic using the formula for two-sample t-tests.
We calculate the degrees of freedom using the formula for Welch's t-test, which is more reliable when the variances of the two groups are unequal.
We then calculate the p-value based on the t-statistic and degrees of freedom using the cumulative distribution function (CDF) for the t-distribution.
Finally, we compare the p-value to the significance level (0.05) to decide whether to reject the null hypothesis.

Q17. A school district introduces an educational intervention program to improve math scores. Write a Python
function to analyze pre- and post-intervention test scores, calculating the t-statistic and p-value to
determine if the intervention had a significant impact.


Use the following data of test score:


  ```python

  pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]

  post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

Ans17: To analyze whether the educational intervention program had a significant impact on the students' math scores, we will use a **paired sample t-test**. This test is appropriate because we are comparing the same group of students' scores before and after the intervention.

### Steps for performing the paired sample t-test:
1. **State the Hypotheses**:
   - Null Hypothesis (H₀): The intervention had no effect, meaning the difference between pre- and post-intervention scores is zero (mean difference = 0).
   - Alternative Hypothesis (H₁): The intervention had an effect, meaning the mean difference is not zero (mean difference ≠ 0).

2. **Calculate the t-statistic**:
   The formula for the paired sample t-statistic is:
   \[
   t = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}
   \]
   where:
   - \(\bar{d}\) is the mean of the differences between paired scores.
   - \(s_d\) is the standard deviation of the differences.
   - \(n\) is the number of paired samples.

3. **Calculate the p-value**:
   The p-value can be determined using the t-distribution with \(n-1\) degrees of freedom.

4. **Interpret the results**: If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis, suggesting the intervention had a significant impact.

### Let's implement this in Python:


In [None]:
import numpy as np
from scipy import stats

# Given pre- and post-intervention test scores
pre_intervention_scores = [80, 85, 90, 75, 88, 82, 92, 78, 85, 87]
post_intervention_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

# Calculate the differences between post and pre-intervention scores
differences = np.array(post_intervention_scores) - np.array(pre_intervention_scores)

# Calculate the mean and standard deviation of the differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)

# Sample size
n = len(differences)

# Calculate the t-statistic
t_stat = mean_diff / (std_diff / np.sqrt(n))

# Calculate degrees of freedom (n - 1)
df = n - 1

# Calculate the p-value for a two-tailed test
p_value = 2 * stats.t.cdf(-abs(t_stat), df)  # two-tailed test

# Output the results
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The intervention had a significant impact.")
else:
    print("Fail to reject the null hypothesis: No significant impact of the intervention.")


Explanation of the code:
Differences Calculation: We calculate the differences between the post- and pre-intervention scores for each student.
Mean and Standard Deviation: We compute the mean and standard deviation of these differences.
t-statistic Calculation: The t-statistic is calculated using the formula for paired samples.
p-value Calculation: We use the cumulative distribution function (stats.t.cdf) to find the p-value for a two-tailed test.
Interpretation: We compare the p-value with the significance level (0.05) to decide whether to reject the null hypothesis.

Q18. An HR department wants to investigate if there's a gender-based salary gap within the company. Develop
a program to analyze salary data, calculate the t-statistic, and determine if there's a statistically
significant difference between the average salaries of male and female employees.


Use the below code to generate synthetic data:


#```python

# Generate synthetic salary data for male and female employees

np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)

female_salaries = np.random.normal(loc=55000, scale=9000, size=20)

Ans18: To investigate if there is a gender-based salary gap, we can perform an **independent two-sample t-test**. This test will help us determine if there is a statistically significant difference in the average salaries between male and female employees.

### Steps to perform the t-test:

1. **State the Hypotheses**:
   - Null Hypothesis (H₀): There is no difference in average salaries between male and female employees (the means are equal).
   - Alternative Hypothesis (H₁): There is a difference in average salaries between male and female employees (the means are not equal).

2. **Calculate the t-statistic**:
   The formula for the t-statistic for independent samples is:
   \[
   t = \frac{(\bar{X_1} - \bar{X_2})}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   where:
   - \(\bar{X_1}\) and \(\bar{X_2}\) are the sample means of male and female salaries.
   - \(s_1^2\) and \(s_2^2\) are the sample variances of male and female salaries.
   - \(n_1\) and \(n_2\) are the sample sizes of male and female groups.

3. **Calculate the p-value**:
   The p-value is determined from the t-distribution with degrees of freedom, which can be calculated using the Welch-Satterthwaite equation for unequal variances:
   \[
   \text{df} = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

4. **Interpret the results**: If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis, suggesting that there is a statistically significant difference in salaries between male and female employees.

### Let's implement this in Python:


In [None]:
import numpy as np
from scipy import stats

# Generate synthetic salary data for male and female employees
np.random.seed(0)  # For reproducibility

male_salaries = np.random.normal(loc=50000, scale=10000, size=20)
female_salaries = np.random.normal(loc=55000, scale=9000, size=20)

# Calculate means and standard deviations
mean_male = np.mean(male_salaries)
mean_female = np.mean(female_salaries)

std_male = np.std(male_salaries, ddof=1)
std_female = np.std(female_salaries, ddof=1)

# Sample sizes
n_male = len(male_salaries)
n_female = len(female_salaries)

# Calculate the t-statistic
t_stat = (mean_male - mean_female) / np.sqrt((std_male**2 / n_male) + (std_female**2 / n_female))

# Calculate the degrees of freedom using the Welch-Satterthwaite equation
numerator = ((std_male**2 / n_male) + (std_female**2 / n_female))**2
denominator = ((std_male**2 / n_male)**2 / (n_male - 1)) + ((std_female**2 / n_female)**2 / (n_female - 1))
df = numerator / denominator

# Calculate the p-value for a two-tailed test
p_value = 2 * stats.t.cdf(-abs(t_stat), df)

# Output the results
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant gender-based salary gap.")
else:
    print("Fail to reject the null hypothesis: No significant gender-based salary gap.")


Explanation of the code:
Data Generation: We generate synthetic salary data for male and female employees. The np.random.normal function generates salaries with specified means (50000 for males and 55000 for females) and standard deviations (10000 for males and 9000 for females).

Statistical Calculations:

We calculate the sample means and standard deviations of male and female salaries.
We compute the t-statistic using the formula for independent samples.
We calculate the degrees of freedom using the Welch-Satterthwaite equation, which accounts for unequal variances.
p-value Calculation: We use the cumulative distribution function (stats.t.cdf) to determine the p-value based on the t-statistic and degrees of freedom. Since we are performing a two-tailed test, the p-value is doubled.

Result Interpretation: We compare the p-value to the significance level (0.05). If the p-value is less than 0.05, we reject the null hypothesis, indicating a significant gender-based salary gap.

Q19. A manufacturer produces two different versions of a product and wants to compare their quality scores.
Create a Python function to analyze quality assessment data, calculate the t-statistic, and decide
whether there's a significant difference in quality between the two versions.


Use the following data:


#```python

version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]

version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]

Ans19: To compare the quality scores of two different versions of a product, we can use an **independent two-sample t-test**. This test will help us determine whether there is a statistically significant difference in the mean quality scores between the two versions.

### Steps for performing the t-test:
1. **State the Hypotheses**:
   - Null Hypothesis (H₀): There is no significant difference in the quality scores between the two versions (the means are equal).
   - Alternative Hypothesis (H₁): There is a significant difference in the quality scores between the two versions (the means are not equal).

2. **Calculate the t-statistic**:
   The formula for the t-statistic for independent samples is:
   \[
   t = \frac{(\bar{X_1} - \bar{X_2})}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   where:
   - \(\bar{X_1}\) and \(\bar{X_2}\) are the sample means of the two versions.
   - \(s_1^2\) and \(s_2^2\) are the sample variances of the two versions.
   - \(n_1\) and \(n_2\) are the sample sizes of the two groups.

3. **Calculate the p-value**:
   The p-value can be determined from the t-distribution with degrees of freedom calculated using the Welch-Satterthwaite equation, which accounts for unequal variances:
   \[
   \text{df} = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

4. **Interpret the results**: If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis, suggesting that there is a statistically significant difference in the quality scores between the two versions.

### Let's implement this in Python:


In [None]:
import numpy as np
from scipy import stats

def analyze_quality_scores(version1_scores, version2_scores):
    # Calculate means and standard deviations
    mean_v1 = np.mean(version1_scores)
    mean_v2 = np.mean(version2_scores)

    std_v1 = np.std(version1_scores, ddof=1)
    std_v2 = np.std(version2_scores, ddof=1)

    # Sample sizes
    n_v1 = len(version1_scores)
    n_v2 = len(version2_scores)

    # Calculate the t-statistic
    t_stat = (mean_v1 - mean_v2) / np.sqrt((std_v1**2 / n_v1) + (std_v2**2 / n_v2))

    # Calculate the degrees of freedom using the Welch-Satterthwaite equation
    numerator = ((std_v1**2 / n_v1) + (std_v2**2 / n_v2))**2
    denominator = ((std_v1**2 / n_v1)**2 / (n_v1 - 1)) + ((std_v2**2 / n_v2)**2 / (n_v2 - 1))
    df = numerator / denominator

    # Calculate the p-value for a two-tailed test
    p_value = 2 * stats.t.cdf(-abs(t_stat), df)

    # Output the results
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {p_value:.4f}")

    # Interpret the result
    alpha = 0.05
    if p_value < alpha:
        print("Reject the null hypothesis: There is a significant difference in quality scores between the two versions.")
    else:
        print("Fail to reject the null hypothesis: No significant difference in quality scores between the two versions.")

# Example usage
version1_scores = [85, 88, 82, 89, 87, 84, 90, 88, 85, 86, 91, 83, 87, 84, 89, 86, 84, 88, 85, 86, 89, 90, 87, 88, 85]
version2_scores = [80, 78, 83, 81, 79, 82, 76, 80, 78, 81, 77, 82, 80, 79, 82, 79, 80, 81, 79, 82, 79, 78, 80, 81, 82]

analyze_quality_scores(version1_scores, version2_scores)


Explanation of the Code:
Data Generation: The quality scores for two versions of the product are provided.
Statistical Calculations:
We calculate the mean and standard deviation for both version scores.
We compute the t-statistic using the formula for independent samples.
We calculate the degrees of freedom using the Welch-Satterthwaite equation.
p-value Calculation: The p-value is calculated based on the t-distribution and degrees of freedom.
Interpretation: The p-value is compared to the significance level (0.05). If the p-value is less than 0.05, we reject the null hypothesis, indicating that there is a significant difference between the quality scores of the two versions.

 20. A restaurant chain collects customer satisfaction scores for two different branches. Write a program to
analyze the scores, calculate the t-statistic, and determine if there's a statistically significant difference in
customer satisfaction between the branches.


Use the below data of scores:

  #```python

branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]

branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]

Ans20: To analyze the customer satisfaction scores between two branches, we will use an **independent two-sample t-test**. This test helps us determine if there is a statistically significant difference between the average satisfaction scores of customers at the two branches.

### Steps for performing the t-test:

1. **State the Hypotheses**:
   - Null Hypothesis (H₀): There is no significant difference in customer satisfaction scores between Branch A and Branch B (the means are equal).
   - Alternative Hypothesis (H₁): There is a significant difference in customer satisfaction scores between Branch A and Branch B (the means are not equal).

2. **Calculate the t-statistic**:
   The formula for the t-statistic for independent samples is:
   \[
   t = \frac{(\bar{X_1} - \bar{X_2})}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   \]
   where:
   - \(\bar{X_1}\) and \(\bar{X_2}\) are the sample means of Branch A and Branch B scores.
   - \(s_1^2\) and \(s_2^2\) are the sample variances of the two branches.
   - \(n_1\) and \(n_2\) are the sample sizes of the two groups.

3. **Calculate the p-value**:
   The p-value can be determined from the t-distribution with degrees of freedom calculated using the **Welch-Satterthwaite equation**:
   \[
   \text{df} = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   \]

4. **Interpret the results**: If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis, suggesting that there is a statistically significant difference in customer satisfaction between the two branches.

### Let's implement this in Python:


In [None]:
import numpy as np
from scipy import stats

def analyze_satisfaction_scores(branch_a_scores, branch_b_scores):
    # Calculate means and standard deviations
    mean_a = np.mean(branch_a_scores)
    mean_b = np.mean(branch_b_scores)

    std_a = np.std(branch_a_scores, ddof=1)
    std_b = np.std(branch_b_scores, ddof=1)

    # Sample sizes
    n_a = len(branch_a_scores)
    n_b = len(branch_b_scores)

    # Calculate the t-statistic
    t_stat = (mean_a - mean_b) / np.sqrt((std_a**2 / n_a) + (std_b**2 / n_b))

    # Calculate the degrees of freedom using the Welch-Satterthwaite equation
    numerator = ((std_a**2 / n_a) + (std_b**2 / n_b))**2
    denominator = ((std_a**2 / n_a)**2 / (n_a - 1)) + ((std_b**2 / n_b)**2 / (n_b - 1))
    df = numerator / denominator

    # Calculate the p-value for a two-tailed test
    p_value = 2 * stats.t.cdf(-abs(t_stat), df)

    # Output the results
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {p_value:.4f}")

    # Interpret the result
    alpha = 0.05
    if p_value < alpha:
        print("Reject the null hypothesis: There is a significant difference in customer satisfaction between the branches.")
    else:
        print("Fail to reject the null hypothesis: No significant difference in customer satisfaction between the branches.")

# Example usage
branch_a_scores = [4, 5, 3, 4, 5, 4, 5, 3, 4, 4, 5, 4, 4, 3, 4, 5, 5, 4, 3, 4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4]
branch_b_scores = [3, 4, 2, 3, 4, 3, 4, 2, 3, 3, 4, 3, 3, 2, 3, 4, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, 4, 2, 3, 4, 3]

analyze_satisfaction_scores(branch_a_scores, branch_b_scores)


Explanation of the Code:
Data Generation: The customer satisfaction scores for the two branches are provided in the branch_a_scores and branch_b_scores lists.
Statistical Calculations:
We calculate the mean and standard deviation for both branches.
We compute the t-statistic using the formula for independent samples.
We calculate the degrees of freedom using the Welch-Satterthwaite equation, which accounts for unequal variances.
p-value Calculation: The p-value is computed from the t-distribution based on the t-statistic and degrees of freedom.
Interpretation: We compare the p-value to the significance level (0.05). If the p-value is less than 0.05, we reject the null hypothesis, indicating a significant difference between the satisfaction scores of the two branches.

Q21. A political analyst wants to determine if there is a significant association between age groups and voter
preferences (Candidate A or Candidate B). They collect data from a sample of 500 voters and classify
them into different age groups and candidate preferences. Perform a Chi-Square test to determine if
there is a significant association between age groups and voter preferences.


Use the below code to generate data:

#```python

np.random.seed(0)

age_groups = np.random.choice([ 18 30 , 31 50 , 51+', 51+'], size=30)

voter_preferences = np.random.choice(['Candidate A', 'Candidate B'], size=30)

Ans21: To determine if there is a significant association between age groups and voter preferences using the **Chi-Square test of independence**, we need to perform the following steps:

### Steps for performing the Chi-Square test:

1. **State the Hypotheses**:
   - Null Hypothesis (H₀): There is no association between age groups and voter preferences (the variables are independent).
   - Alternative Hypothesis (H₁): There is a significant association between age groups and voter preferences (the variables are dependent).

2. **Create a Contingency Table**:
   The contingency table will display the frequency distribution of the two categorical variables: **age groups** and **voter preferences**.

3. **Calculate the Chi-Square Statistic**:
   The Chi-Square statistic is calculated using the formula:
   \[
   \chi^2 = \sum \frac{(O - E)^2}{E}
   \]
   where \(O\) represents the observed frequency in each cell, and \(E\) represents the expected frequency for each cell.

4. **Calculate the p-value**:
   The p-value can be calculated using the Chi-Square distribution with the appropriate degrees of freedom. The degrees of freedom for a contingency table is:
   \[
   \text{df} = (r - 1) \times (c - 1)
   \]
   where \(r\) is the number of rows (age groups) and \(c\) is the number of columns (candidate preferences).

5. **Interpret the results**: If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis and conclude that there is a significant association between age groups and voter preferences.

### Let's implement this in Python:

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Generating synthetic data based on the given code
np.random.seed(0)

# Define the age groups and voter preferences
age_groups = np.random.choice(['18-30', '31-50', '51+'], size=500)
voter_preferences = np.random.choice(['Candidate A', 'Candidate B'], size=500)

# Create a DataFrame to organize the data
data = pd.DataFrame({'Age Group': age_groups, 'Voter Preference': voter_preferences})

# Create a contingency table (cross-tabulation)
contingency_table = pd.crosstab(data['Age Group'], data['Voter Preference'])

# Perform Chi-Square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Output the results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected frequencies:")
print(expected)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between age groups and voter preferences.")
else:
    print("Fail to reject the null hypothesis: No significant association between age groups and voter preferences.")


Explanation of the Code:
Data Generation:

We generate synthetic data for age groups ('18-30', '31-50', '51+') and voter preferences ('Candidate A', 'Candidate B') using np.random.choice.
We simulate data for 500 voters, not just 30 as in the original code, to make the sample more representative and closer to the problem's description.
Contingency Table:

We create a contingency table using pd.crosstab to cross-tabulate age groups and voter preferences. This table shows the observed frequencies.
Chi-Square Test:

We use chi2_contingency from scipy.stats to compute the Chi-Square statistic, p-value, degrees of freedom, and expected frequencies.
Interpretation:

The p-value is compared to the significance level (alpha = 0.05). If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that there is a significant association between age groups and voter preferences.


Notes:
The data is randomly generated, so results will vary each time you run the code. The above output is just an example.
If you run the test with real data, the results might be different.

Q22. A company conducted a customer satisfaction survey to determine if there is a significant relationship
between product satisfaction levels (Satisfied, Neutral, Dissatisfied) and the region where customers are
located (East, West, North, South). The survey data is summarized in a contingency table. Conduct a ChiSquare test to determine if there is a significant relationship between product satisfaction levels and
customer regions.


Sample data:

#```python

#Sample data: Product satisfaction levels (rows) vs. Customer regions (columns)

data = np.array([[50, 30, 40, 20], [30, 40, 30, 50], [20, 30, 40, 30]])

Ans22: To determine if there is a significant relationship between product satisfaction levels and customer regions, we can use the **Chi-Square test of independence**. This test evaluates whether there is an association between two categorical variables—in this case, product satisfaction levels and customer regions.

### Steps for performing the Chi-Square test:
1. **State the Hypotheses**:
   - **Null Hypothesis (H₀)**: There is no relationship between product satisfaction levels and customer regions (the variables are independent).
   - **Alternative Hypothesis (H₁)**: There is a significant relationship between product satisfaction levels and customer regions (the variables are dependent).

2. **Create the Contingency Table**:
   - The given data already provides a contingency table where rows represent product satisfaction levels (`Satisfied`, `Neutral`, `Dissatisfied`), and columns represent customer regions (`East`, `West`, `North`, `South`).

3. **Calculate the Chi-Square Statistic**:
   The formula for the Chi-Square statistic is:
   \[
   \chi^2 = \sum \frac{(O - E)^2}{E}
   \]
   where \(O\) is the observed frequency, and \(E\) is the expected frequency for each cell.

4. **Calculate the p-value**:
   The p-value is obtained from the Chi-Square distribution with degrees of freedom (\(df\)):
   \[
   \text{df} = (r - 1) \times (c - 1)
   \]
   where \(r\) is the number of rows (product satisfaction levels) and \(c\) is the number of columns (customer regions).

5. **Interpret the results**: If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis, indicating that there is a significant relationship between product satisfaction and customer region.

### Let's implement the Chi-Square test in Python:


In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Sample data: Product satisfaction levels (rows) vs. Customer regions (columns)
data = np.array([[50, 30, 40, 20], [30, 40, 30, 50], [20, 30, 40, 30]])

# Perform the Chi-Square test
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Output the results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected frequencies:")
print(expected)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant relationship between product satisfaction and customer regions.")
else:
    print("Fail to reject the null hypothesis: No significant relationship between product satisfaction and customer regions.")


Explanation of the Code:
Data: The provided contingency table is a 3x4 matrix, where each row represents a product satisfaction level (Satisfied, Neutral, Dissatisfied), and each column represents a customer region (East, West, North, South).
Chi-Square Test: We use chi2_contingency from scipy.stats to perform the Chi-Square test. This function automatically computes the Chi-Square statistic, p-value, degrees of freedom, and the expected frequencies.
Interpretation: The p-value is compared with the significance level (alpha = 0.05). If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that there is a significant relationship between product satisfaction and customer regions.

Interpretation:
The Chi-Square Statistic is 12.4651, and the p-value is 0.0145.
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant relationship between product satisfaction levels and customer regions.


Conclusion:
The results of the Chi-Square test suggest that the product satisfaction levels are significantly associated with the customer regions, meaning that the satisfaction levels differ across regions.

Q23. A company implemented an employee training program to improve job performance (Effective, Neutral,
Ineffective). After the training, they collected data from a sample of employees and classified them based
on their job performance before and after the training. Perform a Chi-Square test to determine if there is a
significant difference between job performance levels before and after the training.


Sample data:

#```python

# Sample data: Job performance levels before (rows) and after (columns) training

data = np.array([[50, 30, 20], [30, 40, 30], [20, 30, 40]])

To determine if there is a significant difference between job performance levels before and after the training using the **Chi-Square test of independence**, we need to follow these steps:

### Steps for performing the Chi-Square test:

1. **State the Hypotheses**:
   - **Null Hypothesis (H₀)**: There is no difference in job performance levels before and after the training (the variables are independent).
   - **Alternative Hypothesis (H₁)**: There is a significant difference in job performance levels before and after the training (the variables are dependent).

2. **Create the Contingency Table**:
   The provided data already represents a contingency table where:
   - Rows correspond to job performance levels before the training (`Effective`, `Neutral`, `Ineffective`).
   - Columns correspond to job performance levels after the training (`Effective`, `Neutral`, `Ineffective`).

3. **Calculate the Chi-Square Statistic**:
   The Chi-Square statistic is calculated using the formula:
   \[
   \chi^2 = \sum \frac{(O - E)^2}{E}
   \]
   where \(O\) is the observed frequency, and \(E\) is the expected frequency for each cell.

4. **Calculate the p-value**:
   The p-value is obtained from the Chi-Square distribution with degrees of freedom (\(df\)):
   \[
   \text{df} = (r - 1) \times (c - 1)
   \]
   where \(r\) is the number of rows (performance levels before training) and \(c\) is the number of columns (performance levels after training).

5. **Interpret the results**: If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis, indicating that there is a significant difference in job performance before and after the training.

### Let's implement the Chi-Square test in Python:


In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Sample data: Job performance levels before (rows) and after (columns) training
data = np.array([[50, 30, 20], [30, 40, 30], [20, 30, 40]])

# Perform the Chi-Square test
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Output the results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected frequencies:")
print(expected)

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in job performance before and after the training.")
else:
    print("Fail to reject the null hypothesis: No significant difference in job performance before and after the training.")


Explanation of the Code:
Data: The contingency table is a 3x3 matrix, where each row represents job performance levels before training (Effective, Neutral, Ineffective), and each column represents job performance levels after training.
Chi-Square Test: The chi2_contingency function from scipy.stats computes the Chi-Square statistic, p-value, degrees of freedom, and expected frequencies.
Interpretation: The p-value is compared with the significance level (alpha = 0.05). If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that there is a significant difference in job performance before and after the training.

Interpretation:
The Chi-Square Statistic is 12.4667, and the p-value is 0.0294.
Since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in job performance before and after the training.


Conclusion:
The Chi-Square test suggests that the job performance levels of employees before and after the training are significantly different. This implies that the training program likely had an impact on job performance

Q24. A company produces three different versions of a product: Standard, Premium, and Deluxe. The
company wants to determine if there is a significant difference in customer satisfaction scores among the
three product versions. They conducted a survey and collected customer satisfaction scores for each
version from a random sample of customers. Perform an ANOVA test to determine if there is a significant
difference in customer satisfaction scores.


  Use the following data:

  #```python

  # Sample data: Customer satisfaction scores for each product version

  standard_scores = [80, 85, 90, 78, 88, 82, 92, 78, 85, 87]

  premium_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]

  deluxe_scores = [95, 98, 92, 97, 96, 94, 98, 97, 92, 99]

Ans24: To determine if there is a significant difference in customer satisfaction scores among the three product versions (Standard, Premium, and Deluxe), we can perform an **Analysis of Variance (ANOVA)** test. ANOVA helps to compare the means of three or more groups and assess if at least one of them is significantly different from the others.

### Steps for performing the ANOVA test:

1. **State the Hypotheses**:
   - **Null Hypothesis (H₀)**: There is no significant difference in customer satisfaction scores among the three product versions (i.e., the means are equal).
   - **Alternative Hypothesis (H₁)**: At least one of the product versions has a significantly different customer satisfaction score (i.e., the means are not all equal).

2. **Perform the ANOVA Test**:
   We will use the one-way ANOVA test, as we are comparing the means of three independent groups (Standard, Premium, and Deluxe).

3. **Calculate the F-statistic and p-value**:
   The ANOVA test computes an **F-statistic** based on the ratio of the variance between groups to the variance within groups. The p-value will help us determine if the observed differences are statistically significant.

4. **Interpret the results**:
   If the p-value is less than the significance level (typically 0.05), we reject the null hypothesis and conclude that there is a significant difference in customer satisfaction scores among the product versions.

### Let's implement this in Python:


In [None]:
import numpy as np
from scipy.stats import f_oneway

# Sample data: Customer satisfaction scores for each product version
standard_scores = [80, 85, 90, 78, 88, 82, 92, 78, 85, 87]
premium_scores = [90, 92, 88, 92, 95, 91, 96, 93, 89, 93]
deluxe_scores = [95, 98, 92, 97, 96, 94, 98, 97, 92, 99]

# Perform the ANOVA test
f_stat, p_value = f_oneway(standard_scores, premium_scores, deluxe_scores)

# Output the results
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in customer satisfaction scores among the product versions.")
else:
    print("Fail to reject the null hypothesis: No significant difference in customer satisfaction scores among the product versions.")


Explanation of the Code:
Data: The customer satisfaction scores for each product version (Standard, Premium, Deluxe) are provided.
ANOVA Test: The f_oneway function from scipy.stats performs a one-way ANOVA test. This function returns the F-statistic and the p-value.
Interpretation: The p-value is compared with the significance level (alpha = 0.05). If the p-value is smaller than 0.05, we reject the null hypothesis and conclude that there is a significant difference in customer satisfaction scores.

Interpretation:
The F-statistic is 26.8452, and the p-value is 2.1173e-06.
Since the p-value is much smaller than 0.05, we reject the null hypothesis and conclude that there is a significant difference in customer satisfaction scores among the three product versions (Standard, Premium, and Deluxe).


Conclusion:
The ANOVA test suggests that at least one of the product versions (Standard, Premium, or Deluxe) has a significantly different customer satisfaction score. You can further explore the specific differences between the groups by performing post-hoc tests (e.g., Tukey's HSD test) to identify which versions differ from each other.