# Statistics Advance Part 1
1.  What is a random variable in probability theory.
 - In probability theory, a random variable is a variable whose value is a numerical outcome of a random phenomenon, where a random phenomenon is one whose outcome cannot be predicted with certainty.
2. What are the types of random variables.
 - Random variables are broadly classified into discrete (countable values) and continuous (any value within an interval) types, with mixed random variables also existing as a combination of both.
3. What is the difference between discrete and continuous distributions.
 - The key difference between discrete and continuous distributions lies in the nature of the variable they represent: discrete distributions deal with variables that can only take on a finite or countable number of distinct values, while continuous distributions deal with variables that can take any value within a given range.
4. What are probability distribution functions (PDF)
 - Probability distribution is a function that is used to give the probability of all the possible values that a random variable can take. A discrete probability distribution can be described by a probability distribution function and a probability mass function.
5. How do cumulative distribution functions (CDF) differ from probability distribution functions (PDF)
 - The key difference between a Cumulative Distribution Function (CDF) and a Probability Distribution Function (PDF) lies in what they represent: CDF describes the probability of a random variable being less than or equal to a specific value, while PDF describes the probability density at a specific value.
6. What is a discrete uniform distribution.
 - In a discrete uniform distribution, outcomes are discrete and have the same probability. In a continuous uniform distribution, outcomes are continuous and infinite. In a normal distribution, data around the mean occur more frequently than occurrences farther from it.
7. What are the key properties of a Bernoulli distribution.
 - The key properties of a Bernoulli distribution are: it models a single trial with two possible outcomes (success or failure), the probability of success (p) remains constant, and the trials are independent.
Here's a more detailed breakdown:
Binary Outcomes:
A Bernoulli distribution describes a random variable that can take only two values, typically represented as 0 (failure) or 1 (success).
Fixed Probability of Success:
The probability of success (p) is the same for each trial, and the probability of failure is 1-p.
8. What is the binomial distribution, and how is it used in probability.
 - The binomial distribution is a discrete probability distribution that models the probability of a certain number of successes in a fixed number of independent trials, each with only two possible outcomes (success or failure), where the probability of success remains constant for each trial.
9. What is the Poisson distribution and where is it applied.
 - The Poisson distribution is a discrete probability distribution that describes the probability of a certain number of events occurring within a fixed interval of time or space, given the average rate of occurrence, and it's applied in scenarios like modeling customer arrivals, call center calls, or rare events.
10. What is a continuous uniform distribution.
 - The uniform distribution (continuous) is one of the simplest probability distributions in statistics. It is a continuous distribution, this means that it takes values within a specified range, e.g. between 0 and 1.
11. What are the characteristics of a normal distribution.
 - A normal distribution, also known as a Gaussian distribution, is characterized by its bell-shaped, symmetrical curve, where the mean, median, and mode are equal, and data values cluster around the mean.
12. What is the standard normal distribution, and why is it important.
 - The standard normal distribution is a specific type of normal distribution with a mean of 0 and a standard deviation of 1, often used as a reference for other normal distributions because it simplifies probability calculations and comparisons. It's important because it allows for standardized analysis and interpretation of data, facilitating statistical inference and hypothesis testing.
13. What is the Central Limit Theorem (CLT), and why is it critical in statistics.
 - The Central Limit Theorem (CLT) is a fundamental statistical concept stating that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the original population's distribution. This is critical because it allows us to make inferences about populations using sample data, even when the population distribution isn't normal.
14. How does the Central Limit Theorem relate to the normal distribution
 - Central Limit Theorem (CLT): Definition and Key CharacteristicsThe Central Limit Theorem (CLT) is a fundamental concept in statistics that explains why the normal distribution is so prevalent in many real-world scenarios. It states that the distribution of sample means will approximate a normal distribution, regardless of the original population's distribution, as the sample size increases.
15. What is the application of Z statistics in hypothesis testing.
 - In hypothesis testing, z-statistics (or z-tests) are used to determine if there's a statistically significant difference between a sample mean and a population mean (or if two sample means are different) when the population standard deviation is known or the sample size is large enough.
16. How do you calculate a Z-score, and what does it represent.
 - A Z-score, or standard score, measures how many standard deviations a data point is away from the mean, and is calculated using the formula: Z = (x - μ) / σ, where 'x' is the data point, 'μ' is the population mean, and 'σ' is the population standard deviation.
17. What are point estimates and interval estimates in statistics.
 - In statistics, a point estimate is a single value used to estimate an unknown population parameter, while an interval estimate provides a range of values within which the parameter is likely to fall, often with a specified level of confidence.
18. What is the significance of confidence intervals in statistical analysiS.
 - Confidence intervals in statistical analysis are crucial because they provide a range of plausible values for an unknown population parameter, allowing researchers to estimate the true value with a certain level of confidence and quantify the uncertainty in their estimates.
19. What is the relationship between a Z-score and a confidence interval.
 - A Z-score, which measures how many standard deviations a data point is from the mean, is a crucial component in calculating confidence intervals, specifically the margin of error, which determines the width of the interval.
20. How are Z-scores used to compare different distributions.
 - Z-scores, by standardizing data, allow for meaningful comparisons between different distributions, even those with varying means and standard deviations, by expressing each data point's position relative to its distribution's mean in terms of standard deviations.
21. What are the assumptions for applying the Central Limit Theorem.
 - To apply the Central Limit Theorem (CLT), you need to assume that the data is sampled randomly, the samples are independent, and the sample size is large enough (often 30 or more).
22. What is the concept of expected value in a probability distribution.
 - In a probability distribution, the expected value (also known as the mean or mathematical expectation) is the weighted average of all possible outcomes, where each outcome is weighted by its probability of occurrence.
23. How does a probability distribution relate to the expected outcome of a random variable?
 - A probability distribution describes the likelihood of different outcomes for a random variable, and the expected outcome is calculated by weighting each possible value by its probability and summing the results.

# Practical
                    

In [None]:
# 1. Write a Python program to generate a random variable and display its value

import random

random_variable = random.randint(1, 10)
print("The random variable is:", random_variable)

# 2. Generate a discrete uniform distribution using Python and plot the probability mass function (PMF)

import numpy as np
import matplotlib.pyplot as plt
low = 1
high = 7
size = 1000
random_numbers = np.random.randint(low, high +1, size)
unique_elements, counts = np.unique(random_numbers, return_counts=True)
pmf = counts / size
plt.bar(unique_elements, pmf)
plt.title('Probability Mass Function (PMF) of a Discrete Uniform Distribution')
plt.xlabel('Potential Value of a Die Roll')
plt.ylabel('Probability')
plt.xticks(unique_elements)
plt.grid(axis='y', alpha=0.75)
plt.show()

# 3. Write a Python function to calculate the probability distribution function (PDF) of a Bernoulli distribution

def bernoulli_pdf(p, x):
    if x == 0:
        return 1 - p
    elif x == 1:
        return p
    else:
        return 0

# 4. Write a Python script to simulate a binomial distribution with n=10 and p=0.5, then plot its histogram

n = 10
p = 0.5
size = 1000
random_numbers = np.random.binomial(n, p, size)

plt.hist(random_numbers, bins=range(n + 2), align='left', rwidth=0.8, density=True)

plt.title('Histogram of a Binomial Distribution')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.xticks(range(n + 1))
plt.grid(axis='y', alpha=0.75)
plt.show()

# 5. Create a Poisson distribution and visualize it using Python

lambda_value = 5
size = 1000
random_numbers = np.random.poisson(lambda_value, size)

plt.hist(random_numbers, bins=range(int(np.max(random_numbers)) + 2), align='left', rwidth=0.8, density=True)

plt.title('Histogram of a Poisson Distribution')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.xticks(range(int(np.max(random_numbers)) + 1))
plt.grid(axis='y', alpha=0.75)
plt.show()

# 6. Write a Python program to calculate and plot the cumulative distribution function (CDF) of a discrete uniform distribution

low = 1
high = 6
size = 1000
random_numbers = np.random.randint(low, high + 1, size)
unique_elements, counts = np.unique(random_numbers, return_counts=True)
pmf = counts / size
cdf = np.cumsum(pmf)
plt.plot(unique_elements, cdf, marker='o')
plt.title('Cumulative Distribution Function (CDF) of a Discrete Uniform Distribution')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.xticks(unique_elements)
plt.grid(axis='y', alpha=0.75)
plt.show()

# 7. Generate a continuous uniform distribution using NumPy and visualize it

low = 0
high = 1
size = 1000
random_numbers = np.random.uniform(low, high, size)
plt.hist(random_numbers, bins=30, density=True, alpha=0.75)
plt.title('Histogram of a Continuous Uniform Distribution')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.grid(axis='y', alpha=0.75)
plt.show()

# 8. Simulate data from a normal distribution and plot its histogram

mean = 0
std_dev = 1
size = 1000
random_numbers = np.random.normal(mean, std_dev, size)
plt.hist(random_numbers, bins=30, density=True, alpha=0.75)
plt.title('Histogram of a Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.grid(axis='y', alpha=0.75)
plt.show()

# 9. Write a Python function to calculate Z-scores from a dataset and plot them

def calculate_z_scores(data):
    mean = np.mean(data)
    std_dev = np.std(data)
    z_scores = (data - mean) / std_dev
    return z_scores

data = np.array([10, 15, 20, 25, 30])
z_scores = calculate_z_scores(data)
print("Z-scores:", z_scores)

# 10. Implement the Central Limit Theorem (CLT) using Python for a non-normal distribution.

import numpy as np
import matplotlib.pyplot as plt
scale = 1
num_samples = 10000
sample_size = 30
sample_means = []
for _ in range(num_samples):
    sample = np.random.exponential(scale, size=sample_size)
    sample_mean = np.mean(sample)
    sample_means.append(sample_mean)
plt.hist(sample_means, bins=30, density=True, alpha=0.75)
plt.title('Central Limit Theorem Demonstration')
plt.xlabel('Sample Mean')
plt.ylabel('Probability Density')
plt.grid(axis='y', alpha=0.75)
plt.show()

# 11. Simulate multiple samples from a normal distribution and verify the Central Limit Theorem

num_samples = 1000
sample_size = 30
sample_means = []
for _ in range(num_samples):
    sample = np.random.normal(0, 1, size=sample_size)
    sample_mean = np.mean(sample)
    sample_means.append(sample_mean)

    plt.hist(sample_means, bins=30, density=True, alpha=0.75)
    plt.title('Central Limit Theorem Demonstration')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density')
    plt.grid(axis='y', alpha=0.75)
    plt.show()

# 12. Write a Python function to calculate and plot the standard normal distribution (mean = 0, std = 1)

def plot_standard_normal_distribution():
    mean = 0
    std_dev = 1
    x = np.linspace(-4, 4, 1000)
    y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev)**2)
    plt.plot(x, y)
    plt.title('Standard Normal Distribution')
    plt.xlabel('Value')
    plt.ylabel('Probability Density')
    plt.grid(axis='y', alpha=0.75)
    plt.show()

plot_standard_normal_distribution()

# 13. Generate random variables and calculate their corresponding probabilities using the binomial distribution

import numpy as np
from scipy.stats import binom
n = 10
p = 0.5
random_variables = np.random.binomial(n, p, size=5)
probabilities = [binom.pmf(k, n, p) for k in random_variables]
for rv, prob in zip(random_variables, probabilities):
    print(f"Random Variable: {rv}, Probability: {prob}")

# 14. Write a Python program to calculate the Z-score for a given data point and compare it to a standard normal distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def calculate_z_score(data_point, mean, std_dev):
    """Calculates the Z-score for a data point."""
    z_score = (data_point - mean) / std_dev
    return z_score

def plot_z_score_comparison(data_point, mean, std_dev):
    """Plots the Z-score comparison against the standard normal distribution."""
    z_score = calculate_z_score(data_point, mean, std_dev)
    x = np.linspace(mean - 3 * std_dev, mean + 3 * std_dev, 100)
    y = norm.pdf(x, mean, std_dev)

    plt.figure(figsize=(8, 6))
    plt.plot(x, y, label='Standard Normal Distribution')
    plt.axvline(x=data_point, color='red', linestyle='--', label=f'Data Point: {data_point}')
    plt.axvline(x=mean, color='black', linestyle='--', label='Mean')
    plt.title('Z-score Comparison with Standard Normal Distribution')
    plt.xlabel('Values')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(True)
    plt.show()

    print(f"Z-score for data point {data_point}: {z_score}")

# 15. Implement hypothesis testing using Z-statistics for a sample dataset

import numpy as np
from scipy import stats
sample_data = np.array([10, 12, 15, 13, 18, 11, 16, 14, 17, 19])
population_mean = 13
population_std_dev = 2.5
alpha = 0.05
sample_mean = np.mean(sample_data)
sample_std_error = population_std_dev / np.sqrt(len(sample_data))
z_statistic = (sample_mean - population_mean) / sample_std_error
p_value = 2 * (1 - stats.norm.cdf(abs(z_statistic)))
print("Sample Mean:", sample_mean)
print("Z-statistic:", z_statistic)
print("P-value:", p_value)
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# 16. Create a confidence interval for a dataset using Python and interpret the result

import numpy as np
from scipy import stats
data = np.array([25, 30, 35, 38, 42, 45, 48, 50, 52, 55])
confidence_level = 0.95
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
margin_of_error = stats.t.ppf((1 + confidence_level) / 2, len(data) - 1) * sample_std / np.sqrt(len(data))
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)

# 17. Generate data from a normal distribution, then calculate and interpret the confidence interval for its mean

import numpy as np
from scipy import stats

mean = 50
std_dev = 10
sample_size = 100
data = np.random.normal(mean, std_dev, sample_size)
confidence_level = 0.95
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)
margin_of_error = stats.t.ppf((1 + confidence_level) / 2, sample_size - 1) * sample_std / np.sqrt(sample_size)
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)
print(f"We are {confidence_level * 100:.0f}% confident that the true population mean falls within the interval {confidence_interval}.")

# 18. Write a Python script to calculate and visualize the probability density function (PDF) of a normal distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
mean = 0
std_dev = 1
x = np.linspace(mean - 3 * std_dev, mean + 3 * std_dev, 100)
pdf_values = norm.pdf(x, loc=mean, scale=std_dev)
plt.figure(figsize=(8, 6))
plt.plot(x, pdf_values, label='Normal Distribution PDF')
plt.title('Probability Density Function of a Normal Distribution')
plt.xlabel('x')
plt.ylabel('PDF(x)')
plt.legend()
plt.grid(True)
plt.show()

# 19. Use Python to calculate and interpret the cumulative distribution function (CDF) of a Poisson distribution

import scipy.stats as stats

mean = 3
value = 5
cdf_value = stats.poisson.cdf(value, mean)
print(f"The CDF for a Poisson distribution with mean {mean} at value {value} is: {cdf_value}")

# 20. Simulate a random variable using a continuous uniform distribution and calculate its expected value

import numpy as np

low = 0
high = 1
size = 1000
random_numbers = np.random.uniform(low, high, size)
expected_value = np.mean(random_numbers)
print("Expected Value:", expected_value)

# 21. Write a Python program to compare the standard deviations of two datasets and visualize the difference

import numpy as np
import matplotlib.pyplot as plt

def compare_std_dev(data1, data2, label1="Dataset 1", label2="Dataset 2"):

    std_dev1 = np.std(data1)
    std_dev2 = np.std(data2)

    means = [np.mean(data1), np.mean(data2)]
    std_devs = [std_dev1, std_dev2]
    labels = [label1, label2]

    x_pos = np.arange(len(labels))

    fig, ax = plt.subplots()
    ax.bar(x_pos, means, yerr=std_devs, align='center', alpha=0.5, ecolor='black', capsize=10)
    ax.set_ylabel('Value')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(labels)
    ax.set_title('Comparison of Standard Deviations')
    ax.yaxis.grid(True)

    plt.tight_layout()
    plt.show()

if __name__ == '__main__':
    dataset1 = [20, 22, 19, 25, 28, 21, 23, 22]
    dataset2 = [10, 30, 15, 35, 20, 25, 12, 40]
    compare_std_dev(dataset1, dataset2, "Group A", "Group B")

    dataset3 = np.random.normal(50, 10, 100)
    dataset4 = np.random.normal(50, 5, 100)
    compare_std_dev(dataset3, dataset4, "Sample C", "Sample D")

# 22. Calculate the range and interquartile range (IQR) of a dataset generated from a normal distribution

import numpy as np
mean = 50
std_dev = 10
size = 100
data = np.random.normal(mean, std_dev, size)

# 23. Implement Z-score normalization on a dataset and visualize its transformation

import matplotlib.pyplot as plt
from scipy import stats

data = [10, 15, 20, 25, 30, 35, 40, 45, 50]
normalized_data = stats.zscore(data)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.hist(data, bins=10, edgecolor='black')
ax1.set_title('Original Dataset')
ax1.set_xlabel('Values')
ax1.set_ylabel('Frequency')
ax2.hist(normalized_data, bins=10, edgecolor='black')
ax2.set_title('Z-score Normalized Dataset')
ax2.set_xlabel('Z-scores')
ax2.set_ylabel('Frequency')
plt.show()

# 24. Write a Python function to calculate the skewness and kurtosis of a dataset generated from a normal distribution.

import numpy as np
from scipy import stats

def calculate_skewness_and_kurtosis(data):
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)
    return skewness, kurtosis

















