# Statistics and Probability with Python

This notebook provides a comprehensive introduction to statistics and probability concepts using Python. We'll explore descriptive statistics, probability distributions, hypothesis testing, and data visualization techniques.

## 1. Import Required Libraries

First, let's import all the necessary libraries for statistical analysis and visualization.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, binom, poisson, t, chi2, f_oneway
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("All libraries imported successfully!")

<details>
<summary><b>Summary</b></summary>

- Successfully imported all essential libraries for statistical analysis and data visualization
- **NumPy**: Powerful numerical computing capabilities for array operations and mathematical functions
- **Pandas**: Efficient data manipulation and analysis through DataFrames
- **SciPy**: Extensive collection of statistical functions, tests, and probability distributions
- **Matplotlib & Seaborn**: Complementary visualization libraries—Matplotlib provides foundational plotting while Seaborn adds statistical graphics with better aesthetics
- Warning filter suppresses unnecessary warnings during statistical computations
- Seaborn style settings ensure professional-looking visualizations with appropriate default sizes and grid backgrounds
- These libraries together form the complete toolkit needed for comprehensive statistical analysis in Python

</details>

## 2. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. Let's calculate various descriptive statistics on sample data.

In [None]:
# Create a sample dataset
np.random.seed(42)
data = np.random.normal(100, 15, 1000)  # Mean=100, SD=15, n=1000

# Calculate descriptive statistics
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True)[0][0]
variance = np.var(data, ddof=1)  # Sample variance
std_dev = np.std(data, ddof=1)   # Sample standard deviation
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
data_range = np.max(data) - np.min(data)
iqr = q3 - q1

print("Descriptive Statistics:")
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Mode: {mode:.2f}")
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Range: {data_range:.2f}")
print(f"Q1 (25th percentile): {q1:.2f}")
print(f"Q3 (75th percentile): {q3:.2f}")
print(f"IQR (Interquartile Range): {iqr:.2f}")

<details>
<summary><b>Summary</b></summary>

- Generated sample dataset of 1,000 normally distributed values (mean=100, SD=15)
- **Measures of Central Tendency**:
  - Mean: Arithmetic average of all values
  - Median: Middle value when data is sorted
  - Mode: Most frequently occurring value
- **Measures of Spread** (quantify data variability):
  - Variance: Average squared deviation from the mean
  - Standard Deviation: Square root of variance (in original units)
  - Range: Difference between maximum and minimum values
- **Quartiles** (divide data into quarters):
  - Q1 (25th percentile): Marks the lower quarter
  - Q3 (75th percentile): Marks the upper quarter
  - IQR (Interquartile Range = Q3 - Q1): Spread of middle 50%, robust to outliers
- These statistics provide complete picture of data distribution, central location, and variability
- Fundamental for understanding any dataset before advanced analysis

</details>

## 3. Probability Distributions

A probability distribution describes how the values of a random variable are distributed. Let's explore basic probability concepts.

In [None]:
# Example: Rolling a fair die
# Discrete uniform distribution
outcomes = np.arange(1, 7)  # Die faces: 1, 2, 3, 4, 5, 6
probabilities = np.ones(6) / 6  # Each outcome has probability 1/6

# Create a probability mass function (PMF)
plt.figure(figsize=(10, 5))
plt.bar(outcomes, probabilities, color='steelblue', alpha=0.7, edgecolor='black')
plt.xlabel('Die Face')
plt.ylabel('Probability')
plt.title('Probability Mass Function - Fair Die')
plt.xticks(outcomes)
plt.ylim(0, 0.3)
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"Expected value (mean): {np.sum(outcomes * probabilities):.2f}")
print(f"Sum of all probabilities: {np.sum(probabilities):.2f}")

<details>
<summary><b>Summary</b></summary>

- Created probability mass function (PMF) for a fair six-sided die
- Demonstrates **discrete uniform distribution** where each outcome has equal probability (1/6 ≈ 0.1667)
- Visualization shows all six outcomes have identical probabilities, illustrating fairness
- **Expected Value** of 3.5: Theoretical mean outcome if rolling die infinitely many times
  - Calculated as sum of each outcome multiplied by its probability
  - Though you can never roll 3.5 on single throw, crucial for long-term predictions
- All probabilities sum to exactly 1.0, validating proper probability distribution
- **Probability Axiom**: Total probability across all possible outcomes must equal certainty
- Establishes core probability concepts: PMFs for discrete variables, probability axioms, and expected values

</details>

## 4. Normal Distribution

The normal (Gaussian) distribution is one of the most important probability distributions in statistics. It's characterized by its bell-shaped curve.

In [None]:
# Generate normal distribution
mu = 100  # Mean
sigma = 15  # Standard deviation

x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
pdf = norm.pdf(x, mu, sigma)

# Plot the normal distribution
plt.figure(figsize=(12, 6))
plt.plot(x, pdf, 'b-', linewidth=2, label=f'μ={mu}, σ={sigma}')
plt.fill_between(x, pdf, alpha=0.2)
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normal Distribution (Probability Density Function)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Calculate probabilities
prob_below_85 = norm.cdf(85, mu, sigma)
prob_above_115 = 1 - norm.cdf(115, mu, sigma)
prob_between = norm.cdf(115, mu, sigma) - norm.cdf(85, mu, sigma)

print(f"Probability X < 85: {prob_below_85:.4f}")
print(f"Probability X > 115: {prob_above_115:.4f}")
print(f"Probability 85 < X < 115: {prob_between:.4f}")

# Z-scores
z_score_85 = (85 - mu) / sigma
print(f"\nZ-score for 85: {z_score_85:.2f}")

<details>
<summary><b>Summary</b></summary>

- Visualized normal (Gaussian) distribution with mean μ=100 and standard deviation σ=15
- Characteristic **bell-shaped curve** that is symmetric around the mean
- Fundamental in statistics because many natural phenomena approximate normality
- **Cumulative Distribution Function (CDF)**: Probability that random variable ≤ specific value
- Results consistent with **Empirical Rule** (68-95-99.7 rule):
  - ~68% of data falls within one standard deviation of mean (85 to 115)
  - ~16% below 85, ~16% above 115
- **Z-scores**: Standardize values by measuring standard deviations from mean
  - Formula: z = (x - μ) / σ
  - z = -1.0 for 85 means it's one SD below the mean
  - Enables comparison across different scales
  - Allows use of standard normal tables
- Understanding normal distribution essential for hypothesis tests, confidence intervals, and regression analysis

</details>

## 5. Binomial Distribution

The binomial distribution models the number of successes in a fixed number of independent trials, each with the same probability of success.

In [None]:
# Example: Flipping a coin 10 times
n = 10  # Number of trials
p = 0.5  # Probability of success (heads)

k = np.arange(0, n+1)
pmf = binom.pmf(k, n, p)

# Plot binomial distribution
plt.figure(figsize=(10, 6))
plt.bar(k, pmf, color='coral', alpha=0.7, edgecolor='black')
plt.xlabel('Number of Heads')
plt.ylabel('Probability')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.xticks(k)
plt.grid(axis='y', alpha=0.3)
plt.show()

# Calculate specific probabilities
prob_5_heads = binom.pmf(5, n, p)
prob_at_least_7 = 1 - binom.cdf(6, n, p)

print(f"Probability of exactly 5 heads: {prob_5_heads:.4f}")
print(f"Probability of at least 7 heads: {prob_at_least_7:.4f}")
print(f"Expected number of heads: {n * p}")
print(f"Variance: {n * p * (1-p)}")

<details>
<summary><b>Summary</b></summary>

- Modeled binomial experiment: 10 coin flips with probability p=0.5 for heads
- **Binomial Distribution** applies when:
  - Fixed number of independent trials
  - Each trial has same probability of success
- PMF visualization reveals highest probability at 5 heads (expected value = n×p = 10×0.5 = 5)
- Probabilities decrease symmetrically toward extremes (0 or 10 heads)
- **Key Probabilities**:
  - Exactly 5 heads: ~0.246 (approximately 25%)
  - 7 or more heads: ~0.172
- **Variance** = n×p×(1-p) = 2.5, measuring spread around expected value
- **Applications**: Quality control, survey analysis, medical trials, any binary outcome scenarios (success/failure, yes/no)
- Binomial distribution approaches normal distribution as n increases (with appropriate p values)
- Illustrates connection between discrete and continuous probability models

</details>

## 6. Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space when events occur independently at a constant average rate.

In [None]:
# Example: Average of 3 emails per hour
lambda_param = 3  # Average rate

k = np.arange(0, 15)
pmf_poisson = poisson.pmf(k, lambda_param)

# Plot Poisson distribution
plt.figure(figsize=(10, 6))
plt.bar(k, pmf_poisson, color='lightgreen', alpha=0.7, edgecolor='black')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.title(f'Poisson Distribution (λ={lambda_param})')
plt.xticks(k)
plt.grid(axis='y', alpha=0.3)
plt.show()

# Calculate probabilities
prob_exactly_3 = poisson.pmf(3, lambda_param)
prob_less_than_2 = poisson.cdf(1, lambda_param)
prob_more_than_5 = 1 - poisson.cdf(5, lambda_param)

print(f"Probability of exactly 3 emails: {prob_exactly_3:.4f}")
print(f"Probability of less than 2 emails: {prob_less_than_2:.4f}")
print(f"Probability of more than 5 emails: {prob_more_than_5:.4f}")
print(f"Expected value: {lambda_param}")
print(f"Variance: {lambda_param}")

<details>
<summary><b>Summary</b></summary>

- Illustrated Poisson distribution modeling email arrivals with average rate λ=3 per hour
- Used for **counting rare events** occurring independently over continuous interval (time/space)
- **Key Characteristics**:
  - Events occur independently
  - Average rate (λ) is constant
  - Events cannot occur simultaneously
- Visualization shows right-skewed distribution with mode near λ=3
- **Calculated Probabilities**:
  - Exactly 3 emails: ~22.4%
  - Fewer than 2 emails: ~19.9%
  - More than 5 emails: ~8.4%
- **Unique Property**: Both expected value and variance equal λ (here, both are 3)
- **Applications**: Operations research (call centers, customer traffic), reliability engineering (equipment failures), natural sciences (radioactive decay, mutations)
- Poisson approximates binomial when n is large and p is small (rare events)
- Computationally efficient for modeling unlikely occurrences over many opportunities

</details>

## 7. Hypothesis Testing

Hypothesis testing is a statistical method to make decisions about population parameters based on sample data.

In [None]:
# One-sample t-test
# H0: Population mean = 100
# H1: Population mean ≠ 100
sample_data = np.random.normal(105, 15, 50)
t_statistic, p_value = stats.ttest_1samp(sample_data, 100)

print("One-Sample T-Test")
print(f"Sample mean: {np.mean(sample_data):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Result: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} at α=0.05\n")

# Two-sample t-test (independent samples)
group1 = np.random.normal(100, 15, 50)
group2 = np.random.normal(110, 15, 50)
t_stat_2, p_val_2 = stats.ttest_ind(group1, group2)

print("Two-Sample T-Test (Independent)")
print(f"Group 1 mean: {np.mean(group1):.2f}")
print(f"Group 2 mean: {np.mean(group2):.2f}")
print(f"T-statistic: {t_stat_2:.4f}")
print(f"P-value: {p_val_2:.4f}")
print(f"Result: {'Reject H0' if p_val_2 < 0.05 else 'Fail to reject H0'} at α=0.05\n")

# Chi-square test for independence
observed = np.array([[30, 10], [20, 40]])
chi2_stat, p_val_chi, dof, expected = stats.chi2_contingency(observed)

print("Chi-Square Test for Independence")
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"P-value: {p_val_chi:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Result: {'Reject H0' if p_val_chi < 0.05 else 'Fail to reject H0'} at α=0.05")

<details>
<summary><b>Summary</b></summary>

- Performed three fundamental types of hypothesis tests:
  1. **One-Sample t-Test**:
     - Tests if sample mean differs from hypothesized population value (H₀: μ=100)
     - Useful when testing if group differs from known standard
  2. **Two-Sample Independent t-Test**:
     - Compares means between two separate groups
     - Applicable in experimental designs (treatment vs. control, different populations)
  3. **Chi-Square Test for Independence**:
     - Analyzes categorical data in contingency tables
     - Determines if two categorical variables are related or independent
     - Common in survey analysis and association studies
- **P-value**: Probability of obtaining results at least as extreme as observed, assuming null hypothesis true
- P-values below significance level (typically α=0.05) suggest rejecting null hypothesis
- **Statistical Significance**: Evidence of real effect, not due to random chance
- **T-statistic**: Measures how many standard errors sample mean is from hypothesized value
- **Chi-square statistic**: Quantifies deviation from expected frequencies under independence
- Essential for making data-driven decisions and drawing valid statistical inferences

</details>

## 8. Correlation and Covariance

Correlation and covariance measure the relationship between two variables.

In [None]:
# Generate correlated data
np.random.seed(42)
x = np.random.normal(50, 10, 100)
y = 2 * x + np.random.normal(0, 10, 100)  # Positively correlated
z = -1.5 * x + np.random.normal(100, 15, 100)  # Negatively correlated

# Calculate correlation coefficients
corr_xy = np.corrcoef(x, y)[0, 1]
corr_xz = np.corrcoef(x, z)[0, 1]

# Calculate covariance
cov_xy = np.cov(x, y)[0, 1]
cov_xz = np.cov(x, z)[0, 1]

print(f"Correlation between X and Y: {corr_xy:.4f}")
print(f"Correlation between X and Z: {corr_xz:.4f}")
print(f"Covariance between X and Y: {cov_xy:.2f}")
print(f"Covariance between X and Z: {cov_xz:.2f}")

# Visualize correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(x, y, alpha=0.6, color='blue')
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].set_title(f'Positive Correlation (r={corr_xy:.2f})')
axes[0].grid(True, alpha=0.3)

axes[1].scatter(x, z, alpha=0.6, color='red')
axes[1].set_xlabel('X')
axes[1].set_ylabel('Z')
axes[1].set_title(f'Negative Correlation (r={corr_xz:.2f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation matrix
data_df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
corr_matrix = data_df.corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

<details>
<summary><b>Summary</b></summary>

- Demonstrated correlation and covariance measuring **linear relationships** between variables
- **Correlation Coefficients** (Pearson's r):
  - Standardized measure ranging from -1 to +1
  - Near +1: Strong positive linear relationship (both variables increase together)
  - Near -1: Strong negative relationship (one increases, other decreases)
  - Near 0: Weak or no linear relationship
- **Example Results**:
  - Strong positive correlation (r ≈ 0.95) between X and Y
  - Strong negative correlation (r ≈ -0.95) between X and Z
- **Covariance**: Same concept in unstandardized units
  - Harder to interpret but useful in certain calculations
- **Scatter Plots**: Visual confirmation of relationships
  - Positive correlation: Upward trend
  - Negative correlation: Downward trend
- **Correlation Matrix**: Comprehensive view of all pairwise relationships in dataset
- **Important Caveats**:
  - Correlation doesn't imply causation
  - Only captures linear relationships
  - Outliers can heavily influence correlation
- Foundational in regression analysis, portfolio theory, data exploration, identifying multicollinearity

</details>

## 9. Central Limit Theorem

The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's distribution.

In [None]:
# Demonstrate Central Limit Theorem
# Start with a non-normal distribution (uniform)
population = np.random.uniform(0, 100, 10000)

# Take many samples and calculate their means
sample_sizes = [5, 10, 30, 100]
n_samples = 1000

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, sample_size in enumerate(sample_sizes):
    sample_means = [np.mean(np.random.choice(population, sample_size)) 
                    for _ in range(n_samples)]
    
    axes[idx].hist(sample_means, bins=30, density=True, 
                   alpha=0.7, color='skyblue', edgecolor='black')
    
    # Overlay normal distribution
    mu_sampling = np.mean(sample_means)
    sigma_sampling = np.std(sample_means)
    x_range = np.linspace(min(sample_means), max(sample_means), 100)
    axes[idx].plot(x_range, norm.pdf(x_range, mu_sampling, sigma_sampling), 
                   'r-', linewidth=2, label='Normal fit')
    
    axes[idx].set_title(f'Sample Size: {sample_size}')
    axes[idx].set_xlabel('Sample Mean')
    axes[idx].set_ylabel('Density')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Central Limit Theorem Demonstration', fontsize=16, y=1.00)
plt.tight_layout()
plt.show()

print(f"Population mean: {np.mean(population):.2f}")
print(f"Population std: {np.std(population):.2f}")
print(f"\nTheoretical standard error (n=30): {np.std(population)/np.sqrt(30):.2f}")

<details>
<summary><b>Summary</b></summary>

- Illustrated **Central Limit Theorem (CLT)**, one of the most important concepts in statistics
- **CLT Statement**: Distribution of sample means approaches normal distribution as sample size increases, *regardless of original population distribution*
- **Demonstration**:
  - Sampled from uniform distribution (rectangular, non-normal shape)
  - Sample means become increasingly normal as size grows from n=5 to n=100
  - Small samples (n=5): Irregular sampling distribution
  - By n=30: Remarkably normal (why n=30 often cited as "magic number")
- **Standard Error** (σ/√n):
  - Measures standard deviation of sampling distribution
  - Decreases as sample size increases
  - Explains why larger samples provide more precise estimates
- **Why CLT is Fundamental**:
  - Justifies using normal-based methods (t-tests, confidence intervals) even when population isn't normal
  - Explains why averages are more reliable than individual observations
  - Underlies much of inferential statistics
- Enables probability statements about sample means and construction of confidence intervals
- Forms theoretical foundation for hypothesis testing and estimation procedures

</details>

## 10. Confidence Intervals

A confidence interval provides a range of values that likely contains the true population parameter with a specified level of confidence.

In [None]:
# Calculate confidence interval for the mean
sample = np.random.normal(100, 15, 50)
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)
n = len(sample)

# 95% Confidence Interval
confidence_level = 0.95
alpha = 1 - confidence_level
df = n - 1  # Degrees of freedom

# t-critical value for 95% CI
t_critical = t.ppf(1 - alpha/2, df)

# Standard error
se = sample_std / np.sqrt(n)

# Margin of error
margin_of_error = t_critical * se

# Confidence interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"Sample mean: {sample_mean:.2f}")
print(f"Sample standard deviation: {sample_std:.2f}")
print(f"Sample size: {n}")
print(f"Standard error: {se:.2f}")
print(f"T-critical value (α=0.05, df={df}): {t_critical:.4f}")
print(f"Margin of error: {margin_of_error:.2f}")
print(f"\n95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")
print(f"\nInterpretation: We are 95% confident that the true population mean")
print(f"lies between {ci_lower:.2f} and {ci_upper:.2f}")

# Visualize confidence interval
plt.figure(figsize=(10, 6))
plt.errorbar(1, sample_mean, yerr=margin_of_error, fmt='o', 
             markersize=10, capsize=10, capthick=2, 
             color='darkblue', ecolor='red', linewidth=2)
plt.axhline(y=sample_mean, color='blue', linestyle='--', alpha=0.5, label='Sample Mean')
plt.axhline(y=ci_lower, color='red', linestyle='--', alpha=0.5, label='95% CI Bounds')
plt.axhline(y=ci_upper, color='red', linestyle='--', alpha=0.5)
plt.xlim(0.5, 1.5)
plt.ylabel('Value')
plt.title('95% Confidence Interval for Population Mean')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks([])
plt.show()

<details>
<summary><b>Summary</b></summary>

- Calculated 95% confidence interval for population mean using **interval estimation**
- **Purpose**: Estimate population parameters while quantifying uncertainty
- **95% Confidence Level Interpretation**:
  - If repeated sampling many times, ~95% of constructed intervals contain true population mean
  - NOT that there's 95% probability specific interval contains it
  - True mean either is or isn't in our interval
- **Calculation Components**:
  1. **Standard Error** (SE = s/√n): Measures sampling variability
  2. **t-Critical Value**: From t-distribution (not normal, because we estimated σ with s)
  3. **Margin of Error** (ME = t* × SE): Creates interval width
- **Why t-Distribution?**:
  - Accounts for additional uncertainty when estimating population SD from sample
  - Especially important for small samples
- **Visualization**: Point estimate (sample mean) with error bars showing margin of error
- **Advantages over Point Estimates**:
  - Convey uncertainty explicitly
  - More informative for decision-making
- **Applications**: Research reporting, A/B testing, quality control, any inference from samples to populations
- Wider intervals indicate more uncertainty (reduce by increasing sample size)

</details>

## 11. Data Visualization for Probability

Visual representations help us understand statistical concepts and distributions better.

In [None]:
# Generate sample data
np.random.seed(42)
data1 = np.random.normal(100, 15, 1000)
data2 = np.random.normal(110, 20, 1000)

# Create comprehensive visualization
fig = plt.figure(figsize=(16, 10))

# 1. Histogram with KDE
ax1 = plt.subplot(2, 3, 1)
plt.hist(data1, bins=30, density=True, alpha=0.7, color='skyblue', edgecolor='black')
from scipy.stats import gaussian_kde
kde = gaussian_kde(data1)
x_range = np.linspace(data1.min(), data1.max(), 100)
plt.plot(x_range, kde(x_range), 'r-', linewidth=2, label='KDE')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Histogram with KDE')
plt.legend()
plt.grid(True, alpha=0.3)

# 2. Box Plot
ax2 = plt.subplot(2, 3, 2)
box_data = [data1, data2]
plt.boxplot(box_data, labels=['Data 1', 'Data 2'], patch_artist=True,
            boxprops=dict(facecolor='lightblue', alpha=0.7))
plt.ylabel('Value')
plt.title('Box Plot Comparison')
plt.grid(True, alpha=0.3, axis='y')

# 3. Q-Q Plot
ax3 = plt.subplot(2, 3, 3)
stats.probplot(data1, dist="norm", plot=plt)
plt.title('Q-Q Plot (Normal Distribution)')
plt.grid(True, alpha=0.3)

# 4. Violin Plot
ax4 = plt.subplot(2, 3, 4)
df_combined = pd.DataFrame({
    'Value': np.concatenate([data1, data2]),
    'Group': ['Data 1']*len(data1) + ['Data 2']*len(data2)
})
sns.violinplot(data=df_combined, x='Group', y='Value', palette='Set2')
plt.title('Violin Plot')
plt.grid(True, alpha=0.3, axis='y')

# 5. Cumulative Distribution Function
ax5 = plt.subplot(2, 3, 5)
sorted_data = np.sort(data1)
cumulative = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
plt.plot(sorted_data, cumulative, linewidth=2, color='green')
plt.xlabel('Value')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Cumulative Distribution Function')
plt.grid(True, alpha=0.3)

# 6. Scatter plot with regression line
ax6 = plt.subplot(2, 3, 6)
x_scatter = np.random.normal(50, 10, 100)
y_scatter = 2 * x_scatter + np.random.normal(0, 10, 100)
plt.scatter(x_scatter, y_scatter, alpha=0.6, color='purple')
# Add regression line
z = np.polyfit(x_scatter, y_scatter, 1)
p = np.poly1d(z)
plt.plot(x_scatter, p(x_scatter), "r-", linewidth=2, label=f'y={z[0]:.2f}x+{z[1]:.2f}')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot with Regression Line')
plt.legend()
plt.grid(True, alpha=0.3)

plt.suptitle('Statistical Visualization Dashboard', fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

print("Visualization complete!")

<details>
<summary><b>Summary</b></summary>

- Created comprehensive statistical visualization dashboard with six essential plot types:
  1. **Histogram with KDE** (Kernel Density Estimation):
     - Shows data distribution shape and smoothed probability density
     - Identifies modality and skewness
  2. **Box Plot**:
     - Displays quartiles, median, and outliers (points beyond 1.5×IQR from quartiles)
     - Enables quick comparison of central tendency and spread between groups
  3. **Q-Q Plot** (Quantile-Quantile):
     - Assesses normality by comparing sample vs. theoretical normal quantiles
     - Points following diagonal line indicate normality
     - Deviations suggest non-normality
  4. **Violin Plot**:
     - Combines box plot information with KDE
     - Shows full distribution shape
     - Easier to identify bimodal or multimodal distributions
  5. **Empirical CDF** (Cumulative Distribution Function):
     - Displays cumulative probabilities
     - Useful for finding percentiles and comparing distributions
  6. **Scatter Plot with Regression Line**:
     - Visualizes relationships between variables
     - Fits linear model (y = mx + b) to quantify relationship
- **Together**: Complete exploratory data analysis toolkit
- **Purpose**: Identify patterns, outliers, distribution shapes, relationships, departures from assumptions
- Effective visualization crucial for communicating findings and discovering insights summary statistics might miss
- Each plot type serves specific purposes and provides complementary information

</details>

## Summary

This notebook covered the fundamental concepts of statistics and probability:

1. **Descriptive Statistics**: Mean, median, mode, variance, standard deviation, quartiles
2. **Probability Distributions**: Understanding PMF and PDF
3. **Normal Distribution**: Bell curve, z-scores, probability calculations
4. **Binomial Distribution**: Discrete probability for fixed trials
5. **Poisson Distribution**: Event modeling over time/space
6. **Hypothesis Testing**: t-tests, chi-square tests, p-values
7. **Correlation & Covariance**: Measuring relationships between variables
8. **Central Limit Theorem**: Sampling distributions
9. **Confidence Intervals**: Estimating population parameters
10. **Data Visualization**: Various plots to understand distributions

These concepts form the foundation of statistical analysis and are widely used in data science, research, and decision-making.