# **Chapter 2. Fundamentals of Statistics**

## **2.1. Population and Sampling**

In statistics, it is often impractical or impossible to collect data from an entire population. Instead, a sample is taken to make inferences about the population.

![Population_Sample](./images/Population_Sample.webp)

Definitions:
- **Population:** The entire set of items or events under study.
- **Sample:** A subset of the population, selected for analysis.

Why Sampling?
- To save time and resources.
- To allow statistical inference about a population based on a smaller dataset.

Random Sampling:
Random sampling ensures that every member of the population has an equal chance of being selected. This minimizes bias and allows generalization of results to the population.

Below is an example of how to generate a population and take a random sample using Python.

In [None]:
import numpy as np

# Generate a population of 100 random integers between 1 and 20
np.random.seed(42)  # For reproducibility
population = np.random.randint(1, 21, size=100)
print("Population:", population)

# Take a random sample of 20 elements from the population
sample_indices = np.random.choice(np.arange(len(population)), size=20, replace=False)
sample = population[sample_indices]
print("Sample:", sample)

# Verify population and sample sizes
print(f"Population size: {len(population)}")
print(f"Sample size: {len(sample)}")

In [None]:
import matplotlib.pyplot as plt

# Create an array of indices for the population
indices = np.arange(len(population))

# Create a mask to identify sampled elements
sample_mask = np.zeros_like(population, dtype=bool)
sample_mask[sample_indices] = True

# Plotting
plt.figure(figsize=(14, 6))

# Plot the population values
plt.scatter(indices, population, color='gray', alpha=0.7, label='Population')

# Highlight the sampled numbers
plt.scatter(indices[sample_mask], population[sample_mask], color='red', label='Sampled Numbers')

# Optional: Connect the sampled points with vertical lines for emphasis
plt.vlines(indices[sample_mask], ymin=0, ymax=population[sample_mask], colors='red', linestyles='dotted', alpha=0.5)

# Labels and title
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Visualization of Random Sampling from Population')

# Legend
plt.legend()

# Grid for better readability
plt.grid(True, linestyle='--', alpha=0.5)

# Show the plot
plt.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 1</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Store the column "Sepal length" as a numpy array named `sepal_length`
3. Take 20 samples from the `sepal_length` array
4. Visualize the sampling

## **2.2. Descriptive Statistics**

Descriptive statistics involves summarizing and organizing data to make it easier to understand. Key aspects include:

- **Central Tendency:** Mean, Median, Mode
- **Variability:** Range, Variance, Standard Deviation
- **Distribution Shape:** Skewness, Kurtosis

These measures help us understand the general behavior of data, making them a foundation for statistical analysis.

### **2.2.1. Measures of Central Tendency**

#### ***2.2.1.1. Mean***

**Definitions:** Average of the data.

$$
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

Where:
- $\mu$: Population mean
- $x_i$: Individual data points
- $N$: Total number of data points

For a sample, the mean is denoted as:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

Where:
- $\bar{x}$: Sample mean
- $n$: Number of observations in the sample

#### ***2.2.1.2. Geometric Mean***

**Definition:***
The geometric mean is a type of average that indicates the central tendency of a set of numbers by using the product of their values. It is especially useful for sets of positive numbers that are interpreted according to their product and is commonly used in financial analysis, growth rates, and environmental studies.

The geometric mean is defined as the nth root of the product of all data points.

$$
\sqrt[N]{\prod_{i=1}^{N} x_i}
$$

Where:
- $N$: Total number of data points (for population)
- $x_i$: Individual data points

For a sample, the geometric mean is similarly defined:

$$
\sqrt[n]{\prod_{i=1}^{n} x_i}
$$

Where:
- $n$: Number of observations in the sample

**Usage:**
- Financial Analysis: Calculating average growth rates over multiple periods.
- Environmental Science: Averaging ratios like pollutant concentrations.
- Biology: Determining average rates of population growth.
- Data Normalization: Useful in multiplicative processes and for data that spans several orders of magnitude.

#### ***2.2.1.3. Harmonic Mean***

**Definition:** The harmonic mean is a type of average, typically used when dealing with rates or ratios. It is especially appropriate when the average of rates is desired, and it tends to mitigate the impact of large outliers, providing a better measure for datasets dominated by small values.

The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of the data points.

$$
\text{Harmonic Mean}=\frac{N}{sum_{i=1}^{N} x_i}
$$

Where:
- $N$: Total number of data points (for population)
- $x_i$: Individual data points

For a sample, the harmonic mean is similarly defined:

$$
\text{Harmonic Mean}=\frac{n}{sum_{i=1}^{n} x_i}
$$

Where:

- $n$: Number of observations in the sample

**Usage:**
- Rates and Ratios: Ideal for averaging quantities like speeds, efficiencies, or any rates where the denominator is variable.
- Finance: Useful in calculating average multiples like the Price-to-Earnings (P/E) ratio.
- Engineering: Applicable in scenarios involving parallel systems or resistances.


#### ***2.2.1.4. Median***

**Definitions:** Middle value when data is sorted.

#### ***2.2.1.5. Mode***

**Definitions:** Most frequent value in the data.

Let's compute these using Python.

In [None]:
#Example: Computing Mean, Median, and Mode
from scipy import stats

# Mean
population_mean = np.mean(population)
print(f'Population mean: {population_mean}')
sample_mean = np.mean(sample)
print(f'Sample mean: {sample_mean}')

# Geometric mean
population_geometric_mean = stats.gmean(population)
print(f'Population Geometric Mean: {population_geometric_mean:.2f}')
sample_geometric_mean = stats.gmean(sample)
print(f'Sample Geometric Mean: {sample_geometric_mean:.2f}')
    
# Harmonic mean
population_harmonic_mean = stats.hmean(population)
print(f'Population harmonic mean: {population_harmonic_mean}')
sample_harmonic_mean = stats.hmean(sample)
print(f'Sample harmonic mean: {sample_harmonic_mean}')

# Median
population_median = np.median(population)
print(f'Population median: {population_median}')
sample_median = np.median(sample)
print(f'Sample median: {sample_median}')

# Mode
population_mode = stats.mode(population)
print(f'Population mode: {population_mode.mode}, Count: {population_mode.count}')
sample_mode = stats.mode(sample)
print(f'Sample mode: {sample_mode.mode}, Count: {sample_mode.count}')

### **2.2.2. Measures of Variability**

#### ***2.2.2.1. Range***

**Definitions:** Difference between the maximum and minimum values.

In [None]:
# Population Range
population_range_value = np.max(population) - np.min(population)
print(f'Population range: {population_range_value}')

# Sample Range
sample_range_value = np.max(sample) - np.min(sample)
print(f'Sample range: {sample_range_value}')

#### ***2.2.2.2. Variance***

**Definitions:** Average of squared differences from the mean.

*Population Variance*

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

Where:
- $\sigma^2$: Population variance
- $\mu $: Population mean- $\bar{x}$: Sample mean
- $N$: Total population size
- $n$: Sample size

*Sample Variance*

$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

Where:
- $s^2$: Sample variance
- $\bar{x}$: Sample mean
- $N$: Total population size
- $n$: Sample size

**Difference:** The sample variance uses $n-1$ in the denominator (degrees of freedom) to correct for bias when estimating the population variance from a sample.

In [None]:
# Population Variance
population_variance = np.var(population)
print(f'Population Variance: {population_variance}')

# Sample Variance
sample_variance = np.var(sample, ddof=1)  # ddof=1 applies Bessel's correction
print(f'Sample Variance: {sample_variance}')

#### ***2.2.2.3. Standard Deviation***

**Definitions:** Square root of the variance, measuring spread around the mean.

*Population Standard Deviation*
$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
$$

*Sample Standard Deviation*
$$
s = \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

**Difference:** Like variance, the sample standard deviation uses $n-1$ in the denominator to correct for bias.

In [None]:
# Population Standard Deviation
population_std_dev = np.std(population)
print(f'Population Standard Deviation: {population_std_dev}')

# Sample Standard Deviation
sample_std_dev = np.std(sample, ddof=1)  # ddof=1 applies Bessel's correction
print(f'Sample Standard Deviation: {sample_std_dev}')

#### ***2.2.2.4. Standard Error***

**Definition:** The standard error (SE) measures the accuracy with which a sample mean estimates the population mean. It quantifies the variability of the sample mean from the true population mean.

$$
\text{SE} = \frac{\sigma}{\sqrt{N}}
$$

For a sample, the standard error is calculated using the sample standard deviation:

$$
\text{SE} = \frac{s}{\sqrt{n}}
$$

**Usage:**
- Confidence Intervals: SE is used to construct confidence intervals around the sample mean.
- Hypothesis Testing: SE helps in determining how far the sample mean is from the hypothesized population mean.


In [None]:
# Population Standard Error
population_std_dev = np.std(population, ddof=0)
population_size = len(population)
population_se = population_std_dev / np.sqrt(population_size)
print(f'Population Standard Error: {population_se}')

# Sample Standard Error
sample_std_dev = np.std(sample, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(sample)
sample_se = sample_std_dev / np.sqrt(sample_size)
print(f'Sample Standard Error: {sample_se}')

#### ***2.2.2.5. Coefficient of Variation (Cv)***

**Definition:** The coefficient of variation (Cv) is a standardized measure of dispersion of a probability distribution or frequency distribution. It is the ratio of the standard deviation to the mean, and it is expressed as a percentage.

$$
\text{Cv}=\frac{\sigma}{\mu} \times 100\%
$$

For a sample, it is calculated as:

$$
\text{Cv}=\frac{s}{\mu} \times 100\%
$$

**Usage:**

- Comparing Variability: Cv is useful for comparing the degree of variation between datasets with different units or vastly different means.
- Relative Risk: In fields like finance and biology, Cv helps assess relative risk or variability.


In [None]:
# Coefficient of Variation for Population
population_mean = np.mean(population)
population_cv = (population_std_dev / population_mean) * 100
print(f'Population Coefficient of Variation: {population_cv:.2f}%')

# Coefficient of Variation for Sample
sample_mean = np.mean(sample)
sample_cv = (sample_std_dev / sample_mean) * 100
print(f'Sample Coefficient of Variation: {sample_cv:.2f}%')


<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 2</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Calculate the range, variance, and standard deviation, standard error and coefficient of variation for all columns of the dataset.

### **2.2.3. Distribution Shape**

#### ***2.2.3.1. Skewness***

**Definitions:** Measures symmetry of the data distribution.

$$
\text{Skewness} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^3}{\sigma^3}
$$

Where:
- Positive skewness: Distribution tails off to the right.
- Negative skewness: Distribution tails off to the left.

#### ***2.2.3.2. Kurtosis***

**Definitions:** Measures the "tailedness" of the data distribution.

$$
\text{Kurtosis} = \frac{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^4}{\sigma^4} - 3
$$

Where:
- A kurtosis of 0 indicates a normal distribution (mesokurtic).
- Positive kurtosis: Heavy-tailed distribution (leptokurtic).
- Negative kurtosis: Light-tailed distribution (platykurtic).

The $-3$ adjustment is used to make the kurtosis of a normal distribution equal to 0.

Let's compute these using Python.

In [None]:
# Example: Computing Skewness and Kurtosis
# Skewness
skewness = stats.skew(sample)
print(f'Skewness: {skewness}')

# Kurtosis
kurtosis = stats.kurtosis(sample)
print(f'Kurtosis: {kurtosis}')

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 3</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Calculate the skewness and Kurtosis for all columns of the dataset.

### **2.2.4. Visualizing Data**

Visualizations help us better understand the data distribution. Common plots include:
- **Histogram:** Shows frequency of data points in intervals.
- **Box Plot:** Highlights spread, median, and outliers.
- **Scatter Plot:** Shows relationships between variables.

#### ***2.2.4.1. Histogram***

In [None]:
# Example: Histogram for sample
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
plt.hist(sample, bins=5, alpha=0.7, color='blue', edgecolor='black')
plt.title('Sample Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Example: Histogram for sample and population
from scipy.stats import norm

# Create figure and axes for subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Histogram of Population with Normal Curve
ax1 = axes[0]
sns.histplot(population, bins=range(1, 22), kde=False, color='skyblue', stat='density', ax=ax1, edgecolor='black')
ax1.axvline(population_mean, color='red', linestyle='dashed', linewidth=1.5, label=f'Mean: {population_mean:.2f}')
ax1.axvline(population_median, color='green', linestyle='dashed', linewidth=1.5, label=f'Median: {population_median}')
ax1.set_title('Population Distribution')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')

# Overlay Normal Distribution Curve for Population
x_population = np.linspace(min(population)-1, max(population)+1, 1000)
y_population = norm.pdf(x_population, population_mean, population_std_dev)
ax1.plot(x_population, y_population, color='blue', linewidth=2, label='Normal Curve')

# Add Standard Deviation Line and Text for Population
x_std_right = population_mean + population_std_dev
y_std_right = norm.pdf(x_std_right, population_mean, population_std_dev)
ax1.hlines(y=y_std_right, xmin=population_mean, xmax=x_std_right, colors='purple', linewidth=2)
ax1.vlines(x=x_std_right, ymin=0, ymax=y_std_right, colors='purple', linestyles='dotted', linewidth=1)
ax1.annotate('σ', xy=((population_mean + x_std_right)/2, y_std_right), color='purple', fontsize=14, ha='center', va='bottom')
ax1.legend()

# Histogram of Sample with Normal Curve
ax2 = axes[1]
sns.histplot(sample, bins=range(1, 22), kde=False, color='orange', stat='density', ax=ax2, edgecolor='black')
ax2.axvline(sample_mean, color='red', linestyle='dashed', linewidth=1.5, label=f'Mean: {sample_mean:.2f}')
ax2.axvline(sample_median, color='green', linestyle='dashed', linewidth=1.5, label=f'Median: {sample_median}')
ax2.set_title('Sample Distribution')
ax2.set_xlabel('Value')
ax2.set_ylabel('Density')

# Overlay Normal Distribution Curve for Sample
x_sample = np.linspace(min(sample)-1, max(sample)+1, 1000)
y_sample = norm.pdf(x_sample, sample_mean, sample_std_dev)
ax2.plot(x_sample, y_sample, color='blue', linewidth=2, label='Normal Curve')

# Add Standard Deviation Line and Text for Sample
x_std_right_sample = sample_mean + sample_std_dev 
y_std_right_sample = norm.pdf(x_std_right_sample, sample_mean, sample_std_dev)
ax2.hlines(y=y_std_right_sample, xmin=sample_mean, xmax=x_std_right_sample, colors='purple', linewidth=2)
ax2.vlines(x=x_std_right_sample, ymin=0, ymax=y_std_right_sample, colors='purple', linestyles='dotted', linewidth=1)
ax2.annotate('σ', xy=((sample_mean + x_std_right_sample)/2, y_std_right_sample), color='purple', fontsize=14, ha='center', va='bottom')
ax2.legend()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 4</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Plot the histogram of all numeric columns in a 2×2 grid.

#### ***2.2.4.2. Box Plot***

In [None]:
# Example: Box plot of sample
sns.boxplot(data=sample, color='lightblue')
plt.title('Sample Box Plot')
plt.show()

In [None]:
# Example: Box plot of sample and population
# Create figure and axes for subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Visualization Setup
sns.set(style='whitegrid')

# Boxplot of Population
ax1 = axes[0]
sns.boxplot(y=population, color='skyblue', ax=ax1)
ax1.set_title('Population box plot')
ax1.set_ylabel('Value')
# Annotate statistics
ax1.text(-0.2, population_mean, f'Mean: {population_mean:.2f}', verticalalignment='center', color='red')
ax1.text(0.1, population_median, f'Median: {population_median}', verticalalignment='center', color='green')

# Boxplot of Sample
ax2 = axes[1]
sns.boxplot(y=sample, color='orange', ax=ax2)
ax2.set_title('Sample box plot')
ax2.set_ylabel('Value')
# Annotate statistics
ax2.text(-0.2, sample_mean, f'Mean: {sample_mean:.2f}', verticalalignment='center', color='red')
ax2.text(0.1, sample_median, f'Median: {sample_median}', verticalalignment='center', color='green')

# Adjust layout
plt.tight_layout()
plt.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 5</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Plot the boxplot of all numeric columns in a 2×2 grid.

#### ***2.2.4.3. Scatter Plot***

In [None]:
# Example: Scatter plot of sample vs squared sample
squared_sample = np.square(sample)
sns.scatterplot(x=sample, y=squared_sample, color='red')
plt.title('Sample Square Scatter Plot')
plt.show()

<p style="background-color: lightgreen; text-align: center; font-size: 18px; color: red; padding: 5px; border-radius: 10px;"><b>Exercise 6</b></p>

Write Python code to:
1. Load the `IrisFlower.csv` file into a dataframe.
2. Draw the scatter plot between `sepal length` and `sepal width`, use different colors for each species.