## Descriptive Statistics

Descriptive statistics summarize and organize the characteristics of a data set. This tutorial covers key concepts, mathematical background, and numerical examples.

### 1. Measures of Central Tendency

**Measures of central tendency** describe the center of a data set. The three main measures are:

- **Mean** (Arithmetic Average)
- **Median** (Middle Value)
- **Mode** (Most Frequent Value)

#### Mean

The mean is calculated by summing all the values and dividing by the number of values.

$$ \text{Mean} (\mu) = \frac{1}{N} \sum_{i=1}^{N} x_i $$

*Example:*

Given data: 5, 8, 12, 20

$$ \mu = \frac{5 + 8 + 12 + 20}{4} = \frac{45}{4} = 11.25 $$

**Key Properties:**

1. **Affected by Outliers**: The mean can be significantly affected by extreme values (outliers).
2. **Mathematical Manipulation**: The mean is useful in further statistical calculations and algebraic manipulations.
3. **Every Data Point Contributes**: Each data point affects the mean.

#### Median

The median is the middle value when the data set is ordered. If the number of observations is odd, the median is the middle value. If even, it is the average of the two middle values.

*Example:*

Given data: 5, 8, 12, 20 (even number of observations)

Ordered data: 5, 8, 12, 20

Median is $( \frac{8 + 12}{2} = 10 )$

Given data: 5, 8, 12 (odd number of observations)

Ordered data: 5, 8, 12

Median is 8

**Key Properties:**

1. **Not Affected by Outliers**: The median is resistant to outliers and skewed data.
2. **Positional Measure**: The median depends on the position of the data values.
3. **Appropriate for Ordinal Data**: Suitable for ordinal data where the mean cannot be defined.

#### Mode

The mode is the value that appears most frequently in the data set.

*Example:*

Given data: 5, 8, 12, 8, 20

Mode is 8

**Key Properties:**

1. **Multiple Modes**: A data set can have more than one mode (bimodal, multimodal).
2. **Not Affected by Extreme Values**: The mode is not influenced by outliers.
3. **Applicable to Categorical Data**: The mode is the only measure of central tendency that can be used with nominal data.

### 2. Measures of Dispersion

**Measures of dispersion** describe the spread of the data. The main measures are:

- **Range**
- **Variance**
- **Standard Deviation**

#### Range

The range is the difference between the maximum and minimum values.

*Example:*

Given data: 5, 8, 12, 20

Range = 20 - 5 = 15

**Key Properties:**

1. **Simple Calculation**: Easy to compute.
2. **Affected by Outliers**: Highly influenced by extreme values.
3. **Ignores Distribution**: Does not provide information about the distribution between the minimum and maximum values.

#### Variance

Variance measures how far each number in the set is from the mean and thus from every other number in the set.

Population Variance:

$$ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 $$

Sample Variance:

$$ s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{x})^2 $$

*Example:*

Given data: 5, 8, 12, 20 (Sample)

Mean ($\bar{x}$) = 11.25

$$ s^2 = \frac{1}{3} [(5-11.25)^2 + (8-11.25)^2 + (12-11.25)^2 + (20-11.25)^2] $$

$$ s^2 = \frac{1}{3} [39.0625 + 10.5625 + 0.5625 + 76.5625] $$

$$ s^2 = \frac{1}{3} \times 126.75 = 42.25 $$

**Key Properties:**

1. **Squared Units**: Variance is expressed in squared units of the original data.
2. **Affected by Outliers**: Sensitive to extreme values.
3. **Foundation for Standard Deviation**: Variance is a crucial component in calculating the standard deviation.

#### Standard Deviation

Standard deviation is the square root of the variance.

Population Standard Deviation:

$$ \sigma = \sqrt{\sigma^2} $$

Sample Standard Deviation:

$$ s = \sqrt{s^2} $$

*Example:*

Given data: 5, 8, 12, 20 (Sample)

Variance ($s^2$) = 42.25

$$ s = \sqrt{42.25} = 6.5 $$

**Key Properties:**

1. **Same Units**: Standard deviation is expressed in the same units as the original data.
2. **Affected by Outliers**: Sensitive to extreme values.
3. **Measure of Spread**: Provides a measure of how much the data varies from the mean.

### 3. Summary

Descriptive statistics provide a summary of data using measures of central tendency and measures of dispersion. The mean, median, and mode describe the center, while range, variance, and standard deviation describe the spread.

By understanding these fundamental concepts, you can effectively summarize and interpret data sets.


In [None]:
# numpy and scipy implementaion
import numpy as np
from scipy import stats

# Sample data
data = [5, 8, 12, 20, 8]

# Calculate mean
mean = np.mean(data)
print(f"Mean: {mean}")

# Calculate median
median = np.median(data)
print(f"Median: {median}")

# Calculate mode
mode = stats.mode(data)
print(mode)

# Calculate range
data_range = np.ptp(data)
print(f"Range: {data_range}")

# Calculate variance
variance = np.var(data, ddof=1)  # Using ddof=1 for sample variance
print(f"Variance: {variance}")

# Calculate standard deviation
std_deviation = np.std(data, ddof=1)  # Using ddof=1 for sample standard deviation
print(f"Standard Deviation: {std_deviation}")


Mean: 10.6
Median: 8.0
ModeResult(mode=8, count=2)
Range: 15
Variance: 33.8
Standard Deviation: 5.813776741499453


In [None]:
# Pandas Implementation
import pandas as pd
from scipy import stats

# Sample data
data = [5, 8, 12, 20, 8]

# Create a Pandas Series
data_series = pd.Series(data)

# Using pd.describe() for a quick summary
summary = data_series.describe()
print("Summary using pd.describe():")
print(summary)

Summary using pd.describe():
count     5.000000
mean     10.600000
std       5.813777
min       5.000000
25%       8.000000
50%       8.000000
75%      12.000000
max      20.000000
dtype: float64
