# Descriptive Statistics

**[Mushfikur R. Mahi](https://x.com/mushfikurahmaan)**  
Department of Economics  
Bangladesh National University  
The second week of October.

## Introduction
**Descriptive statistics** is a branch of statistics that focuses on summarizing and describing the features of a dataset. It provides a way to present large amounts of data in a simplified, more digestible form. Descriptive statistics help you understand the basic characteristics of your data without making conclusions beyond the data at hand (which would be inferential statistics). In this section, we will explore the following key topics, which are essential for data science:

1. Mean, Median, Mode
2. Variance, Standard Deviation
3. Skewness and Kurtosis
4. Percentiles and Quartiles

### Mean
In mathematics and statistics, the **mean** is a measure of central tendency, commonly referred to as the "average." It is calculated by summing all the values in a dataset and then dividing by the number of values. The formula for the mean is,

$$
\text{Mean}(\mu) = \frac{\sum_{i=1}^{N} x_i}{N}
$$
Here, 
- $\mu$ = mean
- $x_i$ = Each data point
- $N$ = Total number of observations

### Median
The **median** is another measure of central tendency that represents the middle value of a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values. The formula for the median is:

**If \( N \) is odd:**

$$
\text{Median} = \text{Value\;of\;}\left(\frac{N+1}{2}\right)\text{th\;observation}
$$

**If \( N \) is even:**

$$
\text{Median} = \frac{\text{Value of } \left(\frac{N}{2}\right)\text{th observation + Value of } \left(\frac{N}{2}+1\right)\text{th observation}}{2}
$$

Here, 
- $N$ = Total number of observation

### Mode
The **mode** is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode is useful for understanding the most common value within the dataset. There is no specific formula for calculating mode; it is determined by counting the frequency of each value. For grouped data, the formula for calculating the mode is given by:

$$
\text{Mode} = L + \left( \frac{f_1 - f_0}{2f_1 - f_0 - f_2} \right) \times h
$$

Where:
- $L$: The lower boundary of the modal class (the class with the highest frequency).
- $f_1$: The frequency of the modal class (the highest frequency).
- $f_0$: The frequency of the class preceding the modal class.
- $f_2$: The frequency of the class succeeding the modal class.
- $h$: The width of the modal class interval (the difference between the upper and lower boundaries of the modal class).


### Variance
**Variance** measures the spread or dispersion of a set of data points around the mean. It quantifies how much the data points deviate from the mean value. The formula for variance is:

**Population Variance:**
$$
\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}
$$
Here,
- $\sigma^2 = \text{Population variance}$

**Sample Variance:**
$$
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}
$$
Here,
- $s^2 = \text{Sample variance}$
- $\bar{x} = \text{Sample mean}$
- $n = \text{Sample size}$


### Standard Deviation
The **standard deviation** is the square root of the variance and provides a measure of how spread out the values in a dataset are relative to the mean. It is expressed in the same units as the original data. The formulas for standard deviation are:

**Population Standard Deviation:**
$$
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}
$$

**Sample Standard Deviation:**
$$
s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}
$$

Here,
- $\sigma = \text{Population standard deviation}$
- $s = \text{Sample standard deviation}$
- $\mu = \text{Population mean}$
- $\bar{x} = \text{Sample mean}$
- $N = \text{Total number of observations (population)}$
- $n = \text{Total number of observations (sample)}$

### Skewness
**Skewness** measures the asymmetry of the probability distribution of a real-valued random variable about its mean. It indicates whether data points are skewed to the left (negative skewness) or to the right (positive skewness).

The formulas for skewness are as follows:

**Population Skewness:**
$$
\text{Skewness} (\gamma) = \frac{1}{N} \sum_{i=1}^{N} \left(\frac{x_i - \mu}{\sigma}\right)^3
$$

**Sample Skewness:**
$$
\text{Skewness} (g_1) = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3
$$

Where:
- $\mu$: Population mean
- $\bar{x}$: Sample mean
- $\sigma$: Population standard deviation
- $s$: Sample standard deviation
- $N$: Total number of observations in the population
- $n$: Total number of observations in the sample

**Example:**  
In a dataset of test scores, if most students score below the average, the distribution will be negatively skewed. Conversely, if a few students score exceptionally high while the majority score lower, the distribution is positively skewed. 

<div style="text-align: center;">
    <img src="1_bHglrUGg4CGLouOfFn9ZJw.png" alt="Description of Image" style="width: 50%;"/>
</div>

<div style="text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*1kI-dYa3Cpnqwy_L8upRJg.jpeg" alt="Description of Image" style="width: 50%;"/>
</div>

### Kurtosis
**Kurtosis** measures the "tailedness" of the probability distribution of a real-valued random variable. It indicates how much of the variance is due to extreme values (outliers) in the distribution. In essence, kurtosis tells us about the shape of the distribution concerning its tails and peak.

The formulas for kurtosis are:

**Population Kurtosis:**

$$

\text{Kurtosis} (\beta) = \frac{1}{N} \sum_{i=1}^{N} \left(\frac{x_i - \mu}{\sigma}\right)^4 - 3

$$
**Sample Kurtosis:**
$$
\text{Kurtosis} (g_2) = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
$$

Where:
- $\mu$: Population mean
- $\bar{x}$: Sample mean
- $\sigma$: Population standard deviation
- $s$: Sample standard deviation
- $N$: Total number of observations in the population
- $n$: Total number of observations in the sample

**Types of Kurtosis:**

1. **Mesokurtic:** This type has a kurtosis value close to 0 (or 3 when using the excess kurtosis formula). The distribution resembles the normal distribution and has moderate tails and a moderate peak.  
   **Example:** The normal distribution itself is a classic example of a mesokurtic distribution.

2. **Leptokurtic:** Leptokurtic distributions have positive kurtosis (greater than 0). They have heavier tails and a sharper peak than the normal distribution, indicating a higher likelihood of extreme values (outliers).  
   **Example:** Stock returns often display leptokurtic behavior, as they may have more extreme movements than what a normal distribution would suggest.

3. **Platykurtic:** Platykurtic distributions have negative kurtosis (less than 0). They are characterized by lighter tails and a flatter peak compared to the normal distribution, indicating fewer extreme values.  
   **Example:** A uniform distribution is an example of a platykurtic distribution, as values are evenly spread out with no peaks.

<div style="text-align: center;">
    <img src="Kurtosis-in-the-normal-distribution.png" alt="Description of Image" style="width: 50%;"/>
</div>

## Percentiles

A **percentile** is a measure that indicates the relative standing of a value in a dataset. The \(p\)-th percentile of a dataset is the value below which \(p\%\) of the observations fall. Percentiles help in dividing the dataset into 100 equal parts.

The formula to find the \(p\)-th percentile for a sorted dataset of size \(N\) is:

$$
P_p = \left(\frac{p}{100} \times (N + 1)\right)
$$

Where:
- $P_p$ is the $p$-th percentile.
- $p$ is the desired percentile.
- $N$ is the total number of observations.


**Example:** Consider the following dataset representing the exam scores of 10 students: 

$$ 
[55, 60, 65, 70, 75, 80, 85, 90, 95, 100] 
$$

To find the 70th percentile $P_{70}$:

$$
P_{70} = \text{Value at position } \left(\frac{70}{100} \times (10 + 1)\right) = \text{Value at position } 7.7
$$

Since position 7.7 is between the 7th and 8th values, we interpolate between the 7th and 8th values (85 and 90):

$$
P_{70} = 85 + (0.7 \times (90 - 85)) = 85 + 3.5 = 88.5
$$

**So,** it means that:
- 70 percent of the data points are below 88.5: This indicates that 70 percent of the observations in your dataset have values less than or equal to 88.5.
- 30 percent of the data points are above 88.5: Conversely, 30 percent of the observations are greater than 88.5.

## Quartiles

**Quartiles** divide a dataset into four equal parts. The three quartiles are:
- **Q1 (First Quartile):** The 25th percentile, where 25\% of the data lies below.
- **Q2 (Second Quartile):** The 50th percentile, or the median, where 50\% of the data lies below.
- **Q3 (Third Quartile):** The 75th percentile, where 75\% of the data lies below.

The formula to find the quartiles is similar to percentiles, but we use specific values for \(p\):

$$
Q_1 = P_{25}, \quad Q_2 = P_{50}, \quad Q_3 = P_{75}
$$

**Example:** Using the same dataset:

$$
[55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
$$

To find $Q_1$ (the 25th percentile):

$$
Q_1 = \text{Value at position } \left(\frac{25}{100} \times (10 + 1)\right) = \text{Value at position } 2.75
$$

Interpolating between the 2nd and 3rd values (60 and 65):

$$
Q_1 = 60 + (0.75 \times (65 - 60)) = 60 + 3.75 = 63.75
$$

Similarly, for $Q_2$ (the median, 50th percentile):

$$
Q_2 = \text{Value at position } \left(\frac{50}{100} \times (10 + 1)\right) = \text{Value at position } 5.5
$$

Interpolating between the 5th and 6th values (75 and 80):

$$
Q_2 = 75 + (0.5 \times (80 - 75)) = 75 + 2.5 = 77.5
$$

Finally, for $Q_3$ (the 75th percentile):

$$
Q_3 = \text{Value at position } \left(\frac{75}{100} \times (10 + 1)\right) = \text{Value at position } 8.25
$$

Interpolating between the 8th and 9th values (90 and 95):

$$
Q_3 = 90 + (0.25 \times (95 - 90)) = 90 + 1.25 = 91.25
$$

Thus, the quartiles are:
$$
Q_1 = 63.75, \quad Q_2 = 77.5, \quad Q_3 = 91.25
$$

**Example:**
Imagine a school is analyzing students' test scores to identify students in different performance bands. If a student scores at the 90th percentile, this means they scored higher than 90\% of the other students. Quartiles could be used to classify students into different performance groups (e.g., low, medium, and high performers) based on how their scores compare to the first, second, and third quartiles.

### Conclusion
In conclusion, descriptive statistics is crucial in data analysis as it provides insights into the characteristics of a dataset. By understanding measures like mean, median, mode, variance, standard deviation, skewness, kurtosis, and percentiles, we can better interpret the data and make informed decisions.

### References
1. Sullivan, M. (2019). Statistics: Informed Decisions Using Data. Pearson.
2. Triola, M. F. (2018). Elementary Statistics. Pearson.
3. McClave, J. T., & Sincich, T. (2018). Statistics. Pearson.
4. Freedman, D., Pisani, R., & Purves, R. (2014). Statistics. W. W. Norton & Company.
