# Measures of Central Tendency: Mean, Median, and Mode

Measures of central tendency are single values that attempt to describe the "center" or "typical" value of a dataset. They provide a summary statistic that represents the central point around which the data tends to cluster.

We can use measures of central tendency to:

* **Summarize Data:** Providing a concise description of a dataset's typical value.
* **Compare Datasets:** Comparing the centers of different datasets to see if they differ significantly.
* **Understand Distributions:** Helping to understand the shape and characteristics of a data distribution.
* **Make Predictions:** Using the central tendency as a simple prediction for future values (though this is often too simplistic).
* **Analyze Data:** Understanding central values before processing.

---

## Mean (Arithmetic Mean, Average)

Mean is the sum of all values divided by the number of values. Its best use case is for data that is *symmetrically distributed* (like a bell curve) and *doesn't have extreme outliers*.

Mean uses all data points, which means it's mathematically convenient. However, it's *sensitive to outliers* (extreme values can drastically pull the mean away from the "true" center).

**Formula (Population):**  

$\mu = \frac{\sum_{i=1}^{N} x_i}{N}$

* $μ$ (mu): Population mean
* $x_i$: Individual data points
* $N$: Total number of data points in the population
* $Σ$ (sigma): Summation (adding up all the values)

**Formula (Sample):**  

$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$

* $\bar{x}$ (x-bar): Sample mean
* $x_i$: Individual data points
* $n$: Total number of data points in the sample

In [2]:
import numpy as np

data = np.array([1, 2, 4, 6, 7])
mean = np.mean(data)

print(f"Mean: {mean}")

Mean: 4.0


### Weighted Mean
Weighted mean is used when data points have different levels of importance. This usually happens when taking averages from multiple groups of *different* sizes.

**Formula:**  

$\bar{x_w} = \frac{\sum_{i=1}^{n} w_ix_i}{\sum_{i=1}^{n} w_i}$

*   $x_i$: Value
*   $w_i$: Weight

In [4]:
values = np.array([8.4, 6.1, 9.1, 7.8])
weights = np.array([20, 7, 13, 25])
weighted_mean = np.average(values, weights=weights)

print(f"Weighted Mean: {weighted_mean}") 

Weighted Mean: 8.061538461538461


### Truncated Mean

Truncated mean is used when the data set might contain outliers. Some of the smallest and largest values are *removed* before calculating the mean. It is important to specify the *percentage of data to remove*.

In [5]:
from scipy import stats  # SciPy has a convenient function for this

data = np.array([16, 18, 21, 27, 32, 32, 33, 91])
truncated_mean = stats.trim_mean(data, 0.25)  # Remove 25% (12.5% from each end)

print(f"Truncated Mean: {truncated_mean}")

Truncated Mean: 28.0


---

## Median
The median is the middle value when the data is sorted in ascending order. It's best for data that is *skewed* (not symmetrical) or has *outliers*.  Also useful when the exact numerical values are less important than the relative position (e.g., ranking). 

The advantage of using the median is that it is *resistant to outliers*. However, it doesn't use all of the information in the data (only the middle value(s)).

*  **Odd number of data points:** The median is the middle value.
*  **Even number of data points:** The median is the average of the two middle values.

In [6]:
data_odd = np.array([1, 2, 4, 6, 7])
median_odd = np.median(data_odd)
print(f"Median (Odd): {median_odd}")


data_even = np.array([1, 2, 4, 6, 7, 9])
median_even = np.median(data_even)
print(f"Median (Even): {median_even}") 

Median (Odd): 4.0
Median (Even): 5.0


---

## Mode
The mode is the value that occurs *most frequently* in the dataset. It's most useful for *categorical data* (e.g.,finding the most popular color) or discrete data with a limited number of values. Can be used for numerical data, but often less informative than the mean or median. 

Its advantages are that its easy to understand, and applicable to non-numerical data. However, it might not be unique (a dataset can have multiple modes), may not exist (if all values are unique) and may not be a good representation of the "center".

In [14]:
data = np.array([1, 2, 4, 4, 4, 6, 7])
mode_result = stats.mode(data)  # Returns both mode and count
print(f"Mode: {mode_result.mode}, Count: {mode_result.count}")  # Output: Mode: 4, Count: 3

data_no_mode = np.array([1, 2, 3, 4, 5])
mode_result_no_mode = stats.mode(data_no_mode)
# For no mode case, SciPy returns the smallest value and a count of 1.
print(f"Mode (No Mode): {mode_result_no_mode.mode}, Count: {mode_result_no_mode.count}") # Output: Mode (No Mode): 1, Count: 1

data_bimodal = np.array([1, 1, 2, 3, 3, 4])
mode_result_bimodal = stats.mode(data_bimodal)
print(f"Mode (Bimodal - SciPy): {mode_result_bimodal.mode}, Count: {mode_result_bimodal.count}") # Output: Mode (Bimodal): 1, Count: 2

# Note: SciPy's mode only returns *one* mode (the smallest) even if there are multiple.
# For true multimodal detection, you'd need a different approach.

Mode: 4, Count: 3
Mode (No Mode): 1, Count: 1
Mode (Bimodal - SciPy): 1, Count: 2


---

## Choosing the Right Measure
When you try to decide which measure to choose to describe the center of your data, you should first ask the question, _"Can you calculate it?"_

* To calculate mean, you need numerical data.
* To calculate median, you need data that can be ordered (ordinal or numerical).
* To mode can be used with any data type (nominal, ordinal, numerical).

After this filtering step, you should decide which measure is more appropriate for your data. Here are some things to consider:

*   **Symmetrical data, no outliers:** Mean is usually best.
*   **Skewed data or outliers:** Median is usually better.
*   **Categorical data:** Mode is often the only meaningful measure.
*   **Specific Context:** Sometimes, the specific question you're asking dictates the best measure (e.g., median household income is preferred over mean because of the skew in income distributions).