# Central Tendency - Finding the "Middle" of Your Data

When analyzing data, one of the first questions is: **What's a typical value?**

Central tendency gives us different ways to answer this:
- **Mean** - The average of all values
- **Median** - The middle value when sorted
- **Mode** - The most frequently occurring value

Let's see each one in action with our sales data.

In [1]:
import pandas as pd

In [2]:
toyota_sales = pd.read_csv("data/car_sales/toyota_sales_data.csv")

In [None]:
toyota_sales.head()

Unnamed: 0,sale_id,sale_rep_id,sale_date,car_model,sale_amount,commission_pct,sale_status
0,1,16,2024-11-18,Tundra,44496.88,0.05,Completed
1,2,11,2024-11-08,Tacoma,34824.72,,Pending
2,3,5,2024-11-03,Corolla,20275.08,,Completed
3,4,20,2024-11-06,Corolla,20068.93,,Completed
4,5,1,2024-11-26,Tundra,49811.99,0.03,Completed


## 1. Mean - The Average

The mean adds up all values and divides by the count.

**When to use:** Symmetric data without extreme outliers

In [4]:
mean_sale = toyota_sales["sale_amount"].mean()

In [5]:
print(f"Mean sale amount: ${mean_sale:,.2f}")

Mean sale amount: $32,979.71


In [6]:
# Mean by car model
toyota_sales. \
    groupby('car_model')['sale_amount']. \
    mean(). \
    sort_values()

car_model
Corolla       22442.176808
Camry         27470.874825
RAV4          30985.724349
Tacoma        34998.464068
Highlander    40002.871658
Tundra        42478.580428
Name: sale_amount, dtype: float64

## 2. Median - The Middle Value

The median is the value in the middle when you sort all values.

**When to use:** Data with outliers or skewed distributions

In [7]:
median_sale = toyota_sales['sale_amount'].median()

In [8]:
print(f"Median sale amount: ${median_sale:,.2f}")

Median sale amount: $32,613.76


In [9]:
# Compare with mean
print(f"Mean sale amount: ${mean_sale:,.2f}")
print(f"Difference: ${abs(mean_sale - median_sale):,.2f}")

Mean sale amount: $32,979.71
Difference: $365.94


In [11]:
# Show why median resists outliers
sample_sales = [20000, 25000, 30000, 35000, 40000]
sample_with_outlier = [20000, 25000, 30000, 35000, 500000]

In [12]:
print("Without outlier:")
print(f"  Mean: ${pd.Series(sample_sales).mean():,.0f}")
print(f"  Median: ${pd.Series(sample_sales).median():,.0f}")

Without outlier:
  Mean: $30,000
  Median: $30,000


In [13]:
print("\nWith $500k outlier:")
print(f"  Mean: ${pd.Series(sample_with_outlier).mean():,.0f}")
print(f"  Median: ${pd.Series(sample_with_outlier).median():,.0f}")


With $500k outlier:
  Mean: $122,000
  Median: $30,000


## 3. Mode - The Most Common Value

The mode is the value that appears most frequently.

**When to use:** Categorical data or finding the most popular item

In [18]:
toyota_sales['car_model'].mode()[0]

'RAV4'

In [19]:
# Mode for car model
mode_car = toyota_sales['car_model'].mode()[0]
print(f"Most popular car model: {mode_car}")

Most popular car model: RAV4


In [20]:
# Show the counts
toyota_sales['car_model'].value_counts()

car_model
RAV4          860
Camry         856
Corolla       827
Tacoma        826
Tundra        817
Highlander    814
Name: count, dtype: int64

In [21]:
# Mode for commission percentage
mode_commission = toyota_sales['commission_pct'].mode()[0]
print(f"Most common commission rate: {mode_commission * 100}%")

Most common commission rate: 5.0%


In [22]:
toyota_sales['commission_pct'].value_counts()

commission_pct
0.05    1288
0.03    1250
0.02    1188
Name: count, dtype: int64

In [23]:
# All three for different data types
print("=== Sale Amount (Numerical) ===")
print(f"Mean:   ${toyota_sales['sale_amount'].mean():,.2f}")
print(f"Median: ${toyota_sales['sale_amount'].median():,.2f}")
print(f"Mode:   Not meaningful (all unique values)")

print("\n=== Car Model (Categorical) ===")
print(f"Mean:   Not applicable (can't average text)")
print(f"Median: Not applicable (can't sort meaningfully)")
print(f"Mode:   {toyota_sales['car_model'].mode()[0]} (best seller)")

print("\n=== Commission % (Discrete Numerical) ===")
print(f"Mean:   {toyota_sales['commission_pct'].mean():.4f} (3.37%)")
print(f"Median: {toyota_sales['commission_pct'].median():.4f} (3%)")
print(f"Mode:   {toyota_sales['commission_pct'].mode()[0]:.2f} (5%)")

=== Sale Amount (Numerical) ===
Mean:   $32,979.71
Median: $32,613.76
Mode:   Not meaningful (all unique values)

=== Car Model (Categorical) ===
Mean:   Not applicable (can't average text)
Median: Not applicable (can't sort meaningfully)
Mode:   RAV4 (best seller)

=== Commission % (Discrete Numerical) ===
Mean:   0.0337 (3.37%)
Median: 0.0300 (3%)
Mode:   0.05 (5%)


In [24]:
# Summary by car model
summary = toyota_sales. \
    groupby('car_model')['sale_amount']. \
    agg(['mean', 'median', 'count'])

summary.columns = ['Mean', 'Median', 'Sales Count']

summary.sort_values('Mean')

Unnamed: 0_level_0,Mean,Median,Sales Count
car_model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Corolla,22442.176808,22477.28,827
Camry,27470.874825,27448.575,856
RAV4,30985.724349,30922.64,860
Tacoma,34998.464068,34989.815,826
Highlander,40002.871658,39988.89,814
Tundra,42478.580428,42306.67,817


## Summary: Central Tendency

**Three ways to find the "center":**
- **Mean** - Average of all values (affected by outliers)
- **Median** - Middle value (resistant to outliers)  
- **Mode** - Most frequent value (best for categorical data)

**Next lectures:** We'll dive deeper into median and mode, then learn exactly when to use each one!