# Central Tendency - Finding the "Middle" of Your Data

When analyzing data, one of the first questions is: **What's a typical value?**

Central tendency gives us different ways to answer this:
- **Mean** - The average of all values
- **Median** - The middle value when sorted
- **Mode** - The most frequently occurring value

Let's see each one in action with our sales data.

In [None]:
import pandas as pd

In [None]:
toyota_sales = pd.read_csv("data/car_sales/toyota_sales_data.csv")

In [None]:
toyota_sales.head()

## 1. Mean - The Average

The mean adds up all values and divides by the count.

**When to use:** Symmetric data without extreme outliers

In [None]:
mean_sale = toyota_sales["sale_amount"].mean()

In [None]:
print(f"Mean sale amount: ${mean_sale:,.2f}")

In [None]:
# Mean by car model
toyota_sales. \
    groupby('car_model')['sale_amount']. \
    mean(). \
    sort_values()

## 2. Median - The Middle Value

The median is the value in the middle when you sort all values.

**When to use:** Data with outliers or skewed distributions

In [None]:
median_sale = toyota_sales['sale_amount'].median()

In [None]:
print(f"Median sale amount: ${median_sale:,.2f}")

In [None]:
# Compare with mean
print(f"Mean sale amount: ${mean_sale:,.2f}")
print(f"Difference: ${abs(mean_sale - median_sale):,.2f}")

In [None]:
# Show why median resists outliers
sample_sales = [20000, 25000, 30000, 35000, 40000]
sample_with_outlier = [20000, 25000, 30000, 35000, 500000]

In [None]:
print("Without outlier:")
print(f"  Mean: ${pd.Series(sample_sales).mean():,.0f}")
print(f"  Median: ${pd.Series(sample_sales).median():,.0f}")

In [None]:
print("\nWith $500k outlier:")
print(f"  Mean: ${pd.Series(sample_with_outlier).mean():,.0f}")
print(f"  Median: ${pd.Series(sample_with_outlier).median():,.0f}")

## 3. Mode - The Most Common Value

The mode is the value that appears most frequently.

**When to use:** Categorical data or finding the most popular item

In [None]:
toyota_sales['car_model'].mode()[0]

In [None]:
# Mode for car model
mode_car = toyota_sales['car_model'].mode()[0]
print(f"Most popular car model: {mode_car}")

In [None]:
# Show the counts
toyota_sales['car_model'].value_counts()

In [None]:
# Mode for commission percentage
mode_commission = toyota_sales['commission_pct'].mode()[0]
print(f"Most common commission rate: {mode_commission * 100}%")

In [None]:
toyota_sales['commission_pct'].value_counts()

In [None]:
# All three for different data types
print("=== Sale Amount (Numerical) ===")
print(f"Mean:   ${toyota_sales['sale_amount'].mean():,.2f}")
print(f"Median: ${toyota_sales['sale_amount'].median():,.2f}")
print(f"Mode:   Not meaningful (all unique values)")

print("\n=== Car Model (Categorical) ===")
print(f"Mean:   Not applicable (can't average text)")
print(f"Median: Not applicable (can't sort meaningfully)")
print(f"Mode:   {toyota_sales['car_model'].mode()[0]} (best seller)")

print("\n=== Commission % (Discrete Numerical) ===")
print(f"Mean:   {toyota_sales['commission_pct'].mean():.4f} (3.37%)")
print(f"Median: {toyota_sales['commission_pct'].median():.4f} (3%)")
print(f"Mode:   {toyota_sales['commission_pct'].mode()[0]:.2f} (5%)")

In [None]:
# Summary by car model
summary = toyota_sales. \
    groupby('car_model')['sale_amount']. \
    agg(['mean', 'median', 'count'])

summary.columns = ['Mean', 'Median', 'Sales Count']

summary.sort_values('Mean')

## Summary: Central Tendency

**Three ways to find the "center":**
- **Mean** - Average of all values (affected by outliers)
- **Median** - Middle value (resistant to outliers)  
- **Mode** - Most frequent value (best for categorical data)

**Next lectures:** We'll dive deeper into median and mode, then learn exactly when to use each one!