In [None]:
import pandas as pd

In [None]:
toyota_sales = pd.read_csv('data/car_sales/toyota_sales_data.csv')

# Understanding Variance

**Variance measures how spread out your data is from the mean.**

## The Concept:
- Low variance = Data points are close to the mean (consistent)
- High variance = Data points are spread far from the mean (variable)

## Formula:
Variance = Average of (each value - mean)Â²

**Why squared?** So negative and positive differences don't cancel out.

## Simple Example: Two Datasets with Same Mean

Let's compare two sets of sales with the same average but different spreads.

In [None]:
# Two sets of sales - same mean, different spread
consistent_sales = pd.Series([29000, 30000, 31000, 32000, 33000])
variable_sales = pd.Series([15000, 25000, 31000, 40000, 44000])

In [None]:
print("=== Consistent Sales ===")
print(f"Values: {consistent_sales.tolist()}")
print(f"Mean: ${consistent_sales.mean():,.0f}")
print(f"Variance: ${consistent_sales.var():,.0f}")

In [None]:
print("\n=== Variable Sales ===")
print(f"Values: {variable_sales.tolist()}")
print(f"Mean: ${variable_sales.mean():,.0f}")
print(f"Variance: ${variable_sales.var():,.0f}")

## Variance in Toyota Sales Data

Let's calculate variance for our actual sale amounts.

In [None]:
# Variance by car model
variance_by_model = toyota_sales. \
    groupby('car_model')['sale_amount']. \
    var(). \
    sort_values(ascending=False)

In [None]:

print("Variance by Car Model:")
print(variance_by_model.round(0))

In [None]:
# See mean and variance together
comparison = toyota_sales.groupby('car_model')['sale_amount'].agg([
    ('Mean', 'mean'),
    ('Variance', 'var'),
    ('Count', 'count')
]).round(0)

In [None]:
comparison.sort_values('Variance', ascending=False)

## Interpreting Variance in Business Context

**High Variance means:**
- Less predictable outcomes
- Wide range of values
- Could indicate multiple product tiers or configurations
- More risk/uncertainty

**Low Variance means:**
- Consistent, predictable outcomes
- Tight clustering around mean
- Standardized pricing or product
- More stability

In [None]:
# Compare Tundra vs Camry spreads
print("=== Tundra ===")
tundra = toyota_sales[toyota_sales['car_model'] == 'Tundra']['sale_amount']
print(f"Mean: ${tundra.mean():,.0f}")
print(f"Variance: ${tundra.var():,.0f}")
print(f"Min: ${tundra.min():,.0f}")
print(f"Max: ${tundra.max():,.0f}")

In [None]:
print("\n=== Camry ===")
camry = toyota_sales[toyota_sales['car_model'] == 'Camry']['sale_amount']
print(f"Mean: ${camry.mean():,.0f}")
print(f"Variance: ${camry.var():,.0f}")
print(f"Min: ${camry.min():,.0f}")
print(f"Max: ${camry.max():,.0f}")

## Summary: Variance

**Key Points:**
- Variance measures spread from the mean
- Higher variance = more variability
- Lower variance = more consistency
- Uses squared differences (so units are squared dollars)
- More informative than range because it uses all data points

**Business Application:**
- High variance products: Expect diverse pricing, less predictability
- Low variance products: Expect consistent pricing, more predictability

**Next:** Standard deviation - which is just the square root of variance and easier to interpret!