In [1]:
import pandas as pd

In [2]:
toyota_sales = pd.read_csv('data/car_sales/toyota_sales_data.csv')

# Understanding Variance

**Variance measures how spread out your data is from the mean.**

## The Concept:
- Low variance = Data points are close to the mean (consistent)
- High variance = Data points are spread far from the mean (variable)

## Formula:
Variance = Average of (each value - mean)Â²

**Why squared?** So negative and positive differences don't cancel out.

## Simple Example: Two Datasets with Same Mean

Let's compare two sets of sales with the same average but different spreads.

In [3]:
# Two sets of sales - same mean, different spread
consistent_sales = pd.Series([29000, 30000, 31000, 32000, 33000])
variable_sales = pd.Series([15000, 25000, 31000, 40000, 44000])

In [4]:
print("=== Consistent Sales ===")
print(f"Values: {consistent_sales.tolist()}")
print(f"Mean: ${consistent_sales.mean():,.0f}")
print(f"Variance: ${consistent_sales.var():,.0f}")

=== Consistent Sales ===
Values: [29000, 30000, 31000, 32000, 33000]
Mean: $31,000
Variance: $2,500,000


In [6]:
print("\n=== Variable Sales ===")
print(f"Values: {variable_sales.tolist()}")
print(f"Mean: ${variable_sales.mean():,.0f}")
print(f"Variance: ${variable_sales.var():,.0f}")


=== Variable Sales ===
Values: [15000, 25000, 31000, 40000, 44000]
Mean: $31,000
Variance: $135,500,000


## Variance in Toyota Sales Data

Let's calculate variance for our actual sale amounts.

In [7]:
# Variance by car model
variance_by_model = toyota_sales. \
    groupby('car_model')['sale_amount']. \
    var(). \
    sort_values(ascending=False)

In [8]:

print("Variance by Car Model:")
print(variance_by_model.round(0))

Variance by Car Model:
car_model
Tundra        17944408.0
Highlander     8481920.0
Tacoma         8448003.0
RAV4           5551942.0
Corolla        2132979.0
Camry          2047450.0
Name: sale_amount, dtype: float64


In [9]:
# See mean and variance together
comparison = toyota_sales.groupby('car_model')['sale_amount'].agg([
    ('Mean', 'mean'),
    ('Variance', 'var'),
    ('Count', 'count')
]).round(0)

In [11]:
comparison.sort_values('Variance', ascending=False)

Unnamed: 0_level_0,Mean,Variance,Count
car_model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tundra,42479.0,17944408.0,817
Highlander,40003.0,8481920.0,814
Tacoma,34998.0,8448003.0,826
RAV4,30986.0,5551942.0,860
Corolla,22442.0,2132979.0,827
Camry,27471.0,2047450.0,856


## Interpreting Variance in Business Context

**High Variance means:**
- Less predictable outcomes
- Wide range of values
- Could indicate multiple product tiers or configurations
- More risk/uncertainty

**Low Variance means:**
- Consistent, predictable outcomes
- Tight clustering around mean
- Standardized pricing or product
- More stability

In [12]:
# Compare Tundra vs Camry spreads
print("=== Tundra ===")
tundra = toyota_sales[toyota_sales['car_model'] == 'Tundra']['sale_amount']
print(f"Mean: ${tundra.mean():,.0f}")
print(f"Variance: ${tundra.var():,.0f}")
print(f"Min: ${tundra.min():,.0f}")
print(f"Max: ${tundra.max():,.0f}")

=== Tundra ===
Mean: $42,479
Variance: $17,944,408
Min: $35,027
Max: $49,996


In [13]:
print("\n=== Camry ===")
camry = toyota_sales[toyota_sales['car_model'] == 'Camry']['sale_amount']
print(f"Mean: ${camry.mean():,.0f}")
print(f"Variance: ${camry.var():,.0f}")
print(f"Min: ${camry.min():,.0f}")
print(f"Max: ${camry.max():,.0f}")


=== Camry ===
Mean: $27,471
Variance: $2,047,450
Min: $25,006
Max: $29,982


## Summary: Variance

**Key Points:**
- Variance measures spread from the mean
- Higher variance = more variability
- Lower variance = more consistency
- Uses squared differences (so units are squared dollars)
- More informative than range because it uses all data points

**Business Application:**
- High variance products: Expect diverse pricing, less predictability
- Low variance products: Expect consistent pricing, more predictability

**Next:** Standard deviation - which is just the square root of variance and easier to interpret!