# Chapter 1.3: Statistical Summaries

Goal: Practice computing and interpreting summary statistics to understand data distributions.

### Topics:
- Using `.describe()` for quick summaries
- Computing individual statistics (mean, median, std)
- Understanding mean vs. median
- Grouping with `.groupby()` to compare across categories
- Identifying outliers using percentiles

In [None]:
import pandas as pd
import numpy as np

## Loading the Data

We'll use the California Housing dataset, which contains information about housing in various districts in California. Each row represents a district, with features like median income, house age, and median house value.

In [None]:
# Load the California Housing dataset
df = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv')
df.head()

In [None]:
# How many rows and columns?
df.shape

## Using `.describe()`

The `.describe()` method gives you a quick summary of all numeric columns in your DataFrame. It tells you:
- **count**: how many non-missing values
- **mean**: the average
- **std**: standard deviation (how spread out the data is)
- **min/max**: smallest and largest values
- **25%, 50%, 75%**: the quartiles (50% is the median)

In [None]:
# Get summary statistics for the entire DataFrame
df.describe()

See how that gives you a quick overview of every numeric column? You can also run `.describe()` on a single column:

In [None]:
# Summary of just the median_house_value column
df['median_house_value'].describe()

## Individual Statistics

Sometimes you want just one statistic. Pandas gives you methods for each. Take a guess at what each one is:

In [None]:
# Mean (average) house value
df['median_house_value'].???()

In [None]:
# Median (middle value) house value
df['median_house_value'].???()

In [None]:
# Standard deviation
df['median_house_value'].???()

### Mean vs. Median

Notice that the mean and median are different! The mean is about $207,000 and the median is about $180,000. When the mean is higher than the median, it usually means there are some very high values pulling the average up. The median is more "robust" because it isn't affected by extreme values.

## Practice: Basic Statistics

1. Compute the mean of `median_income`
2. Compute the median of `housing_median_age`
3. Compare mean vs median for `total_rooms` - which is larger and why?
4. Find the standard deviation of `population`

In [None]:
# 1. Compute the mean of median_income
# Step 1: Select the median_income column using df['column_name']
# Step 2: Apply the .mean() method


In [None]:
# 2. Compute the median of housing_median_age
# Step 1: Select the housing_median_age column
# Step 2: Apply the .median() method


In [None]:
# 3. Compare mean vs median for total_rooms
# Compute both and print them out
# Which is larger? Why do you think that is?


In [None]:
# 4. Find the standard deviation of population
# Step 1: Select the population column
# Step 2: Apply the .std() method


## Identifying Outliers

Outliers are extreme values that might be errors or just unusual cases. One simple way to find them is using percentiles. The `.quantile()` method lets you find any percentile:

In [None]:
# 1st percentile (only 1% of values are below this)
df['median_house_value'].quantile(0.01)

In [None]:
# 99th percentile (only 1% of values are above this)
df['median_house_value'].quantile(0.99)

Values below the 1st percentile or above the 99th percentile might be outliers worth investigating. You can combine this with filtering to actually see those rows:

In [None]:
# Find districts with extremely high house values (above 99th percentile)

# Start by finding the 99th percentile of the median_house_value column
high_value_threshold = ???

# Create a mask for houses with a median_house_value above this threshold
mask = ???

# Use this mask to filter to only these rows
expensive_districts = df[???]

# Inspect the resulst
expensive_districts.head()

## Grouping with `.groupby()`

Here's where things get interesting. What if we want to compare statistics across different groups? For example, what's the average house value for districts near the ocean vs. inland?

The `ocean_proximity` column tells us how close each district is to the ocean. Let's see what values it has:

In [None]:
# What are the unique values in ocean_proximity?
df['ocean_proximity'].unique()

Now let's find the average house value for each of these categories:

In [None]:
# Average house value by ocean_proximity
df.groupby('ocean_proximity')['median_house_value'].mean()

The syntax is: `df.groupby('grouping_column')['column_to_summarize'].statistic()`

You can use any statistic: `.mean()`, `.median()`, `.max()`, `.min()`, `.count()`, etc.

In [None]:
# Maximum house value by ocean_proximity
df.groupby('ocean_proximity')['median_house_value'].max()

## Practice: Grouped Statistics

1. Find average `median_income` by `ocean_proximity`
2. Find the maximum `population` for each `ocean_proximity` category
3. Find the median `housing_median_age` for each `ocean_proximity`
4. Which `ocean_proximity` category has the highest average house value? (You already computed this above - just answer the question!)

In [None]:
# 1. Average median_income by ocean_proximity
# Step 1: Group by 'ocean_proximity'
# Step 2: Select the 'median_income' column
# Step 3: Apply .mean()


In [None]:
# 2. Maximum population for each ocean_proximity category
# Same pattern as above, but use .max() instead of .mean()


In [None]:
# 3. Median housing_median_age for each ocean_proximity


In [None]:
# 4. Which ocean_proximity category has the highest average house value?