## The Average
As of Q2 2024, the [*average U.S. home value*](https://www.zillow.com/home-values/102001/united-states/)
according to *Zillow* is: $\textdollar354,179$. But what does this [*average*](https://en.wikipedia.org/wiki/Mean) really tell us? What insight about the *U.S. housing market* can we gain from this information? By itself, the average can really only give us a sense of the *central tendency* of the data: the vast majority of *U.S. housing prices* tend to cluster around this $\textdollar354,179$ price.

In [None]:
# libs for downloading US states shape files
import pathlib
import tempfile
import urllib.request
import zipfile

# US state shape file URL
shape_file_url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_20m.zip"

# directory to save the downloaded file
data_dir = pathlib.Path("./data/geo/")

# directory to extract the contents of the ZIP file
extract_dir = data_dir / "cb_2018_us_state_20m"

# check for previous download
if not extract_dir.exists():
    # create the parent tree (incase it doesn't exist)
    data_dir.mkdir(parents=True, exist_ok=True)
    
    # create a temporary directory to download the file
    with tempfile.TemporaryDirectory() as temp_dir:
        # filename for the downloaded ZIP file
        zip_filename = pathlib.Path(temp_dir) / "cb_2018_us_state_20m.zip"
    
        # download the ZIP file
        urllib.request.urlretrieve(shape_file_url, zip_filename)
    
        # extract the contents of the ZIP file
        with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
            zip_ref.extractall(extract_dir)

In [None]:
# now libs for housing data
import pandas as pd

# gist URL
housing_prices_data_url = (
    "https://gist.githubusercontent.com/DiogenesAnalytics/679e24616670afb65b25e96bc697940a/raw/ecebe4176c6fc0e09fc949d05c12c3abc3d0306f/housing_median_data_2024.csv"
)

# get state housing data
housing_price_data = pd.read_csv(housing_prices_data_url)

# pull out average
average_us_house_price = housing_price_data.at[24, "Median home price in US$"]

# now just get median prices
median_us_housing_prices = housing_price_data[housing_price_data["State or territory"] != "United States"]

In [None]:
# now libs for combining US state housing prices and geo data
import geopandas as gpd

# get state geo data
us_state_map = gpd.read_file(extract_dir)

# get only continental states
contiguous_us_state_map = us_state_map[~us_state_map["NAME"].isin(["Alaska", "Hawaii", "Puerto Rico"])]

In [None]:
# change default theme
import matplotlib.pyplot as plt

# set the style to a dark theme
plt.style.use("dark_background")

# match website background
plt.rcParams["figure.facecolor"] = "#181818"
plt.rcParams["axes.facecolor"] = "#181818"
plt.rcParams["axes.edgecolor"] = "#181818"

# turn of axes
plt.axis("off")

# set title
plt.suptitle(
    "Figure 1. Choropleth Map of Median U.S. Housing Prices (2024)", y=0.05, fontsize=10)

# finally merge ...
state_map_w_prices = contiguous_us_state_map.merge(
    median_us_housing_prices,
    how="left",
    left_on="NAME",
    right_on="State or territory",
)

# ... and plot
state_map_w_prices.plot(
    ax=plt.gca(),
    cmap="viridis",
    column="Median home price in US$",
    legend=True,
    legend_kwds={"label": "USD", "orientation": "horizontal"},
);

In [None]:
# plot histogram
median_us_housing_prices[
    "Median home price in US$"
].hist(bins=30, color=plt.cm.viridis(0.55), edgecolor="#181818")
plt.axvline(average_us_house_price, color='red', linestyle='--', linewidth=2, label="Average")
plt.legend()
plt.grid(False)
plt.gca().yaxis.set_visible(False)
plt.xlabel("USD")
plt.suptitle(
    "Figure 2. Histogram of Median U.S. Housing Prices (2024)", y=0.05, fontsize=10
)
plt.subplots_adjust(bottom=0.17)
plt.show();

But what can it tell us about the *probability* that a house for sale in the U.S. market will have a given price (or fall within a certain *price range*)? Or that a given price *deviates significantly* from the average?

## Normal Distribution
As seen in figures 1 and 2 above, which use [*median U.S. housing price data from Wikipedia*](https://en.wikipedia.org/wiki/List_of_U.S._states_by_median_home_price), the average can really only give us a sense of the *central tendency* as mentioned before. To be able to answer questions related to *probabibility* we must create a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution).

In [None]:
# libs for generating the distribution
import numpy as np
from scipy.stats import norm

# get mean and standard dev
mu, std = norm.fit(median_us_housing_prices["Median home price in US$"])

# create evenly spaced (x-axis) data in housing prices range
lin_prices = np.linspace(
    median_us_housing_prices["Median home price in US$"].min(),
    median_us_housing_prices["Median home price in US$"].max(), 
    100,
)

# get the probabilities for (y-axis) for distribution
probabilities = norm.pdf(lin_prices, mu, std)

# plot
plt.plot(lin_prices, probabilities, linewidth=2, color=plt.cm.viridis(0.74))
plt.xlabel("USD")
plt.grid(False)
plt.suptitle(
    "Figure 3. Normal Distribution of Median U.S. Housing Prices (2024)", y=0.05, fontsize=10
)
plt.subplots_adjust(bottom=0.17)
plt.gca().yaxis.set_visible(False)
plt.show()

Figure 3 above shows the *normal distribution* of the 2024 *median U.S. housing prices* by state. It is generated by the following equation:

$$
f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}
$$

Since the distribution is generated from the [*median*](https://en.wikipedia.org/wiki/Median) U.S. housing prices, the distribution is mising the *lower end* housing price values.

With this distribution, we can begin to *estimate* the probability of *seeing a given range of prices*. For example, what if we wanted to know the probability of seeing a house in the U.S. listed in the range $[\textdollar200,000 - \textdollar300,000]$? What we want to know is actually the *area* under the normal distribution.

In [None]:
# define a few utility functions
def between_bounds(lower: int, upper: int) -> bool:
    """Uses previously computed linear housing price data to find sub range."""
    return (lin_prices >= lower) & (lin_prices <= upper)

def probability_range(lower: int, upper: int) -> float:
    """Uses previously calculated mean and standard dev to find probability of a range."""        
    return norm.cdf(upper, mu, std) - norm.cdf(lower, mu, std)

# create initial plot
plt.plot(lin_prices, probabilities, linewidth=2, color=plt.cm.viridis(0.74))

# fill in shaded area
shaded_region = plt.fill_between(
    lin_prices,
    0,
    probabilities,
    where=between_bounds(2 * 10**5, 3 * 10**5),
    color='orange',
    alpha=0.5
)

# get coordinates of shaded region
(x0, y0), (x1, y1) = shaded_region.get_paths()[0].get_extents().get_points()

# add text to shaded region
plt.text(
    (x0 + x1) / 2, 
    (y0 + y1) / 2,
    f"{probability_range(2 * 10**5, 3 * 10**5):.2%}", 
    ha='center',
    va='center',
    fontsize=8
)

# label, adjust, title, and show plot
plt.xlabel("USD")
plt.grid(False)
plt.suptitle(
    "Figure 4. Probability of a \$200k to \$300k House (2024)", y=0.05, fontsize=10
)
plt.subplots_adjust(bottom=0.17)
plt.gca().yaxis.set_visible(False)
plt.show()

As figure 4 indicates, you have a $20.45\%$ or a $\sim\frac{1}{5}$ *probability* of encountering a house priced in the range $[\textdollar200,000 - \textdollar300,000]$ within the U.S.

## Return of the Average
Now we are ready to actually begin to think about what the *average* means. As it turns out the average IS the value where, $50\%$ of the population is less than this value, and $50\%$ of the population is greater than this value.

In [None]:
# create initial plot
plt.plot(lin_prices, probabilities, linewidth=2, color=plt.cm.viridis(0.74))

# add the average
plt.axvline(mu, color=plt.cm.viridis(0.35), linestyle="--", linewidth=2, label="Average")

# fill in shaded area
shaded_region = plt.fill_between(
    lin_prices,
    0,
    probabilities,
    where=between_bounds(0, lin_prices.max()),
    color='orange',
    alpha=0.5
)


# add text to shaded region
plt.text(
    (mu - 100000), 
    (y0 + y1) / 2,
    f"{probability_range(0, mu):.2%}", 
    ha='center',
    va='center',
    fontsize=8
)

# add text to shaded region
plt.text(
    (mu + 100000), 
    (y0 + y1) / 2,
    f"{probability_range(mu, lin_prices.max()):.2%}", 
    ha='center',
    va='center',
    fontsize=8
)

# label, adjust, title, and show plot
plt.xlabel("USD")
plt.grid(False)
plt.suptitle(
    "Figure 5. The Interpretation of the Average U.S. Housing Price (2024)", y=0.05, fontsize=10
)
plt.subplots_adjust(bottom=0.17)
plt.gca().yaxis.set_visible(False)
plt.legend()
plt.show()

Basically, half the time when searching housing prices in the U.S. you will encounter a price *below* the average (i.e. about $\textdollar354,179$), and half the time the price will be *above* the average. This can be useful as a *high-level* view of a distribution, but it cannot tell us what price can truly be *expected*.

## Expected Price
The truth about the [*expected*](https://en.wikipedia.org/wiki/Expected_value) price is simply that it only makes sense to consider a *range of prices*. For example, does any **potential buyer** care about finding a house priced at *exactly* $\textdollar300,000$? What about $\textdollar300,100$ or $\textdollar299,900$? So again we must turn to the *normal distribution* to satisfy our query: what is the range of prices we can expect in the U.S. housing market?

In [None]:
# create initial plot
plt.plot(lin_prices, probabilities, linewidth=2, color=plt.cm.viridis(0.74))

# fill in shaded area
shaded_region = plt.fill_between(
    lin_prices,
    0,
    probabilities,
    where=between_bounds(mu - std, mu + std),
    color='orange',
    alpha=0.5
)

# get coordinates of shaded region
(x0, y0), (x1, y1) = shaded_region.get_paths()[0].get_extents().get_points()

# add text to shaded region
plt.text(
    (x0 + x1) / 2, 
    (y0 + y1) / 2,
    f"{probability_range(mu - std, mu + std):.2%}", 
    ha='center',
    va='center',
    fontsize=8
)

# label, adjust, title, and show plot
plt.xlabel("USD")
plt.grid(False)
plt.suptitle(
    "Figure 5. One Standard Deviation from the Average\nMedian U.S. Housing Price (2024)", y=0.05, fontsize=10
)
plt.subplots_adjust(bottom=0.17)
plt.gca().yaxis.set_visible(False)
plt.show()

What figure 5 shows, is that we can expect to see $68\%$ or $\sim\frac{2}{3}$ of the houses listed in the *U.S. housing market* will have a price in the range $[\textdollar200,000 - \textdollar500,000]$. This is the *lion's share* of the market, with the remaining $32\%$ or $\sim\frac{1}{3}$ of the market containing houses listed with price below $\textdollar200,000$ or above $\textdollar500,000$.

## Moral
The *average* (i.e. $\textdollar354,179$) is not a good representation of the *expected* value that a **potential buyer** will encounter. Instead, a better representation of what to expect is the $68\%$ or $\sim\frac{2}{3}$ probability that a house listed in the *U.S. housing market* will have a price in the range $[\textdollar200,000 - \textdollar500,000]$. The average is best understood as a *condensed* estimate of where values *cluster* (i.e. around $\textdollar354,179$). It is a *high-level* view of the population, but without the *standard deviation* it cannot be used to reason about the values that will be obtained [*empirically*](https://en.wikipedia.org/wiki/Empirical_research).