# DS EDA Project ⟡ King County House Prices ⟡ *Amy Williams* Edition

## Outline of the Problem

Amy Williams (Seller, Mafiosa) sells several central houses (top10%) over time, needs average outskirt houses over time to hide from the FBI.

Corrected version (since Miss Williams is  quite upset that her smart tricks are misconstrued as organised crime ...):

Amy Williams, a businesswoman by trade
- wants to sells several central houses (top 10%) over time and
- needs average outskirt houses over time to avoid unwanted attention from the FBI.

## Usual Setup Tasks

To analyse the data set, we begin by loading the required modules:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Required for some plot fine tuning.
from matplotlib.patches import Patch

Let us import the data set under consideration:

In [None]:
# Path to the csv file containing the data.
DATAPATH = "./data/King_County_House_prices_dataset.csv"

df_0 = pd.read_csv(DATAPATH)

## First Inspection and Data Cleaning

For an initial overview of this data set, let us perform the usual steps of inspection.

In [None]:
# The first 10 rows of the table.
df_0.head(10)

In [None]:
# Information about column dtypes and NaN values. 
df_0.info()

In [None]:
# Common statistical quantities for the numerical columns.
df_0.describe()

We start with the necessary cleaning tasks. For convenience, let us create a deep copy of the original data that will incorporate our edits.

In [None]:
df_1 = df_0.copy()

We fix the data types for several columns where necessary.

### Ad "date"

As we saw, the displayed `dtype` of "date" is `object`. 

Dealing with dates becomes much easier if we cast colum

It is always advisable to use the available tools for working with dates. In pandas, this means to casting the affected columns into the datetime type. 

Fortunately, there are no further intricacies (like exotic date formats) so that the following command suffices:

In [None]:
df_1["date"] = pd.to_datetime(df_0["date"])

### Ad "bedrooms", "bathrooms", "floors"

Let us have a look at the columns representing the number of specific rooms. We saw that they share a floating point data type. Now we check which fractional parts arise in these columns:

In [None]:
room_cols = [
    "bedrooms",
    "bathrooms",
    "floors",
]

pd.DataFrame(
    [(col, (df_0[col] % 1).unique()) for col in room_cols],
    columns=["Room Column", "Fractional Parts"]
)

Hence, it is fine to leave these columns as they are (although the distinction between values like 0.5 and 0.75 bathrooms is likely not precise ...).

### Ad "yr_renovated"

We saw that "yr_renovated" has dtype `float64`. When inspecting its unique values:

In [None]:
df_0["yr_renovated"].unique() 

Two issue can be detected:

- Floating point numbers are not necessary, integers suffice.
- Among the values, we find `nan` and `0`. It is unclear what `0` represents: It could represent "no renovation" or an alternative encoding for `nan`. We will treat `0` as equivalent to `nan`.

Let us address these problems:

In [None]:
df_0["yr_renovated"]
df_1["yr_renovated"] = df_1["yr_renovated"].replace(0.0, np.nan)
# To retain NaN values, "Int64" instead of "int64" must be used.
df_1["yr_renovated"] = df_1["yr_renovated"].astype("Int64")

### Ad "waterfront", "view"

It is very likely that the columns "waterfront" and "view" are, in actual fact, categorical (or even Boolean), meaning floating point numbers are unnecessary. Their unique values are:

In [None]:
maybe_cat_cols = [
    "waterfront",
    "view",
]

pd.DataFrame(
    [(col, df_0[col].unique()) for col in maybe_cat_cols],
    columns=["Maybe Categorical Column", "Unique Values"]
)

Hence, the columns are categorical ("waterfront" is even Boolean). Thus, we convert these columns:

In [None]:
for col in maybe_cat_cols:
    # Conversion to "int64" will not work since NaN values are present,
    # we have to use "Int64".
    df_1[col] = df_0[col].astype("Int64")

df_1.info()

#### Ad "sqft_*"

Finally, at Miss Williams's request, we convert columns representing square footage (those matching "sqft_*") into square metre (she advocates for metric units!).

In [None]:
# The ratio of square metre to square foot.
SQ_M_TO_SQ_F_RATIO = 0.09290304

# The column sqft_basement has dtype object, 
# we convert it to integer (and interpret missing values as 0).
df_0["sqft_basement"] = df_0["sqft_basement"].replace("?", np.nan)
df_0["sqft_basement"] = pd.to_numeric(df_0["sqft_basement"], errors="coerce")
df_0["sqft_basement"] = df_0["sqft_basement"].fillna(0).astype("int64")

sqft_cols = [col for col in df_0.columns if col.startswith("sqft_")]

for col in sqft_cols:
    # Drop sqft column.
    df_1 = df_1.drop(columns=[col], errors="ignore")
    # Add sqm column.
    col_new = col.replace("sqft_", "sqm_")
    df_1[col_new] = SQ_M_TO_SQ_F_RATIO * df_0[col]



## Warm Up: Three Haphazard Hypotheses on the Data

To become acquainted with the data set, we formulate three arbitrary (not well-thought-out) hypotheses and check their consistency with the data.

**CAVEAT:** We do *not* perform hypothesis testing in a (statistically) rigorous sense!

### Hypothesis #1: Dependence of Two Discrete Variables

**Hypothesis:** The correlation between "bathrooms" and "bedrooms" is strictly positive (particularly, they are dependent).

**Consistency Check:** Let us create a scatterplot with "bathrooms" as the x- and "bedrooms" as the y-variable.

In [None]:
plt.scatter(
    df_1["bathrooms"], 
    df_1["bedrooms"], 
    c="blue", 
    alpha=0.005, 
    label="Location of Houses"
)

plt.title("Bathrooms vs. Bedrooms")
plt.xlabel("bathrooms")
plt.ylabel("bedrooms")
plt.xlim(0,6)
plt.ylim(0,8)

plt.savefig("./plots/bathrooms-vs-bedrooms.png")
plt.show()


The correlation tends to be slightly positive. Hence, the hypothesis should be *accepted*.

**REMARK:** The following computation further corroborates the hypothesis:

In [None]:
df_1["bathrooms"].corr(df_1["bedrooms"])

### Hypothesis #2: Dependence of a Continuous and a Discrete Variable 

**Hypothesis:** The variables "price" and "waterfront" are not independent.

**Consistency Check:** Let us compare the distributions of "price" with the distribution of "price" conditioned on the event that "waterfront" equals 1.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].hist(df_1["price"], bins=100, density=True)
ax[0].set_title("Price")
ax[0].set_xlabel("Price")
ax[0].set_ylabel("Frequency")

is_wf_1 = (df_1["waterfront"] == 1) 
ax[1].hist(df_1["price"][is_wf_1], bins=100, density=True)
ax[1].set_title("Price | Waterfront == 1")
ax[1].set_xlabel("Price")
ax[1].set_ylabel("Frequency")

plt.savefig("./plots/price-on-waterfront.png")
plt.show()


Obviously, the two distributions are distinct. Thus, the hypotheses should be *accepted*.

### Hypothesis #3: Dependence of Two Continuous Variables

**Hypotheses:** The variables "lat" and "long" are not correlated.

**Consistency Check:** Let us look at the scatterplot of "lat" versus "long":

In [None]:
plt.scatter(
    df_1["lat"], 
    df_1["long"], 
    c="blue", 
    alpha=0.1, 
    label="Location of Houses"
)

plt.title("Location of Houses")
plt.xlabel("Latitude")
plt.ylabel("Longitude")

plt.savefig("./plots/lat-vs-long.png")
plt.show()

The visual evidence against the hypothesis is insufficient, so it should be **accepted**.

**REMARK:** The following computation also shows that the correlation deviates not drastically from zero:

In [None]:
df_1["lat"].corr(df_1["long"])

## Definitions

Recall that our client's description (which she and her lawyer find partly denigrating!) was as follows:

> Amy Williams | Seller | Mafiosa, sells several central houses (top 10%) over time, needs average outskirt houses over time to hide from the FBI.

By detokenising her requirements, we see that we have to answer the following questions:

1. When is a house considered central or on the outskirts (We need a measure of **centrality**)?
2. What does "top 10%" refer to: (Non-)Peripherality, price, size, ...? 
3. When is an (outskirt) house considered average (we need a measure of **exceptionality**)?
4. When is an (outskirt) house suitable for avoiding inconveniences with the FBI? (we need a measure of ... **privacy**)?

There are definitively no unique answers to these question. We have to make educated guesses (based on some plots) ...

### Ad (Top 10%) "Centrality"

We attack the first two question simultaneously.

As a first approximation, we will ignore matters of curvatures and consider earth as (locally) flat. We will treat latitude and longitude as if they were lengths, i.e. as x- and y-coordinates (we are not in Arctic zones). The house locations are distributed as follows:

As a measure of centrality for a house, we take the number of houses that lie in a "rectangular" vicinity of the house under consideration. 

In [None]:
# These are guesses based on the plot.
# Of course, one can imagine more sophisticated methods to
# determine these quantities ...
DIST_X = 0.1
DIST_Y = 0.1

# Matrices of lateral and longitudinal distances
# (kudos to NumPy's broadcasting magic).
lat_diff = np.abs(df_1["lat"].values[:, np.newaxis] - df_1["lat"].values)
long_diff = np.abs(df_1["long"].values[:, np.newaxis] - df_1["long"].values)
inside_rectangle = (lat_diff <= DIST_X) & (long_diff <= DIST_Y)

# Sum along the columns to get the number of neighbours.
df_1["centrality"] = inside_rectangle.sum(axis=1)

Let us create a histogram for the distribution of centrality:

In [None]:
plt.hist(
    df_1["centrality"], 
    bins=128, 
    alpha=0.75, 
    density=True,
    color="blue", 
    edgecolor="blue",
)

plt.title("Distribution of Centrality")
plt.xlabel("Centrality")
plt.ylabel("Frequency")

plt.savefig("./plots/centrality-hist.png")
plt.show()

We show how the centrality reflects in a scatterplot for latitude and longitude in moment.

But before then, let us give a precise meaning to *top 10% central house* and *house on the outskirts*:

- Central houses (top 10%): Houses with centrality above the 0.9-quantile. 
- Houses on the outskirts: Houses with centrality below the 0.1-quantile.

Now, we can visualise the groups of houses on a scatterplot:

In [None]:
centrality_10pc = df_1["centrality"].quantile(0.1)
centrality_90pc = df_1["centrality"].quantile(0.9)

colours = np.where(
    df_1["centrality"] > centrality_90pc, 
    "green",
    np.where(
        df_1["centrality"] < centrality_10pc, 
        "red", 
        "blue",
    )
)

plt.scatter(
    df_1["lat"], 
    df_1["long"], 
    c=colours, 
    alpha=0.1, 
    label="Location of Houses"
)

legend_elements = [
    Patch(facecolor='green', edgecolor='g', label="Top 10"),
    Patch(facecolor='red', edgecolor='r', label="Middle"),
    Patch(facecolor='blue', edgecolor='b', label="Bottom 10"),
]

# ENIGMA How do the arguments of legend work?
plt.legend(
    handles=legend_elements, 
    loc="lower right", 
    bbox_to_anchor=(1.33,0),
)

plt.title("Location of Houses, Coloured by Centrality")
plt.xlabel("Latitude")
plt.ylabel("Longitude")

plt.savefig("./plots/centrality-plot.png")
plt.show()

### Ad "Exceptionality" (on the Outskirts)

Gratefully, the data frame includes information on the neighbourhood mean for living space ("sqm_living15") and lot size ("sqm_lot15"). To measure exceptionality, we sum the absolute differences from these neighbourhood means:

In [None]:
df_1["exceptionality"] = (
    + (df_1["sqm_living"] - df_1["sqm_living15"]).abs()
    + (df_1["sqm_lot"] - df_1["sqm_lot15"]).abs()
)

So, exceptionality is distributed on the map as follows:

In [None]:
df_1["exceptionality_q"] = pd.qcut(df_1["exceptionality"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
colours = {"Q1": "green", "Q2": "yellow", "Q3": "orange", "Q4": "red"}

fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True, sharey=True)
axes = axes.flatten()  

for i, quartile in enumerate(["Q1", "Q2", "Q3", "Q4"]):
    ax = axes[i]
    subset = df_1[df_1["exceptionality_q"] == quartile]
    ax.scatter(df_1["lat"], df_1["long"], color="gray", alpha=0.1, label=quartile)
    ax.scatter(subset["lat"], subset["long"], color=colours[quartile], alpha=0.1, label=quartile)
    ax.set_title(f"Exceptionality, {quartile}")
    ax.set_xlabel("Latitude")
    ax.set_ylabel("Longitude")
    # ax.legend()

fig.suptitle("Location of Houses, Coloured by Exceptionality", fontsize=16)

# Adjust layout to prevent overlapping
plt.tight_layout(rect=[0, 0, 1, 0.95])

plt.savefig("./plots/normality-plot.png")
plt.show()

There are no significant visible differences

*The (dis-)recommended regions will be indicated by the cursor.*

### Ad "Privacy"

For privacy, we consider the size of the basement ("sqm_basement") as a proxy measure:

In [None]:
df_1["privacy"] = df_1["sqm_basement"]

Recall that we interpreted `nan` value for the column "sqm_basement" as `0`. For the sake of simplicity, let us simply visualise where "privacy" is (not) strictly positive:

In [None]:
df_priv0 = df_1[df_1["privacy"] == 0]
df_priv1 = df_1[df_1["privacy"] > 0]

df_1["exceptionality_q"] = pd.qcut(df_1["exceptionality"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
colours = {"Q1": "green", "Q2": "yellow", "Q3": "orange", "Q4": "red"}

fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=True, sharey=True)

axes[0].scatter(df_1["lat"], df_1["long"], color="gray", alpha=0.1, label=quartile)
axes[0].scatter(df_priv0["lat"], df_priv0["long"], color="red", alpha=0.01, label=quartile)
axes[0].set_title("Without Privacy")
axes[0].set_xlabel("Latitude")
axes[0].set_ylabel("Longitude")

axes[1].scatter(df_1["lat"], df_1["long"], color="gray", alpha=0.1, label=quartile)
axes[1].scatter(df_priv1["lat"], df_priv1["long"], color="green", alpha=0.01, label=quartile)
axes[1].set_title("With Privacy")
axes[1].set_xlabel("Latitude")
axes[1].set_ylabel("Longitude")

fig.suptitle("Location of Houses, With And Without Privacy", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.95])

plt.savefig("./plots/privacy-plot.png")
plt.show()

*The (dis-)recommended regions will be indicated by the cursor.*