# DS EDA Project ⟡ King County House Prices ⟡ *Amy Williams* Edition

## Outline of the Problem

<!-- | Amy Williams        | Seller      | Mafiosi, sells several central houses(top10%) over time, needs average outskirt houses over time to hide from the FBI    -->
<!-- 
Amy Williams, businesswoman by trade (very upset that her very smart tricks are construed as organised crime and ... )  
- wants to sells several central houses (top 10%) over time, 
- needs average outskirt houses over time to hide from the FBI -->



## Usual Setup Tasks

For analysing the data set, we first load the required modules.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let us import the data under consideration:

In [None]:
# Path to the csv file containing the data.
DATAPATH = "./data/King_County_House_prices_dataset.csv"

df_0 = pd.read_csv(DATAPATH)

## First Inspection and Data Cleaning

For a bird's eye-view on this data set, let us do the usual steps of inspection.

In [None]:
# The first 10 rows of the table.
df_0.head(10)

In [None]:
# Information about column dtypes and NaN values. 
df_0.info()

In [None]:
# Common statistical quantities for the numerical columns.
df_0.describe()

We start with the necessary cleaning tasks. For convenience, let us create a deep copy of the original data which will incorporate the edits.

In [None]:
df_1 = df_0.copy()

We fix the data types for some columns where it is not already appropriate. 

### Ad "date"

As we saw, the displayed `dtype` of "date" is `object`. 

It is always wise to use any help that is provided for working with dates which means in the context of pandas to cast the affected columns into the datetime type. 

To our luck, there are no further intricacies (like exotic date formats) so that the following command is sufficient:

In [None]:
df_1["date"] = pd.to_datetime(df_0["date"])

### Ad "bedrooms", "bathrooms", "floors"

Let us have a look whether it is reasonable that the columns encoding the number of specific rooms really requires floating point numbers. For this, we check which fractional parts arise in these columns.

In [None]:
room_cols = [
    "bedrooms",
    "bathrooms",
    "floors",
]

pd.DataFrame(
    [(col, (df_0[col] % 1).unique()) for col in room_cols],
    columns=["Room Column", "Fractional Parts"]
)

So, it is perfectly fine to leave these columns as they are (although the discriminating criterion between 0.5 bathrooms and 0.75 bathrooms is almost surely not sharp ...).

### Ad "yr_renovated"

We saw that "yr_renovated" has dtype `float64`. When we inspect the values of this column by

In [None]:
df_0["yr_renovated"].unique() 

, we immediately recognise three aspects:

- It is very likely that the years are 10-times to high.
- Floating point numbers are not necessary, integers are sufficient.
- Among the values, there are `nan` and `0`. It is not definitively clear what `0` should mean: It could mean no renovation but it could also be an alternative encoding for `nan`. We will stick to the second interpretation. 

Let us address these problems:

In [None]:
df_1["yr_renovated"] = (df_0["yr_renovated"]/10).map(
    lambda x: x if x > 0 else np.nan
# To retain NaN values, "Int64" instead of "int64" must be used.
).astype("Int64")

### Ad "waterfront", "view"

It is very likely that the columns "waterfront" and "view" are, in actual fact, categorical (or even Boolean) so that floating point numbers are not necessary. Indeed, we have:

In [None]:
maybe_cat_cols = [
    "waterfront",
    "view",
]

pd.DataFrame(
    [(col, df_0[col].unique()) for col in maybe_cat_cols],
    columns=["Maybe Categorical Column", "Unique Values"]
)

Hence, they are categorical ("waterfront" is even Boolean). So let us convert these columns:

In [None]:
for col in maybe_cat_cols:
    # Conversion to "int64" will not work since NaN values are present,
    # we have to use "Int64".
    df_1[col] = df_0[col].astype("Int64")

df_1.info()

#### Ad "sqft_*"

As a last step, we follow a wish of Miss Williams and convert columns whose dimension is square foot (those matching "sqft_*") into square metre (She is an advocate of metric units!). 

In [None]:
# The ratio of square metre to square foot.
SQ_M_TO_SQ_F_RATIO = 0.09290304

sqft_cols = [col for col in df_0.columns if col.startswith("sqft_")]

for col in sqft_cols:
    # Dropping sqft column.
    df_1 = df_1.drop(columns=[col], errors="ignore")
    # Adding sqm column.
    col_new = col.replace("sqft_", "sqm_")
    df_1[col_new] = SQ_M_TO_SQ_F_RATIO * df_0[col]

df_1.info()

## Warm Up: Three Haphazard Hypotheses on the Data

To get some acquaintance with the data set, we state three arbitrary (not well-thought-out) hypotheses on the given data and check their consistency with the data afterwards.

### Hypothesis #1: Dependence of Two Discrete Variables

**Hypothesis:** The correlation between "bathrooms" and "bedrooms" is strictly positive (particularly, they are dependent).

**Consistency Check:** Let us draw a scatterplot with "bathrooms" as x- and "bedrooms" as y-variable.

In [None]:
plt.scatter(
    df_1["bathrooms"], 
    df_1["bedrooms"], 
    c="blue", 
    alpha=0.005, 
    label="Location of Houses"
)

plt.title("Location of Houses")
plt.xlabel("bathrooms")
plt.ylabel("bedrooms")

plt.show()

A slightly positive correlation can be recognised. Hence, the hypothesis should be *accepted*.

**REMARK:** The following computation corroborates the hypothesis, too:

In [None]:
df_1["bathrooms"].corr(df_1["bedrooms"])

### Hypothesis #2: Dependence of a Continuous and a Discrete Variable 

**Hypothesis:** The variables "price" and "waterfront" are not independent.

**Consistency Check:** Let us compare the distributions of "price" and the distribution of "price" conditioned on the event that "waterfront" is 1.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].hist(df_1["price"], bins=100, density=True)
ax[0].set_title("Price")
ax[0].set_xlabel("Price")
ax[0].set_ylabel("Frequency")

is_wf_1 = (df_1["waterfront"] == 1) 
ax[1].hist(df_1["price"][is_wf_1], bins=100, density=True)
ax[1].set_title("Price | Waterfront == 1")
ax[1].set_xlabel("Price")
ax[1].set_ylabel("Frequency")

plt.show()


Obviously, these distributions are very different. The hypotheses should be *accepted*.

### Hypothesis #3: Dependence of Two Continuous Variables

**Hypotheses** "lat" and "long" are not correlated.

**Consistency Check:** Let us look at the scatterplot of "lat" and "long":

In [None]:
plt.scatter(
    df_1["lat"], 
    df_1["long"], 
    c="blue", 
    alpha=0.1, 
    label="Location of Houses"
)

plt.title("Location of Houses")
plt.xlabel("Latitude")
plt.ylabel("Longitude")

plt.show()

There is not enough visual evidence against the hypotheses, it should be **accepted**.

**REMARK:** The following computation also shows that the correlation deviates not drastically from zero:

In [None]:
df_1["lat"].corr(df_1["long"])

## Definitions

Recall that our client's description (which is partially denigrating in her and her lawyer's opinion!) was as follows:

> Amy Williams | Seller | Mafiosa, sells several central houses (top 10%) over time, needs average outskirt houses over time to hide from the FBI.

By detokenising the demands of our client, we see that we have to answer the following questions:

1. When is a house considered central or on the outskirts? I.e., we need here an appropriate measure for **centrality**.
2. What does "top 10%" refer to: To (non-)peripherality, price, size, ...? I.e., we have to attribute "top 10%" to some appropriate measure.
3. When is an (outskirt) house considered average? I.e., we need here an appropriate measure for **normality**.
4. When is an (outskirt) house suitable for avoiding inconveniences with the FBI? I.e., we need here an appropriate for ... **privacy**.

There are definitively no unique answers to these question. We have to make educated guesses (based on some plots) ...

### Ad (Top 10%) "Centrality"

We attack the first two question simultaneously.

As a first approximation, we will ignore matters of curvatures and consider earth as (locally) flat. We will treat latitude and longitude as if they were lengths, i.e. as x- and y-coordinate (we are not in Arctic zones). The distribution of the house locations is as follows:

As a measure of centrality for a house, we take the number of houses that lie in a "rectangular" vicinity of the original house. 

In [None]:
# These are guesses based on the plot on how to choose
# the side lengths of 
# Of course, one can imagine more sophisticated methods to
# determine these quantities.
DIST_X = 0.1
DIST_Y = 0.1

# Matrices of lateral and longitudinal distances
# (kudos to NumPy's broadcasting magic).
lat_diff = np.abs(df_1["lat"].values[:, np.newaxis] - df_1["lat"].values)
long_diff = np.abs(df_1["long"].values[:, np.newaxis] - df_1["long"].values)
inside_rectangle = (lat_diff <= DIST_X) & (long_diff <= DIST_Y)

# Sum along the columns to get the number of neighbours.
df_1["centrality"] = inside_rectangle.sum(axis=1)

Let us create a histogram for the distribution of centrality:

In [None]:
plt.hist(
    df_1["centrality"], 
    bins=128, 
    alpha=0.75, 
    density=True,
    color="blue", 
    edgecolor="blue",
)

plt.title("Distribution of Centrality")
plt.xlabel("Centrality")
plt.ylabel("Frequency")

plt.show()

We show how the centrality reflects in a scatterplot for latitude and longitude in moment.

But before then, let us give a precise meaning to top 10% central house and house on the outskirts. 
We will simply define that 

- central houses (top 10%) are those houses that lie above the 0.9-quantile with respect to centrality.
- houses on the outskirt are those houses that lie below the 0.1-quantile with respect to centrality.

Now we are able to visualise central and peripheral houses:

In [None]:
bottom_10pc_threshold = df_1["centrality"].quantile(0.1)
top_10pc_threshold = df_1["centrality"].quantile(0.9)

colours = np.where(
    df_1["centrality"] > top_10pc_threshold, 
    "green",
    np.where(
        df_1["centrality"] < bottom_10pc_threshold, 
        'red', 
        'blue',
    )
)

plt.scatter(
    df_1["lat"], 
    df_1["long"], 
    c=colours, 
    alpha=0.1, 
    label="Location of Houses"
)

# TODO Replace this ugly ad-hoc solution.
from matplotlib.patches import Patch

legend_elements = [
    Patch(facecolor='green', edgecolor='g', label="Top 10"),
    Patch(facecolor='red', edgecolor='r', label="Middle"),
    Patch(facecolor='blue', edgecolor='b', label="Bottom 10"),
]

# ENIGMA How do the arguments of legend work?
plt.legend(handles=legend_elements, loc="lower right", bbox_to_anchor=(1.33,0))

plt.title("Location of Houses, Coloured by Centrality")
plt.xlabel("Latitude")
plt.ylabel("Longitude")

plt.show()

### Ad "Normality" (on the Outskirts)

Gratefully, the data frame contains two columns which carry information about some neighbourhood's mean with respect to the nearest neighbourhood, namely "sqm_living15" and "sqm_lot15". As normality, we take the sum of the distances that a house has to its neighbourhood means, more precisely:

In [None]:
df_1["normality"] = (
    + (df_1["sqm_living"] - df_1["sqm_living15"]).abs()
    + (df_1["sqm_lot"] - df_1["sqm_lot15"]).abs()
)

### Ad "Privacy"

Lock herself at the basement. So as a measure of privacy we choose simply "sqm_basement".

In [None]:
df_1["privacy"] = df_1["sqm_basement"]

What to do?