# Data Analysis

## Outline of the Problem

Let us cite the original description of our client: 

> Amy Williams (Seller, Mafiosa) sells several central houses (top $10%$) over time, needs average outskirt houses over time to hide from the FBI.

As Miss Williams is somewhat distressed that her astute manoeuvres are misconstrued as organised crime, let us recast the version proposed by her solicitor:

> Amy Williams, a businesswoman by trade, …
> - wishes to sell several central houses (top $10%$) over time, and
> - requires average outskirt houses over time to avoid unwelcome attention from the FBI.

## Setup

To analyse the data set, we begin with loading the required modules:

In [None]:
import os
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Required for some plot fine tuning.
import matplotlib.patches as patches

In this notebook, we employ the following environment variables. As in the preceding notebook, the variable `DF_PKL_PATH` denotes the path to the pickled data frame.

In [None]:
# Path to root directory of the repo.
root_dir_ = subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
)
ROOT_DIR = root_dir_.strip()

# Path to pickled dataframe.
DF_PKL_PATH = os.path.join(ROOT_DIR, "data",  "df.pkl")

# Path to plots directory.
PLOTS_DIR = os.path.join(ROOT_DIR, "notebooks", "plots")

We define two auxiliary functions; the first offers a convenient wrapper for `plt.savefig`.

In [None]:
def plt_savefig(
    plot_corename,
    plot_extension = ".png",
    base_directory = PLOTS_DIR,
) -> None:
    plot_name = f"{plot_corename}{plot_extension}"
    plot_path = os.path.join(base_directory, plot_name)
    plt.savefig(plot_path, bbox_inches="tight", dpi=600)

The second function serves to annotate regions within plots.

In [None]:
def draw_rectangle(ax, x, y, width, height, color="blue", linewidth=1):
    circle = patches.Rectangle(
        (x - width / 2, y - height / 2),
        width,
        height,
        linewidth=linewidth,
        edgecolor=color,
        facecolor='none'
    )
    ax.add_patch(circle)

As the final setup step, we import the data set under consideration:

In [None]:
df = pd.read_pickle(DF_PKL_PATH)

## First Inspection and Data Cleaning

As a preliminary inspection, let us invoke the info and sample methods of the data frame.

In [None]:
df.info()

In [None]:
df.sample(10)

The assignment was accompanied by a table clarifying the meanings of the column headings. Let us reproduce it here.

| Identifier | Description |
| --- | --- |
| id | unique identifier for a house |
| dateDate | house was sold |
| pricePrice | is prediction target |
| bedroomsNumber | # of bedrooms |
| bathroomsNumber | # of bathrooms |
| sqft_livingsquare | footage of the home |
| sqft_lotsquare | footage of the lot |
| floorsTotal | floors (levels) in house |
| waterfront | House which has a view to a waterfront |
| view | quality of view |
| condition | How good the condition is ( Overall ) |
| grade | overall grade given to the housing unit, based on King County grading system |
| sqft_above | square footage of house apart from basement |
| sqft_basement | square footage of the basement |
| yr_built | Built Year |
| yr_renovated | Year when house was renovated |
| zipcode | zip |
| lat | Latitude coordinate |
| long | Longitude coordinate |
| sqft_living15 | The square footage of interior housing living space for the nearest 15 neighbors |
| sqft_lot15 | The square footage of the land lots of the nearest 15 neighbors |

Moreover, we consider the statistical properties of the numerical features:

In [None]:
df.describe()

### Data Cleaning

Let us now begin with the necessary cleaning tasks. In particular, we correct the data types for several columns where necessary.

### Ad "date"

As previously observed, the displayed `dtype` of "date" is `object`. Working with dates is considerably simplified if the column is cast to the appropriate type.

It is always advisable to employ the tools available for working with dates. In pandas, this entails casting the relevant columns to the datetime type. 

Fortunately, there are no additional complications—such as unconventional date formats—so the following command shall suffice:

In [None]:
df["date"] = pd.to_datetime(df["date"])

### Ad "bedrooms", "bathrooms", "floors"

Let us examine the columns representing the number of specific rooms. We observed that they share a floating-point data type. We shall now identify which fractional parts occur in these columns:

In [None]:
room_cols = [
    "bedrooms",
    "bathrooms",
    "floors",
]

pd.DataFrame(
    [(col, (df[col] % 1).unique()) for col in room_cols],
    columns=["Room Column", "Fractional Parts"]
)

Hence, it is permissible to leave these columns as they stand—although the distinction between values like 0.5 and 0.75 bathrooms appears somewhat artificial.

### Ad "yr_renovated"

We observed that "yr_renovated" possesses the data type `float64`. The unique values are as follows:

In [None]:
df["yr_renovated"].unique() 

We note two issues:

- The use of floating-point numbers is unnecessary; integers suffice.
- Among the values appear `nan` and `0`. The meaning of `0` remains obscure: It may denote "no renovation" or serve as an alternative encoding for `nan`. We shall regard `0` as equivalent to `nan`.

We shall now proceed to address these matters.

In [None]:
df["yr_renovated"]
df["yr_renovated"] = df["yr_renovated"].replace(0.0, np.nan)
# To preserve NaN values, "Int64" instead of "int64" must be employed.
df["yr_renovated"] = df["yr_renovated"].astype("Int64")

### Ad "waterfront", "view"

It is highly probable that the columns "waterfront" and "view" are, in truth, categorical—or even Boolean—so floating-point numbers are unnecessary. The unique values are as follows:

In [None]:
maybe_cat_cols = [
    "waterfront",
    "view",
]

pd.DataFrame(
    [(col, df[col].unique()) for col in maybe_cat_cols],
    columns=["Maybe Categorical Column", "Unique Values"]
)

Hence, the columns are categorical, and "waterfront" is even Boolean. Accordingly, we shall convert these columns:

In [None]:
for col in maybe_cat_cols:
    # Conversion to "int64" will not work since NaN values are present,
    # we have to use "Int64".
    df[col] = df[col].astype("Int64")

df.info()

### Ad "sqft_*"

Finally, at Miss Williams's request, we convert the columns representing square footage (those matching "sqft_*") into square metres as she advocates the use of metric units.

In [None]:
# The ratio of square metre to square foot.
SQ_M_TO_SQ_F_RATIO = 0.09290304

sqft_cols = [col for col in df.columns if col.startswith("sqft_")]

if "sqft_basement" in sqft_cols:
    # The column sqft_basement has data type object.
    # We convert it to integer and interpret missing values as 0.
    df["sqft_basement"] = df["sqft_basement"].replace("?", np.nan)
    df["sqft_basement"] = pd.to_numeric(df["sqft_basement"], errors="coerce")
    df["sqft_basement"] = df["sqft_basement"].fillna(0).astype("int64")

for col in sqft_cols:
    # Add sqm column.
    col_new = col.replace("sqft_", "sqm_")
    df[col_new] = SQ_M_TO_SQ_F_RATIO * df[col]
    # Drops sqft column.
    df = df.drop(columns=[col], errors="ignore")



## Three Haphazard Hypotheses Regarding the Dataset

To familiarise ourselves with the dataset, we shall formulate three haphazard hypotheses and examine their consistency with the data.

**Caveat:** We do *not* conduct hypothesis testing in a statistically rigorous sense!

### Hypothesis #1: Dependence of Two Discrete Variables

**Hypothesis:** The correlation between "bathrooms" and "bedrooms" is strictly positive. Particularly, they are dependent.

**Consistency Check:** Let us produce a scatterplot, taking "bathrooms" as the $x$- and "bedrooms" as the $y$-variable.

In [None]:
plt.scatter(
    df["bathrooms"], 
    df["bedrooms"], 
    c="blue", 
    alpha=0.005, 
    label="Location of Houses"
)

plt.title("Bathrooms vs. Bedrooms")
plt.xlabel("bathrooms")
plt.ylabel("bedrooms")
plt.xlim(0,6)
plt.ylim(0,8)

plt_savefig("bathrooms-vs-bedrooms")
plt.show()


The correlation appears to be mildly positive. Hence, the hypothesis may be *accepted*.

**REMARK:** The following computation further substantiates the hypothesis:

In [None]:
df["bathrooms"].corr(df["bedrooms"])

### Hypothesis #2: Dependence of a Continuous and a Discrete Variable 

**Hypothesis:** The variables "price" and "waterfront" are not independent.

**Consistency Check:** Let us compare the (unconditional) distribution of "price" with the distribution of "price" conditioned on the event that "waterfront" equals $1$.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].hist(df["price"], bins=100, density=True)
ax[0].set_title("Price")
ax[0].set_xlabel("Price")
ax[0].set_ylabel("Frequency")

is_wf_1 = (df["waterfront"] == 1) 
ax[1].hist(df["price"][is_wf_1], bins=100, density=True)
ax[1].set_title("Price | Waterfront == 1")
ax[1].set_xlabel("Price")
ax[1].set_ylabel("Frequency")

plt_savefig("price-on-waterfront")
plt.show()

Evidently, the two distributions differ. Therefore, the hypothesis shall be *accepted*.

### Hypothesis #3: Dependence of Two Continuous Variables

**Hypotheses:** The variables "lat" and "long" are not correlated.

**Consistency Check:** Let us examine the scatterplot of "lat" versus "long":

In [None]:
plt.scatter(
    df["long"],
    df["lat"],
    c="blue",
    alpha=0.1,
    label="Location of Houses",
)

plt.title("Location of Houses")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

plt_savefig("long-vs-lat")
plt.show()

The visual evidence disfavouring the hypothesis is insufficient; hence, it shall be **accepted**.

**Remark:** The subsequent computation likewise demonstrates that the correlation does not deviate significantly from zero:

In [None]:
df["long"].corr(df["lat"])

## Definitions

Recall our client's description—which she and her solicitor deem somewhat denigratory—was as follows:

> Amy Williams | Seller | Mafiosa, sells several central houses (top $10%$) over time, needs average outskirt houses over time to hide from the FBI.

By unpacking her requirements, we discern that the following questions must be addressed:

1. When is a house considered central or located on the outskirts? — We require a measure of **centrality**.
2. To what does "top $10%$" refer? (Non-)peripherality, price, size, or other factors?
3. When is an (outskirt) house considered average? — We require a measure of **exceptionality**.
4. When is an (outskirt) house suitable for evading scrutinity by the FBI? — We require a measure of **privacy**.

There are certainly no unique answers to these question. We must make educated guesses based on various plots.

### Ad (Top 10%) "Centrality"

We shall address the first two questions simultaneously.

As a first approximation, we shall disregard considerations of curvature and regard the Earth as (locally) flat. Latitude and longitude shall be treated as ordinary lengths, that is, as $x$- and $y$-coordinates (King County lies far from the Arctic regions).

As a measure of centrality for a house, we select the number of houses that lie within a "rectangular" vicinity of the house under consideration.

In [None]:
# These are guesses based on the plot.
# Of course, one may invent more sophisticated methods to
# determine these quantities ...
DIST_X = 0.1
DIST_Y = 0.1

# Matrices of lateral and longitudinal distances
# (kudos to NumPy's broadcasting magic).
long_diff = np.abs(df["long"].values[:, np.newaxis] - df["long"].values)
lat_diff = np.abs(df["lat"].values[:, np.newaxis] - df["lat"].values)
inside_rectangle = (long_diff <= DIST_Y) & (lat_diff <= DIST_X)

# Sum along the columns to get the number of neighbours.
df["centrality"] = inside_rectangle.sum(axis=1)

Let us create a histogram illustrating the distribution of centrality:

In [None]:
plt.hist(
    df["centrality"], 
    bins=128, 
    alpha=0.75, 
    density=True,
    color="blue", 
    edgecolor="blue",
)

plt.title("Distribution of Centrality")
plt.xlabel("Centrality")
plt.ylabel("Frequency")

plt_savefig("centrality-hist")
plt.show()

Presently, we shall visualise the spatial distribution of centrality across the map of King County.

Before proceeding, let us assign a precise meaning to the expressions "top $10%$ central house" and "house on the outskirts":

- Central houses (top 10%): Houses with centrality exceeding the $0.9$-quantile.
- Houses on the outskirts: Houses with centrality deceeding the $0.1$-quantile.

We may now viualise the respective groups of houses on a scatterplot:

In [None]:
centrality_10pc = df["centrality"].quantile(0.1)
centrality_90pc = df["centrality"].quantile(0.9)

colours = np.where(
    df["centrality"] > centrality_90pc,
    "green",
    np.where(
        df["centrality"] < centrality_10pc,
        "red",
        "blue",
    ),
)

plt.scatter(df["long"], df["lat"], c=colours, alpha=0.1, label="Location of Houses")

legend_elements = [
    patches.Patch(facecolor="green", label="Top 10"),
    patches.Patch(facecolor="blue", label="Middle"),
    patches.Patch(facecolor="red", label="Bottom 10"),
]

# ENIGMA: How do the arguments of legend work?
plt.legend(
    handles=legend_elements,
    loc="lower right",
    bbox_to_anchor=(1.33, 0),
)

plt.title("Location of Houses, Coloured by Centrality")
plt.xlabel("Longitude")
plt.ylabel("Latitude")

plt_savefig("centrality-plot")
plt.show()

### Ad "Exceptionality" (on the Outskirts)

Fortunately, the data frame includes information on the neighbourhood means for living space ("sqm_living15") and lot size ("sqm_lot15"). To measure exceptionality, we sum the absolute deviations from these neighbourhood means:

In [None]:
df["exceptionality"] = (
    + (df["sqm_living"] - df["sqm_living15"]).abs()
    + (df["sqm_lot"] - df["sqm_lot15"]).abs()
)

The following plot visualises exceptionality on the map. Regions we advise against are framed in blue.

In [None]:
df["exceptionality_q"] = pd.qcut(df["exceptionality"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
colours = {"Q1": "green", "Q2": "yellow", "Q3": "orange", "Q4": "red"}

fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True, sharey=True)
axes = axes.flatten()  

for i, quartile in enumerate(["Q1", "Q2", "Q3", "Q4"]):
    ax = axes[i]
    subset = df[df["exceptionality_q"] == quartile]
    ax.scatter(df["long"], df["lat"], color="gray", alpha=0.1, label=quartile)
    ax.scatter(subset["long"], subset["lat"], color=colours[quartile], alpha=0.1, label=quartile)
    ax.set_title(f"Exceptionality, {quartile}")
    ax.set_xlabel("Longitude")
    ax.set_ylabel("Latitude")
    # ax.legend()

fig.suptitle("Location of Houses, Coloured by Exceptionality", fontsize=16)

# Adjusts layout to prevent overlapping.
plt.tight_layout(rect=[0, 0, 1, 0.95])

draw_rectangle(axes[0], x=-122.00, y=47.25, width=0.10, height=0.05, color="blue")
draw_rectangle(axes[0], x=-122.45, y=47.40, width=0.10, height=0.05, color="blue")
draw_rectangle(axes[0], x=-122.00, y=47.42, width=0.10, height=0.05, color="blue")
draw_rectangle(axes[0], x=-122.10, y=47.76, width=0.10, height=0.05, color="blue")

plt_savefig("exceptionality-plot")
plt.show()

### Ad "Privacy"

As a proxy for privacy, we consider the size of the basement ("sqm_basement"):

In [None]:
df["privacy"] = df["sqm_basement"]

Recall that we interpreted the `nan` values in the column "sqm_basement" as `0`. For the sake of simplicity, let us visualise solely where "privacy" is, or is not, strictly positive. Regions we advise against are framed in blue.

In [None]:
df_priv0 = df[df["privacy"] == 0]
df_priv1 = df[df["privacy"] > 0]

df["exceptionality_q"] = pd.qcut(df["exceptionality"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
colours = {"Q1": "green", "Q2": "yellow", "Q3": "orange", "Q4": "red"}

fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=True, sharey=True)

axes[0].scatter(df["long"], df["lat"], color="gray", alpha=0.1, label=quartile)
axes[0].scatter(df_priv0["long"], df_priv0["lat"], color="red", alpha=0.01, label=quartile)
axes[0].set_title("Without Privacy")
axes[0].set_xlabel("Longitude")
axes[0].set_ylabel("Latitude")

axes[1].scatter(df["long"], df["lat"], color="gray", alpha=0.1, label=quartile)
axes[1].scatter(df_priv1["long"], df_priv1["lat"], color="green", alpha=0.01, label=quartile)
axes[1].set_title("With Privacy")
axes[1].set_xlabel("Longitude")
axes[1].set_ylabel("Latitude")

fig.suptitle("Location of Houses, With And Without Privacy", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.95])

draw_rectangle(axes[0], x=-122.05, y=47.37, width=0.10, height=0.05, color="blue")
draw_rectangle(axes[0], x=-122.16, y=47.47, width=0.10, height=0.05, color="blue")
draw_rectangle(axes[0], x=-122.00, y=47.58, width=0.10, height=0.05, color="blue")

plt_savefig("privacy-plot")
plt.show()