In [1]:
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
HOUSING_PATH = os.path.join("datasets", "housing")


def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)


housing = load_housing_data()

# Exploring the dataset

Each row in the data set represents a district.

In [5]:
housing.head()

Note below that there are 20640 entries in the data set. However, feature ``total_bedrooms`` has only 20433 nonnull entries.
This means that some districts miss this feature.

In [6]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html
housing.info()

In [7]:
housing.ocean_proximity.value_counts()

In [8]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
housing.describe()

In [9]:
%matplotlib inline
housing.hist(bins=50, figsize=(20, 15))
plt.show()

# Train-Test split

We need to set some data aside for testing and some for training. This is important to evaluate how well the model generalizes to unseen data.

In [10]:
from sklearn.model_selection import train_test_split

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

## A note on representativeness

When training the data, it's not always a good a idea to just apply random sampling. Sample data must be representative to the population to avoid bias.
Therefore, you must consider _stratified sampling_. This means that your sample has different representative subgroups called _strata_.

Let's use `StratifiedShuffleSplit` to create another test dataset and compare it with the random sampling (i.e., plain `train_test_split`).

First, let's create an income category to identifiy our stratas:

In [11]:
housing["income_category"] = pd.cut(housing.median_income,
                                    bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
                                    labels=[1, 2, 3, 4, 5])

Now, we use the newly created ``income_category`` with ``StratifiedShuffleSplit`` to split the data in the same proportion of the dataset, therefore, making a "representative" train-test split.

In [12]:
from sklearn.model_selection import StratifiedShuffleSplit


# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing.income_category):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Let's check the distribution of median income:

In [13]:
housing.median_income.describe()

In [14]:
train_set.median_income.describe()

In [15]:
strat_train_set.median_income.describe()

Removing the temporary `income_category` feature...

In [16]:
for df in (strat_train_set, strat_test_set):
    df.drop("income_category", axis=1, inplace=True)

# Discover and Visualize the Data to Gain Insights

In [17]:
# Avoiding undesired side-effects...
housing = strat_train_set.copy()

In [18]:
# alpha helps to visualize the density
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
plt.show()

Checking median house values

In [19]:
# https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing.population/100, label="population", figsize=(12, 8),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.show()

## Exploring linear correlations

On the previous two figures, we see that prices and population density **seems** correlated (...please, never fall into the trap of "eye-balling" statistics).
Let's explore some correlations using the _standard correlation coefficient_ (aka Pearson's r) between pairs of features:

In [20]:
corr_matrix = housing.corr()

In [21]:
corr_matrix.median_house_value.sort_values(ascending=False)

Another way to explore correlations is to use a _scatter matrix_ where we can contrast each feature:

In [25]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age", "total_bedrooms"]
scatter_matrix(housing[attributes], figsize=(16,10))
plt.show()

In [30]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.2, figsize=(10,6))
plt.show()

### Exploring some feature engineering

In [31]:
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]

In [33]:
scatter_matrix(housing[["median_house_value", "bedrooms_per_room", "median_income"]], figsize=(16,10))
plt.show()

In [37]:
housing.corr()["median_house_value"].sort_values(ascending=False)

# Data Preparation for Machine Learning

We explored the data to get some initial insights and analyzed some characteristics of the dataset.
Before working on the Machine Learning model, let's do some cleanup and data preparation.

In [38]:
# TODO