# California Housing

Much of data analysis with ML is in the data preparation step: building monitoring tools, setting up human evaluation pipelines, and automating regular model training. The machine learning algorithms are important, of course, but it is probably preferable to be comfortable with the overall process and know three or four algorithms well rather than to spend all your time exploring advanced algorithms.

Let's analyze how housing prices across California varies according to a number of factors.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Getting Data

In [None]:
HOUSING_FILE = "../data/ca_housing.csv"
if not Path(HOUSING_FILE).exists():
    print(f"ERROR: {HOUSING_FILE} does not exist")
else:
    raw = pd.read_csv(HOUSING_FILE)
    
raw.shape

## Checking out Data

In [None]:
raw.head(5)

In [None]:
raw.info()

In [None]:
raw.describe()

In [None]:
raw.value_counts("ocean_proximity")

## Data Visualization

In [None]:
import matplotlib.pyplot as plt

def saveImage(filename, format="png", dpi=300):
    plt.savefig(f"ca_housing_histograms.{format}", format=format, dpi=dpi)
    
plt.rc('font', size=12)
plt.rc('axes', labelsize=12, titlesize=12)
plt.rc('legend', fontsize=12)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10) 

In [None]:
housing = raw.copy()

housing.hist(bins=50, figsize=(12, 10))

# saving the image to a file
saveImage(f"ca_housing_histograms")
plt.show()

In [None]:
housing.plot(kind="scatter", 
             x="longitude", 
             y="latitude", 
             grid=True)
saveImage(f"ca_housing_map1")
plt.show()

In [None]:
housing.plot(kind="scatter", 
             x="longitude", 
             y="latitude", 
             grid=True, 
             alpha=0.3)
saveImage(f"ca_housing_map2")
plt.show()

`matplotlib` ColorMap: [link](https://matplotlib.org/stable/tutorials/colors/colormaps.html) main choices:
- rainbow
- jet
- turbo

DataFrame parameters:
- "kind" : type of graph
- "s" : size of bubble
- "c" : color variable

In [None]:
housing.columns

In [None]:
housing.plot(kind="scatter", 
             x="longitude", 
             y="latitude", 
             grid=True,
             s=housing["population"] / 100, 
             label="population",
             c="median_house_value", 
             cmap="turbo", 
             colorbar=True,
             legend=True, 
             figsize=(10, 7))
saveImage("ca_housing_map3")  # extra code
plt.show()

## Stratified Sampling

Stratified random sampling is a technique used to ensure that the training and test datasets represent the overall population. This technique is especially important when you're dealing with an imbalanced dataset, or when the dataset's categorical variables have different levels with varying frequencies.

In a stratified random sample, each subgroup within the overall population is adequately represented within the whole sample. Each subgroup is called a stratum, and stratified random sampling represents these strata well.

Let's say you've talked with some professionals and they've explained that the middle (or median) income level is really important for predicting the average price of houses. You would want to make sure that your test data represents the different ranges of income in all your information.

Since the median income is just a plain number and could be anything, we first need to create income categories. Imagine we look at our data's median incomes. You'll notice that most of the median income levels are bunched up between $15,000 to $60,000. But, some median incomes are much higher than that.

To get a good estimate of the importance of each income level (or stratum), we need to have enough data for each level. This means we shouldn't have too many different income categories, and each category should include a good amount of data.

So, how can we do this? Well, we can use something called the pd.cut() function to make an income category with five groups (numbered from 1 to 5). The first group ranges from 0 to 1.5 (or less than $15,000), the second group from 1.5 to 3, and so on. This way, we can ensure each group has enough data and isn't too specific.

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing['income_cat'].value_counts().sort_index()

In [None]:
housing["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.xlabel("Income category")
plt.ylabel("Number of districts")
plt.show()


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

Scikit-learn provides several functions to split datasets, such as train_test_split(). 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
housing.columns

In [None]:
strat_train_set, strat_test_set = train_test_split(
    housing, 
    test_size=0.2, 
    stratify=housing["income_cat"], 
    random_state=1)

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
# random train_test split
rand_train_set, rand_test_set = train_test_split(
    housing, 
    test_size=0.2, 
    random_state=1
)

In [None]:
# comparing error differences for random vs stratified sampling

def cat_proportions(df, cat):
    return df[cat].value_counts() / len(df)

compare_df = pd.DataFrame({
    "Target %": cat_proportions(housing, "income_cat"),
    "Random %": cat_proportions(rand_test_set, "income_cat"),
    "Stratified %": cat_proportions(strat_test_set, "income_cat"),
}).sort_index()
compare_df.index.name = "Income Category"
(compare_df*100).round(2)


In [None]:
# after creating the stratified samples, we can remove the "income_cat"

for s in (strat_train_set, strat_test_set):
    s.drop("income_cat", axis=1, inplace=True)