Use of California Census data to build a model of House Prices<br>
Model Should learn from the data and predict the median housing prices of any districts given all the predictors.<br>
This is a supervised regression task using batch learning technique

### Selecting A performance Measure -

We used Root Mean Squared Error(RMSE) as our performance measure <br><br>
\begin{equation*}
RMSE(X, h) = \sqrt{\frac 1m \sum_{i = 1}^m (h(x^{(i)}) - y^{(i)})^2}
\end{equation*}
- $m$ is the number of observations in the dataset
- $x^{(i)}$ is the feature vector of the $i^{th}$ instance of the dataset, $y^{(i)}$ is its label
- $X$ is the matrix contating predictor values all the observations ni the dataset (excluding the label)
- $h$ is the estimated prediction function of aur learning algorithm.

We call $RMSE(X,h)$ as cost function measured on our obesrvations using the hypothesis $h$. <br>
We could also have used Mean Absolute Error(MAE) if our dataset had many outliers. Basically RMSE uses L2 norm and MAE the L1 norm, the higher the norm index the more focus is on the larger values.


### Getting the Data

In [None]:
import os 
import tarfile
import urllib

DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml2/raw/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [None]:
HOUSING_URL

In [None]:
#function to fetch the data
def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url,tgz_path) #Dowloads the data from housing_url to tgz_path
    housing_tgz = tarfile.open(tgz_path) 
    housing_tgz.extractall(path = housing_path)#extracts and saves the data into housing_path
    housing_tgz.close()

In [None]:
#loading the data
import pandas as pd

#function to load the data
def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
fetch_housing_data()

### Inspect the Data

In [None]:
df = load_housing_data()

In [None]:
df.head()

In [None]:
# a quick discription of data
df.info()

In [None]:
#summary of the categorical variable
df['ocean_proximity'].value_counts()

In [None]:
#summary of the numerical datatypes
df.describe()

The 25%, 50% and the 75% rows shows the value below which a given percentage of obseervation in a group of observations fall

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
df.hist(bins = 50, figsize = (20,15))
plt.show()

### Creating a test Set
We pick some instances randomly typically around 20% of the data and set them aside

In [None]:
import numpy as np

#function for test train split
def split_test_train(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) *test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

This method of train test split works but we will get a new split when we run the algorithm again and over time our ml model will see the entire dataset.<br>
Other method is to use random seed or save the test data and both the method will not if we fetch the updated data <br>
The work-around is to use a **hash funstion** and calculate the hash of eash instance's identifier and if its hash value is below 20% of the maximum hash value it will be in the test set otherwise ot will be in the training data using this we will not have any instance in the test set which was earlier in the training set and the test train split will maintain the desirable ratio.

In [None]:
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [None]:
#adding index column to the dataframe
df_with_index = df.reset_index()
train_set, test_set = split_train_test_by_id(df_with_index, 0.2, "index")

it is needed to be made sure that if row is needed to be the unique identifier then the new data gets appended to the end of the data det and no row gets deleted, otherwise we can use most stable features of the dataset as the unique identitfiers.

In [None]:
#we can also use sklearn inbuilt train_test_split function
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 42)

We can use diffrent method of sampling to devide our dataset. So far we have only considered simple random sampling but it may give sampling biases sometimes. For example, if the dataset contains 53% of one class and 47% of other class we can devide the data into different strata so that the sample is representatiove of the original population, this is called ***stratified random sampling***. <br>
In our example we will do stratified random sampling on median income by first binning the income into different categories.

In [None]:
# binning the median incomes into different categories
df['income_cat'] = pd.cut(df['median_income'],
                         bins = [0., 1.5, 3.0, 4.5, 6, np.inf],
                         labels = [1,2,3,4,5])

In [None]:
df['income_cat'].hist()

In [None]:
#doing stratified sampling based on median income categories
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1,test_size=0.2, random_state=42)

for train_index, test_index in split.split(df, df['income_cat']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]   

In [None]:
# propotion of each median income category in train set
strat_train_set['income_cat'].value_counts() / len(strat_train_set)

In [None]:
# propotion of each median income category in entire set
df['income_cat'].value_counts() / len(df)

We can see that the test set is representative of the population wrt median income category attribute

In [None]:
# removing the income_cat attribute to get the data back in the same state
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)