# Guided self-learning project from Hands-On Introduction to Machine Learning

This is a *guided* self-learning project I have done, using the book "Hands-on Introduction to Machine Learning[...]".

<b> Project summary </b> <br>
The goal is to find a model for house prices prediction, given a sample dataset. We test Linear Regressor, Decision Tree and Random Forest. <br>
<b> The plan of the project:</b>
1. Basic data exploration. <br>
2. Cleaning the data - adding/removing attributes, imputing missing values, encoding categorical attributes as numerical. <br>
3. Performing regression - splitting into train and test set, running models, testing performance and overfit. <br>
4. Final answer.

## Reading the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#read and copy the dataset
housing_orig = pd.read_csv('housing.csv')
housing = housing_orig.copy()

## Data Exploration

Let's have a quick glance on the data.

In [None]:
housing.head()

9 numeric attributes, except one categorical "ocean_proximity".

Missing values: only in attribute total_bedrooms, 207 missing values (ca. 1% of this column). <br>
Imputing strategy: median (done later).

A problem with dataset that we ignore (<b> Fig.1 </b>): numeric data have suspiciously lot of observations with maximal value (in attributes housing_media_age on value 50 and attribute median_house_value on value 500).

A problem with the dataset that we fix later: very few observations with value ISLAND in categorical attribute ocean_proximity (5 out of ca. 20 000, ca. 0.03%). In consequence this value often does not occur in the test set; as a solution we manually pick the random seed used for the split into train and test set.

In [None]:
housing.ocean_proximity.value_counts()

#### Fig1: Too many maximal values in housing_median_age and median_house_value

In [None]:
housing.hist(bins=50,figsize=(20,15));

# Split into train and test set

In [None]:
test_ratio = 0.2 #size of test set w.r.t. whole dataset

In [None]:
from sklearn.model_selection import train_test_split
train_set3, test_set3 = train_test_split(housing, test_size=test_ratio, random_state=6)#seed 6 chosen in order to have all values of all (1) categorical attributes in both train set and test set (explicitly: ISLAND value of ocean_proximity attribute)

In [None]:
#Checking if all values of ocean_proximity are in both train and test set
if len(np.unique(housing.ocean_proximity.values)) == len(np.unique(test_set3.ocean_proximity.values)):
    print("OK - test set contains all values of categorical atributes.")
else:
    print("Bad - test set misses some values of categorical atributes.")

# Cleaning the train set

### Adding/removing parameters

A glance on the train set.

In [None]:
train_set3.head()

We will <b> add/drop parameters </b>, that appear <b> logically more relevant/irrelevant</b> as predictors of house prices. <br>
We <b> add attributes: </b> rooms per household, bedrooms per household, and bedrooms per room (the last one is determined by the previous two). <br>
We <b> drop attributes: </b> total_rooms, total_bedrooms, longitude, latitude.

#### Our attribute transformer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class AttributeAdderAndRemover(BaseEstimator, TransformerMixin):
    def __init__(self, before_cat_encoding):
        self.before_cat_encoding = before_cat_encoding
    def fit(self, X):
        return self
    def transform(self, X):
        #we add new attributes and we remove four irrelevant ones
        X["rooms_per_household"] = X.total_rooms/X.households
        X["bedrooms_per_household"] = X.total_bedrooms/X.households
        X["bedrooms_per_room"] = X.total_bedrooms/X.total_rooms
        X.drop(["total_rooms", "total_bedrooms", "longitude", "latitude"], axis=1, inplace=True)#drop irrelevant attributes
        return X

#### Auxilliary functions (attributes lists and array-to-dataframe function)

In [None]:
#columns
## categorical columns ##
old_cat_columns = ["ocean_proximity"]
new_cat_columns = list(np.unique(train_set3.copy()[old_cat_columns[0]].values)) #or statically: ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
all_cat_columns = old_cat_columns + new_cat_columns
# auxilliary function
def wo_cat_columns(L:list):#without categorical columns
    return [col_name for col_name in L if col_name not in all_cat_columns]

## numerical columns ##
old_columns = list(housing_orig.columns)
old_num_columns = wo_cat_columns(old_columns)
new_num_columns = wo_cat_columns(list(AttributeAdderAndRemover(before_cat_encoding=True).fit_transform(housing_orig.copy()).columns))

## new columns ##
new_columns = new_num_columns + new_cat_columns

In [None]:
def our_to_DataFrame(X, before_cat_encoding:bool, before_attr_transforming:bool):
    """
    Turns an array to a DataFrame by adding column names.
    """
    if before_cat_encoding:
        if before_attr_transforming:
            raise(NotImplementedError)
        else:
            return pd.DataFrame(X, columns=old_columns, index = range(len(X)))
    else:
        return pd.DataFrame(X, columns=new_columns, index = range(len(X)))

### Imputing and encoding categorical attributes

Numerical attributes are imputed using median strategy. <br>
Categorical attributes are encoded into vectors using one-hot encoding

In [None]:
#define transformers that build the pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
my_attribute_transformer = AttributeAdderAndRemover(before_cat_encoding=True)
imputer = SimpleImputer(strategy="median")
cat_encoder = OneHotEncoder()

In [None]:
#define pipelines
col_pipeline = ColumnTransformer([
    ('num', imputer, new_num_columns), #impute
    ('cat', cat_encoder, old_cat_columns) #encode categorical as numerical
])
full_pipeline = Pipeline([
    ('attr_transformer', my_attribute_transformer), #add/remove attributes
    ('imput-cat_encode', col_pipeline)
])

In [None]:
#run pipeline, obtaining an array
housing_train = train_set3.copy()
housing_prepared_plus_labels = full_pipeline.fit_transform(housing_train)

### Final objects, ready for analysis

##### Creating final objects

In [None]:
#convert array to DataFrame
housing_prepared_plus_labels_as_df = our_to_DataFrame(housing_prepared_plus_labels, before_cat_encoding=False, before_attr_transforming=False)
#final objects
label_attrs = ["median_house_value"]
housing_prepared = housing_prepared_plus_labels_as_df.drop(label_attrs, axis=1)
housing_labels = housing_prepared_plus_labels_as_df[label_attrs]

## Building models

We train three models: Linear Regression, Decision Tree, Random Forest.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
names = ["Linear Regressor", "Decision Tree Regressor", "Random Forest Regressor"]
models_list = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor()]#mutable objects (a reminder)
models = dict(zip(names, models_list)) #dictionary name:model

In [None]:
for model in models_list:
    model.fit(housing_prepared, housing_labels);

### Testing performance and overfit

In [None]:
#auxilliary function
def name_of_min(d:dict):
    #we assume values are comparable
    #
    #take first value as cur_min
    for name, val in d.items():
        cur_min = name, val
        break
    #search for real min
    for name, val in d.items():
        if val < cur_min[1]:
            cur_min = name, val
    #
    return cur_min[0]

#### Overfit test

We define the overfit of a model by its performance on a train set - the better the performance, the more overfitted we find the model. This is <b> not exactly   what overfit means</b>, nevertheless, in this project we apply this simple, yet inaccurate method.

In [None]:
rmses_on_train_set = dict()
from sklearn.metrics import mean_squared_error
print("Overfit test (RMSE on train set).")
for nazwa, model in models.items():
    predicted_labels = model.predict(housing_prepared)
    mse = mean_squared_error(housing_labels, predicted_labels)
    rmse = np.sqrt(mse)
    rmses_on_train_set[nazwa] = rmse
    print('RMSE of model', nazwa, ':', rmse)

#### Performance test

In [None]:
#read data
housing_test = test_set3.copy().reset_index().drop(["index"], axis=1)
#run pipeline
housing_test_prepared_plus_labels = full_pipeline.fit_transform(housing_test)

In [None]:
#final object
housing_test_prepared_plus_labels_as_df = our_to_DataFrame(housing_test_prepared_plus_labels, before_cat_encoding=False, before_attr_transforming=False)
#recall: label_attrs = ["median_house_value"]
housing_test_prepared = housing_test_prepared_plus_labels_as_df.drop(label_attrs, axis=1)
housing_test_labels = housing_test_prepared_plus_labels_as_df[label_attrs]

In [None]:
rmses_on_test_set = dict()
print("Performance on test set (RMSE) ")
for nazwa, model in models.items():
    predicted_test_labels = model.predict(housing_test_prepared)
    mse = mean_squared_error(housing_test_labels, predicted_test_labels)
    rmse = np.sqrt(mse)
    rmses_on_test_set[nazwa] = rmse
    print('RMSE of model', nazwa, ':', rmse)

# Final answer

In [None]:
name_of_best_model = name_of_min(rmses_on_test_set)
print('Best performing model: ', name_of_best_model, '.',sep="")
print('Most overtrained model: ', name_of_min(rmses_on_train_set), '.', sep="")

In [None]:
import joblib
joblib.dump(models[name_of_best_model], "best_model.pkl");