### Predicting  California House Prices with  Linear Regression

### Objective
* To predict California Housing Prices using the most simple Linear Regression Model and see how it performs.
* To understand the modeling workflow using mlpack.

### About the Data
 This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
 
 This dataset is also used in a book HandsOn-ML (a very good book and highly recommended)[ https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/].
 
 The dataset in this directory is almost identical to the original, with two differences:
207 values were randomly removed from the totalbedrooms column, so we can discuss what to do with missing data. An additional categorical attribute called oceanproximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data.
Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."

Lets look at the features of the dataset:
* Longitude : Longitude coordinate of the houses.
* Latitude : Latitude coordinate of the houses.
* Housing Median Age : Average lifespan of houses.
* Total Rooms : Number of rooms in a location.
* Total Bedrooms : Number of bedroooms in a location.
* Population : Population in that location.
* Median Income : Median Income of households in a location.
* Median House Value : Median House Value in a location.
* Ocean Proximity : Closeness to shore. 

### Approach
 Here, we will try to recreate the workflow from the book mentioned above. 
 * Look at the Big Picture.
 * Get the Data.
 * Discover and Visualize the data to gain insights.
 * Pre-Process the data for the Ml Algorithm.
 * Create new features. 
 * Splitting the data.
 * Training the ML model using MLPACK.
 * Residuals, Errors and Conclusion.


### Big Picture

Suppose you work in a Real State Agency as an analyst or Data Scientist and your Boss wants you to predict the housing prices in a certain location. You are provided with a dataset. So, what will be the first thing to do?

If you are probably jumping right into anaylsing the data and ML Algos, then this is a wrong a step. Its a big "NO". 
 <h5> The first thing is to ask Questions. </h5>
 
 Questions like :  What will be the predictions used for? Will it be fed into some other system or not? And Many More, just to have concrete goals.
 
 So, your boss says that they will be using the data to get the predcitions so that the other team can work on some investment strategies.
 
So, let's get started.

<h3> Importing Header Files </h3>

In [1]:
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/methods/linear_regression/linear_regression.hpp>
#include <mlpack/core/data/split_data.hpp>

In [2]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"

/* CPython Api Scripts for Plots */

#include "../utils/histogram.hpp"
#include "../utils/impute.hpp"
#include "../utils/pandasscatter.hpp"
#include "../utils/heatmap.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [3]:
using namespace mlpack;
using namespace mlpack::data;

<h3> Let's download the dataset. </h3>

In [4]:
!wget -q https://datasets.mlpack.org/examples/housing.csv

### Loading the Data
Now, we need to load the dataset as armadillo matrix for further operations. Our dataset has a total of 9 features: 8 numerical and 1 categorical(ocean proximity). We need to map the categorical feature as armadillo operates on numeric values.

But, there's one thing which we need to do before loading the dataset as armadillo matrix, that is, to deal with any missing values. Since 207 values were removed from the original dataset from "total_bedrooms_column", we need to fill them using either "mean" or "median" of that feature( for numerical) and "mode"( for categorical").

In [49]:
// The imputing functions follows this:
// Impute(inputFile, outputFile, kind);
// Here, inputFile is our raw file, outputFile is our new file with the imputations, 
// and kind refers to imputation method.

Impute("housing.csv", "housing_imputed.csv", "median");

0

Let's drop the headers using sed. Sed is a unix utility which is used to parse and transform text.

In [50]:
!sed 1d housing_imputed.csv > housing_without_header.csv

// Here, we used sed to delete the first row which is indicated by "1d" and created a new file with name
// housing_without_header.csv

In [51]:
arma::mat dataset;
data::DatasetInfo info;
info.Type(9) = mlpack::data::Datatype::categorical;
data::Load("housing_without_header.csv", dataset, info);

In [52]:
// Print the first 6 rows of the input data.
std::cout << dataset.submat(0, 0, dataset.n_rows - 1 , 5)<< std::endl;

  -1.2223e+02  -1.2222e+02  -1.2224e+02  -1.2225e+02  -1.2225e+02  -1.2225e+02
   3.7880e+01   3.7860e+01   3.7850e+01   3.7850e+01   3.7850e+01   3.7850e+01
   4.1000e+01   2.1000e+01   5.2000e+01   5.2000e+01   5.2000e+01   5.2000e+01
   8.8000e+02   7.0990e+03   1.4670e+03   1.2740e+03   1.6270e+03   9.1900e+02
   1.2900e+02   1.1060e+03   1.9000e+02   2.3500e+02   2.8000e+02   2.1300e+02
   3.2200e+02   2.4010e+03   4.9600e+02   5.5800e+02   5.6500e+02   4.1300e+02
   1.2600e+02   1.1380e+03   1.7700e+02   2.1900e+02   2.5900e+02   1.9300e+02
   8.3252e+00   8.3014e+00   7.2574e+00   5.6431e+00   3.8462e+00   4.0368e+00
   4.5260e+05   3.5850e+05   3.5210e+05   3.4130e+05   3.4220e+05   2.6970e+05
            0            0            0            0            0            0



Did you notice something? Yes, the last row looks like it is entirely filled with '0'. Let's check our dataset to see what it corresponds to.
It corresponds to Ocean Proximity which is a categorical value, but here it is zero.
Why? It's because the load function loads numerical values only. This is exactly why we mapped Ocean proximity earlier.
So, let's deal with this.

In [53]:
#include<mlpack/core/data/one_hot_encoding.hpp>
arma::mat encoded_dataset; 
data::OneHotEncoding(dataset, encoded_dataset, info);

Here, we chose our pre-built encoding method "One Hot Encoding" to deal with the categorical values.

In [54]:
encoded_dataset.n_rows
// The above code prints the number of rows(features + labels) in current dataset.

14

You can notice the number of rows changed from 10 to 14, siginifying one hot encoding in this case.

<h3>Visualization</h3>

Let's plot a histogram. 

In [8]:
// Hist(inputFile, bins, width, height, outputFile);
Hist("housing.csv", 50, 20, 15, "histogram.png");
auto im = xw::image_from_file("histogram.png").finalize();
im

A Jupyter widget with unique id: 5c0dd57a133c4ecca91802380f610915

Let's plot a scatter plot with longitude and latitude as x and y coordinates respectively.

In [9]:
// PandasScatter(inputFile, x, y, outputFile);
PandasScatter("housing.csv", "longitude", "latitude", "output.png");
auto im = xw::image_from_file("output.png").finalize();
im

A Jupyter widget with unique id: f938371980f045b4b47b190bdc1dd973

Let's add some colour to the scatter plot.

In [2]:
// PandasScatterColor(inputFile, x, y, label, c, outputFile);
PandasScatterColor("housing.csv","longitude","latitude","Population","median_house_value","output1.png");
auto im = xw::image_from_file("output1.png").finalize();
im

A Jupyter widget with unique id: 8177cbf69b104cfeb24cbea0475693ae

Let's take it a step further and plot this on top of California map.

In [11]:
//PandasScatterMap(inputFile, imgFile, x, y, label, c, outputFile);
PandasScatterMap("housing.csv","california.png","longitude","latitude","Population","median_house_value","output2.png");
auto im = xw::image_from_file("output2.png").finalize();
im

A Jupyter widget with unique id: 10408985977f4b25b0332df8a43f7081

<h3>Correlation</h3>

In [2]:
// HeatMap(inputFile, outputFile);
HeatMap("housing.csv", "heatmap.png");
auto im = xw::image_from_file("heatmap.png").finalize();
im

A Jupyter widget with unique id: 98d1a64dbd0947d78f0f8e276debab93

<h3>Train-Test Split</h3>
The dataset needs to be splitted into training and testing set for tarining.

In [13]:
// Labels are median_house_value which is row 8
arma::rowvec labels =
    arma::conv_to<arma::rowvec>::from(encoded_dataset.row(8));
encoded_dataset.shed_row(8);

In [14]:
arma::mat trainSet, testSet;
arma::rowvec trainLabels, testLabels;

In [15]:
// Split dataset randomly into training set and test set.
data::Split(encoded_dataset, labels, trainSet, testSet, trainLabels, testLabels,
    0.2 /* Percentage of dataset to use for test set. */);

### Training the linear model

Regression analysis is the most widely used method of prediction. Linear regression is used when the dataset has a linear correlation and as the name suggests, multiple linear regression has one independent variable (predictor) and one or more dependent variable(response).

The simple linear regression equation is represented as y = $a + b_{1}x_{1} + b_{2}x_{2} + b_{3}x_{3} + ... + b_{n}x_{n}$ where $x_{i}$ is the ith explanatory variable, y is the dependent variable, $b_{i}$ is ith coefficient and a is the intercept.

To perform linear regression we'll be using `LinearRegression()` api from mlpack.

In [34]:
using namespace mlpack::regression;
LinearRegression lr(trainSet, trainLabels, 0.5);
// The above line creates and train the model.

In [35]:
// Let's create a output vector for storing the results.
arma::rowvec output; 
lr.Predict(testSet, output);

In [36]:
lr.ComputeError(trainSet, trainLabels);

In [37]:
std::cout<<lr.ComputeError(trainSet, trainLabels);

4.74874e+09

Let's manually check some predictions.

In [38]:
std::cout << testLabels[1] << std::endl;
std::cout << output[1] << std::endl;

174300
190507


In [39]:
std::cout << testLabels[7] << std::endl;
std::cout << output[7] << std::endl;

170500
203024


In [40]:
arma::mat preds;
preds.insert_rows(0, testLabels);
preds.insert_rows(1, output);

In [41]:
mlpack::data::Save("preds.csv", preds);

### Model Evaluation
Test data is visualized with `testLables` and `output`, the blue points indicates the data points and the blue line indicates the regression line or best fit line.

In [42]:
lmplot("predis.csv", "predsScatter");
auto img = xw::image_from_file("predsScatter.png").finalize();    
img

A Jupyter widget with unique id: 2e350347bea84528bfc8ad34775eff1b

In [43]:
histplot("predictions.csv", "Distribution of residuals");
auto img = xw::image_from_file("Distribution of residuals.png").finalize();
img

A Jupyter widget with unique id: 739a2b0617ff402ab08664e833fb9339

## Evaluation Metrics for Regression model

In the Previous cell we have visualized our model performance by plotting the best fit line. Now we will use various evaluation metrics to understand how well our model has performed.

* Mean Absolute Error (MAE) is the sum of absolute differences between actual and predicted values, without considering the direction.
$$ MAE = \frac{\sum_{i=1}^n\lvert y_{i} - \hat{y_{i}}\rvert} {n} $$
* Mean Squared Error (MSE) is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset, a lower value is better
$$ MSE = \frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2 $$
* Root Mean Squared Error (RMSE), Square root of MSE yields root mean square error (RMSE) it indicates the spread of the residual errors. It is always positive, and a lower value indicates better performance.
$$ RMSE = \sqrt{\frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2} $$

In [44]:
// Model evaluation metrics.

std::cout << "Mean Absolute Error: " << arma::mean(arma::abs(output - testLabels)) << std::endl;
std::cout << "Mean Squared Error: " << arma::mean(arma::pow(output - testLabels,2)) << std::endl;
std::cout << "Root Mean Squared Error: " << sqrt(arma::mean(arma::pow(output - testLabels,2))) << std::endl;

Mean Absolute Error: 49434.7
Mean Squared Error: 4.78e+09
Root Mean Squared Error: 69137.5


We can clearly see that the MAE is 49674, when compared with the median house value doesn't seems to be a good fit. 

Thus we can conclude that, the simple Linear Regression models is not being able to catch all the features.
So, maybe its time for you to try other algorithms. 
<h5>NOTE : </h5> In the entire ML workflow, you never know exactly which model will perfrom the best. So, usually you try a lot of different algorithms to see which fits the model.