## Predicting Salary using Linear Regression

### Objective
* We have to predict the salary of an employee given how many years of experience they have.

### Dataset
* Salary_Data.csv has 2 columns — “Years of Experience” (feature) and “Salary” (target) for 30 employees in a company

### Approach
* So in this example, we will train a Linear Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. 
* Once the model is trained, we will be able to do some sample predictions.

In [1]:
!wget -q https://datasets.mlpack.org/Salary_Data.csv

In [2]:
// Import necessary library header.
#include <mlpack/xeus-cling.hpp>

#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/linear_regression/linear_regression.hpp>
#include <cmath>

In [3]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"

namespace plt = matplotlibcpp;

In [4]:
using namespace mlpack;
using namespace mlpack::regression;

In [5]:
// Load the dataset into armadillo matrix.

arma::mat inputs;
data::Load("Salary_Data.csv", inputs);

In [6]:
// Drop the first row as they represent header.

inputs.shed_col(0);

In [7]:
// Display the first 5 rows of the input data.

std::cout << std::setw(18) << "Years Of Experience" << std::setw(10) << "Salary" << std::endl;
std::cout << inputs.submat(0, 0, inputs.n_rows-1, 5).t() << std::endl;

Years Of Experience    Salary
   1.1000e+00   3.9343e+04
   1.3000e+00   4.6205e+04
   1.5000e+00   3.7731e+04
   2.0000e+00   4.3525e+04
   2.2000e+00   3.9891e+04
   2.9000e+00   5.6642e+04



In [8]:
// Plot the input data.

std::vector<double> x = arma::conv_to<std::vector<double>>::from(inputs.row(0));
std::vector<double> y = arma::conv_to<std::vector<double>>::from(inputs.row(1));

plt::figure_size(800, 800);

plt::scatter(x, y, 12, {{"color","coral"}});
plt::xlabel("Years of Experience");
plt::ylabel("Salary in $");
plt::title("Experience vs. Salary");

plt::save("./scatter.png");
auto img = xw::image_from_file("scatter.png").finalize();
img

A Jupyter widget with unique id: 912d932e54c14571a0ac726764dac35f

In [9]:
// Split the data into features (X) and target (y) variables
// targets are the last row.

arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(inputs.row(inputs.n_rows - 1));

In [10]:
// Labels are dropped from the originally loaded data to be used as features.

inputs.shed_row(inputs.n_rows - 1);

### Train Test Split
The dataset has to be split into a training set and a test set.
This can be done using the `data::Split()` api from mlpack.
Here the dataset has 30 observations and the `testRatio` is taken as 40% of the total observations.
This indicates the test set should have 40% * 30 = 12 observations and training test should have 18 observations respectively.

In [11]:
// Split the dataset into train and test sets using mlpack.

arma::mat Xtrain;
arma::mat Xtest;
arma::Row<size_t> Ytrain;
arma::Row<size_t> Ytest;
data::Split(inputs, targets, Xtrain, Xtest, Ytrain, Ytest, 0.4);

In [12]:
// Convert armadillo Rows into rowvec. (Required by mlpacks' LinearRegression API in this format).

arma::rowvec yTrain = arma::conv_to<arma::rowvec>::from(Ytrain);
arma::rowvec yTest = arma::conv_to<arma::rowvec>::from(Ytest);

## Linear Model

Regression analysis is the most widely used method of prediction. Linear regression is used when the dataset has a linear correlation and as the name suggests, 
simple linear regression has one independent variable (predictor) and one dependent variable(response).

The simple linear regression equation is represented as $y = a+bx$ where $x$ is the explanatory variable, $y$ is the dependent variable, $b$ is coefficient and $a$ is the intercept

To perform linear regression we'll be using `LinearRegression()` api from mlpack.

In [13]:
// Create and Train Linear Regression model.

regression::LinearRegression lr(Xtrain, yTrain, 0.5);

In [14]:
// Make predictions for test data points.

arma::rowvec yPreds;
lr.Predict(Xtest, yPreds);

In [15]:
// Convert armadillo vectors and matrices to vector for plotting purpose.

std::vector<double> XtestPlot = arma::conv_to<std::vector<double>>::from(Xtest);
std::vector<double> yTestPlot = arma::conv_to<std::vector<double>>::from(yTest);
std::vector<double> yPredsPlot = arma::conv_to<std::vector<double>>::from(yPreds);

In [16]:
// Visualize Predicted datapoints.
plt::figure_size(800, 800);

plt::scatter(XtestPlot, yTestPlot, 12, {{"color", "coral"}});
plt::plot(XtestPlot,yPredsPlot);
plt::xlabel("Years of Experience");
plt::ylabel("Salary in $");
plt::title("Predicted Experience vs. Salary");

plt::save("./scatter1.png");
auto img = xw::image_from_file("scatter1.png").finalize();
img

A Jupyter widget with unique id: 88f7de7663bd431382ce760f7f8a08a0

Test data is visualized with `XtestPlot` and `yPredsPlot`, the coral points indicates the data points and the blue line indicates the regression line or best fit line.

## Evaluation Metrics for Regression model

In the Previous cell we have visualized our model performance by plotting the best fit line. Now we will use various evaluation metrics to understand how well our model has performed.

* Mean Absolute Error (MAE) is the sum of absolute differences between actual and predicted values, without considering the direction.
$$ MAE = \frac{\sum_{i=1}^n\lvert y_{i} - \hat{y_{i}}\rvert} {n} $$
* Mean Squared Error (MSE) is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset, a lower value is better
$$ MSE = \frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2 $$
* Root Mean Squared Error (RMSE), Square root of MSE yields root mean square error (RMSE) it indicates the spread of the residual errors. It is always positive, and a lower value indicates better performance.
$$ RMSE = \sqrt{\frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2} $$

In [101]:
// Model evaluation metrics.

std::cout << "Mean Absolute Error: " << arma::mean(arma::abs(yPreds - yTest)) << std::endl;
std::cout << "Mean Squared Error: " << arma::mean(arma::pow(yPreds - yTest,2)) << std::endl;
std::cout << "Root Mean Squared Error: " << sqrt(arma::mean(arma::pow(yPreds - yTest,2))) << std::endl;

Mean Absolute Error: 5753.06
Mean Squared Error: 3.9482e+07
Root Mean Squared Error: 6283.47


From the above metrics we can notice that our model MAE is ~5K, which is relatively small compared to our average salary of $76003, from this we can conclude our model is resonably good fit.