### Predicting Avocado's Average Price using Linear Regression

### Objective
* Our target is to predict the future price of avocado's depending on various features (Type, Region, Total Bags, ...).

### Dataset
Avocado Prices dataset has the following features:
* PLU - Product Lookup Code in Hass avocado board.
* Date - The date of the observation.
* AveragePrice - Observed average price of single avocado.
* Total Volume - Total number of avocado's sold.
* 4046 - Total number of avocado's with PLU 4046 sold.
* 4225 - Total number of avocado's with PLU 4225 sold.
* 4770 - Total number of avocado's with PLU 4770 sold.
* Total Bags = Small Bags + Large Bags + XLarge Bags.
* Type - Conventional or organic.
* Year - Year of observation.
* Region - City or region of observation.

### Approach
* In this example, first we will do EDA on the dataset to find correlation between various features.
* Then we'll be using onehot encoding to encode categorical features.
* Finally we will use LinearRegression API from mlpack to learn the correlation between various features and the target i.e AveragePrice.
* After training the model, we will use it to do some predictions, followed by various evaluation metrics to quantify how well our model behaves.

In [None]:
!wget -q https://mlpack.org/datasets/avocado.csv.gz

In [None]:
!gzip -d avocado.csv.gz

In [2]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/core/data/one_hot_encoding.hpp>
#include <mlpack/methods/linear_regression/linear_regression.hpp>

In [3]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [4]:
using namespace mlpack;
using namespace mlpack::data;

Drop the dataset header using sed, sed is a Unix utility that parses and transforms text.

In [5]:
!cat avocado.csv | sed 1d > avocado_trim.csv

Drop columns 1 and 2 ("Unamed: 0", "Date") as these are not required and their presence cause issues while loading the data.

In [6]:
!cut -d, -f1-2 --complement avocado_trim.csv > avocado_trim2.csv

Rename the newly created csv file.

In [7]:
!rm avocado_trim.csv

In [8]:
!mv avocado_trim2.csv avocado_trim.csv

### Loading the Data
Since features 9 (Avocado type) and 11 (region of observation) are strings (categorical), armadillo matrices can contain only numeric information so we have to explicitly define them as categorical in datasetInfo
this allows mlpack to map numeric values to each of those values, which can then be unmaped to strings.

In [9]:
// Load the dataset into armadillo matrix.

arma::mat matrix;
mlpack::data::DatasetInfo info;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(11) = mlpack::data::Datatype::categorical;
data::Load("avocado_trim.csv", matrix, info);

In [10]:
// Printing header for dataset.
std::cout << std::setw(10) << "AveragePrice" << std::setw(14) 
    << "Total Volume" << std::setw(9) << "4046" << std::setw(13) 
    << "4225" << std::setw(13) << "4770" << std::setw(17) << "Total Bags" 
    << std::setw(13) << "Small Bags" << std::setw(13) << "Large Bags" 
    << std::setw(17) << "XLarge Bags" << std::setw(10) << "Type" 
    << std::setw(10) << "Year" << std::setw(15) << "Region" <<  std::endl;

std::cout << matrix.submat(0, 0, matrix.n_rows-1, 5).t() << std::endl;

AveragePrice  Total Volume     4046         4225         4770       Total Bags   Small Bags   Large Bags      XLarge Bags      Type      Year         Region
   1.3300e+00   6.4237e+04   1.0367e+03   5.4455e+04   4.8160e+01   8.6969e+03   8.6036e+03   9.3250e+01            0            0   2.0150e+03            0
   1.3500e+00   5.4877e+04   6.7428e+02   4.4639e+04   5.8330e+01   9.5056e+03   9.4081e+03   9.7490e+01            0            0   2.0150e+03            0
   9.3000e-01   1.1822e+05   7.9470e+02   1.0915e+05   1.3050e+02   8.1454e+03   8.0422e+03   1.0314e+02            0            0   2.0150e+03            0
   1.0800e+00   7.8992e+04   1.1320e+03   7.1976e+04   7.2580e+01   5.8112e+03   5.6774e+03   1.3376e+02            0            0   2.0150e+03            0
   1.2800e+00   5.1040e+04   9.4148e+02   4.3838e+04   7.5780e+01   6.1839e+03   5.9863e+03   1.9769e+02            0            0   2.0150e+03            0
   1.2600e+00   5.5980e+04   1.1843e+03   4.8068e+04   4.3

### Exploratory Data Analysis

* In the below visualization we are interested to see if there are any trends that occur with the prices of conventional avocados over a period of time.

In [11]:
scatter("avocado.csv", "Date", "AveragePrice", "Date", "type", "conventional", "AveragePrice", "Date", "Average Price (USD)", "Average Price of Conventional Avocados Over Time");
auto img = xw::image_from_file("Average Price of Conventional Avocados Over Time.png").finalize();
img

A Jupyter widget with unique id: 3b2897b830fe4e92882b429ab9e230e3

* In the below visualization we are interested to see if there are any trends that occur with the prices of organic avocados over a period of time.

In [12]:
scatter("avocado.csv", "Date", "AveragePrice", "Date", "type", "organic", "AveragePrice", "Date", "Average Price (USD)", "Average Price of Organic Avocados Over Time");
auto img = xw::image_from_file("Average Price of Organic Avocados Over Time.png").finalize();
img

A Jupyter widget with unique id: 8e823a1e3c2a40b9a87d35cc12fe3b1a

### Observations
* Looks like every year avocado's are most expensive between August - November.
* There is a steep rise in the price in 2017.
* December - February seems to be the best months to purchase avocado's.

In [13]:
barplot("avocado.csv", "AveragePrice", "region", "Date", "Avg.Price of Avocado by Region", 8, 10);
auto img = xw::image_from_file("Avg.Price of Avocado by Region.png").finalize();
img

A Jupyter widget with unique id: 75ed134b64f048c69511c075036a283f

In [14]:
barplot("avocado.csv", "type", "AveragePrice", "Date", "Avg.Price of Avocado by Type");
auto img = xw::image_from_file("Avg.Price of Avocado by Type.png").finalize();
img

A Jupyter widget with unique id: fe46f70d2e5a4535be688c20e681df2b

### Correlation
There is high correlation between:
* 4046 & total volume. 
* 4225 & total volume.
* 4770 & total volume.
* Total bags & total volume.
* Small bags & total bags.
* We can observe that 4046 avocados are the most sold type in US.
* Since there is high correlation between Total Bags, Total Volume & Small bags, 
  we assume most sales comes from small bags.

In [15]:
heatmap("avocado.csv","coolwarm", "Correlation Heatmap", true);
auto img = xw::image_from_file("Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 211350146429449f9ddc1b8d90c99520

As we can from the heatmap above, all the Features are not correlated with the Average Price column, instead most of them are correlated with each other. 

### Handling Categorical Features

* One hot encoding is used to to perform “binarization” of the category and include it as a feature to train the model.
* As we can see we have 54 regions and 2 unique types, so it's going to be easy to to transform the type & regions

In [16]:
arma::mat output;
data::OneHotEncoding(matrix, output, info);

In [17]:
// Split the data into features (X) and target (y) variables
// targets are the last row.

arma::Row<double> targets = arma::conv_to<arma::Row<double>>::from(output.row(0));

In [18]:
// Labels are dropped from the originally loaded data to be used as features.

output.shed_row(0)

### Train Test Split

The dataset has to be split into a training set and a test set. Here the dataset has 18249 observations and the testRatio is taken as 20% of the total observations. This indicates the test set should have 20% * 18249 = 3649 observations and training test should have 14600 observations respectively.

In [19]:
// Split the dataset into train and test sets using mlpack.

arma::mat Xtrain;
arma::mat Xtest;
arma::Row<double> Ytrain;
arma::Row<double> Ytest;
data::Split(output, targets, Xtrain, Xtest, Ytrain, Ytest, 0.2);

In [20]:
// Convert armadillo Rows into rowvec. (Required by mlpacks' LinearRegression API in this format).

arma::rowvec yTrain = arma::conv_to<arma::rowvec>::from(Ytrain);
arma::rowvec yTest = arma::conv_to<arma::rowvec>::from(Ytest);

### Training the linear model

Regression analysis is the most widely used method of prediction. Linear regression is used when the dataset has a linear correlation and as the name suggests, multiple linear regression has one independent variable (predictor) and one or more dependent variable(response).

The simple linear regression equation is represented as y = $a + b_{1}x_{1} + b_{2}x_{2} + b_{3}x_{3} + ... + b_{n}x_{n}$ where $x_{i}$ is the ith explanatory variable, y is the dependent variable, $b_{i}$ is ith coefficient and a is the intercept.

To perform linear regression we'll be using `LinearRegression()` api from mlpack.

In [21]:
// Create and Train Linear Regression model.

regression::LinearRegression lr(Xtrain, yTrain, 0.5);

### Making Predictions on Test set

In [22]:
// Make predictions on test data points.

arma::rowvec yPreds;
lr.Predict(Xtest, yPreds);

In [23]:
// Save the yTest and yPreds into csv for generating plots.
arma::mat preds;
preds.insert_rows(0, yTest);
preds.insert_rows(1, yPreds);

In [24]:
mlpack::data::Save("predictions.csv", preds);

### Model Evaluation
Test data is visualized with `yTest` and `yPreds`, the blue points indicates the data points and the blue line indicates the regression line or best fit line.

In [25]:
lmplot("predictions.csv", "predsScatter");
auto img = xw::image_from_file("predsScatter.png").finalize();    
img

A Jupyter widget with unique id: d3ead2cbe58c407bb136fa3fdf7942a3

In [26]:
histplot("predictions.csv", "Distribution of residuals");
auto img = xw::image_from_file("Distribution of residuals.png").finalize();    
img

A Jupyter widget with unique id: 792d2da160b54d9c8374fd2c45dbe2a4

## Evaluation Metrics for Regression model

In the Previous cell we have visualized our model performance by plotting the best fit line. Now we will use various evaluation metrics to understand how well our model has performed.

* Mean Absolute Error (MAE) is the sum of absolute differences between actual and predicted values, without considering the direction.
$$ MAE = \frac{\sum_{i=1}^n\lvert y_{i} - \hat{y_{i}}\rvert} {n} $$
* Mean Squared Error (MSE) is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset, a lower value is better
$$ MSE = \frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2 $$
* Root Mean Squared Error (RMSE), Square root of MSE yields root mean square error (RMSE) it indicates the spread of the residual errors. It is always positive, and a lower value indicates better performance.
$$ RMSE = \sqrt{\frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2} $$

In [27]:
// Model evaluation metrics.

std::cout << "Mean Absolute Error: " << arma::mean(arma::abs(yPreds - yTest)) << std::endl;
std::cout << "Mean Squared Error: " << arma::mean(arma::pow(yPreds - yTest,2)) << std::endl;
std::cout << "Root Mean Squared Error: " << sqrt(arma::mean(arma::pow(yPreds - yTest,2))) << std::endl;

Mean Absolute Error: 0.201418
Mean Squared Error: 0.0721066
Root Mean Squared Error: 0.268527


From the above metrics, we can notice that our model MAE is ~0.2, which is relatively small compared to our average price of $1.405, from this and the above plot we can conclude our model is a reasonably good fit.