### Predicting Avocado's Average Price using Linear Regression

### Objective
* Our target is to predict the future price of avocado's depending on various features (Type, Region, Total Bags, ...)

### Dataset
Avocado Prices dataset has the following features:

PLU - Product Lookup Code in Hass avocado board.
* Date - The date of the observation
* AveragePrice - observed average price of single avocado
* Total Volume - Total number of avocado's sold
* 4046 - Total number of avocado's with PLU 4046 sold
* 4225 - Total number of avocado's with PLU 4225 sold
* 4770 - Total number of avocado's with PLU 4770 sold
* Total Bags = Small Bags + Large Bags + XLarge Bags
* Type - conventional or organic
* Year - year of observation
* Region - city or region of observation

### Approach
* In this example, first we will do EDA on the dataset to find correlation between various features
* Then we'll be using onehot encoding to encode categorical features
* Finally we will use LinearRegression API from mlpack to learn the correlation between various features and the target i.e AveragePrice
* After training the model, we will use it to do some predictions, followed by various evaluation metrics to quanitfy how well our model behaves

In [1]:
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/core/data/one_hot_encoding.hpp>
#include <mlpack/methods/linear_regression/linear_regression.hpp>

In [2]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [3]:
using namespace mlpack;
using namespace mlpack::data;

In [4]:
!cat avocado.csv | sed 1d > avocado_trim.csv

In [5]:
!cut -d, -f1-2 --complement avocado_trim.csv > avocado_trim2.csv

In [6]:
!rm avocado_trim.csv

In [7]:
!mv avocado_trim2.csv avocado_trim.csv

In [8]:
arma::mat matrix;
mlpack::data::DatasetInfo info;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(11) = mlpack::data::Datatype::categorical;
data::Load("avocado_trim.csv", matrix, info);

In [9]:
// Printing header for dataset.
std::cout << std::setw(10) << "AveragePrice" << std::setw(14) << "Total Volume" << std::setw(9) << "4046" << std::setw(13) << "4225" << std::setw(13) << "4770" 
    << std::setw(17) << "Total Bags" << std::setw(13) << "Small Bags" << std::setw(13) << "Large Bags" << std::setw(17) << "XLarge Bags" << 
    std::setw(10) << "Type" << std::setw(10) << "Year" << std::setw(15) << "Region" <<  std::endl;

std::cout << matrix.submat(0, 0, matrix.n_rows-1, 5).t() << std::endl;

AveragePrice  Total Volume     4046         4225         4770       Total Bags   Small Bags   Large Bags      XLarge Bags      Type      Year         Region
   1.3300e+00   6.4237e+04   1.0367e+03   5.4455e+04   4.8160e+01   8.6969e+03   8.6036e+03   9.3250e+01            0            0   2.0150e+03            0
   1.3500e+00   5.4877e+04   6.7428e+02   4.4639e+04   5.8330e+01   9.5056e+03   9.4081e+03   9.7490e+01            0            0   2.0150e+03            0
   9.3000e-01   1.1822e+05   7.9470e+02   1.0915e+05   1.3050e+02   8.1454e+03   8.0422e+03   1.0314e+02            0            0   2.0150e+03            0
   1.0800e+00   7.8992e+04   1.1320e+03   7.1976e+04   7.2580e+01   5.8112e+03   5.6774e+03   1.3376e+02            0            0   2.0150e+03            0
   1.2800e+00   5.1040e+04   9.4148e+02   4.3838e+04   7.5780e+01   6.1839e+03   5.9863e+03   1.9769e+02            0            0   2.0150e+03            0
   1.2600e+00   5.5980e+04   1.1843e+03   4.8068e+04   4.3

In [10]:
scatter("avocado.csv", "conventional");
auto img = xw::image_from_file("cscatter_conventional.png").finalize();
img

A Jupyter widget with unique id: 405f639f26d74173b4645eb13d64d53c

In [11]:
scatter("avocado.csv", "organic");
auto img = xw::image_from_file("cscatter_organic.png").finalize();
img

A Jupyter widget with unique id: dfcc56bb0edf4771bf51907f81b0e586

In [12]:
barplot("avocado.csv", "AveragePrice", "region", "Avg.Price of Avocado by Region", 8, 10);
auto img = xw::image_from_file("cbarplot_Avg.Price of Avocado by Region.png").finalize();
img

A Jupyter widget with unique id: fd9fb61cc84f40a7abf71c3cf9056bdb

In [13]:
barplot("avocado.csv", "type", "AveragePrice", "Avg.Price of Avocado by Type");
auto img = xw::image_from_file("cbarplot_Avg.Price of Avocado by Type.png").finalize();
img

A Jupyter widget with unique id: 73493b2dde3b4ea9902b655a4556c506

In [14]:
heatmap("avocado.csv","coolwarm", "Correlation Heatmap", true);
auto img = xw::image_from_file("cheatmap_Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 7e51b7aab0a442a38baf8bb8a6ca00bd

In [15]:
arma::mat output;
data::OneHotEncoding(matrix, output, info);

In [16]:
output.n_rows

66

In [17]:
arma::Row<double> targets = arma::conv_to<arma::Row<double>>::from(output.row(0));

In [18]:
output.shed_row(0)

In [19]:
output.col(0)

{ 64236.620, 1036.7400, 54454.850, 48.160000, 8696.8700, 8603.6200, 93.250000, 0.0000000, 1.0000000, 0.0000000, 2015.0000, 1.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000 }

In [20]:
arma::mat Xtrain;
arma::mat Xtest;
arma::Row<double> Ytrain;
arma::Row<double> Ytest;

In [21]:
data::Split(output, targets, Xtrain, Xtest, Ytrain, Ytest, 0.2);

In [22]:
arma::rowvec yTrain = arma::conv_to<arma::rowvec>::from(Ytrain);
arma::rowvec yTest = arma::conv_to<arma::rowvec>::from(Ytest);

In [23]:
regression::LinearRegression lr(Xtrain, yTrain, 0.5);

In [24]:
arma::rowvec yPreds;
lr.Predict(Xtest, yPreds);

In [25]:
arma::mat preds;
preds.insert_rows(0, yTest);
preds.insert_rows(1, yPreds);

In [26]:
mlpack::data::Save("predictions.csv", preds);

In [27]:
lmplot("predictions.csv");
auto img = xw::image_from_file("clmplot_predictions.csv.png").finalize();    
img

A Jupyter widget with unique id: 913ae034151b4f11b2a2fc57afbb798f

In [28]:
histplot("predictions.csv", "Distribution of residuals");
auto img = xw::image_from_file("chistplot_Distribution of residuals.png").finalize();    
img

A Jupyter widget with unique id: c3ce801edb0848039d0975c661290c69

In [None]:
std::vector<double> yTestPlot = arma::conv_to<std::vector<double>>::from(Ytest);
std::vector<double> yPredsPlot = arma::conv_to<std::vector<double>>::from(yPreds);

In [None]:
// Visualize Predicted datapoints.
plt::figure_size(800, 800);

plt::scatter(yTestPlot, yPredsPlot, 12); //{{"color", "blue"}});
//plt::plot();
plt::xlabel("Y Test");
plt::ylabel("Pred");
plt::title("AveragePrice vs Predicted Average Price");

plt::save("./scatter1.png");
auto img = xw::image_from_file("scatter1.png").finalize();
img

## Evaluation Metrics for Regression model

In the Previous cell we have visualized our model performance by plotting the best fit line. Now we will use various evaluation metrics to understand how well our model has performed.

* Mean Absolute Error (MAE) is the sum of absolute differences between actual and predicted values, without considering the direction.
$$ MAE = \frac{\sum_{i=1}^n\lvert y_{i} - \hat{y_{i}}\rvert} {n} $$
* Mean Squared Error (MSE) is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset, a lower value is better
$$ MSE = \frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2 $$
* Root Mean Squared Error (RMSE), Square root of MSE yields root mean square error (RMSE) it indicates the spread of the residual errors. It is always positive, and a lower value indicates better performance.
$$ RMSE = \sqrt{\frac {1}{n} \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2} $$

In [29]:
// Model evaluation metrics.

std::cout << "Mean Absolute Error: " << arma::mean(arma::abs(yPreds - yTest)) << std::endl;
std::cout << "Mean Squared Error: " << arma::mean(arma::pow(yPreds - yTest,2)) << std::endl;
std::cout << "Root Mean Squared Error: " << sqrt(arma::mean(arma::pow(yPreds - yTest,2))) << std::endl;

Mean Absolute Error: 0.204229
Mean Squared Error: 0.0749513
Root Mean Squared Error: 0.273772
