### Using Random Forest to predict rainfall

### What is our Objective ?
* To reliably predict next day's rainfall using possible derminants 

### Getting to know our WeatherAus dataset! 

WeatherAus dataset contains about 10 years of daily weather observations from many locations across Australia it has the following features:

* Date - The date of observation
* Location - Location of the weather station
* MinTemp - Minimum temperature of the observed day in degree celsius
* MaxTemp - Maximum temperature of the observed day in degree celsius
* Rainfall - The amount of rainfall recorded for the day in mm
* Evaporation - Class A pan evaporation (mm) in the 24 hours to 9am
* Sunshine - The number of hours of bright sunshine in the day.
* WindGustDir - The direction of the strongest wind gust in the 24 hours to midnight
* WindGustSpeed - The speed (km/h) of the strongest wind gust in the 24 hours to midnight
* WindDir9am - Direction of the wind at 9am
* WindDir3pm - Direction of the wind at 3pm
* WindSpeed9am - Wind speed (km/hr) averaged over 10 minutes prior to 9am
* WindSpeed3pm - Wind speed (km/hr) averaged over 10 minutes prior to 3pm
* Humidity9am - Humidity (percent) at 9am
* Humidity3pm - Humidity (percent) at 3pm
* Pressure9am - Atmospheric pressure (hpa) reduced to mean sea level at 9am
* Pressure3pm - Atmospheric pressure (hpa) reduced to mean sea level at 3pm
* Cloud9am - Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many 
* Cloud3pm - Fraction of sky obscured by cloud at 3pm. This is measured in "oktas", which are a unit of eigths. It records how many 
* Temp9am - Temperature (degrees C) at 9am
* Temp3pm - Temperature (degrees C) at 3pm
* RainToday -  if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
* RainTomorrow - The amount of next day rain in mm

### Approach
* In this example, we will be balancing an imbalanced data set using random oversampling, undersampling and SMOTE.
* Then we'll label encode categorical features.
* Use various imputation methods to handle missing values in the dataset.
* Split the preprocessed dataset and train on it using RandomForest API from mlpack.
* Finally we'll use various metrics such as Accuracy, F1-Score, ROC AUC to judge the performance of our model.

#### NOTE: This example has 4 parts implementing the above approach using raw imbalanced data, undersampled, oversampled & using SMOTE 

In [None]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/random_forest/random_forest.hpp>
#include <mlpack/core/data/scaler_methods/standard_scaler.hpp>

In [None]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/preprocess.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [None]:
using namespace mlpack;

In [None]:
using namespace mlpack::data;

In [None]:
using namespace mlpack::tree;

### Part 1 - Modelling using Imbalanced Dataset

### Visualize the Missing Values

In [None]:
MissingPlot("weatherAUS.csv", "PuBu", "Part-1 Missing values pre-imputation");
auto img = xw::image_from_file("./plots/Part-1 Missing values pre-imputation.png").finalize();
img

The above visualization shows that high number of missing values in: Sunshine, Evaporation, Cloud9am and Cloud3pm.
We observe that atmost some features have 50% missing values. So instead of discarding them, we will impute them with  proper imputation method.

In [None]:
// Perform imputation on the original dataset using "mean" imputation policy.
Impute("weatherAUS.csv");

Drop the dataset header using sed, sed is a Unix utility that parses and transforms text.

In [None]:
!cat ./data/weatherAUS_mean_imputed.csv | sed 1d > ./data/weatherAUS_trim.csv

Drop columns 1 ("Date") as it is not required and causes issues while loading the data.

In [None]:
!cut -d, -f1 --complement ./data/weatherAUS_trim.csv > ./data/weatherAUS_trim2.csv

Rename the newly created csv file.

In [None]:
!rm ./data/weatherAUS_trim.csv

In [None]:
!mv ./data/weatherAUS_trim2.csv ./data/weatherAUS_trim.csv

In [None]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat weatherData;
mlpack::data::DatasetInfo info;

// Manually set the columns with contain categorical data in DatasetInfo.
info.Type(0) = mlpack::data::Datatype::categorical;
info.Type(6) = mlpack::data::Datatype::categorical;
info.Type(8) = mlpack::data::Datatype::categorical;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(20) = mlpack::data::Datatype::categorical;
info.Type(21) = mlpack::data::Datatype::categorical;

data::Load("./data/weatherAUS_trim.csv", weatherData, info);

In [None]:
data::Save("./data/weatherAUSEnc.csv", weatherData);

In [None]:
!sed -i '1iLocation,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow' ./data/weatherAUSEnc.csv

In [None]:
// Inspect the first 5 examples in the dataset
std::cout.precision(4);
std::cout.setf(std::ios::fixed);
std::cout << std::setw(15) << "Location" << std::setw(10) << "MinTemp" << std::setw(13) << "MaxTemp" 
          << std::setw(12) << "Rainfall" << std::setw(15) << "Evaporation" << std::setw(12) 
          << "Sunshine" << std::setw(14) << "WindGust" << std::setw(15) << "WindGustSpeed"
          << std::setw(12) << "WindDir9am" << std::setw(12) << "WindDir3pm" << std::setw(13)
          << "WindSpeed9am" << std::setw(14) << "WindSpeed3pm" << std::setw(13) 
          << "Humidity9am" << std::setw(12) << "Humidity3pm" << std::setw(14)
          << "Pressure9am" << std::setw(14) << "Pressure3pm" << std::setw(10) 
          << "Cloud9am" << std::setw(14) << "Cloud3pm" << std::setw(15)
          << "Temp9am" << std::setw(12) << "Temp3pm" << std::setw(16)
          << "RainToday" << std::setw(15) << "RainTomorrow" << std::endl;
std::cout << weatherData.submat(0, 0, weatherData.n_rows-1, 5).t() << std::endl;

In [None]:
// Visualize the distribution of target classes.
CountPlot("./data/weatherAUS_mean_imputed.csv", "RainTomorrow", "", "Part-1 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-1 Distribution of target class.png").finalize();
img

### EDA

In [None]:
CountPlot("./data/weatherAUS_mean_imputed.csv", "WindDir9am", "", "Part-1 Direction of wind at 9 am");
auto img = xw::image_from_file("./plots/Part-1 Direction of wind at 9 am.png").finalize();
img

In [None]:
CountPlot("./data/weatherAUS_mean_imputed.csv", "WindDir3pm", "", "Part-1 Direction of wind at 3 pm");
auto img = xw::image_from_file("./plots/Part-1 Direction of wind at 3 pm.png").finalize();
img

In [None]:
CountPlot("./data/weatherAUS_mean_imputed.csv", "WindGustDir", "", "Part-1 Direction of wind Gust");
auto img = xw::image_from_file("./plots/Part-1 Direction of wind Gust.png").finalize();
img

### Visualize Correlation

In [None]:
HeatMapPlot("./data/weatherAUSEnc.csv", "coolwarm", "Part-1 Correlation Heatmap", 1);
auto img = xw::image_from_file("./plots/Part-1 Correlation Heatmap.png").finalize();
img

As we can observe from the above heatmap, there is high correlation between the following features:
* MinTemp & MaxTemp
* MinTemp & Temp9am
* MinTemp & Temp3pm
* MaxTemp & Temp9am
* MaxTemp & Temp3pm
* Temp3pm & Temp9am
* Pressure9am & Pressure3pm
* Evaporation & MaxTemp

In [None]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(weatherData.row(weatherData.n_rows - 1));
// Targets are dropped from the loaded matrix.
weatherData.shed_row(weatherData.n_rows-1);

### Train Test Split

The dataset has to be split into training and test set. Here the dataset has 145460 observations and the test ratio is taken as 25% of the total observations. This indicates that the test set should have 25% * 145460 = 36365 observations and training set should have 109095 observations respectively.

In [None]:
// Split the dataset into train and set sets using mlpack Split API.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(weatherData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [None]:
// Standardize the train & test features.
arma::mat XtrainScaled, XtestScaled;
StandardScaler scale;
scale.Fit(Xtrain);
scale.Transform(Xtrain, XtrainScaled);
scale.Transform(Xtest, XtestScaled);

### Training the Random Forest model
To create the model we'll be using RandomForest() API from mlpack.

In [None]:
//Create and train Random Forest model with 100 trees.
RandomForest<> rf(XtrainScaled, Ytrain, 2, 100);

In [None]:
// Predict the values for test data using previously trained model as input.
arma::Row<size_t> output;
arma::mat probs;
rf.Classify(XtestScaled, output, probs);

In [None]:
// Save predicted probabilities and ground truth as csv for generating ROC AUC curve.
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [None]:
double Accuracy(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    const size_t correct = arma::accu(yPreds == yTrue);
    return (double)correct / (double)yTrue.n_elem;
}

In [None]:
double Precision(const size_t truePos, const size_t falsePos)
{
    return (double)truePos / (double)(truePos + falsePos);
}

In [None]:
double Recall(const size_t truePos, const size_t falseNeg)
{
    return (double)truePos / (double)(truePos + falseNeg);
}

In [None]:
double F1Score(const size_t truePos, const size_t falsePos, const size_t falseNeg)
{
    double prec = Precision(truePos, falsePos);
    double rec = Recall(truePos, falseNeg);
    return 2 * (prec * rec) / (prec + rec);
}

In [None]:
void ClassificationReport(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    arma::Row<size_t> uniqs = arma::unique(yTrue);
    std::cout << std::setw(29) << "precision" << std::setw(15) << "recall" 
              << std::setw(15) << "f1-score" << std::setw(15) << "support" 
              << std::endl << std::endl;
    
    for(auto val: uniqs)
    {
        size_t truePos = arma::accu(yTrue == val && yPreds == val && yPreds == yTrue);
        size_t falsePos = arma::accu(yPreds == val && yPreds != yTrue);
        size_t trueNeg = arma::accu(yTrue != val && yPreds != val && yPreds == yTrue);
        size_t falseNeg = arma::accu(yPreds != val && yPreds != yTrue);
        
        std::cout << std::setw(15) << val
                  << std::setw(12) << std::setprecision(2) << Precision(truePos, falsePos) 
                  << std::setw(16) << std::setprecision(2) << Recall(truePos, falseNeg) 
                  << std::setw(14) << std::setprecision(2) << F1Score(truePos, falsePos, falseNeg)
                  << std::setw(16) << truePos
                  << std::endl;
    }
}

### Model Evaluation

In [None]:
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

In [None]:
RocAucPlot("ytest.csv", "probabilities.csv", "Part-1 Imbalanced Targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-1 Imbalanced Targets ROC AUC Curve.png").finalize();
img

### Part 2 - Modelling using Random Oversampling

In [None]:
resample("weatherAUS.csv", "RainTomorrow", "No", "Yes", "oversample", "Date", 123);

In [None]:
// Visualize the distribution of target classes.
CountPlot("weatherAUS_oversampled.csv", "RainTomorrow", "", "Part-2 Oversampled Population");
auto img = xw::image_from_file("./plots/Part-2 Oversampled Population.png").finalize();
img

### Visualize the Missing Values

In [None]:
MissingPlot("weatherAUS_oversampled.csv", "PuBu", "Part-2 Missing values before imputation");
auto img = xw::image_from_file("./plots/Part-2 Missing values before imputation.png").finalize();
img

In [None]:
// Imputation using mean.
impute("weatherAUS_oversampled.csv");

In [None]:
!cat weatherAUS_oversampled_mean_imputed.csv | sed 1d > weatherAUS_os_imp.csv

In [None]:
!cut -d, -f1 --complement weatherAUS_os_imp.csv > weatherAUS_trim2.csv

In [None]:
!rm weatherAUS_trim.csv

In [None]:
!mv weatherAUS_trim2.csv weatherAUS_trim.csv

In [None]:
arma::mat overSampled;
mlpack::data::DatasetInfo info;

info.Type(0) = mlpack::data::Datatype::categorical;
info.Type(6) = mlpack::data::Datatype::categorical;
info.Type(8) = mlpack::data::Datatype::categorical;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(20) = mlpack::data::Datatype::categorical;
info.Type(21) = mlpack::data::Datatype::categorical;

data::Load("weatherAUS_trim.csv", overSampled, info);

In [None]:
data::Save("weatherAUSEnc.csv", overSampled);

In [None]:
!sed -i '1iLocation,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow' weatherAUSEnc.csv

### Visualize Correlation

In [None]:
HeatMapPlot("weatherAUSEnc.csv", "coolwarm", "Part-2 Correlation Heatmap", 1);
auto img = xw::image_from_file("./plots/Part-2 Correlation Heatmap.png").finalize();
img

In [None]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(overSampled.row(overSampled.n_rows - 1));
overSampled.shed_row(overSampled.n_rows-1);

In [None]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [None]:
mlpack::data::Split(overSampled, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [None]:
arma::mat XtrainScaled, XtestScaled;

In [None]:
StandardScaler scale;
scale.Fit(Xtrain);
scale.Transform(Xtrain, XtrainScaled);
scale.Transform(Xtest, XtestScaled);

In [None]:
RandomForest<> rf(XtrainScaled, Ytrain, 2, 100);

In [None]:
arma::Row<size_t> output;
arma::mat probs;
rf.Classify(XtestScaled, output, probs);

In [None]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [None]:
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

In [None]:
RocAucPlot("ytest.csv", "probabilities.csv", "Part-2 Random Oversampled Targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-2 Random Oversampled Targets ROC AUC Curve.png").finalize();
img

### Part 3 - Modelling using Synthetic Minority Over Sampling Technique

In [None]:
impute("weatherAUS.csv");

In [None]:
!cat weatherAUS_mean_imputed.csv | sed 1d > weatherAUS_mean_imp.csv

In [None]:
!cut -d, -f1 --complement weatherAUS_mean_imp.csv > weatherAUS_trim2.csv

In [None]:
!rm weatherAUS_trim.csv

In [None]:
!mv weatherAUS_trim2.csv weatherAUS_trim.csv

In [None]:
arma::mat smote;
mlpack::data::DatasetInfo info;

info.Type(0) = mlpack::data::Datatype::categorical;
info.Type(6) = mlpack::data::Datatype::categorical;
info.Type(8) = mlpack::data::Datatype::categorical;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(20) = mlpack::data::Datatype::categorical;
info.Type(21) = mlpack::data::Datatype::categorical;

data::Load("weatherAUS_trim.csv", smote, info);

In [None]:
mlpack::data::Save("smote_in.csv", smote);

In [None]:
resample("smote_in.csv", "RainTomorrow", "No", "Yes", "smote", "Date", 123);

In [None]:
!cat smote_in_smotesampled.csv | sed 1d > smote_in_smotesampled_woh.csv

In [None]:
arma::mat smoteEnc;
mlpack::data::DatasetInfo info;

info.Type(0) = mlpack::data::Datatype::categorical;
info.Type(6) = mlpack::data::Datatype::categorical;
info.Type(8) = mlpack::data::Datatype::categorical;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(20) = mlpack::data::Datatype::categorical;
info.Type(21) = mlpack::data::Datatype::categorical;

data::Load("smote_in_smotesampled_woh.csv", smoteEnc, info);

In [None]:
data::Save("weatherAUSEnc.csv", smoteEnc);

In [None]:
!sed -i '1iLocation,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow' weatherAUSEnc.csv

In [None]:
CountPlot("weatherAUSEnc.csv", "RainTomorrow", "", "Part-3 Distribution of target class");
auto img = xw::image_from_file("./plots/Part-3 Distribution of target class.png").finalize();
img

### Visualize Correlation

In [None]:
HeatMapPlot("weatherAUSEnc.csv", "coolwarm", "Part-3 Correlation Heatmap", 1);
auto img = xw::image_from_file("./plots/Part-3 Correlation Heatmap.png").finalize();
img

In [None]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(smoteEnc.row(smoteEnc.n_rows - 1));
smoteEnc.shed_row(smoteEnc.n_rows-1);

In [None]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [None]:
mlpack::data::Split(smoteEnc, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [None]:
arma::mat XtrainScaled, XtestScaled;

In [None]:
StandardScaler scale;
scale.Fit(Xtrain);
scale.Transform(Xtrain, XtrainScaled);
scale.Transform(Xtest, XtestScaled);

In [None]:
RandomForest<> rf(XtrainScaled, Ytrain, 2, 100);

In [None]:
arma::Row<size_t> output;
arma::mat probs;
rf.Classify(XtestScaled, output, probs);

In [None]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [None]:
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

In [None]:
RocAucPlot("ytest.csv", "probabilities.csv", "Part-3 SMOTE ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-3 SMOTE ROC AUC Curve.png").finalize();
img

### Part - 4 Modelling using Random undersampling

In [None]:
resample("weatherAUS.csv", "RainTomorrow", "No", "Yes", "undersample", "Date", 123);

In [None]:
// Visualize the distribution of target classes.
CountPlot("weatherAUS_undersampled.csv", "RainTomorrow", "", "Part-4 Undersampled Population");
auto img = xw::image_from_file("./plots/Part-4 Undersampled Population.png").finalize();
img

In [None]:
MissingPlot("weatherAUS_undersampled.csv", "PuBu", "Part-4 Missing values pre-imputation");
auto img = xw::image_from_file("./plots/Part-4 Missing values pre-imputation.png").finalize();
img

In [None]:
// Imputation using mean.
impute("weatherAUS_undersampled.csv");

In [None]:
!cat weatherAUS_undersampled_mean_imputed.csv | sed 1d > weatherAUS_us_imp.csv

In [None]:
!cut -d, -f1 --complement weatherAUS_us_imp.csv > weatherAUS_trim2.csv

In [None]:
!rm weatherAUS_trim.csv

In [None]:
!mv weatherAUS_trim2.csv weatherAUS_trim.csv

In [None]:
arma::mat underSampled;
mlpack::data::DatasetInfo info;

info.Type(0) = mlpack::data::Datatype::categorical;
info.Type(6) = mlpack::data::Datatype::categorical;
info.Type(8) = mlpack::data::Datatype::categorical;
info.Type(9) = mlpack::data::Datatype::categorical;
info.Type(20) = mlpack::data::Datatype::categorical;
info.Type(21) = mlpack::data::Datatype::categorical;

data::Load("weatherAUS_trim.csv", underSampled, info);

In [None]:
data::Save("weatherAUSEnc.csv", underSampled);

In [None]:
!sed -i '1iLocation,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow' weatherAUSEnc.csv

In [None]:
HeatMapPlot("weatherAUSEnc.csv", "coolwarm", "Part-4 Correlation Heatmap", 1);
auto img = xw::image_from_file("./plots/Part-4 Correlation Heatmap.png").finalize();
img

In [None]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(underSampled.row(underSampled.n_rows - 1));
underSampled.shed_row(underSampled.n_rows-1);

In [None]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [None]:
mlpack::data::Split(underSampled, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [None]:
arma::mat XtrainScaled, XtestScaled;

In [None]:
StandardScaler scale;
scale.Fit(Xtrain);
scale.Transform(Xtrain, XtrainScaled);
scale.Transform(Xtest, XtestScaled);

In [None]:
RandomForest<> rf(XtrainScaled, Ytrain, 2, 100);

In [None]:
arma::Row<size_t> output;
arma::mat probs;
rf.Classify(XtestScaled, output, probs);

In [None]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [None]:
std::cout << "Accuracy: " << Accuracy(output, Ytest) << std::endl;
ClassificationReport(output, Ytest);

In [None]:
RocAucPlot("ytest.csv", "probabilities.csv", "Part-4 Random Undersampled targets ROC AUC Curve");
auto img = xw::image_from_file("./plots/Part-4 Random Undersampled targets ROC AUC Curve.png").finalize();
img