### Predicting Chances of Admission for Graduate Programs in Universities.

### Our Objective:
* Determine the most important factors that contribute to a student's chance of admission, and select the most accurate model to predict the probability of admission.
* The predicted output gives them a fair idea about their admission chances in a particular university. 

### Getting to know the dataset!
GA dataset contains various paraameters which are important for admission into graduate programs in universities. The features included are :
* GRE Scores ( out of 340 ).
* TOEFL Scores ( out of 120 ).
* University Rating ( out of 5 ).
* Statement of Purpose and Letter of Recommendation Strength ( out of 5 ).
* Undergraduate GPA ( out of 10 ).
* Research Experience ( either 0 or 1 ).
* Chance of Admit ( ranging from 0 to 1 ).

### Approach
* Explore our data to check for imbalance and missing values.
* Explore the correlation between various features in the dataset.
* Split the preprocessed dataset into train and test sets respectively.
* Create and Train a AdaBoost Classifier using mlpack.
* We'll perform evaluation on our test set using metrics such as Accuracy, ROC AUC to quantify the performance of out model.

In [None]:
!wget -q http://datasets.mlpack.org/Admission_Predict.csv

In [1]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/decision_tree/decision_tree.hpp>
#include <mlpack/methods/adaboost/adaboost.hpp>

In [2]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [3]:
using namespace mlpack;

In [4]:
using namespace mlpack::data;

In [5]:
using namespace mlpack::tree;

In [6]:
using namespace mlpack::adaboost;

In [7]:
// Utility functions for evaluation metrics.
double Accuracy(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    const size_t correct = arma::accu(yPreds == yTrue);
    return (double)correct / (double)yTrue.n_elem;
}

In [8]:
double Precision(const size_t truePos, const size_t falsePos)
{
    return (double)truePos / (double)(truePos + falsePos);
}

In [9]:
double Recall(const size_t truePos, const size_t falseNeg)
{
    return (double)truePos / (double)(truePos + falseNeg);
}

In [10]:
double F1Score(const size_t truePos, const size_t falsePos, const size_t falseNeg)
{
    double prec = Precision(truePos, falsePos);
    double rec = Precision(truePos, falseNeg);
    return 2 * (prec * rec) / (prec + rec);
}

In [11]:
void ClassificationReport(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    arma::Row<size_t> uniqs = arma::unique(yTrue);
    std::cout << std::setw(29) << "precision" << std::setw(15) << "recall" 
              << std::setw(15) << "f1-score" << std::setw(15) << "support" 
              << std::endl << std::endl;
    
    for(auto val: uniqs)
    {
        size_t truePos = arma::accu(yTrue == val && yPreds == val && yPreds == yTrue);
        size_t falsePos = arma::accu(yPreds == val && yPreds != yTrue);
        size_t trueNeg = arma::accu(yTrue != val && yPreds != val && yPreds == yTrue);
        size_t falseNeg = arma::accu(yPreds != val && yPreds != yTrue);
        
        std::cout << std::setw(15) << val
                  << std::setw(12) << std::setprecision(2) << Precision(truePos, falsePos) 
                  << std::setw(16) << std::setprecision(2) << Recall(truePos, falseNeg) 
                  << std::setw(14) << std::setprecision(2) << F1Score(truePos, falsePos, falseNeg)
                  << std::setw(16) << truePos
                  << std::endl;
    }
}

In [12]:
! mkdir data && cat Admission_Predict.csv | sed 1d > ./data/Admission_Predict_trim.csv

In [13]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat gradData;
data::Load("./data/Admission_Predict_trim.csv", gradData);

In [14]:
// Examine first 5 samples from our dataset.
std::cout.precision(4);
std::cout.setf(std::ios::fixed);
std::cout << std::setw(13) << "GRE Score" << std::setw(13) << "TOEFL Score" 
          << std::setw(18) << "University Rating" << std::setw(5) << "SOP" 
          << std::setw(13) << "LOR" << std::setw(13) << "CGPA" 
          << std::setw(15) << "Research" << std::setw(17) << "Chance of Admit" 
          << std::endl;
std::cout << gradData.submat(0, 0, gradData.n_rows-1, 5).t() << std::endl;

    GRE Score  TOEFL Score University Rating  SOP          LOR         CGPA       Research  Chance of Admit
   3.3700e+02   1.1800e+02   4.0000e+00   4.5000e+00   4.5000e+00   9.6500e+00   1.0000e+00   9.2000e-01
   3.2400e+02   1.0700e+02   4.0000e+00   4.0000e+00   4.5000e+00   8.8700e+00   1.0000e+00   7.6000e-01
   3.1600e+02   1.0400e+02   3.0000e+00   3.0000e+00   3.5000e+00   8.0000e+00   1.0000e+00   7.2000e-01
   3.2200e+02   1.1000e+02   3.0000e+00   3.5000e+00   2.5000e+00   8.6700e+00   1.0000e+00   8.0000e-01
   3.1400e+02   1.0300e+02   2.0000e+00   2.0000e+00   3.0000e+00   8.2100e+00            0   6.5000e-01
   3.3000e+02   1.1500e+02   5.0000e+00   4.5000e+00   3.0000e+00   9.3400e+00   1.0000e+00   9.0000e-01



In [15]:
// Plot the correlation matrix as heatmap.
HeatMapPlot("Admission_Predict.csv", "coolwarm", "Correlation Heatmap", 1, 12, 12);
auto img = xw::image_from_file("./plots/Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 7a3b84344b52425dbbf72457df173dd7

As we can observe from the above heatmap, there is high correlation between the follwing features:

* Chance of Admit & GRE Score.
* Change of Admit & TOEFL Score.
* Chance of Admit & CGPA.
* GRE & TOEFL Score.

We can infer that these are really important for the chance of admit function as it varies almost about linearly with the mentioned factors.

### Exploratory Data Analysis
#### Univariate Analysis

In [17]:
HistPlot("Admission_Predict.csv", "Chance of Admit", "Distribution of Chance of Admit", 10, 6);
auto img = xw::image_from_file("./plots/Distribution of Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: 7cdae965701e48e29924d6dd755a7ed3

* Most of the students have above 70% chance of admit.
* More than 50% of students have above 72% chance of admit.

In [19]:
HistPlot("Admission_Predict.csv", "GRE Score", "GRE Score Distribution", 10, 6);
auto img = xw::image_from_file("./plots/GRE Score Distribution.png").finalize();
img

A Jupyter widget with unique id: e04bac86ce334df5978cc977b2a97c00

* Large Number of students have secured GRE score between 308 & 325.
* More than 50% of students scored more than 316 in GRE.

In [20]:
HistPlot("Admission_Predict.csv", "TOEFL Score", "TOEFL Score Distribution", 10, 6);
auto img = xw::image_from_file("./plots/TOEFL Score Distribution.png").finalize();
img

A Jupyter widget with unique id: 3cca3b61203e4d269dc4b9542dd3faf6

* Large number of students have scored between 103 & 112 in TOEFL.
* More than 50% of students scored more than 107 in TOEFL.

In [16]:
CountPlot("Admission_Predict.csv", "University Rating", "", "Distribution of University Rating", 8, 6);
auto img = xw::image_from_file("./plots/Distribution of University Rating.png").finalize();
img

A Jupyter widget with unique id: 49b6dabe2b114ead906f19ae4f7fc0e7

* From the above plot we can infer that students from universities that have got a rating of 3 are more in number among those who have applied for MS program.
* More than 50% of universities have rating of 3 and above.

In [21]:
CountPlot("Admission_Predict.csv", "SOP", "", "Distribution of SOP", 8, 6);
auto img = xw::image_from_file("./plots/Distribution of SOP.png").finalize();
img

A Jupyter widget with unique id: fac5344c79cc42488acc6da83830fac3

* From the above plot we can infer that students with SOP score of 4 are highest in number.
* Large number of students have SOP score ranging between 2.5 to 4.
* More than 50% of students have SOP scores of 3.5 and above.

In [22]:
CountPlot("Admission_Predict.csv", "LOR", "", "Distribution of LOR", 8, 6);
auto img = xw::image_from_file("./plots/Distribution of LOR.png").finalize();
img

A Jupyter widget with unique id: c8e19c8903d949e9a4e6a4c487a1feaf

* From the above plot we can infer that the studetnts with LOR score of 3 are highest in number.
* Large number of students have LOR score ranging between 3 to 4.
* More than 50% of students have LOR scores of 3.5 and above.

In [23]:
HistPlot("Admission_Predict.csv", "CGPA", "CGPA Score Distribution", 10, 6);
auto img = xw::image_from_file("./plots/CGPA Score Distribution.png").finalize();
img

A Jupyter widget with unique id: 800bbfd5f06d4c2f85d770044906375b

* Large number of students have secured CGPA between 8.0 & 9.0.
* More than 50% of students have CGPA of 8.5 and above.

In [24]:
CountPlot("Admission_Predict.csv", "Research", "", "Distribution of Researchers", 6, 6);
auto img = xw::image_from_file("./plots/Distribution of Researchers.png").finalize();
img

A Jupyter widget with unique id: 9372f926615149a28b9a82ee5d0ba548

* From the above fig we can infer most students did some kind of research. 

### Bivariate Analysis

In [32]:
LmPlot("Admission_Predict.csv", "GRE Score", "Chance of Admit", "GRE Score vs Chance of Admit");
auto img = xw::image_from_file("./plots/GRE Score vs Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: 8983f92468ac43d4b74c4fb5ada6059e

* Higher the GRE score, higher the chance of getting admit.
* From the above plot it is clear that most students tend to score above 310 in GRE. Maximum GRE scores are in range 320-340.

In [33]:
LmPlot("Admission_Predict.csv", "TOEFL Score", "Chance of Admit", "TOEFL Score vs Chance of Admit");
auto img = xw::image_from_file("./plots/TOEFL Score vs Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: cf315c876b1b4df3b731a621aabaef2a

* High TOEFL score has a greater chance of getting admit.

In [6]:
LinePlot("Admission_Predict.csv", "University Rating", "Chance of Admit", "Rating vs Admission");
auto img = xw::image_from_file("./plots/Rating vs Admission.png").finalize();
img

A Jupyter widget with unique id: 8427083b203b42f186793069e94881b8

Students from universitites rated 5 have a average of whopping 88.8% chances of admit whilst students from 1 rated universities have not a great value of 56.2 % chances.

In [7]:
LmPlot("Admission_Predict.csv", "SOP", "Chance of Admit", "SOP vs Chance of Admit");
auto img = xw::image_from_file("./plots/SOP vs Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: 34b772e217484d209ea27d4f2902c657

* Students who have secured higher score for their Statement of Purpose (SOP), have an upper hand in getting an admit. 

In [10]:
LmPlot("Admission_Predict.csv", "LOR", "Chance of Admit", "LOR vs Chance of Admit");
auto img = xw::image_from_file("./plots/LOR vs Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: 1277cf27615b47c7b7549c53e005e3be

* Students who have secured higher score for their Letter of Recommendation (LOR), have an upper hand in getting an admit.

In [13]:
ScatterPlot("Admission_Predict.csv", "CGPA", "Chance of Admit", "", "", "", "", "", "", "CGPA vs Chance of Admit");
auto img = xw::image_from_file("./plots/CGPA vs Chance of Admit.png").finalize();
img

A Jupyter widget with unique id: 856a7db761424de29ea4ea610c6e0c7b

* Students with high CGPA are likely to get more chance of admit than those who scored low CGPA.

In [14]:
ScatterPlot("Admission_Predict.csv", "University Rating", "CGPA", "", "", "", "", "", "", "University Rating vs CGPA");
auto img = xw::image_from_file("./plots/University Rating vs CGPA.png").finalize();
img

A Jupyter widget with unique id: 3f610612f61f441a845043ac32bc017b

Ratings of university increase with the increase in the CGPA.

In [None]:
// Split the data into features (X) and target (y) variables, targets are the last row.
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(gradData.row(7) > 0.8);
// Targets are dropped from the loaded matrix.
gradData.shed_row(gradData.n_rows-1);

### Train Test Split
The dataset has to be split into training and test set. Here the dataset has 400 observations and the test ratio is taken as 25% of the total observations. This indicates that the test set should have 25% * 400 = 100 observations and training set should have 300 observations respectively.

In [None]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(gradData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

### Training the AdaBoost Classifier model
* Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking)
* AdaBoost is a boosting approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak an inaccurate rules.

In [None]:
// Create a DecisionStump with two classes.
ID3DecisionStump ds(Xtrain, Ytrain, 2);

In [None]:
// Create and train an AdaBoost Classifier with DecisionStump as weak learner.
AdaBoost<ID3DecisionStump> ab(Xtrain, Ytrain, 2, ds, 50, 1e-10);

### Making Predictions on Test set

In [None]:
// Predict the values for test data using previously trained model as input.
arma::Row<size_t> output;
arma::mat probs;
ab.Classify(Xtest, output, probs);

In [None]:
// Save predicted probabilities and ground truth as csv for generating ROC AUC curve.
data::Save("./data/probabilities.csv", probs);
data::Save("./data/ytest.csv", Ytest);

### Evaluation metrics

* True Positive - The actual value was true & the model predicted true.
* False Positive - The actual value was false & the model predicted true, Type I error.
* True Negative - The actual value was false & the model predicted false.
* False Negative - The actual value was true & the model predicted false, Type II error.

`Accuracy`: is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions.

$$Accuracy = \frac{True_{positive} + True_{negative}}{True_{positive} + True_{negative} + False_{positive} + False_{negative}}$$

`Precision`: is calculated as the ratio between the number of positive samples correctly classified to the total number of samples classified as Positive. The precision measures the model's accuracy in classifying a sample as positive.

$$Precision = \frac{True_{positive}}{True_{positive} + False_{positive}}$$

`Recall`: is calulated as the ratio between the number of positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect Positive samples. The higher the recall, the more positive samples detected.

$$Recall = \frac{True_{positive}}{True_{positive} + False_{negative}}$$

* The decision of whether to use precision or recall depends on the type of problem begin solved.
* If the goal is to detect all positive samples then use recall.
* Use precision if the problem is sensitive to classifying a sample as Positive in general.

* ROC graph has the True Positive rate on the y axis and the False Positive rate on the x axis.
* ROC Area under the curve in the graph is the primary metric to determine if the classifier is doing well, the higher the value the higher the model performance.

In [None]:
// Classification report.
std::cout <<  "Accuracy: " << accuracy(output, Ytest) << std::endl;
classification_report(output, Ytest);

In [None]:
// Plot ROC AUC Curve to visualize the performance of the model on TP & FP.
RocAucPlot("./data/ytest.csv", "./data/probabilities.csv", "ROC AUC Curve");
auto img = xw::image_from_file("./plots/ROC AUC Curve.png").finalize();
img

### Conclusion
From the above ROC AUC curve, we can infer that out AdaBoost model performs well on predicting student admissions. There is still room for improvement. Feel free to play around with the hyperparameters, split ratio, admission threshold etc.