<div id="container" style="position:relative;">
<div style="float:left">

***Kazi Shahid***

***BrainStation Data Science Diploma Candidate***

***Capstone Project***

=============================================================

***Project SteamBuzz: Will Our Game Create a Buzz in the Steam community?***

***Part 4: Sentiment Analysis - Supervised Machine Learning Modelling Overview***
</div>
<div style="position:relative; float:right"><img style="height:100px" src ="https://i.ibb.co/mcvpL4Z/Steam-Buzz-logo.png" />
</div>
</div>

---
# Overview

In this part, we will discuss the key concepts driving the primary component of the project: the sentiment analysis. The discussion includes a high-level overview of sentiment analysis, supervised machine learning ("ML") for sentiment analysis, selection of ML classifiers for the sentiment analysis for this project, performance measures and the ones selected for this project, hyperparameters and their tuning process.

---
# Sentiment Analysis

Sentiment analysis is [defined](https://towardsdatascience.com/sentiment-analysis-using-logistic-regression-and-naive-bayes-16b806eb4c4b) as the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. It is also sometimes referred to as opinion mining or emotion AI.

---
# Supervised Machine Learning for Sentiment Analysis

[Machine learning](https://en.wikipedia.org/wiki/Machine_learning) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is also recognized as the use and development of computer systems that are able to learn and adapt without following explicit instructions. It does so by using algorithms and statistical models with a view to analyzing and drawing inferences from patterns in data.

A [popular definition](https://mnassar.github.io/deeplearninghandbook/slides/05_ml.pdf) is that, in ML, "**A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.**" *(Tom M. Michell, 1997)*

In a supervised ML, a mathematical model of a set of data is built that contains both the inputs and the desired outputs. It has a few categories, such as regression, classification, and active learning. Sentiment analysis falls under the **classification** category where the outputs are restricted to a limited set of values, and through training our ML classifier for the input variables on how they are associated with the output values, we essentially try to train our ML classifier to be able to predict the output values of unseen data passed through as input variables. Based on how many unique values these "limited set of values" take, it can be a binary classification (i.e., the output values are limited to two unique values) or a multi-class classification (i.e., the values can take more than two unique values).

---
# Selection of Machine Learning Classifiers for Sentiment Analysis

The sentiment analysis part of the project is essentially a binary classification problem, where we are trying to predict whether a set of text in a review point to the review having "positive" or "negative" sentiment.

It is widely known that, ["traditional machine learning methods such as Naïve Bayes, Logistic Regression and Support Vector Machines (SVM) are widely used for large-scale sentiment analysis because they scale well."](https://underthehood.meltwater.com/blog/2019/08/22/deep-learning-models-for-sentiment-analysis/)

We therefore chose the ML classifiers listed below the sentiment analysis part of this project:
- Logistic Regression
- Random Forest Classifier
- Naive Bayes Classifier
- Support Vector Machine

Below are overviews of each of the chosen ML classifiers and our rationale for choosing them for this project.


## Logistic Regression

### Overview

Logistic regression is [defined](https://www.sciencedirect.com/topics/computer-science/logistic-regression) as a process of modelling the probability of a discrete outcome given an input variable. Most commonly, logistic regression classifiers model a binary outcome, i.e., outputs that constitute two unique values such as positive/negative, true/false, yes/no, etc.

Logistic regression is a rather fundamental classifier, a widely popular baseline one due to its simple and uncomplicated nature. However, its simplicity also means that it does not always work well in more complex models. To account for this, we will also employ some more complex classification techniques, as described further below.

### Rationale Behind Choosing Logistic Regression Classifier for this Project

- Logistic regression is a widely used (most used, in fact) classifier to solve binary classification problems.


- Logistic regression is the simplest and most uncomplicated classifier of all, making it a prime baseline classifier candidate.


- Insights can be readily derived from a logistic regression, thanks to its easily-interpretable coefficients.


## Random Forest Classifier

### Overview

An RF model is [defined](https://dl.acm.org/doi/10.1145/3357384.3357891) as a set of decision trees each of which is trained using random subsets of features. It uses [bagging and feature randomness](https://askinglot.com/is-random-forest-good-for-text-classification) when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Given an instance, the prediction by the RF is obtained via majority voting of the predictions of all the trees in the forest.

### Rationale Behind Choosing Random Forest Classifier for this Project

- RF classifiers are suitable for dealing with the high dimensional noisy data in text classification.


- RF classifiers' decision-based algorithm suits text classification problems in the way that they create DTs on randomly-selected data samples, obtains prediction from each DT, and selects the best solution by means of voting, and provides a fairly good indicator of feature importance as well.


## Naive Bayes Classifier

### Overview

Naive Bayes ("NB") classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (hence "naïve") independence assumptions between the features.

The NB classifier, more specifically the **multinomial NB** classifier, is mostly used in NLP. According to [its documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): "***The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.***"

The NB is called "naïve" due to the [assumptions it makes](https://towardsdatascience.com/sentiment-analysis-using-logistic-regression-and-naive-bayes-16b806eb4c4b):

- **Independence Assumption:** NB assumes independence throughout the words in the sentence / document, assigning equal weight to the words.

- **Relative Frequencies:** In reality, there can be more positive-sentiment-bearing reviews than negative ones. However, when cleaning datasets we tend to rid the dataset of class imbalance (e.g., gravitate towards 50/50 ratio of classes in binary classification problem). We need to be cognizant of the fact that this may not be the case in the real world.


### Rationale Behind Choosing a Naive Bayes Classifier for this Project

[Some key rationale](https://www.upgrad.com/blog/multinomial-naive-bayes-explained/) behind choosing a Naive Bayes algorithm for a NLP project such as sentiment analysis:

- A NB classifier is **highly scalable** and can handle large datasets fairly easily.


- Since a NB classifier only has to calculate probability, it is **easy to implement**.


- A NB classifier can be used on **both continuous and discrete data**.


- A NB classifier is not suitable for regression (cannot be used for numeric value prediction) and is only used for **text data classification**.


## Support Vector Machine

### Overview

A Support Vector Machine, or SVM, is a classifier that finds an optimal hyperplane that maximizes the margin between two classes. Though it can also be applied as a multi-class classifier by running the necessary number of binary "one-versus-rest" SVMs, it is still basically a binary classifier run several times in permutation.

A SVM strives to find a line that is in the middle of / evenly spaced between the two classes. This is achieved by maximizing the distance between the *decision boundary* and the *closest points*. The perpendicular line from the decision boundary to the closest points in both classes is called **Support Vector**, which a SVM tries to maximize the length of.

The particularities of the SVM, including its hyperparameters, are outlined in its [documentation]((https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)).

### Rationale Behind Choosing SVM for this Project

Some key rationale behind choosing SVM for a NLP project such as sentiment analysis:

- SVM does not work great with regression problem, but is very effective in **binary classification**, i.e., working with a binary target variable.


- Many algorithms do not work reliably when **number of features are higher (or *much* higher) than data points (i.e., rows)**, but SVM generally does very well in such cases. NLP-type analyses frequently face this kind of scenario where the number of tokens are so high that they sometimes are close to, or greater than, the number of rows in the dataset.


- As the support vectors in a SVM only focus on the data points closest to the line, outliers get ignored, so SVM **works well with datasets that have a lot of outliers** too.


- SVM does quite well when the algorithm needs to **unravel complex relationships** in the data. NLP-based projects can be prime examples of such cases.

## Selection of Performance Measure

After training a ML model, we need to evaluate how well the model learned from the input variables on the basis of how well it is able to predict the output values. We have a number of performance measures at our disposal to do so. Two of the most common, and useful, performance measures are accuracy scores and precision-recall, along with some related measures or derivatives of these.

### Accuracy Score

Accuracy is [defined](https://mnassar.github.io/deeplearninghandbook/slides/05_ml.pdf) as the proportion of examples for which the model produces the correct output.

One downside of the accuracy score is that it does not give reliable information in case of imbalanced classes. However, for our particular dataset where the class imbalance has been addressed to the point of achieving perfect balance, the accuracy scores should quite accurately portray the model's performance.

### Precision and Recall

Precision and recall are two of the most important performance measures, which are especially useful in case of imbalanced dataset that we come across often in reality where accuracy score is not reliable.

**Precision** measures the proportion of positive identifications that were actually correct.

**Recall** measures the proportion of actual positives that were identified correctly.

In formulaic expression:

$$Precision = \frac{TP}{TP+FP}$$

$$Recall = \frac{TP}{TP+FN}$$

### F1-score

The F1-score is another popular accuracy measure, which balances the precision and recall measures. F1-score is the harmonic mean of the precision and recall scores, calculated as follows:

$$F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$$


### Area Under the Receiver-Operating-Characteristic Curve (AUC)

The Receiver Operating Characteristic (ROC) curve is formed by plotting the [true positive rate (TPR)](https://www.split.io/glossary/false-positive-rate/#:~:text=The%20true%20positive%20rate%20(TPR,as%20TN%2FTN%2BFP.) against the [false positive rate (FPR)](https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_latest/wsj/model/wos-quality-fpr.html). The area under the ROC curve (AUC) is an associated metric that represents, after plotting the ROC curve, the measure of the area under the curve.

AUC=0 represents a classifier that basically generates a vertical line in the plot (hence the area under this curve i.e. vertical line is zero) and is therefore the worst classifier, not able to predict. The AUC=1 is at the opposite end of the spectrum, represented by a horizontal line, the area under which is 1, i.e., the associated classifier is able to predict perfectly. Midway through stands a random-guessing classifier with AUC=0.5.

## Selection of Hyperparameters

Different classifiers employ distinctly different algorithms, utilizing different sets of parameters. Based on what values these parameters take, the respective classifier's classification accuracy can vary. These parameters in such cases are referred to as "hyperparameters" as they are the characteristics of a model that are not internal, and their values value cannot be estimated from data but rather need to be iterated through and tuned prior to testing.

We will discuss the relevant hyperparameters in each of the respective ML modelling parts of this project. As a quick reference, all the hyperparameters of our selected ML models can be found in the respective documentation links below:
- [Hyperparameters for Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


- [Hyperparameters for RF](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


- [Hyperparameters for multinomial NB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)


- [Hyperparameters for SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

## Hyperparameter Tuning

Hyperparameter tuning is the process of iterating through different values for each (or most) chosen parameters in order to determine a set of optimal hyperparameters for a ML model.

A widely popular method of hyperparameter tuning is, after selecting the desired parameters and a range of values for the parameters to iterate through, using Grid Search and cross validation as discussed below.


### Grid Search

A [grid search](https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e) builds a model for every combination of hyperparameters specified and evaluates each model. It is used to find the optimal hyperparameters of a model which results in the most "accurate" predictions.


### Cross Validation

Cross validation is the method of splitting the train data into multiple "folds", where in each iteration a singular fold is held back and the model is trained on the remaining folds and then tested on the held-back fold that is used to replicate a model testing on actual test data. This process is depicted on a high level in the image below, and is explained further below:

<img src="https://miro.medium.com/proxy/0*KH3dnbGNcmyV_ODL.png">


#### Validation Set

Validation set is extremely useful, e.g., for the purpose of tuning the hyperparameters of a classifier. We fit the model using the train set for different hyperparameters, and then have to test the model for them to identify the optimal value of the hyperparameter. We obviously cannot use the train set to do so (as the model has already learned from it), and if we use the test set, then the model implicitly learns from the Test set as well. The validation set then comes in between, for the purpose of testing the hyperparameters to find its optimal value, so that the test set can finally be used at this optimal value. We obtain the validation set by further bifurcating the original train set into a revised train set (a subset of the original) and a validation set.

#### K-Fold Cross Validation

One issue with having only one validation set is that the model can still implicitly learn from it and overfit to it. In order to address this issue, cross validation is very useful as it splits up the dataset into multiple permutations where in each iteration the Train and validation set takes up different sections of the dataset (excluding the Test set). We can choose how many splits will be performed as such, and hence the term "K-Fold" (i.e., K number of equal splits).

In the image above, we see that that whole training dataset were divided into 5 sets for each experiment. Therefore, it constitutes a 5-fold (K=5) cross validation where in each experiment the orange block of data represented the validation set for the respective experiment.