# Motivation

The main goal of this project is to predict the sentiments of sentences. <br>
A sentence sentiment is whether a sentence sounds positive, (for eg. "This movie was excellent!") or negative, (for eg. Special effects were terrible..."). <br>
We work with IMDB movies reviews dataset. Movies are rated from 1 to 10. We consider only the reviews with rating 1 (the worst ones) and 10 (the best ones).  <br>
The assumption is that movies with rating 1 have negative sentiment and movies with rating 10 have a positive sentiment.  <br>
An additional goal is to test some heuristic approaches such that after we spot a negation or an adverb we take some certain actions described in the last section. <br>
In sentences like ones above ("This movie was excellent!" and "Special effects were terrible..."), you can feel that the words "excellent" and "terrible" are the most important. We'd like to extract those words, on which meaning depends the most. <br>

# Baseline

Our train dataset consists of 4732 reviews with rating 10 and 5100 reviews with rating 1. <br>
A simple classifier that always votes for one class gives us a baseline accuracy of 51.87 %

# Data preprocessing

Our model cannot work on raw data. We could have somebody writing like TThiSSS or ..making..to..:much;.. punctuation marks. <br>
Apart from obvious procedures like removing those punctuation marks and lowering all sentences, we try different preprocessing techniques such as: <br>

1. Stemming, (reducing words to their root form, we decrease the size of words in our dictionary). 
2. Stop words removal, (we assume that words like 'the', 'and', 'a/an' etc. don't give much information).
3. N-grams, (considering contiguous sequences of n items from a given sample of text).

# Implemented classes

1. CountVectorizer 
2. Naive Bayes
3. Logistic Regression
4. SVM
5. Heuristic Naive Bayes

# Count Vectorizer

The first question is how do we want to keep our data. Most of the models work which data points. <br>
Points are usually living in some space. We can represent a point easily using a tuple/vector (x,y,z, ... ),  <br>
where consecutive elements of this tuple correspond to value on consecutive axes. <br>
For example, a point on a 2D plane can be described as a pair (x, y). <br>
<br>
So how to transform a sentence to a vector? <br>
The simple idea is to use bag-of-words model. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.  <br>
Basically, we have vectors of the size of the whole dictionary that we are using. <br>
Every position in this vector corresponds to a different word from the dictionary. <br>
When let's say a word occurs twice in our text, there should be 2 on the position that corresponds to this word. <br>
<br>
We want to make those vectors as short as possible to save memory. <br>
If we had 10000 words in our dataset every vector would have to be that long. <br>
Sentences usually have several dozen words so those vectors are mainly filled with zeros. <br>
That's why we use methods like stemming which decrease the number of words significantly. Every conjunction of the word comes back to its root.


# Naive Bayes

First, we want to test the simplest model, which works very well when it comes to classifying text documents.<br>
In this model, there is no need to use bag-of-words, because we can keep occurrences of words in  Python dictionaries. <br>
A Naive Bayes classifier uses the Bayes theorem to classify data.  <br>

Bayes theorem:

$$P(A|B) = \frac{P(B|A) * P(A)}{P(B)}$$

Let $c=1$ be class of good movies with review 10. <br>
Given sentace "this is a good movie" we can say, that:

$$ P(c=1 | \text{this is a good movie}) = \frac{P(\text{this is a good movie}|c=1) * P(c=1)}{P(\text{this is a good movie})}  $$
where $ P(\text{this is a good movie}) $ is a constant equal for both classes so we can skip it.

Going further we can assume that

$$ P(\text{this is a good movie}|c=1) * P(c=1) \approx P(\text{this}|c=1) * P(\text{is}|c=1) * P(\text{a}|c=1) * P(\text{good}|c=1) * P(\text{movie}|c=1) * P(c=1) $$

We can use our training data to compute $P(c | \text{word}) $ for all classes. <br>
Then we just choose the class, which has the highest probability. <br>

Also we have to tune the alpha parameter. <br>
Alpha parameter corresponds to the technique called Laplace smoothing. <br>
Let's say we train our model on some data and when we predict some sentence there is a word that our model hasn't seen yet. <br>
We cannot give it a 0 occurrence, because it would zero our probability. <br>
The assumption is that each word in the vocabulary was seen at least or even a fraction of times in each kind of document.<br>
This fraction is called alpha parameter.

![alt text](Images/naive_bayes1.png)

## Plots overall

We plot:
1. Accuracy - the percentage of correctly classified data.
2. FN - False Negatives - the percentage of the data that was incorrectly classified as negative, (bad rating) when it should be classified as positive, (good rating).
3. FP - False Positives - the percentage of the data that was incorrectly classified as positive, (good rating) when it should be classified as negative, (bad rating).

Our Naive Bayes accuracy. <br>
![alt text](Images/naive_bayes2.png)

Sklean Multinomial  Naive Bayes accuracy. <br>
![alt text](Images/naive_bayes3.png)

Comparison between our implementation and sklearn's. <br>
![alt text](Images/naive_bayes4.png)

Naive Bayes worked better than we expected. <br>
With some parameters tuning it scored even better than sklearn implementation. <br>
This simple model set the bar quite high yielding 89.95%.

# Logistic Regression

The goal of logistic regresion is to calculate probability of the sample $x$ belonging to class $y$.
$$p(y=1|x) = \sigma(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}}$$  

We can observe that:  
$$ p(y=y^{(i)}|x^{(i)};\Theta) = \sigma(\Theta^Tx)^{y^{(i)}}(1-\sigma(\Theta^Tx))^{(1-y^{(i)})}$$  

Therefore the negative log likelihood ($nll$) is:$$
\begin{split}
nll(\Theta) &= -\sum_{i=1}^{N} y^{(i)} \log \sigma(\Theta^Tx) + (1-y^{(i)})\log(1-\sigma(\Theta^Tx)) = \\
&= -\sum_{i=1}^{N}y^{(i)}\log p(y=1|x^{(i)}; \Theta) + (1-y^{(i)})\log  p(y=0|x^{(i)}; \Theta)
\end{split}
$$

So we are searching for $\theta$:
$$\theta = arg\,min_{\theta} \ nll(\theta) $$  
  
We can further consider logistic regression with regularization, where:$$
\begin{split}
nll(\Theta) &= -\sum_{i=1}^{N}y^{(i)}\log p(y=1|x^{(i)}; \Theta) + (1-y^{(i)})\log  p(y=0|x^{(i)}; \Theta) + \frac{\lambda}{2} \sum_{i}\theta_{i}^{2}
\end{split}
$$

There are a few ways to find $\theta$. We will consider L-BFGS-B solver, then we will compare results with sklearn LogisticRegression.

Our Logistic Regression accuracy. <br>
![alt text](Images/lr1.png)

Sklearn Logistic Regression accuracy. <br>
![alt text](Images/lr2.png)

Comparison between our implementation and sklearn's. <br>
![alt text](Images/lr3.png)

While testing Logistic Regression we struggled mostly on complexity issues. <br>
Sklearn implementation was running and converging much faster than ours. But this issue doesn't bother us, because minimizing functions is quite a complex task and they probably know more useful tricks.<br>
Our model scored 89.92% and sklearn's 92.16%. <br>
What is interesting our model worked better on stemmed data without stopwords contrary to sklearn. <br>

## Ngrams and TF-IDF

Here we test if n-grams give us some more information and improve accuracy. <br>
An n-gram is a contiguous sequence of n items from a given sample of text or speech.  <br>
It means that we additionally consider continuous subsequences of length n instead of just single words. <br>

We also check if using TF-IDF vectorizer yields better scores than CountVectorizer. <br>
The TF-IDF vectorizer transforms a count matrix to a normalized tf or tf-idf representation. I.e. it is normalizing the count.

Comparison between:
1. Sklearn Logistic Regression + CV
2. Sklearn Logistic Regression + CV + n-gram (1, 5)
3. Sklearn Logistic Regression + TF-IDF + n-gram (1, 5)

<br>


![alt text](Images/lr5.png)

As it turned out using n-grams improves our accuracy a bit. <br>
For ngram_range=(1,4) and every other, where the upper limit was greater than 4 we obtained 92.86% (+0.39%) accuracy for original data and 92.77% (+0.61%)  accuracy for stemmed data without stop words. <br>
So n-grams are more effective if we do not stem and do not remove stop words. 
TF-IDF vectorizer wasn't better than CountVectorizer scoring around 3% less on test data.

# SVM

The goal of the SVM is to predict class of $x^{(i)}$ using $\text{signum}(w^T x + b)$

SVM try to find weights $w\in\mathbb{R}^n$ and bias $b\in\mathbb{R}$ that maximize the separation margin. This corresponds to solving the following quadratic optimization problem:

$$\begin{split}
  \min_{w,b,\xi}  &\frac{1}{2}w^Tw  + C\sum_{i=1}^m \xi_i  \\
  \text{s.t. } & y^{(i)}(w^T x^{(i)} + b) \geq 1- \xi_i\;\; \forall_i \\
  & \xi_i \geq 0 \;\; \forall_i.
\end{split}$$

We used dual form of this problem with kernel trick method, this corresponds to solving the following problem:

$$\begin{aligned}
    \min_{\alpha}  \frac{1}{2}  \alpha^T \mathbf{H}  \alpha - 1^T \alpha
    \\
     s.t.  - \alpha_i \leq 0 
    \\
      \alpha_i \leq C
     \\
     \ y^T \alpha = 0  
     \\ 
     \ H = y^TKy  
     \\
     \ K_{i,j} = K(x_i,x_j)
\end{aligned}$$
  
We used qp solver to do it.

Our SVM accuracy.

![alt text](Images/svm1.png)

Sklearn SVM accuracy. <br>
![alt text](Images/svm2.png)

Comparison between our implementation and sklearn's on test.
![alt text](Images/svm4.png)

SVM got the best accuracies so far. <br>
With some parameters tuning our SVM was sometimes better than sklearns, which is something to be proud of. <br>
It scored 92.16%, which is a bit better than Logistic Regression. <br>

# Heuristic Naive Bayes

We came up with some heuristic approaches and implemented them in Naive Bayes. <br>

1. If we come across a negation word, we negate ${k}$ words after it. I.e. we change the next ${k}$ words' probability for probability from another class.
2. For words that enhance sentiment we multiply the next ${k}$ words' probability.
3. There are also lists with positive and negative words, because sometimes we have a positive word from the negative class and vice versa. We swap probability in this case.

Accuracies we got. <br>

![alt text](Images/heura1.png)

Comparison between our Heuristic Naive Bayes and Naive Bayes. <br>
![alt text](Images/heura2.png)

Our Heuristic Naive Bayes can be considered as quite a success.<br> 
We can confirm our hypotheses about changing the weight of words after negation and
enhancing the weight of some words after certain adverbs. <br>
Also, it scored better than Naive Bayes, yielding solid 90.32% accuracy. <br> <br>

# Summary


Doing this project we definitely learned a lot. <br>
First of all, we are very happy with our results and final accuracies are a big success for us. <br>
Our own implementations scored sometimes a bit better than sklearn's, <br>
which we wouldn't believe in before the project started. <br>



![alt text](Images/res.png)