<h1>JanataHack Steam Reviews Sentiment Analysis Hackathon</h1>
<h2>Approach 2 - Naive Bayes Classification</h2>
Compared the probability of a review coming from either classes ('recommended' and 'not recommended') using Naive Bayes method and choosing the class with higher probability.
Got 0.79 F1-score on the leaderboard.

<h3>Importing relevant libraries</h3>

In [1]:
from time import time
from math import log
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

<h3>Load train and test datasets</h3>

In [2]:
df = pd.read_csv('/home/hithesh/Desktop/ML Competitions/Sentiment_Analysis_Analytics_vidhya/train.csv')
df_test = pd.read_csv('/home/hithesh/Desktop/ML Competitions/Sentiment_Analysis_Analytics_vidhya/test.csv')
df.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,1,Spooky's Jump Scare Mansion,2016.0,I'm scared and hearing creepy voices. So I'll...,1
1,2,Spooky's Jump Scare Mansion,2016.0,"Best game, more better than Sam Pepper's YouTu...",1
2,3,Spooky's Jump Scare Mansion,2016.0,"A littly iffy on the controls, but once you kn...",1
3,4,Spooky's Jump Scare Mansion,2015.0,"Great game, fun and colorful and all that.A si...",1
4,5,Spooky's Jump Scare Mansion,2015.0,Not many games have the cute tag right next to...,1


<h3>Combining reviews to make documents.</h3>
Creating a document of all reviews in the training data, another document of all reviews in which the user recommends the game and another one which has all reviews in which user doesn't recommend the game

In [3]:
doc = ''
doc1 = ''
doc0 = ''
for i in range(df.shape[0]):
    doc = doc + df.loc[i,'user_review'] + str(' ')
    if(df.loc[i,'user_suggestion']==0):
        doc0 = doc0 + df.loc[i,'user_review'] + str(' ')
    else:
        doc1 = doc1 + df.loc[i,'user_review'] + str(' ')

<h3>Preprocessing the text in the three documents</h3>
Removing stopwords (words which don't add relevant amount of meaning to sentences), keeping only alphabet characters and converting documents to lowercase. 

In [4]:
stop_words = stopwords.words('english')

tokens = word_tokenize(doc)
all_words = [token.lower() for token in tokens if token.isalpha()!=0]
all_words = [word for word in all_words if word not in stop_words]

tokens0 = word_tokenize(doc0)
all_words0 = [token.lower() for token in tokens0 if token.isalpha()!=0]
all_words0 = [word for word in all_words0 if word not in stop_words]

tokens1 = word_tokenize(doc1)
all_words1 = [token.lower() for token in tokens1 if token.isalpha()!=0]
all_words1 = [word for word in all_words1 if word not in stop_words]

<h3>Defining and testing Conditional Probabilities</h3>
Conditional probability of a word coming from negative review and from positive review, to be specific.<br>

$P(word|class) = \frac{n+\alpha}{\text{Total no of words in class } + \alpha \times \text{vocabulary size}}$
where $\alpha$ is the smoothing parameter 

<h3>Add 1 smoothing ($\alpha = 1$)</h3>
We add 1 to the numerator and the total vocabulary size of the class document while computing probabilities to accomodate the words which never occured in the training text which will bring down the probability to zero in the probability computation of a class given a sentence.

In [5]:
def cond_prob(word,class_doc,main_doc_word_no,class_doc_word_no, alpha=1):
    n = len(class_doc.lower().split(word)) - 1
    prob = (n + alpha)/(class_doc_word_no + alpha*main_doc_word_no)
    return prob    

words = ['good', 'bad', 'cool', 'gross', 'paid', 'sucks','worst','best','yes','amazing','horrible']
for word in words:
    cond_p0 = cond_prob(word,doc0,len(all_words),len(all_words0))
    cond_p1 = cond_prob(word,doc1,len(all_words),len(all_words1))
    if(cond_p1 > cond_p0):
        print(word,"Good_review")
    else:
        print(word,"Bad_review")

good Good_review
bad Bad_review
cool Good_review
gross Bad_review
paid Bad_review
sucks Bad_review
worst Bad_review
best Good_review
yes Good_review
amazing Good_review
horrible Bad_review


<h3>Defining conditional (log) probabilities for sentences</h3><br>
By Bayes Theorem,
$P(class|word) = \frac{P(word|class)P(class)}{P(word)}$<br>
Since we are considering the same word for both classes, we can ignore the denominator while comparing<br>
$P(class|word) \ \propto \ P(word|class)P(class)$

One big assumption that the Naive Bayes Classifier makes is that each word in a document occur independently of eachother. <br><br>Then the conditional probability of a class given a sentence is the product (over all words) of conditional probabilities of the class given the word. (Log probabilities are used to avoid computational issues like probability vanishing to 0 which is quite probable when you mulitply a lot of probabilities) <br><br>
$ P(class|sentence) = \prod_{w \ \epsilon \text{ sentence}} P(class|word) $<br>
$ log(P(class|sentence)) = \sum_{w \ \epsilon \text{ sentence}} log(P(class|word)) $

In [6]:
def NB_prob(paragraph,doc1,doc0,p0,p1,main_doc_word_no,class1_doc_word_no,class0_doc_word_no,alpha=1):
    t0 = time()
    log_cond_p1_para = log(p1)
    log_cond_p0_para = log(p0)
    words = set([word for word in word_tokenize(paragraph.lower())])
    
    for w in words:
        log_cond_p1_para = log_cond_p1_para + log(cond_prob(w, doc1, main_doc_word_no, class1_doc_word_no,alpha))
        log_cond_p0_para = log_cond_p0_para + log(cond_prob(w, doc0, main_doc_word_no, class0_doc_word_no,alpha))
    
    print(paragraph[:5])
    if(log_cond_p1_para>=log_cond_p0_para):
        print("Recommend")
        return 1
    else:
        print("Nope")
        return 0

<h3>Testing the function</h3>
p0 and p1 are class proportions (estimators of class probabilities)

In [7]:
p1 = df['user_suggestion'].mean()
p0 = 1 - p1

NB_prob('Bad bad one, i don\'t know what to say about this one. it is absolutely terrible',doc1,doc0,p0,p1,len(all_words),len(all_words1),len(all_words0))
NB_prob('his is the best game I have ever played. I\'m really excited for the next' ,doc1,doc0,p0,p1,len(all_words),len(all_words1),len(all_words0))

Bad b
Nope
his i
Recommend


1

This code segment took very long to execute (around 22 hours). Therefore I had to break it down and predict for one group of test examples at a time. Hence the 'read_csv' and 'commented zero series' code segments.

In [10]:
data = pd.read_csv('/home/hithesh/Desktop/ML Competitions/Sentiment_Analysis_Analytics_vidhya/nlp_sub2.csv')
df_test['user_suggestion'] = data['user_suggestion']
# df_test['user_suggestion'] = pd.Series(np.zeros(df_test.shape[0]))
df_test['user_suggestion'][8000:] = df_test['user_review'][8000:].apply(lambda x: int(NB_prob(x,doc1,doc0,p0,p1,len(all_words),len(all_words1),len(all_words0))))

Early
Recommend
Early
Recommend
Inter
Recommend
Thank
Recommend
For a
Recommend
A rat
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
This 
Recommend
Playe
Nope
This 
Recommend
Early
Recommend
Guns 
Recommend
Early
Nope
Early
Recommend
Early
Recommend
This 
Recommend
(said
Recommend
im go
Recommend
This 
Recommend
For f
Recommend
Early
Recommend
Early
Recommend
TL;DR
Recommend
I lov
Recommend
pile 
Nope
New u
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
This 
Recommend
This 
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
Early
Recommend
It is
Recommend
Early
Recommend
After
Recommend
Pros:
Recommend
Actua
Recommend
see p
Recommend


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [11]:
final = pd.concat([df_test['review_id'],df_test['user_suggestion']],axis=1)
final.to_csv('/home/hithesh/Desktop/ML Competitions/Sentiment_Analysis_Analytics_vidhya/nlp_sub2.csv',header=True , index=False)