# SENTIMENT CLASSIFICATION WITH THE TF-IDF VECTORIZER
Sentiment classification/analysis is one of the vital areas of Natural Language Processing (NLP). It is at the very centre of interaction of Machine Learning and NLP. Sentiment classification or sentiment analysis fall into broad category of text classification where a model is developed to take in text (document) and tell if the sentiment behind the text is positive, negetive or neutral. Sentiment Analysis is a supervised learning problem as it require a bunch of labelled texts as input. 
### Statement of Problem
Given a document D and a set of fixed classes C = {c1, c2, ...,cn}, the problem of sentiment classification is that of determining the class c of D in C.
### Definition of Important Terms
In sentiment classification and NLP generally, the following terms are very important:
 - Document: Document in sentiment classification is group of words, phrase, sentence or a comlete article. A document could mean a tweet, part of news article, a whole news article, a product manual, a story etc. In tabular data, a document is the collection of words in text field of a particular record.
 - Corpus:  A corpus which a the list of all unique words in in all documents of a text dataset.
### Method
The Naive Bayes classification algorithm is favoured for this work as it is one of the most popular methods for sentiment classification. Naive Bayes method is easy and fast to use, suitable for multi-class classification and if its assumption of independence is satisfied, it perform better than other models and require much less training dataset.In this project, sentiment is classified as either negetive or positive, hence, the task is a binary problem. Consider a set of movie review, each with positive or negetive lable (that is, C = {c+, c-}), given a new review (document) d whose label (class) is not known, 
 - p(c+/d) = p(d/c+)p(c+)/p(d) - (Bayes Rule)
    - p(c+/d) is called the posterior probability
    - p(d/c+) is called the likelihood
    - p(c) is called the posterior probability
    - p(d) is called the maximum likelihood
A review d is assigned to class c+ if 
 - p(d/c+)p(c+)/p(d)>p(d/c-)p(c-)/p(d)
p(d) is the same across all classes, hence it can be dropped so the rule becomes to assign a review d to class c+ if
 - p(d/c+)p(c+)>p(d/c-)p(c-)
p(c+) and P(c-) are the probabilities of positive and negetive reviews respectively in the training dataset> Hence,
 - p(c+) = r+/n where r+ = number of positive review and n is total number of review
 - P(c-) = r-/n where r- = number of negetive review and n -s total number of review
To calculate p(d/c+) and p(d/c-), we vectorize the text dataset by either using the bag of words model or Term 
Frequency - Inverse Word Frequency (TF-IDF). Having vectorized the features and given that x1, x2, ...,xn are the numerical features
 - p(d/c+) = p(x1, x2, x3,...,xn/c+) and
 - p(d/c-) = p(x1, x2, x3,...,xn/c-)

Assuming independence of x1, x2, ...,xn, the probabilities become
 - p(d/c+) = p(x1, x2, x3,...,xn/c+) = p(x1/c+)p(x2/c+)...p(xn/c+) -(Naive Bayes)
 - p(d/c-) = p(x1, x2, x3,...,xn/c-) = p(x1/c-)p(x2/c-)...p(xn/c-) - (Naive Bayes)
When independence of the features are assumed, the Bayes rule becomes the Navive Bayes Rule as seen above. In this work, I will use the Multinomial Naive Bayes algorithm built into python sklearn for implementation.
### Dataset
In the project, a sentiment classification model will be applied to the Internet Movie Database (IMDB) Dataset of 50K Movie Reviews  downloaded from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews. 


## LIBRARY IMPORTATION AND LOADING OF DATASET

In [2]:
#library importation
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
import numpy as np

In [3]:
#load dataset
df = pd.read_csv('../data/IMDB Dataset.csv')

In [4]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## DATA CLEANING

In [5]:
#removal of unwanted characters and spaces
def clean_data(data):
    data = str(data).lower() #lower case
    data = re.sub(r"@\S+ ", r' ',data) #remove all mentions and replace with a single empty space
    data = re.sub('https://.*','',data) #remove all urls
    data = re.sub("\s+",' ',data) #remove multiple spaces or tabs and replace with a single space
    data = re.sub("\n+",' ',data) #remove multiple empty lines
    letters = re.sub("[^a-zA-Z]",' ',data) #take ontly text and ignore other non text characters
    return letters

In [6]:
#apply clean_data function
df['review'] = df['review'].apply(lambda x:clean_data(x))

##### Note that stopword removal, tokenization, stemming and conversion of text to lower cases are all performed by the TfidfVectorizer. For anyone using any other vectorizer, the cleaning process can be continued with the following code

In [7]:
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf = True, lowercase = True, strip_accents = 'ascii', stop_words = stopset)

In [8]:
#definition of dependent variable
y = df.sentiment

In [9]:
#definition of independent variable
x = vectorizer.fit_transform(df.review)

In [10]:
y.head()

0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object

In [66]:
y.shape

(50000,)

In [67]:
x.shape

(50000, 99248)

In [12]:
#split dataset into training and testing dataset
x_train,x_test, y_train,y_test = train_test_split(x,y, random_state=42)

## MODEL CREATION AND TRAINING

In [13]:
#fit the Multinomial Naive Mayes Model
model = naive_bayes.MultinomialNB()
model.fit(x_train,y_train)

## MODEL EVALUATION

In [15]:
#check acuracy of model
roc_auc_score(y_test,model.predict_proba(x_test)[:,1])

0.9408122901887448

## APPLICATION OF MODEL

In [17]:
#apply model to new review
print(model.predict(vectorizer.transform(np.array(["The movie is a nice movie and I enjoyed watchin"]))))

['positive']


In [18]:
pred = print(model.predict(vectorizer.transform(np.array(["The movie is a bad movie but I wasted time watching"]))))

['negative']


## SAVING AND LOADING MODEL

In [19]:
import joblib

In [21]:
joblib.dump(model, 'agada_sentiment_classifier')

['agada_sentiment_classifier']

In [23]:
joblib.dump(vectorizer, 'agada_sentiment_vectorizer')

['agada_sentiment_vectorizer']

In [28]:
model = joblib.load('agada_sentiment_classifier')

In [29]:
vectorizer = joblib.load('agada_sentiment_vectorizer')

In [30]:
model.predict(vectorizer.transform(np.array(['"The movie is a bad movie but I wasted time watching"'])))[0]

'negative'

In [31]:
text1 = 'Good movie'

In [32]:
model.predict(vectorizer.transform(np.array([text1])))[0]

'negative'