**Deep Learning Project - Movie Reviews Sentiment Analysis** <br>
Names : 
1. Anoop Reddy Kallem - VU21CSEN0500062
2. V Geetha Priya - VU21CSEN0500065
3. G Siva Sai Rushyendra - VU21CSEN0500066

Framework used : 
1. spaCy
2. sklearn
3. Pandas

Machine Learning model used : LinearSVC - Linear Support Vector Machine Classifier

Metrics : Precision, Recall, F1-score, Support


**Import spaCy**<br>
**Import displacy for displaying word dependecies**

In [31]:
import spacy
from spacy import displacy

Load English Language from spaCy

In [32]:
nlp=spacy.load('en_core_web_sm')

Using `nlp()` over a pre-defined text

In [33]:
text="This movie is very bad. This is worst than the one I watch a week ago."
doc=nlp(text)
doc

This movie is very bad. This is worst than the one I watch a week ago.

Tokenization of the pre-defined text

In [34]:
for token in doc:
    print(token)

This
movie
is
very
bad
.
This
is
worst
than
the
one
I
watch
a
week
ago
.


Imputing Sentencizer for separating sentences.

In [35]:
sentencizer = nlp.create_pipe('sentencizer')
nlp.add_pipe('sentencizer', before='parser')

<spacy.pipeline.sentencizer.Sentencizer at 0x2c094e9ff80>

In [36]:
for sentencizer in doc.sents:
    print(sentencizer)

This movie is very bad.
This is worst than the one I watch a week ago.


Import `STOP_WORDS` from spaCy English. <br>STOP_WORDS are the words which repeat more times in a mostly for the purpose of joining sentences, maintaining the syntaxes and semantics of the given context.

In [37]:
from spacy.lang.en.stop_words import STOP_WORDS

In [38]:
stopwords=list(STOP_WORDS)
print(stopwords)


['see', 'whole', 'except', 'a', 'quite', 'them', 'himself', 'hereafter', 'beyond', 'always', 'call', 'was', 'too', 'yourself', 'nevertheless', 'from', 'as', 'using', 'been', 'these', 'show', 'i', "'m", 'perhaps', 'your', 'all', 'whereas', 'have', 'upon', 'along', 'other', 'nor', 'somehow', 'about', 'forty', 'yourselves', 'whereafter', 'with', 'since', 'mostly', 'not', '‘ll', 'wherein', 'third', 'name', 'then', 'still', 'everyone', 'between', 'something', 'because', 'behind', 'many', 'again', 'seem', 'towards', 'twelve', '‘d', 'whereupon', 'until', 'under', 'had', 'up', 'here', '’re', 'indeed', 'what', 'whenever', 'amongst', 'than', 'however', 'throughout', 'he', 'we', 'first', 'yours', 'if', 'must', 'regarding', "'re", 'becoming', 'why', 'of', "'ll", 'herself', 'amount', 'others', 'whither', 'enough', 'before', 'thereupon', 'him', 'whereby', 'front', 'but', 'is', 'never', '’m', 'while', 'above', 'could', 'hers', 'am', 'keep', 'side', 'four', 'ever', 'just', 'several', 'no', 'afterwards

Drop STOP_WORDS


In [39]:
for token in doc:
    if token.is_stop==False:
        print(token)

movie
bad
.
worst
watch
week
ago
.


Lemmatization - finding the base word of an existing word in the dataset

In [40]:
for lem in doc:
    print(lem.text,lem.lemma_)

This this
movie movie
is be
very very
bad bad
. .
This this
is be
worst bad
than than
the the
one one
I I
watch watch
a a
week week
ago ago
. .


Tagging each word in the text with a Parts-Of-Speech tag

In [41]:
pos_list=[]
for token in doc:
    print(token.text,token.pos_,spacy.explain(token.pos_))

This DET determiner
movie NOUN noun
is AUX auxiliary
very ADV adverb
bad ADJ adjective
. PUNCT punctuation
This PRON pronoun
is AUX auxiliary
worst ADJ adjective
than ADP adposition
the DET determiner
one NOUN noun
I PRON pronoun
watch VERB verb
a DET determiner
week NOUN noun
ago ADV adverb
. PUNCT punctuation


This code snippet : `displacy.render(doc)` performs the following functionality - This command renders the processed `doc` using spaCy's built-in visualization tool called `displacy`. The `displacy.render()` function takes the processed doc as input and generates a visualization of the analyzed text. The visualization typically includes the original text with annotations such as part-of-speech tags, named entities, and syntactic dependencies, displayed in an interactive and visually appealing format.

In [42]:
doc=nlp(text)
displacy.render(doc)

In [43]:
doc=nlp(text)
displacy.render(doc,style='ent')

Importing `Vectorizer`, `Pipeline`, `Train Test Split` and `Metrics`.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import pandas as pd

Reading IMDB Dataset with 50,000 records and mapping `positive` and `negative` values to `1` and `0` respectively.

In [45]:
df=pd.read_csv('IMDB Dataset.csv')
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [46]:
column_names=['Reviews','Sentiments']
df.columns=column_names
df

Unnamed: 0,Reviews,Sentiments
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [47]:
df.shape
df['Sentiments'].value_counts()

Sentiments
1    25000
0    25000
Name: count, dtype: int64

In [48]:
import string

Cleaning text data by removing punctuations, stop words, etc

In [49]:
puct=string.punctuation
puct
def text_data_cleaning(sentence):
    doc=nlp(sentence)
    tokens=[]
    for token in doc:
        if token.lemma_ != "-PRON-":
            temp=token.lemma_.lower().strip()
        else : 
            temp=token.lower_
        tokens.append(temp)
    cleaned_tokens=[]
    for token in tokens:
        if token not in stopwords and token not in puct:
            cleaned_tokens.append(token)
    return cleaned_tokens

In [50]:
text_data_cleaning("hello how are you. i am fine.")

['hello', 'fine']

Importing LinearSCV

In [51]:
from sklearn.svm import LinearSVC

In [52]:
tfidf=TfidfVectorizer(tokenizer=text_data_cleaning)
classifier=LinearSVC()

Declaring Feature column and Target column

In [53]:
X=df['Reviews']
y=df['Sentiments']

Splitting data into training and testing <br>
`Training - 70%`
`Testing - 30%`
`random_state=32`

In [54]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=32)

`Fitting the model to the data`

In [55]:
clf=Pipeline([('tfidf',tfidf),('clf',classifier)])
clf.fit(X_train,y_train)





`Predicting the review sentiment`


In [56]:
y_pred=clf.predict(X_test)


In [57]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.88      0.89      7385
           1       0.89      0.90      0.89      7615

    accuracy                           0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



In [58]:
confusion_matrix(y_test,y_pred)

array([[6521,  864],
       [ 760, 6855]], dtype=int64)

**`Take custom input review from end user to classify it as Positive or Negative`**

In [69]:
review_input = []
print("Enter the number of reviews to input:")
n = int(input())

for i in range(n):
    print("\n")
    print("Please enter your review:")
    print("\n")
    x = input()
    print(x)
    review_input.append(x)

# print(review_input)

predictions = clf.predict(review_input)

# Print predictions along with reviews
for review, prediction in zip(review_input, predictions):
    if prediction == 0:
        print("\n")
        print(review)
        print("===> Negative")
        
    else:
        print("\n")
        print(review)
        print("===> Positive")
       

Enter the number of reviews to input:


Please enter your review:


This movie is better than the one I watched a week ago


Please enter your review:


It could be better


This movie is better than the one I watched a week ago
===> Positive


It could be better
===> Negative
