# Movie Reviews Classification

### Method: 
- **Scraped data collection from csv's**
- **Data processing**
      *NLTK is utilised* |
      *scraped data removal* |
      *Tokenizing* |
      *stop_words removal* |
      *stemming of data*
- **Pipelining of data**
      *Using Count Vectorization* |
      *Due to memory limitation n_gram is kept 1 only*
- **Model training**
      *Due to dicrete feature Naive Bayes Classifier choosen* |
      *MultinomialNB()* |
      *BinomialNB()* |
      *GaussianNB()*
- **Model accuracy**
      *scores are observed for each of NB classifier* |
      *MultinomialNB() fits good*
- Value prediction for test data

### Load necessary modules

In [1]:
import numpy as np
import pandas as pd

In [38]:
df = pd.read_csv('Train/train2.csv')
df.head()

Unnamed: 0,review,label
0,mature intelligent and highly charged melodram...,pos
1,http://video.google.com/videoplay?docid=211772...,pos
2,Title: Opera (1987) Director: Dario Argento Ca...,pos
3,I think a lot of people just wrote this off as...,pos
4,This is a story of two dogs and a cat looking ...,pos


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  100 non-null    object
 1   label   100 non-null    object
dtypes: object(2)
memory usage: 1.7+ KB


In [40]:
df.isnull().sum()

review    0
label     0
dtype: int64

In [41]:
reviews_raw_data = df.values

In [42]:
reviews_raw_data.shape

(100, 2)

In [43]:
reviews_raw_data[0]

array(["mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948. wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it",
       'pos'], dtype=object)

### splitting train_x, train_y

In [44]:
reviews_rawX = reviews_raw_data[:, :-1]
reviews_rawY = reviews_raw_data[:, -1]

In [45]:
print(reviews_rawX.shape)
print(reviews_rawY.shape)

(100, 1)
(100,)


### NLTK modules

In [46]:
from nltk.corpus import stopwords

In [47]:
stpwords = set(stopwords.words('english'))
negationwords = {"aren't", "can't", "couldn't", "no", "not", "nor", "didn't", "doesn't", "don't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't", "needn't", "shan't", "shouldn't", "wasn't", "weren't", "won't", "wouldn't"}
stpwords = stpwords - negationwords
print(len(stpwords))

158


### Processing training reviews - Tokenization, Stopwords_Removal, Stemming

In [48]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
# from nltk.stem.lancaster import LancasterStemmer

In [49]:
tokenizer = RegexpTokenizer(r'\w+')
ps = PorterStemmer()
def ProcessReview(review):
    review = review.lower()
    review = review.replace("<br />", "")
    tokenized_review = tokenizer.tokenize(review)
    stemmed_words = [ps.stem(w) for w in tokenized_review if w not in stpwords]
    cleaned_review = " ".join(stemmed_words)
    return cleaned_review

In [50]:
sample_text = "mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948.<br /><br /> wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it"
ProcessReview(sample_text)

'matur intellig highli charg melodrama unbelivebl film china 1948 wei wei stun perform catylast love triangl simpli stun oppurun see magnific film take'

In [51]:
train_x_cleaned = []
for review in reviews_rawX:
    r = ProcessReview(review[0]) # don't train again and again. it's time taking
    train_x_cleaned.append(r)
    # pass

In [52]:
train_x_cleaned = np.asarray(train_x_cleaned)

### Save to avoid pre-processing Reviews

In [53]:
newx = train_x_cleaned.reshape((-1, 1))
newx = np.concatenate((newx, reviews_rawY.reshape((-1, 1))), axis=1)
df = pd.DataFrame(newx, columns=["reviews", "label"])
df.to_csv('Train/train_processed.csv')

In [54]:
# or read train_x_cleaned from here 
train_x_cleaned = pd.read_csv('Train/train_processed.csv')
train_x_cleaned = train_x_cleaned['reviews'].values

In [55]:
train_x_cleaned = train_x_cleaned.reshape((-1,))
train_x_cleaned.shape

(100,)

In [56]:
train_x_cleaned[:2]

array(['matur intellig highli charg melodrama unbelivebl film china 1948 wei wei stun perform catylast love triangl simpli stun oppurun see magnific film take',
       'http video googl com videoplay docid 211772166650071408 hl en distribut tri opt mass appeal want best possibl view rang forgo profit continu manual labor job gladli entertain work view texa tale pleas write like not like alex not like stuie texa texa tale write opinion rule'],
      dtype=object)

In [57]:
reviews_rawY

array(['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos',
       'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg',
       'pos', 'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos',
       'neg', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg',
       'neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'pos', 'neg',
       'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos',
       'neg', 'pos', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg',
       'neg', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg',
       'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg',
       'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'pos', 'pos',
       'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg',
       'neg'], dtype=object)

### Change textual data to numeric data

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
cv = CountVectorizer(ngram_range=(1, 1))

In [60]:
vectorized_train_x = cv.fit_transform(train_x_cleaned[:10000])

In [61]:
vectorized_train_x = vectorized_train_x.toarray()

In [62]:
vectorized_train_x.shape

(100, 3567)

In [63]:
len(cv.vocabulary_)

3567

In [64]:
np.unique(reviews_rawY)

array(['neg', 'pos'], dtype=object)

In [65]:
train_y = [1 if w == 'pos' else 0 for w in reviews_rawY]
train_y = np.asarray(train_y[:10000])
train_y.shape

(100,)

In [66]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

In [67]:
bnb, mnb, gnb = BernoulliNB(), MultinomialNB(), GaussianNB()

In [68]:
vectorized_train_x.shape

(100, 3567)

In [69]:
bnb.fit(vectorized_train_x, train_y)

In [70]:
mnb.fit(vectorized_train_x, train_y)

In [71]:
gnb.fit(vectorized_train_x, train_y)

In [72]:
bnb.score(vectorized_train_x, train_y)

0.97

In [73]:
mnb.score(vectorized_train_x, train_y)

1.0

In [74]:
gnb.score(vectorized_train_x, train_y)

1.0

### Since MultinomialNB() is performing best than other two classifier, 

In [75]:
df_test = pd.read_csv('Test/Test.csv')
df_test

Unnamed: 0,review
0,Remember those old kung fu movies we used to w...
1,This movie is another one on my List of Movies...
2,How in the world does a thing like this get in...
3,"""Queen of the Damned"" is one of the best vampi..."
4,The Caprica episode (S01E01) is well done as a...
...,...
9995,Watched this piece ONDEMAND because the descri...
9996,A nurse travels to a rural psychiatric clinic ...
9997,Although this small film kind of got lost in t...
9998,I first saw this film in the early 80's on cab...


In [76]:
df_test = df_test.values

In [77]:
test_x_raw = df_test[:1000]
test_x_raw.shape

(1000, 1)

In [78]:
test_x = []
for review in test_x_raw:
    r = ProcessReview(review[0])
    test_x.append(r)

test_x[0]

'rememb old kung fu movi use watch friday saturday late night babysitt thought charg well movi play exactli like one movi patsi kensit biggest claim fame love interest mel gibson charact lethal weapon 2 perform one reason never made big terribl actress lethal weapon 2 thought cute cute enough check movi includ love music love danc anoth big let obvious not impress either attract eye soul scream turn play anoth cheap predict role done badli movi kensit star comedienn not good one either work club franc cut homeland make ear bleed luck even wors french govern want throw expir visa mayb caught act get marri casanova freiss luck predict begin terribl way give movi neg rate 1 10 star rate'

### Convert test_x into numeric data and Predict Y values

In [79]:
test_x = np.asarray(test_x)
test_x.shape

(1000,)

In [80]:
vectorized_test_x = cv.transform(test_x).toarray()

In [81]:
predicted_test_y = mnb.predict(vectorized_test_x)
predicted_test_y.shape

(1000,)

In [82]:
test_y = ['pos' if i == 1 else 'neg' for i in predicted_test_y]
test_y = np.asarray(test_y)

In [83]:
df = pd.DataFrame(test_y, columns=["label"])
df.head()

Unnamed: 0,label
0,neg
1,pos
2,neg
3,pos
4,pos


In [84]:
df.to_csv('Test/Sample_submission.csv')

In [88]:
import numpy as np
import pandas as pd

df = pd.read_csv('Train/train_reviews.csv')
print(df.head())

                                              review label
0  mature intelligent and highly charged melodram...   pos
1  http://video.google.com/videoplay?docid=211772...   pos
2  Title: Opera (1987) Director: Dario Argento Ca...   pos
3  I think a lot of people just wrote this off as...   pos
4  This is a story of two dogs and a cat looking ...   pos


In [91]:
reviews_raw_data = df.values
reviews_raw_data[:2]

array([["mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948. wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it",
        'pos'],
       ['http://video.google.com/videoplay?docid=211772166650071408&hl=en Distribution was tried.<br /><br />We opted for mass appeal.<br /><br />We want the best possible viewing range so, we forgo profit and continue our manual labor jobs gladly to entertain you for working yours.<br /><br />View Texas tale, please write about it... If you like it or not, if you like Alex or not, if you like Stuie, Texas or Texas tale... Just write about it.<br /><br />Your opinion rules.',
        'pos']], dtype=object)

In [None]:
`