# Data Pipelining for Classification

### Method:
- Scraped data collection from csv's
- Data processing
      NLTK is utilised |
      scraped data removal |
      Tokenizing |
      stop_words removal |
      stemming of data
- Pipelining of data
      Using Count Vectorization |
      Due to memory limitation n_gram is kept 1 only
- Model training
      Due to dicrete feature Naive Bayes Classifier choosen |
      MultinomialNB() |
      BinomialNB() |
      GaussianNB()
- Model accuracy
      scores are observed for each of NB classifier |
      MultinomialNB() fits good
- Value prediction for test data

### Load necessary modules

In [1]:
import requests

url = 'https://drive.google.com/uc?id=1n73WPNPUNoE9S7GLz0T6FvQ5uaxUq9hg' # training_reviews download link
response = requests.get(url)

if response.status_code == 200:
  with open('train_reviews.csv', 'wb') as f:
      f.write(response.content)
else:
    print(f'Failed to download file. Status code: {response.status_code}')

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('train_reviews.csv')
df.head()

Unnamed: 0,review,label
0,mature intelligent and highly charged melodram...,pos
1,http://video.google.com/videoplay?docid=211772...,pos
2,Title: Opera (1987) Director: Dario Argento Ca...,pos
3,I think a lot of people just wrote this off as...,pos
4,This is a story of two dogs and a cat looking ...,pos


In [6]:
df.info(), df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  10000 non-null  object
 1   label   10000 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB


(None,
 review    0
 label     0
 dtype: int64)

In [7]:
reviews_raw_data = df.values
reviews_raw_data.shape

(10000, 2)

In [8]:
reviews_raw_data[0]

array(["mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948. wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it",
       'pos'], dtype=object)

### splitting train_x, train_y

In [11]:
reviews_rawX = reviews_raw_data[:, :-1]
reviews_rawY = reviews_raw_data[:, -1]

In [12]:
print("raw \n", reviews_rawX[:2])
print(reviews_rawY[:2])

raw 
 [["mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948. wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it"]
 ['http://video.google.com/videoplay?docid=211772166650071408&hl=en Distribution was tried.<br /><br />We opted for mass appeal.<br /><br />We want the best possible viewing range so, we forgo profit and continue our manual labor jobs gladly to entertain you for working yours.<br /><br />View Texas tale, please write about it... If you like it or not, if you like Alex or not, if you like Stuie, Texas or Texas tale... Just write about it.<br /><br />Your opinion rules.']]
['pos' 'pos']


In [13]:
print(reviews_rawX.shape)
print(reviews_rawY.shape)

(10000, 1)
(10000,)


### NLTK modules

In [15]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Processing training reviews - Tokenization, Stopwords_Removal, Stemming

In [16]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

stpwords = set(stopwords.words('english'))
negationwords = {"aren't", "can't", "couldn't", "no", "not", "nor", "didn't", "doesn't", "don't", "hadn't", "hasn't",
                 "haven't", "isn't", "mightn't", "mustn't", "needn't", "shan't", "shouldn't", "wasn't", "weren't", "won't", "wouldn't"}
stpwords = stpwords - negationwords

tokenizer = RegexpTokenizer(r'\w+')


def ProcessReview(review):
    review = review.lower()
    review = review.replace("<br />", "")
    tokenized_review = tokenizer.tokenize(review)
    cleaned_review = [w for w in tokenized_review if w not in stpwords]
    return cleaned_review

In [28]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()


def StemmedReviews(reviews_raw):
  train_x_cleaned = []

  for review in reviews_raw:
    cleaned_words = ProcessReview(review[0])
    stemmed_words = [ps.stem(w) for w in cleaned_words]
    cleaned_sentence = " ".join(stemmed_words)
    train_x_cleaned.append(cleaned_sentence)

  return train_x_cleaned

train_x_cleaned = StemmedReviews(reviews_rawX)
train_x_cleaned = np.asarray(train_x_cleaned)

train_y = [1 if w == 'pos' else 0 for w in reviews_rawY]

print("\ntrain_x_cleaned: \n", train_x_cleaned[:2])
print("\ntrain_y: ", train_y[:10])


train_x_cleaned: 
 ['matur intellig highli charg melodrama unbelivebl film china 1948 wei wei stun perform catylast love triangl simpli stun oppurun see magnific film take'
 'http video googl com videoplay docid 211772166650071408 hl en distribut tri opt mass appeal want best possibl view rang forgo profit continu manual labor job gladli entertain work view texa tale pleas write like not like alex not like stuie texa texa tale write opinion rule']

train_y:  [1, 1, 1, 1, 1, 1, 0, 0, 1, 1]


In [29]:
sample_text = [["mature intelligent and highly charged melodrama unbelivebly filmed in China in 1948.<br /><br /> wei wei's stunning performance as the catylast in a love triangle is simply stunning if you have the oppurunity to see this magnificent film take it"]]
StemmedReviews(sample_text)

['matur intellig highli charg melodrama unbelivebl film china 1948 wei wei stun perform catylast love triangl simpli stun oppurun see magnific film take']

### Change textual data to numeric data

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 1))

vectorized_train_x = cv.fit_transform(train_x_cleaned).toarray()

print("vectorized_train_x dimension: ", vectorized_train_x.shape)
print("\nvectorized_train_x[0] unique values, counts: ", np.unique(vectorized_train_x[0], return_counts=True))
print("vectorized_train_x[1] unique values, counts: ", np.unique(vectorized_train_x[1], return_counts=True))
print("\nvectorized_train_x: \n", vectorized_train_x[:2])

vectorized_train_x dimension:  (10000, 36320)

vectorized_train_x[0] unique values, counts:  (array([0, 1, 2]), array([36300,    17,     3]))
vectorized_train_x[1] unique values, counts:  (array([0, 1, 2, 3]), array([36282,    32,     4,     2]))

vectorized_train_x: 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [40]:
train_y = np.asarray(train_y, dtype='int32')
vectorized_train_x.shape, train_y.shape

((10000, 36320), (10000,))

In [41]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

mnb = MultinomialNB()
mnb.fit(vectorized_train_x, train_y)

print(mnb.score(vectorized_train_x, train_y))

score = cross_val_score(mnb, vectorized_train_x, train_y, cv=5).mean()
print("Approximate score: ", score)


0.9214
Approximate score:  0.8423999999999999


### Since MultinomialNB() is performing best than other two classifier,

In [49]:
test_x_raw = np.array([["Remember those old kung fu movies we used to watch on Friday and Saturday late nights when our babysitters THOUGHT we were in charge? Well, this movie plays exactly like one of those movies. Patsy Kensit's biggest claim to fame was the love interest to Mel Gibson's character in Lethal Weapon 2, and this performance was one of the reasons why she's never made it big: she's a terrible actress.<br /><br />In Lethal Weapon 2, I thought she was cute. Cute enough to check out some of the other movies she'd been in, including Loves Music, Loves to Dance another big let down, which I, obviously, was not impressed with, either. But, as attractive as she is to my eyes, my soul screamed at me to turn it off because she played another cheap, predictable role, and done it very badly.<br /><br />In this movie, Kensit stars as a comedienne (and not a good one, either) who's working the clubs of France (couldn't cut it in her own homeland, so she's making THEIR ears bleed), who's down on her luck, but, even worse, the French government wants to throw her out because of an expired visa (or maybe they just caught her act). But she gets married to this Casanova (Freiss), who is just as down on his luck, and the predictability begins...terribly! Is there any way to give this movie a NEGATIVE rating? 1 out of 10 stars is over rating it!"]])

test_x = StemmedReviews(test_x_raw)
test_x = np.asarray(test_x)
test_x, test_x.shape

(array(['rememb old kung fu movi use watch friday saturday late night babysitt thought charg well movi play exactli like one movi patsi kensit biggest claim fame love interest mel gibson charact lethal weapon 2 perform one reason never made big terribl actress lethal weapon 2 thought cute cute enough check movi includ love music love danc anoth big let obvious not impress either attract eye soul scream turn play anoth cheap predict role done badli movi kensit star comedienn not good one either work club franc cut homeland make ear bleed luck even wors french govern want throw expir visa mayb caught act get marri casanova freiss luck predict begin terribl way give movi neg rate 1 10 star rate'],
       dtype='<U691'),
 (1,))

### Convert test_x into numeric data and Predict Y values

In [51]:
vectorized_test_x = cv.transform(test_x).toarray()

In [52]:
predicted_test_y = mnb.predict(vectorized_test_x)

In [53]:
test_y = ['pos' if i == 1 else 'neg' for i in predicted_test_y]
test_y = np.asarray(test_y)

In [54]:
df = pd.DataFrame(test_y, columns=["label"])
df.head()

Unnamed: 0,label
0,neg
