#Scenario: **IMDB Movie Reviews Classification (or Text Classification)**

###**Dataset Description:**

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

- **Review**
- **Sentiment**


###**Tasks to be performed:**

- Download the dataset from the dropbox
- Import required libraries and load the dataset
- Perform Data Pre-processing and clean the data set 
- Split the data set into training and testing set using the train test split function from sklearn 
- Create a SVC Classifier and fit the model 
- Evaluate the model


###**Downloading the data set from the Dropbox**

In [1]:
!wget https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt

--2023-05-20 04:21:30--  https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/dctsk9k67x2jgnb/imdb_labelled.txt [following]
--2023-05-20 04:21:30--  https://www.dropbox.com/s/raw/dctsk9k67x2jgnb/imdb_labelled.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uced36f1f4ebe45c3c4fdfebcaf5.dl.dropboxusercontent.com/cd/0/inline/B8bd8GHlqPWwOiCn9pA0Q1KIfWw4WgZX9jhueP_TOZb6j43WjWQ4kyY5tx3oPwzxJsoXeR7LdQmFfVPnzRp14CEacueH7BFUGDpML19c7gb3dASA36oge3gXg8yYdtgPkMRM0MwqeZf5Udf_Q5ur3VCv5DSXFqAbqxe6d8xaDYgmVA/file# [following]
--2023-05-20 04:21:31--  https://uced36f1f4ebe45c3c4fdfebcaf5.dl.dropboxusercontent.com/cd/0/inline/B8bd8GHlqPWwOiCn9pA0Q1KIfWw4WgZX9jhueP_TOZb6j43WjWQ4kyY5tx3oPwzxJ

###**Importing required libraries and load the data set**

In [2]:
import pandas as pd
import spacy 
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

nlp = spacy.load('en_core_web_sm')

In [3]:
data_imdb = pd.read_csv('/content/imdb_labelled.txt', names=['Review', 'Sentiment'], sep = '\t', header = None)
print(data_imdb.shape)
data_imdb.head()

(748, 2)


Unnamed: 0,Review,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [4]:
data_imdb.Sentiment.value_counts()

1    386
0    362
Name: Sentiment, dtype: int64

###**Data pre-processing** 

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
punct = string.punctuation
print(len(stopwords))
print(stopwords)
print(punct)

326
['‘re', 'ever', 'first', 'make', 'an', 'their', 'part', 'take', 're', 'quite', 'indeed', 'n’t', 'therein', 'well', 'moreover', 'please', 'when', 'or', 'thereby', 'across', 'each', 'made', 'per', 'at', 'because', 'us', 'otherwise', 'until', 'whoever', 'serious', 'hereafter', 'twenty', 'sometimes', 'which', 'cannot', '’re', 'would', 'ten', 'but', 'was', 'empty', 'so', 'full', 'anyone', 'behind', 'here', 'whereas', 'much', 'could', 'herein', 'everywhere', 'whom', 'seem', 'enough', 'say', 'meanwhile', 'often', 'will', 'themselves', 'done', 'why', 'your', '‘ll', 'former', 'put', 'where', 'even', 'whenever', 'keep', 'while', 'see', 'i', 'both', 'besides', 'only', 'upon', 'already', 'such', 'been', 'used', 'through', 'himself', 'from', 'me', 'nor', 'what', 'thus', 'really', 'therefore', 'during', 'someone', 'again', 'into', 'ours', 'anything', 'that', 'less', 'along', 'latter', 'not', 'against', 'is', 'its', 'hence', 'up', 'seemed', 'these', '‘s', 'between', 'whereby', 'for', '‘d', 'hers'

In [6]:
def text_data_cleaning(sentence):
    tokenCollection = nlp(sentence)
    
    cleaned_tokens = []
    for tokenObj in tokenCollection:
        if tokenObj.lemma_ == "-PRON-":
            word = tokenObj.lower_
        else:
            word = tokenObj.lemma_.lower().strip()
        if (word not in stopwords) and (word not in punct):
            cleaned_tokens.append(word)
    return cleaned_tokens
    
tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)

In [None]:
# print(data_imdb.loc[0,'Review'])
# tokenCollection = nlp(data_imdb.loc[0,'Review'])
# type(tokenCollection)

# for tokenObj in tokenCollection:
#   print(tokenObj.text, tokenObj.lemma_, tokenObj.is_alpha, tokenObj.is_punct)

# text_data_cleaning(data_imdb.loc[0,'Review'])

###**Split the data set into training and testing set using the train_test_split function from sklearn**

In [7]:
X = data_imdb['Review']
y = data_imdb['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape, X_test.shape

((598,), (150,))

###**Creating the pipeline and fitting the model**
- 2 steps in pipeline
  - tfidf
  - svm

In [8]:
from sklearn.svm import LinearSVC
# classifier = LinearSVC()
svm = LinearSVC()

# clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])
clf = Pipeline([('tfidf', tfidf), ('svm', svm)])

In [9]:
clf.fit(X_train, y_train)



In [10]:
y_pred = clf.predict(X_test)

In [11]:
print(y_pred)

[1 1 0 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1
 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1
 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0
 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0
 1 0]


In [12]:
print(y_test)

580    0
356    0
133    0
250    1
299    1
      ..
627    1
90     1
642    0
683    1
69     0
Name: Sentiment, Length: 150, dtype: int64


###**Evaluating the model**

In [13]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.82      0.82        76
           1       0.81      0.82      0.82        74

    accuracy                           0.82       150
   macro avg       0.82      0.82      0.82       150
weighted avg       0.82      0.82      0.82       150



In [14]:
confusion_matrix(y_test, y_pred)

array([[62, 14],
       [13, 61]])

In [15]:
clf.predict(['Wow, this sucks'])

array([0])

In [16]:
clf.predict(['Worth watching the movie. Please like it'])

array([1])

In [17]:
clf.predict(['Loved it. amazing'])

array([1])

In [22]:
clf.predict(['The movie was not good'])

array([1])

In [23]:
rv = '''
I was reluctantly dragged into the theater, thinking that they didn't need to make a Top Gun 2 and that the first one was where that story needed to end.

I could write a couple paragraphs to summarize my feelings after walking out of the theater, but I'm going to leave it with just one sentence.

I was wrong.
'''
clf.predict([rv])

array([0])