# Preparing The Data

## Reading the dataset

In [1]:
!gdown "1WMfGVZ6W3EWTI8HI2DzcJjIpcl1bp9xf"

Downloading...
From: https://drive.google.com/uc?id=1WMfGVZ6W3EWTI8HI2DzcJjIpcl1bp9xf
To: /content/IMDB Dataset.csv
100% 66.2M/66.2M [00:00<00:00, 79.2MB/s]


In [2]:
import pandas as pd

df_review = pd.read_csv('IMDB Dataset.csv')

In [3]:
df_review

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
df_positive = df_review[df_review['sentiment']=='positive'][:1000]
df_negative = df_review[df_review['sentiment']=='negative'][:1000]

df_review_bal = pd.concat([df_positive, df_negative])

In [5]:
df_review_bal['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

## Pré-processamento


* Removing punctuations like . , ! $( ) * % @
* Removing URLs
* Removing Stop words
* Lower casing
* Tokenization
* Lemmatization


In [6]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [7]:
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string


def process_data(corpus):
    data = []
    for sent in corpus:
        # remove <br />
        sent = sent.replace('<br />', '')

        # remove link
        sent = re.sub(r'http\S+', '', sent)
        tokens = word_tokenize(sent)

        # remove punctuation
        puncs = string.punctuation
        tokens = [x for x in tokens if x not in puncs]

        # stopwords
        stopwords = nltk.corpus.stopwords.words('english')
        tokens = [x for x in tokens if x not in stopwords]

        # lemmatization
        wordnet_lemmatizer = WordNetLemmatizer()
        tokens = [wordnet_lemmatizer.lemmatize(x) for x in tokens]

        # lowercase
        data += [" ".join([x.lower() for x in tokens])]

    return pd.Series(data)

In [8]:
original_data = df_review_bal['review']
processed_data = process_data(original_data)

In [9]:
original_data[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [10]:
type(original_data)

pandas.core.series.Series

In [11]:
processed_data[1]

"a wonderful little production the filming technique unassuming- old-time-bbc fashion give comforting sometimes discomforting sense realism entire piece the actor extremely well chosen- michael sheen `` got polari '' voice pat you truly see seamless editing guided reference williams diary entry well worth watching terrificly written performed piece a masterful production one great master 's comedy life the realism really come home little thing fantasy guard rather use traditional 'dream technique remains solid disappears it play knowledge sens particularly scene concerning orton halliwell set particularly flat halliwell 's mural decorating every surface terribly well done"

In [12]:
type(processed_data)

pandas.core.series.Series

## Encoding labels

In [13]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df_review_bal['sentiment'])

LabelEncoder()

In [14]:
le.classes_

array(['negative', 'positive'], dtype=object)

In [15]:
encoded_labels = le.transform(df_review_bal['sentiment'])
encoded_labels

array([1, 1, 1, ..., 0, 0, 0])

In [16]:
le.inverse_transform(encoded_labels)

array(['positive', 'positive', 'positive', ..., 'negative', 'negative',
       'negative'], dtype=object)

## Splitting data into train and test set

In [17]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)

In [22]:
train_x, train_y = process_data(train['review']), pd.Series(le.transform(train['sentiment']))
test_x, test_y = process_data(test['review']), pd.Series(le.transform(test['sentiment']))

In [23]:
train_x

0       the golden door story sicilian family 's journ...
1       this movie start hilarious 15 second mark cont...
2       the plot death little child hopper one investi...
3       the three short included compilation issued 19...
4       the hills have eyes ii would expect nothing of...
                              ...                        
1335    apparently the mutilation man guy wanders land...
1336    this movie pretty much sucked i 'm army soldie...
1337    it 's unbelievable fourth better second third ...
1338    `` zzzzzzzzzzzzzzzzzz '' if imdb would allow o...
1339    when opening shot u.s. marines seriously disre...
Length: 1340, dtype: object

In [24]:
train_y

0       1
1       1
2       0
3       1
4       0
       ..
1335    0
1336    0
1337    1
1338    0
1339    0
Length: 1340, dtype: int64

# Text Representation (Bag of Words)

## Turning our text data into numerical vectors

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)
train_x_vector

<1340x18854 sparse matrix of type '<class 'numpy.float64'>'
	with 110510 stored elements in Compressed Sparse Row format>

In [26]:
pd.DataFrame.sparse.from_spmatrix(train_x_vector, index=train_x.index, columns=tfidf.get_feature_names())



Unnamed: 0,00,000,007,02,06,08,10,100,1000,100th,...,zp,zu,zuber,zucker,zulu,zwick,zzzzzzzzzzzzzzzzzz,æon,élan,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.048952,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.111927,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1335,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1336,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1337,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1338,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.18721,0.0,0.0,0.0


In [27]:
test_x_vector = tfidf.transform(test_x)

# Model Selection

##SVM

In [54]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

SVC(kernel='linear')

In [29]:
print(svc.predict(tfidf.transform(['A good movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))

[1]
[1]
[0]


##Decision Tree

In [30]:
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

DecisionTreeClassifier()

##Naive Bayes

In [31]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

GaussianNB()

##Logistic Regression

In [32]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(train_x_vector, train_y)

LogisticRegression()

#Model Evaluation

##Mean Accuracy

In [55]:
# svc.score('Test samples', 'True labels')

print('SVM: ', svc.score(test_x_vector, test_y))
print('Decision tree: ', dec_tree.score(test_x_vector, test_y))
print('Naive Bayes: ', gnb.score(test_x_vector.toarray(), test_y))
print('Logistic Regression: ', log_reg.score(test_x_vector, test_y))

SVM:  0.8424242424242424
Decision tree:  0.703030303030303
Naive Bayes:  0.6333333333333333
Logistic Regression:  0.8166666666666667


##F1 Score

In [37]:
# F1 Score = 2*(Recall * Precision) / (Recall + Precision)
# F1 score reaches its best value at 1 and worst score at 0.

from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector), average=None)

array([0.8369906 , 0.84750733])

##Classification report

In [42]:
# Favorito da Letícia
from sklearn.metrics import classification_report

y_true = le.inverse_transform(test_y)
y_pred = le.inverse_transform(svc.predict(test_x_vector))

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

    negative       0.88      0.80      0.84       335
    positive       0.81      0.89      0.85       325

    accuracy                           0.84       660
   macro avg       0.85      0.84      0.84       660
weighted avg       0.85      0.84      0.84       660



##Confusion Matrix

In [45]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_true, y_pred)
conf_mat

array([[267,  68],
       [ 36, 289]])

# Tuning the Model

## GridSearchCV

In [46]:
from sklearn.model_selection import GridSearchCV

#set the parameters
parameters = {'C': [1,4,8,16,32] ,'kernel':['linear', 'rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc, parameters, cv=5)

svc_grid.fit(train_x_vector, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})

In [47]:
print(svc_grid.best_params_)
print(svc_grid.best_estimator_)

{'C': 4, 'kernel': 'rbf'}
SVC(C=4)


In [48]:
svc = SVC(C=4)
svc.fit(train_x_vector, train_y)
print('SVM: ', svc.score(test_x_vector, test_y))

SVM:  0.8348484848484848


In [52]:
y_true = le.inverse_transform(test_y)
y_pred = le.inverse_transform(svc.predict(test_x_vector))
print(classification_report(y_true, y_pred, labels=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.81      0.87      0.84       325
    negative       0.87      0.80      0.83       335

    accuracy                           0.83       660
   macro avg       0.84      0.84      0.83       660
weighted avg       0.84      0.83      0.83       660



In [53]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_true, y_pred, labels=['positive', 'negative'])
conf_mat

array([[284,  41],
       [ 68, 267]])