# Part II: Modelling in Python

## II.a Processing the reviews and obtaining final vocabulary

First, import all libraries

In [1]:
import pandas as pd
import sklearn
import string
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras import metrics
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer

Then, import and load the vocabulary with 1005 words to a list (the number of terms will be trimmed down to 980 later):

In [2]:
words = pd.read_csv("myvocab2.csv")
word_list = list(words['x'])

Now, pre-process the text: removing all punctuations and replacing underscores with blank space

In [3]:
vocabR = []
for word in word_list:
    vocabR.append(word.translate(str.maketrans('_',' ','')).lower())
vocab2R = []
for word in vocabR:
    vocab2R.append(word.translate(str.maketrans('','',string.punctuation)).lower())

Now, we read all review texts from "alldata.tsv" and remove punctuation. We also had to remove words ending with "br" since it is a html code and not part of the text.

In [4]:
df = pd.read_table("alldata.tsv")
sentiment_list = list(df['sentiment'])
corpus = []

for document in df['review']:
    clean = document.translate(str.maketrans('','',string.punctuation)).lower()
    words = clean.split()
    filtered = []
    for word in words:
        if word[len(word) - 2] == 'b' and word[len(word) - 1] == 'r':
                word = word[0:(len(word) - 2)]
        filtered.append(word)
    new_string = ' '.join(filtered)
    corpus.append(new_string)

Now, we will vectorize all reviews. Each review is mapped to a vector with 18613240 features. The value of each feature is how many time the N-gram from N = 1 to 4 occurs in that review.

In [5]:
vectorizer = CountVectorizer(ngram_range = (1,4))
Xc = vectorizer.fit_transform(corpus)
names = vectorizer.get_feature_names_out()

Now, from these 18613240 features, we will only keep the ones that are inside the vocabulary found in R:

In [6]:
inds = []

for i in range(18613240):
    if names[i] in vocab2R:
        inds.append(i)

The number of terms of the vocabulary in Python is 980:

In [7]:
print('The Python dictionary has',len(names[inds]),'terms')


The Python dictionary has 980 terms


Let's save the final vocabulary to a csv file:

In [8]:
final_vocab = pd.DataFrame(names[inds])
final_vocab.to_csv('final_vocab.cvs', index = False)

## II.b Neural Network model

The Python model is a neural network. To run the final model, we need the inputs X and outputs y.

The inputs are the columns in the vectorized review matrix which correspond to the terms in the vocabulary. The output is the sentiment of each review. Later we split X and Y into train and test data.

Both X and y are transformed to numpy arrays:

In [9]:
X = np.array(Xc[:,inds].toarray())
y = np.array(sentiment_list)

Now let's separate X and y intro train and test data. Let's take the first split, which is found by taking the review indices in the first column of 'project3_splits.csv':


In [10]:
train_test_df = pd.read_csv('project3_splits.csv')
test_ind = np.array(list(train_test_df['split_1']))-1
train_ind = np.array(list(set(list(range(0,50000)))-set(test_ind)))
traindata = X[train_ind]
trainlabels = y[train_ind]
testdata = X[test_ind]
testlabels = y[test_ind]

Finally let's run the neural network (NN) model. It has one hidden layer with 20 inputs, with ReLU activation. The output is a single number between 0 and 1, obtained by the sigmoid function. We train all NN for 20 epochs. Each split takes only a few seconds to run.

In [11]:
model = Sequential()
model.add(Dense(20, input_shape=(980,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', metrics=[metrics.AUC()])
model.fit(traindata, trainlabels, epochs=20, verbose=2)

Epoch 1/20
782/782 - 1s - loss: 0.3181 - auc: 0.9429 - 1s/epoch - 2ms/step
Epoch 2/20
782/782 - 1s - loss: 0.2420 - auc: 0.9648 - 641ms/epoch - 820us/step
Epoch 3/20
782/782 - 1s - loss: 0.2337 - auc: 0.9673 - 642ms/epoch - 822us/step
Epoch 4/20
782/782 - 1s - loss: 0.2293 - auc: 0.9685 - 632ms/epoch - 808us/step
Epoch 5/20
782/782 - 1s - loss: 0.2263 - auc: 0.9691 - 648ms/epoch - 829us/step
Epoch 6/20
782/782 - 1s - loss: 0.2230 - auc: 0.9698 - 643ms/epoch - 822us/step
Epoch 7/20
782/782 - 1s - loss: 0.2211 - auc: 0.9704 - 634ms/epoch - 811us/step
Epoch 8/20
782/782 - 1s - loss: 0.2191 - auc: 0.9709 - 638ms/epoch - 816us/step
Epoch 9/20
782/782 - 1s - loss: 0.2179 - auc: 0.9712 - 648ms/epoch - 828us/step
Epoch 10/20
782/782 - 1s - loss: 0.2169 - auc: 0.9714 - 651ms/epoch - 833us/step
Epoch 11/20
782/782 - 1s - loss: 0.2156 - auc: 0.9717 - 650ms/epoch - 831us/step
Epoch 12/20
782/782 - 1s - loss: 0.2144 - auc: 0.9720 - 657ms/epoch - 840us/step
Epoch 13/20
782/782 - 1s - loss: 0.2136 - 

<keras.callbacks.History at 0x1dc71698a30>

Finally, let's evaluate the AUC of the trained NN on the test data:


In [12]:
Ypred = model.predict(testdata)
print(roc_auc_score(testlabels, Ypred))

0.9658117636913647


For all 5 splits, the AUC is bigger than 0.96.

