I found a spreadsheet containing 480.000 reviews from RottenTomato. My goal here is to create a model which will read the review and predict if the review is positive (fresh) or negative (rotten).

In [1]:
# Import data from csv file using pandas
import pandas as pd
data=pd.read_csv("rt_reviews.csv",engine='python', encoding='ISO-8859-1', error_bad_lines=False)


In [2]:
data

Unnamed: 0,Freshness,Review
0,fresh,"Manakamana doesn't answer any questions, yet ..."
1,fresh,Wilfully offensive and powered by a chest-thu...
2,rotten,It would be difficult to imagine material mor...
3,rotten,Despite the gusto its star brings to the role...
4,rotten,If there was a good idea at the core of this ...
...,...,...
479995,rotten,Zemeckis seems unable to admit that the motio...
479996,fresh,Movies like The Kids Are All Right -- beautif...
479997,rotten,Film-savvy audiences soon will catch onto Win...
479998,fresh,An odd yet enjoyable film.


In [6]:
import numpy as np
np.unique(data.Freshness)

array(['fresh', 'rotten'], dtype=object)

Data is pretty much clean. All reviews seem to have a valid label. Some reviews might have special characters or out-of-vocabulary words. Those words will be replaced later with built-in tensorflow functions.  

Data preprocessing

In [7]:
## Function that finds the longest reviews (most words)
def longest_sentece(sentences):
    max_l = 0    
    for s in sentences:
        length = len(s)
        if max_l < length:
            max_l = length
    return max_l

Text classification problems are usually tackled with word embedding.

In [8]:
## First we need to tokenize all words from all our reviews. Each unique word will be asigned unique integer
## We can use Tokenizer function from tensorflow

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [9]:
# Split our data in sentences and labels 
sentences,labels = data.Review, data.Freshness

from sklearn import preprocessing
## We need to label positive reviews as 0 and negative reviews as 1
## We can use LabelEncoder function from sklearn

le = preprocessing.LabelEncoder()
le.fit(labels)
labels = le.transform(labels)

# Data cleaning and tokenization
## We want to remove all special characters, turn to lowercase, and replace all out-of-vocabulary words with <OOV>
tokenizer = Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True, split=' ', char_level=False, oov_token="<OOV>",
    document_count=0)
tokenizer.fit_on_texts(sentences)

# Turning text to int sequences
sequences = tokenizer.texts_to_sequences(sentences)
max_len = longest_sentece(sequences)

# Pad all sequences to have the same length (add zeros to the end)
padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
print(padded[4])

[   39    60    69     3    52   342    27     2   919     5    14    16
    17    77  2400     8    18 29790  2926     5 11733   362  1025  5201
   126  7114     4     3   824  2927 40291   139     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0]


In [10]:
## Longest sequence/sentence
max_len

55

Longest review has 55 words.

In [11]:
# There are some pre-trained model that can save us time with word embedding.
# We can use gensim to access them
## I picked 'glove-twitter-25' model, which transforms every word into a vector of dimension 25
import gensim.downloader as api

word2vec = api.load('glove-wiki-gigaword-50')

In [12]:
## Dictionary of our tokenized words
word_index = tokenizer.word_index
## Calculate mean and std in imported word2vec matrix/model
## These values will be used to generate random vectors for words, which are not included in pre-trained model
emb_mean = word2vec.vectors.mean()
emb_std = word2vec.vectors.std()

We can also check how many of our words are included in downloaded pre-trained model

In [13]:
# Function that finds number of included words in pretrained embedding model
def check_words(word_index, word_matrix_model):
    count = 0
    for word in word_index:
        if word in word_matrix_model:
            count += 1
    return count

In [14]:
check_words(word_index, word2vec)

64873

In [15]:
## How many different words we have in our reviews
len(word_index)

102046

Only roughly half of our words are in pretrained model. In this case it is probably better to just train an embedding layer ourselves

In [16]:
## Split data in learning in validation set
# Very simple split: 2/3 of data is used for training and 1/3 for validation

x_test,y_test = padded[:len(padded)//3],labels[:len(labels)//3]
x_train,y_train = padded[len(padded)//3:],labels[len(labels)//3:]

In [17]:
## We dont need this function since we are training our embedding layer ourselves
"""
import numpy as np
def pretrained_embedding_matrix(word_to_vec_map, word_to_index, emb_mean, emb_std):
    
    np.random.seed(1)
    
    # adding 1 to fit Keras embedding (requirement) (=102046 + 1)
    vocab_size = len(word_to_index) + 1
    # define dimensionality of your pre-trained word vectors (= 300)
    emb_dim = word_to_vec_map.get_vector('news').shape[0]
    
    # initialize the matrix with generic normal distribution values
    embed_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, emb_dim))
    
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            embed_matrix[idx] = word_to_vec_map.get_vector(word)
            
    return embed_matrix
"""

'\nimport numpy as np\ndef pretrained_embedding_matrix(word_to_vec_map, word_to_index, emb_mean, emb_std):\n    \n    np.random.seed(1)\n    \n    # adding 1 to fit Keras embedding (requirement) (=102046 + 1)\n    vocab_size = len(word_to_index) + 1\n    # define dimensionality of your pre-trained word vectors (= 300)\n    emb_dim = word_to_vec_map.get_vector(\'news\').shape[0]\n    \n    # initialize the matrix with generic normal distribution values\n    embed_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, emb_dim))\n    \n    # Set each row "idx" of the embedding matrix to be \n    # the word vector representation of the idx\'th word of the vocabulary\n    for word, idx in word_to_index.items():\n        if word in word_to_vec_map:\n            embed_matrix[idx] = word_to_vec_map.get_vector(word)\n            \n    return embed_matrix\n'

Time to construct our model

In [18]:
from tcn import TCN, tcn_full_summary
from tensorflow.keras.layers import Conv1D, Embedding, Dense, Dropout, SpatialDropout1D, Input
from tensorflow.keras.layers import concatenate, GlobalAveragePooling1D, MaxPooling1D, Flatten
from tensorflow.keras.models import Model
import tensorflow as tf

In [21]:
class MyModel(tf.keras.Model):

    def __init__(self, voc_size, max_length, output_dim=25):
        super(MyModel, self).__init__()
        self.max_length = max_length
        self.embedding = Embedding(input_dim=voc_size, output_dim=output_dim, input_length=max_length, 
                                   trainable = True)
        
        self.conv1d = Conv1D(16, 8, activation='relu',padding='same')
        self.max_pool = MaxPooling1D()
        ## Flatten the matrix to a vector
        self.flat = Flatten()
        self.dense1 = Dense(16, activation="relu")
        # Output needs to be between 0 and 1 therefore final activation with sigmoid function
        self.dense2 = Dense(1, activation="sigmoid")

    def call(self, inputs):
        x = self.embedding(inputs)
        x = self.conv1d(x)
        x = self.max_pool(x)
        x = self.flat(x)
        x = self.dense1(x)
        x = self.dense2(x)
        return x
    
    def model(self):
        x = Input(shape=(self.max_length))
        return Model(inputs=[x], outputs=self.call(x))

    
    
model = MyModel(voc_size=len(word_index)+1, max_length=55, output_dim=25 )
model.model().summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 55)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 55, 25)            2551175   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 55, 16)            3216      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 27, 16)            0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 432)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                6928      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17  

In [22]:
model.compile( loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Train model on training data and validate on validation data.

In [23]:
model.fit(x_train,y_train, batch_size = 10, epochs = 3, validation_data=(x_test, y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x1c24880c820>

Our model scored accuracy of 93,4% on training dataset and 87.0% on validation dataset.

Lets have a look at some examples of missclassifications

In [24]:
## first 10 sentences from validation set
model.predict(x_test[0:10],).round()

array([[1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]], dtype=float32)

In [25]:
# labels of fist 10 sentences in validation set
y_test[0:10]

array([0, 0, 1, 1, 1, 1, 0, 1, 1, 0])

In first 10 examples we can find 2 missclassifications.

In [26]:
# Missclassification
# This review is hard for model to understand
sentences[7]

' Everyone in "The Comedian" deserves a better movie than "The Comedian."'

In [27]:
# Missclassification
sentences[9]

' Slight, contained, but ineffably soulful.'

We can also try to classify a custom sentence/review

In [28]:
test_example = ['Not the worst but it was very long and boring']
## First tokenize it 
test_sequence = tokenizer.texts_to_sequences(test_example)
## Pad it to length = 55
test_sequence= pad_sequences(test_sequence, maxlen=max_len, padding='post', truncating='post')
test_sequence

array([[ 24,   2, 349,  11,   9,  69,  83, 135,   4, 490,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0]])

In [29]:
model.predict(test_sequence[0:1],).round()

array([[1.]], dtype=float32)

Model predicts that our test_example is a negative review which is correct.