# Amazon Reviews Dataset

This dataset contains several million reviews of Amazon products, with the reviews separated into two classes for positive and negative reviews. The two classes are evenly balanced here.

This is a large dataset, and the version that I am using here only has the text as a feature with no other metadata. This makes this an interesting dataset for doing NLP work. It is data written by users, so it's like that there are various typos, nonstandard spellings, and other variations that you may not find in curated sets of published text.

In this notebook, I will do some very simple text processing and then try out two fairly unoptimized deep learning models:
1. A convolutional neural net
2. A recurrent neural net
These models should achieve results that are within a couple percent of state of the art at predicting the binary sentiment of the reviews.

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import re

%matplotlib inline
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#After you've uploaded your dataset, you can check the current directory by running:
# Print the current directory contents
import os
print(os.listdir())

['.config', 'test.ft.txt', 'train.ft.txt', 'sample_data']


## Reading the text

The text is held in a compressed format. Luckily, we can still read it line by line. The first word gives the label, so we have to convert that into a number and then take the rest to be the comment.

In [35]:
# Define a function to extract labels and texts from a file
def get_labels_and_texts(file):
    # Initialize empty lists to store labels and texts
    labels = []
    texts = []

    # Open the specified file for reading using 'utf-8' encoding
    with open(file, 'r', encoding='utf-8') as f:
        # Iterate through each line in the file
        for line in f:
            # Extract the label from the line (assuming a specific format in the file)
            # Convert the label to an integer, subtract 1, and append to the labels list
            labels.append(int(line[9]) - 1)

            # Extract the text from the line, starting from the 10th character to the end
            # Remove leading and trailing whitespaces and append to the texts list
            texts.append(line[10:].strip())

    # Convert the labels list to a NumPy array and return the tuple of labels and texts
    return np.array(labels), texts

# Call the function to get labels and texts for the training data
train_labels, train_texts = get_labels_and_texts('train.ft.txt')

# Call the function to get labels and texts for the test data
test_labels, test_texts = get_labels_and_texts('test.ft.txt')

## Text Preprocessing

The first thing I'm going to do to process the text is to lowercase everything and then remove non-word characters. I replace these with spaces since most are going to be punctuation. Then I'm going to just remove any other characters (like letters with accents). It could be better to replace some of these with regular ascii characters but I'm just going to ignore that here. It also turns out if you look at the counts of the different characters that there are very few unusual characters in this corpus.

In [37]:
import re
NON_ALPHANUM = re.compile(r'[\W]')
NON_ASCII = re.compile(r'[^a-z0-1\s]')

def normalize_texts(texts):
    normalized_texts = []
    for text in texts:
        lower = text.lower()
        no_punctuation = NON_ALPHANUM.sub(r' ', lower)
        no_non_ascii = NON_ASCII.sub(r'', no_punctuation)
        normalized_texts.append(no_non_ascii)
    return normalized_texts

# Ensure train_texts and test_texts are lists
train_texts = normalize_texts(train_texts)
test_texts = normalize_texts(test_texts)

## Train/Validation Split
Now I'm going to set aside 20% of the training set for validation.

In [38]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts, train_labels, random_state=57643892, test_size=0.2)


Keras provides some tools for converting text to formats that are useful in deep learning models. I've already done some processing, so now I will just run a Tokenizer using the top 12000 words as features.

In [39]:
MAX_FEATURES = 12000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(train_texts)
train_texts = tokenizer.texts_to_sequences(train_texts)
val_texts = tokenizer.texts_to_sequences(val_texts)
test_texts = tokenizer.texts_to_sequences(test_texts)


## Padding Sequences
In order to use batches effectively, I'm going to need to take my sequences and turn them into sequences of the same length. I'm just going to make everything here the length of the longest sentence in the training set. I'm not dealing with this here, but it may be advantageous to have variable lengths so that each batch contains sentences of similar lengths. This might help mitigate issues that arise from having too many padded elements in a sequence. There are also different padding modes that might be useful for different models.


In [40]:
MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)

train_texts = tf.keras.preprocessing.sequence.pad_sequences(train_texts, maxlen=MAX_LENGTH)
val_texts = tf.keras.preprocessing.sequence.pad_sequences(val_texts, maxlen=MAX_LENGTH)
test_texts = tf.keras.preprocessing.sequence.pad_sequences(test_texts, maxlen=MAX_LENGTH)


## Convolutional Neural Net Model

I'm just using fairly simple models here. This CNN has an embedding with a dimension of 64, 3 convolutional layers with the first two having match normalization and max pooling and the last with global max pooling. The results are then passed to a dense layer and then the output.

In [41]:
def build_model():
    sequences = tf.keras.layers.Input(shape=(MAX_LENGTH,))
    embedded = tf.keras.layers.Embedding(MAX_FEATURES, 64)(sequences)
    x = tf.keras.layers.Conv1D(64, 3, activation='relu')(embedded)

    # BatchNormalization is now part of tf.keras.layers
    x = tf.keras.layers.BatchNormalization()(x)

    x = tf.keras.layers.MaxPool1D(3)(x)
    x = tf.keras.layers.Conv1D(64, 5, activation='relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.MaxPool1D(5)(x)
    x = tf.keras.layers.Conv1D(64, 5, activation='relu')(x)
    x = tf.keras.layers.GlobalMaxPool1D()(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(100, activation='relu')(x)
    predictions = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    model = tf.keras.models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        optimizer='rmsprop',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )
    return model

model = build_model()


In [43]:
# Evaluate on validation data at the end of each epoch
val_loss, val_acc = model.evaluate(val_texts, val_labels, batch_size=batch_size, verbose=0)

print(f'Epoch {epoch + 1}/{epochs}, Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_acc:.4f}')


Epoch 1/2, Validation Loss: 0.6930, Validation Accuracy: 0.5065


Once this finishes training, we should find that we get an accuracy of around 94% for this model.

In [44]:
preds = model.predict(test_texts)
print('Accuracy score: {:0.4}'.format(accuracy_score(test_labels, 1 * (preds > 0.5))))
print('F1 score: {:0.4}'.format(f1_score(test_labels, 1 * (preds > 0.5))))
print('ROC AUC score: {:0.4}'.format(roc_auc_score(test_labels, preds)))


Accuracy score: 0.5065
F1 score: 0.6644
ROC AUC score: 0.5228


## Recurrent Neural Net Model
For an RNN model I'm also going to use a simple model. This has an embedding, two GRU layers, followed by 2 dense layers and then the output layer. I'm using the CuDNNGRU rather than GRU because the former will run much faster (over a factor of 10 I think on Kaggle's servers.

In [46]:
def build_rnn_model():
    sequences = layers.Input(shape=(MAX_LENGTH,))
    embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)
    x = layers.GRU(128, return_sequences=True)(embedded)
    x = layers.GRU(128)(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        optimizer='rmsprop',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )
    return model

rnn_model = build_rnn_model()


In [None]:
# Assuming train_texts, val_texts, and test_texts are already tokenized and padded sequences
rnn_model.fit(
    np.array(train_texts),
    np.array(train_labels),
    batch_size=128,
    epochs=1,
    validation_data=(np.array(val_texts), np.array(val_labels))
)




And we should find that this model will end up with an accuracy similar to the CNN model. I haven't bothered to set the seeds, but it can go as high as 95%.

In [None]:
preds = rnn_model.predict(test_texts)
print('Accuracy score: {:0.4}'.format(accuracy_score(test_labels, 1 * (preds > 0.5))))
print('F1 score: {:0.4}'.format(f1_score(test_labels, 1 * (preds > 0.5))))
print('ROC AUC score: {:0.4}'.format(roc_auc_score(test_labels, preds)))

Accuracy score: 0.9502
F1 score: 0.9504
ROC AUC score: 0.9881


## What else could we do?

There are lots of things I haven't tried here. I think the original data from Amazon has other fields that could be added to the model. Additionally, we haven't added any global features from the samples such as length, character level features, and more. We could even attempt to run character-level deep learning models, which might be able to reduce sensitivity to misspellings. In online reviews, character level features could be quite important as users could intentionally misspell things to avoid moderation. However, these models are already performing at well over 90% so at this point any gains are going to be pretty small.