# 1. Download and prepare data

Data concerns reviews of books on Amazon and was downloaded from<br>
https://nijianmo.github.io/amazon/index.html

Due to the size of the dataset, it was sometimes necessary to use generators in preprocessing.

For a bigger contrast in sentiment, only reviews with an overall rating of 1.3 or 5 were selected. As 5-rated reviews made up nearly 80% of the collection, it was decided to undersampling to balance the grades. Then the missing data and duplicates were removed. In the next step, the target variable was created from the `overall` variable.

Then, the texts of the reviews were cleaned up. Removed:
* URL
* newline characters
* special signs
* numbers
* tags

Then, using `gensim.utils.simple_preprocess()`, further purification of the texts and tokenization were performed. Stopwords were removed using the `spacy` module and the `en_core_web_md` language model. There were used `pad_sequence` for padding - to prepare strings of equal length for each review.<br>
Finally, the dataset was splitted into train (75%) and test (25%).

In [None]:
# Dowloade language model from spacy
!python -m spacy download en_core_web_md

# NOTE!!! MUST restart runtime

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 3.5 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051302 sha256=53dc18a5c967b09542b58c87465c07def3425e393b468cda1f8a2f8059a54130
  Stored in directory: /tmp/pip-ephem-wheel-cache-f_ga9_4s/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
# Download the data - amazon books reviews
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz

--2021-12-23 08:49:13--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3223678899 (3.0G) [application/x-gzip]
Saving to: ‘reviews_Books_5.json.gz’


2021-12-23 09:00:16 (4.64 MB/s) - ‘reviews_Books_5.json.gz’ saved [3223678899/3223678899]



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import gzip
import re
import string
import numpy as np
import pandas as pd
import pickle

import gensim
from gensim.utils import simple_preprocess
import spacy
from nltk.tokenize.treebank import TreebankWordDetokenizer

import tensorflow as tf
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

In [None]:
def parse(load_path, filter = True):
    """
    The generator parses compressed .json.gz file,
    extracts two values: 'overall' and 'reviewText',
    selects only these reviews with a rating of 1.3 or 5,
    and finally yields the given observation.

    Parameters:
    load_path - path to compressed .json.gz directory
    """
    g = gzip.open(load_path, 'rb')
    for l in g:
        l = eval(l)
        l = {key: l[key] for key in l.keys() & {'overall','reviewText'}}
        if filter:
            if l['overall'] in [2,4]:
                pass
            else:
                yield l
        else:
          yield l

def parse_to_df(load_path, filter):
    """
    The function parses the path load_path line by line
    and saves reviews to DataFrame,

    Parameters:
    load_path - path to compressed .json.gz directory
    """
    i = 0
    df = {}
    for d in parse(load_path, filter):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [None]:
df = parse_to_df('reviews_Books_5.json.gz', filter=True)
df

Unnamed: 0,overall,reviewText
0,5.0,Spiritually and mentally inspiring! A book tha...
1,5.0,This is one my must have books. It is a master...
2,5.0,This book provides a reflection that you can a...
3,5.0,I first read THE PROPHET in college back in th...
4,5.0,A timeless classic. It is a very demanding an...
...,...,...
6259832,5.0,Yasss hunny! This is a great read. That Dre is...
6259833,5.0,I ENJOYED THIS BOOK FROM BEGINNING TO END NOW ...
6259834,5.0,Great book! Cherika was a fool. She let that m...
6259835,5.0,When I say this was an excellent book please b...


In [None]:
# How many reviews are in each class?
df['overall'].value_counts()

5.0    4980815
3.0     955189
1.0     323833
Name: overall, dtype: int64

In [None]:
# Target is unbalanced. Class '5' is more represented than others.
# Data are shufled so we will take first 165.000 of reviews of each class instead of using undersampling.
n_ = 165000
print(n_)

df = pd.concat([
    df[df['overall'] == 5][:n_],
    df[df['overall'] == 3][:n_],
    df[df['overall'] == 1][:n_]
])

165000


In [None]:
# Check if there are missing values
df.isnull().sum()

overall       0
reviewText    0
dtype: int64

In [None]:
# Check if there are duplicated reviews, if so remove them
print('Number of duplicates:', df.duplicated().sum())

df.drop_duplicates(inplace=True)

Number of duplicates: 125


In [None]:
# Create target varaible - one hot encode 'overall'
# 0 - negative
# 1 - neutral
# 2 - positive

labels_matrix = np.array(df['overall'])
y = []
for i in range(len(labels_matrix)):
    if labels_matrix[i] == 1:
        y.append(0)
    if labels_matrix[i] == 3:
        y.append(1)
    if labels_matrix[i] == 5:
        y.append(2)
y = np.array(y)
labels_matrix = tf.keras.utils.to_categorical(y, 3, dtype="int32")
del y

In [None]:
# Function to clean text data
def clean_data(data):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)
    data = re.sub('\n', ' ', data)
    data = re.sub("\'", "", data)
    data = re.sub('\$[a-zA-Z0-9]*', ' ', data)
    data = re.sub('\@[a-zA-Z0-9]*', ' ', data)
    data = re.sub('\d+', ' ', data)
    data = re.sub(r'[^\w]', ' ', data)
    return data



data_to_list = []
temp = df['reviewText'].values.tolist()
for i in range(len(temp)):
    data_to_list.append(clean_data(temp[i]))

del temp
del df
list(data_to_list[:5])

['Spiritually and mentally inspiring  A book that allows you to question your morals and will help you discover who you really are ',
 'This is one my must have books  It is a masterpiece of spirituality  Ill be the first to admit  its literary quality isnt much  It is rather simplistically written  but the message behind it is so powerful that you have to read it  It will take you to enlightenment ',
 'This book provides a reflection that you can apply to your own life And  a way for you to try and assess whether you are truly doing the right thing and making the most of your short time on this plane ',
 'I first read THE PROPHET in college back in the  s  The book had a revival as did anything metaphysical in the turbulent  s  It had a profound effect on me and became a book I always took with me  After graduation I joined the Peace Corps and during stressful training in country  Liberia  at times of illness and the night before I left  this book gave me great comfort  I read it befo

In [None]:
# Conversion doc for lists of tokens ['I like the item'] --> [['I','like','the','item']]
def sent_to_words(data_to_list):
    for sentence in data_to_list:
        yield gensim.utils.simple_preprocess(str(sentence), deacc=True)

with open('after_tokenization.txt', 'w') as f:
    for line in sent_to_words(data_to_list):
        f.write("%s\n" % line)

del data_to_list

!ls -lh
!head -5 after_tokenization.txt

total 3.7G
-rw-r--r-- 1 root root 675M Dec 23 09:28 after_tokenization.txt
drwx------ 6 root root 4.0K Dec 23 09:02 drive
-rw-r--r-- 1 root root 3.1G Apr 26  2016 reviews_Books_5.json.gz
drwxr-xr-x 1 root root 4.0K Dec  3 14:33 sample_data
['spiritually', 'and', 'mentally', 'inspiring', 'book', 'that', 'allows', 'you', 'to', 'question', 'your', 'morals', 'and', 'will', 'help', 'you', 'discover', 'who', 'you', 'really', 'are']
['this', 'is', 'one', 'my', 'must', 'have', 'books', 'it', 'is', 'masterpiece', 'of', 'spirituality', 'ill', 'be', 'the', 'first', 'to', 'admit', 'its', 'literary', 'quality', 'isnt', 'much', 'it', 'is', 'rather', 'simplistically', 'written', 'but', 'the', 'message', 'behind', 'it', 'is', 'so', 'powerful', 'that', 'you', 'have', 'to', 'read', 'it', 'it', 'will', 'take', 'you', 'to', 'enlightenment']
['this', 'book', 'provides', 'reflection', 'that', 'you', 'can', 'apply', 'to', 'your', 'own', 'life', 'and', 'way', 'for', 'you', 'to', 'try', 'and', 'assess', 'wheth

In [None]:
with open('after_tokenization.txt', 'r') as f:
    data_words = []
    for elem in f:
        data_words.append(eval(elem))

print(*data_words[:10], sep='\n')

['spiritually', 'and', 'mentally', 'inspiring', 'book', 'that', 'allows', 'you', 'to', 'question', 'your', 'morals', 'and', 'will', 'help', 'you', 'discover', 'who', 'you', 'really', 'are']
['this', 'is', 'one', 'my', 'must', 'have', 'books', 'it', 'is', 'masterpiece', 'of', 'spirituality', 'ill', 'be', 'the', 'first', 'to', 'admit', 'its', 'literary', 'quality', 'isnt', 'much', 'it', 'is', 'rather', 'simplistically', 'written', 'but', 'the', 'message', 'behind', 'it', 'is', 'so', 'powerful', 'that', 'you', 'have', 'to', 'read', 'it', 'it', 'will', 'take', 'you', 'to', 'enlightenment']
['this', 'book', 'provides', 'reflection', 'that', 'you', 'can', 'apply', 'to', 'your', 'own', 'life', 'and', 'way', 'for', 'you', 'to', 'try', 'and', 'assess', 'whether', 'you', 'are', 'truly', 'doing', 'the', 'right', 'thing', 'and', 'making', 'the', 'most', 'of', 'your', 'short', 'time', 'on', 'this', 'plane']
['first', 'read', 'the', 'prophet', 'in', 'college', 'back', 'in', 'the', 'the', 'book', 'ha

In [None]:
# load eng language model from spacy - medium
nlp = spacy.load('en_core_web_md')

# create stopwords list with spacy 
stopwordlist = nlp.Defaults.stop_words

In [None]:
# remove stopwords
for i in range(len(data_words)):
  data_words[i] = [word for word in data_words[i] if word not in stopwordlist]

print(*data_words[:10], sep='\n')

['spiritually', 'mentally', 'inspiring', 'book', 'allows', 'question', 'morals', 'help', 'discover']
['books', 'masterpiece', 'spirituality', 'ill', 'admit', 'literary', 'quality', 'isnt', 'simplistically', 'written', 'message', 'powerful', 'read', 'enlightenment']
['book', 'provides', 'reflection', 'apply', 'life', 'way', 'try', 'assess', 'truly', 'right', 'thing', 'making', 'short', 'time', 'plane']
['read', 'prophet', 'college', 'book', 'revival', 'metaphysical', 'turbulent', 'profound', 'effect', 'book', 'took', 'graduation', 'joined', 'peace', 'corps', 'stressful', 'training', 'country', 'liberia', 'times', 'illness', 'night', 'left', 'book', 'gave', 'great', 'comfort', 'read', 'married', 'children', 'born', 'near', 'fatal', 'illnesses', 'amazed', 'chapter', 'reaches', 'grabs', 'offers', 'comfort', 'hope', 'future', 'gibran', 'offers', 'timeless', 'insights', 'love', 'word', 'think', 'nation', 'read', 'learn', 'lessons', 'definitely', 'time', 'thought', 'reflection', 'book', 'guid

In [None]:
# detokenization
def detokenize(text):
    return TreebankWordDetokenizer().detokenize(text)


data = []
for i in range(len(data_words)):
    data.append(detokenize(data_words[i]))

del data_words
print(*data[:10], sep='\n')

spiritually mentally inspiring book allows question morals help discover
books masterpiece spirituality ill admit literary quality isnt simplistically written message powerful read enlightenment
book provides reflection apply life way try assess truly right thing making short time plane
read prophet college book revival metaphysical turbulent profound effect book took graduation joined peace corps stressful training country liberia times illness night left book gave great comfort read married children born near fatal illnesses amazed chapter reaches grabs offers comfort hope future gibran offers timeless insights love word think nation read learn lessons definitely time thought reflection book guide
timeless classic demanding assuming title gibran backs excellent style content means publish century earlier inspired new religion mouth old man sail away far away destination hear wisdom life important aspects messege guide book sufi sermon perspective hint dogma hints birth place lebanon 

In [None]:
# tokenize and padding to max_len=600 with keras
max_words = 5000
max_len = 600

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
reviews = pad_sequences(sequences, maxlen=max_len)
del data
del sequences
print(reviews)

[[   0    0    0 ... 4393  101 1068]
 [   0    0    0 ...  622    2 4340]
 [   0    0    0 ...  119    6 2074]
 ...
 [   0    0    0 ... 2854 1104  904]
 [   0    0    0 ... 2980   87  562]
 [   0    0    0 ... 4421 4525  334]]


In [None]:
print(labels_matrix)

[[0 0 1]
 [0 0 1]
 [0 0 1]
 ...
 [1 0 0]
 [1 0 0]
 [1 0 0]]


In [None]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(reviews, labels_matrix, random_state=24, shuffle=True)

print('TRAIN:', len(y_train))
print('TRAIN:', len(y_test))
del reviews

TRAIN: 371156
TRAIN: 123719


In [None]:
# Save splitted data for modeling
with open('/content/drive/MyDrive/INL_PROJEKT/preprocessed_data/X_train.npy', 'wb') as f:
    np.save(f, X_train)

with open('/content/drive/MyDrive/INL_PROJEKT/preprocessed_data/X_test.npy', 'wb') as f:
    np.save(f, X_test)

with open('/content/drive/MyDrive/INL_PROJEKT/preprocessed_data/y_train.npy', 'wb') as f:
    np.save(f, y_train)

with open('/content/drive/MyDrive/INL_PROJEKT/preprocessed_data/y_test.npy', 'wb') as f:
    np.save(f, y_test)

# save tokenizer to file with pickle
with open('/content/drive/MyDrive/INL_PROJEKT/preprocessed_data/tokenizer.pickle', 'wb') as f:
    pickle.dump(tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)