In [1]:
# To ignore warning messages when filtering data
from warnings import filterwarnings
filterwarnings('ignore')

# Basic dataset preparation

## Columns of the full dataset

In [2]:
import pandas as pd
# The whole data set
data = pd.read_csv("data/amazonConsumerReviews.csv")
print("COLUMN NAMES\n------------")
for c in data.columns: print(c)

COLUMN NAMES
------------
id
dateAdded
dateUpdated
name
brand
categories
primaryCategories
manufacturer
manufacturerNumber
reviews.date
reviews.doRecommend
reviews.numHelpful
reviews.rating
reviews.text
reviews.title


## Only keeping relevant columns

In [3]:
# Only selecting relevant columns
reviewsData = data[['id',
                  'reviews.doRecommend',
                  'reviews.rating',
                  'reviews.text',
                  'reviews.title']]
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price


# Text processing

The following processes are divided into neat sections, but in the actual implementation, we will probably combine the processes to save time, reduce lines and improve efficiency. We will also be removing the original text columns after processing, replacing them with the processed data. Here, however, that has not been done for demonstration purposes.

## Removing punctuations

### Importing and viewing punctuations

In [4]:
# Obtaining a string containing all possible punctuations
# This will aid in recognising and removing punctuations
from string import punctuation
print("Punctuations available: ", punctuation)
print("Punctuations variable type:", type(punctuation))

Punctuations available:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Punctuations variable type: <class 'str'>


### Demonstration

In [5]:
text = reviewsData['reviews.text'][2]
# List of non-punctuation characters
textNoPunctuation = [c for c in text if c not in punctuation]
# Combining the above list into a string
textNoPunctuation = ''.join(textNoPunctuation)
print(textNoPunctuation)

Didnt know how much id use a kindle so went for the lower end im happy with it even if its a little dark


### Applying the above for all available reviews

In [6]:
reviewsNoPunctuation = []
for text in reviewsData['reviews.text']:
    textNoPunctuation = [c for c in text if c not in punctuation]
    textNoPunctuation = ''.join(textNoPunctuation)
    reviewsNoPunctuation.append(textNoPunctuation)

# Creating a new column (for demo purposes... normally, I will simply replace the original)
reviewsData['reviews.text (no punctuation)'] = reviewsNoPunctuation

# Viewing some rows
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title,reviews.text (no punctuation)
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small,I thought it would be as big as small paper bu...
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,This kindle is light and easy to use especiall...
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price,Didnt know how much id use a kindle so went fo...


## Bringing to lowercase

This step makes it easy to compare words, especially when removing stopwords.

In [7]:
for i in reviewsData['reviews.text (no punctuation)'].index: # So that only valid indices will be chosen
    reviewsData['reviews.text (no punctuation)'][i] = reviewsData['reviews.text (no punctuation)'][i].lower()
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title,reviews.text (no punctuation)
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small,i thought it would be as big as small paper bu...
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,this kindle is light and easy to use especiall...
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price,didnt know how much id use a kindle so went fo...


## Removing stopwords

Stopwords are words that can be ignored, since they are considered to not add any new meaning to the sentence (for our analysis scope or purposes), for example 'a', 'the', 'an', 'in', etc. If our analysis process does not consider these words, it would be a waste of time and space to include them in our data and process.

I will be eliminating punctuations from stopwords as well. This, combined with the fact that I'm removing stopwords after removing punctuations, means that the effect of wrong punctuations will be reduced.

### Importing and viewing stopwords

In [8]:
# List of recognised stopwords in English
# (available in the 'stopwords.words' element of the 'Corpus' module in the 'NLTK' package)
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We can see that all stopwords are given in lowercase, which makes it convenient to search texts for stopwords, with the help of the '.lower( )' method.

### Removing punctuations from stopwords

In [9]:
stopwordsNoPunctuation = []
for s in stopwords.words('english'):
    wordNoPunctuation = [c for c in s if c not in punctuation]
    # 'punctuation' was imported from 'string' in the section above
    wordNoPunctuation = ''.join(wordNoPunctuation)
    stopwordsNoPunctuation.append(wordNoPunctuation)
print(stopwordsNoPunctuation)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre', 'youve', 'youll', 'youd', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'shes', 'her', 'hers', 'herself', 'it', 'its', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'thatll', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', '

### Demonstration

In [10]:
import re # For more inclusive split function
# Dividing the text into individual words...
text = re.split(r'\s', reviewsData['reviews.text (no punctuation)'][2])
# List of non-stopword characters
textNoStopwords = [c for c in text if c not in stopwordsNoPunctuation]
# Combining the above list into a string
textNoStopwords = ' '.join(textNoStopwords) # Join with space
print(textNoStopwords)

know much id use kindle went lower end im happy even little dark


**Limitations**: We can see that words such as 'im' or 'id' are left. Since we have removed punctuations and converted all characters to lowercase, there may be ambiguity in the meaning of words. For example, 'ID' and 'I'd' would be converted into 'id', although both have very different meanings.

### Applying the above for all available reviews

In [11]:
reviewsNoStopwords = []
# Dividing the text into individual words...
for text in reviewsData['reviews.text (no punctuation)']:
    text = re.split(r'\s', text)
    textNoStopwords = [c for c in text if c not in stopwordsNoPunctuation]
    textNoStopwords = ' '.join(textNoStopwords)
    reviewsNoStopwords.append(textNoStopwords)

# Creating a new column (for demo purposes... normally, I will simply replace the original)
reviewsData['reviews.text (no stopwords)'] = reviewsNoStopwords

# Viewing some rows
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title,reviews.text (no punctuation),reviews.text (no stopwords)
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small,i thought it would be as big as small paper bu...,thought would big small paper turn like palm t...
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,this kindle is light and easy to use especiall...,kindle light easy use especially beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price,didnt know how much id use a kindle so went fo...,know much id use kindle went lower end im happ...


# Tokenization


In [12]:
# Tokenising the words within the reviews
from tensorflow.keras.preprocessing.text import Tokenizer
reviews = reviewsData['reviews.text (no stopwords)'].values
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(reviews)

The 'Tokenizer' class enables you to tokenize text. Tokenizing text is the process of breaking a text into tokens (usually individual words). As can be seen in the notes on tokenization using TensorFlow, a Tokenizer object contains multiple dictionaries, including one relating indices to words, and one relating words to indices, etc.

## Encoding texts

In [13]:
# Replacing words with their respective indices
# (The indices can be seen using the 'word_index' or 'index_word' attributes of the 'Tokenizer' object)
encodedDocs = tokenizer.texts_to_sequences(reviews)
# NOTE: reviews = reviewsData['reviews.text'].values

# Comparing element of 'encodedDocs' to corresponding element of 'reviews'
print("ENCODED:")
print(encodedDocs[0], "\n")
print("ORIGINAL:")
print(reviews[0])

ENCODED:
[233, 16, 179, 105, 675, 131, 12, 3377, 148, 105, 35, 563, 352, 8, 16, 164, 51, 243, 297] 

ORIGINAL:
thought would big small paper turn like palm think small read comfortable regular kindle would definitely recommend paperwhite instead
