In [1]:
# To ignore warning messages when filtering data
from warnings import filterwarnings
filterwarnings('ignore')

# Basic dataset preparation

## Columns of the full dataset

In [1]:
import pandas as pd
# The whole data set

# Generalized code for accessing the data directory
# (Meant to work even if this file is within some other subdirectory)
path = "data/amazonConsumerReviews.csv"
while True:
    try:
        data = pd.read_csv(path)
        break
    except:
        path = "../" + path
print("COLUMN NAMES\n------------")
for c in data.columns: print(c)

COLUMN NAMES
------------
id
dateAdded
dateUpdated
name
brand
categories
primaryCategories
manufacturer
manufacturerNumber
reviews.date
reviews.doRecommend
reviews.numHelpful
reviews.rating
reviews.text
reviews.title


## Only keeping relevant columns

In [2]:
# Only selecting relevant columns
reviewsData = data[['id',
                  'reviews.doRecommend',
                  'reviews.rating',
                  'reviews.text',
                  'reviews.title']]
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price


# Tokenization

TensorFlow offers a tokenizer class that abstracts the exact implementation of tokenization for a collection of texts. An instance of this class is a callable object that offers many methods and features related to tokenization of texts. In particular, it maintains attributes containing the following:
- Total word count
- Assigning a unique index number for each word
- The number of times a word appears in all the texts
- The number of different texts in which a word appears

<br><br>
**WHY USE TENSORFLOW TOKENIZER?**<br>
There are many libraries available for natural language processing, which also provide text mining capacities like tokenization. These libraries include:
- Natural Language Toolkit (NLTK)
- Gensim
- polyglot
- TextBlob
- CoreNLP
- spaCy
- Pattern
- Vocabulary

The main advantage of TensorFlow is that its tokenizer class abstracts the tokenization for an entire collection of texts, and not just a single text, as is the case with the NLTK tokenizer function. Furthermore, information about these texts and the occurrence of each word in these texts is also stored. Lastly, the TensorFlow tokenizer class stores various informative attributes and methods that facilitate processes such as:
- Looking up the index for a word, and the word for an index
- Encoding texts (facilitated by the above features as well as in-built functions)

In [3]:
# Instantiating tokenizer object
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 5000)

# Viewing all the attributes of this object
for d in dir(tokenizer): print(d)

__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__
__weakref__
_keras_api_names
_keras_api_names_v1
char_level
document_count
filters
fit_on_sequences
fit_on_texts
get_config
index_docs
index_word
lower
num_words
oov_token
sequences_to_matrix
sequences_to_texts
sequences_to_texts_generator
split
texts_to_matrix
texts_to_sequences
texts_to_sequences_generator
to_json
word_counts
word_docs
word_index


## Some noteworthy attributes

**filters**<br>
The string containing all the characters that will be omitted from consideration.
<br><br>
**split**<br>
The string pattern by which the texts are split.
<br><br>
**index_word, word_index**<br>
'index_word' is dictionary associating every index with the respective word, while 'word_index' is a dictionary associating every word to the respective index.

## Some noteworthy methods

**.fit_on_texts(\<collection of texts\>)**<br>
Updates the internal vocabulary of the tokenizer object based on a given collection of texts. This method creates a vocabulary index (based on word frequency), i.e.it fills the values of the dictionaries **index_word** and **word_index**.
<br><br>
**.texts_to_sequences(\<collection of texts\>)**<br>
Transforms each text in the collection of texts to a sequence of integers, using the indices given to each word. Note that 'sequence' means an iterable collection (in this case, it would be an iterable collection of indices corresponding to the respective word in the text).
<br><br>
**.get_config( )**<br>
Returns a dictionary containing each attribute of the tokenizer object and their respective values.

## Fitting indices on texts

In [4]:
# Creating internal vocabulary and word indices
reviews = reviewsData['reviews.text'].values
tokenizer.fit_on_texts(reviews)

# Viewing some index-word pairs --only for demo--
for i, x in enumerate(tokenizer.index_word.items()):
    if i > 4: break
    print(x)

(1, 'the')
(2, 'and')
(3, 'to')
(4, 'it')
(5, 'i')


## Converting texts to sequences (i.e. encoding text)

In [5]:
# Replacing words with their respective indices
# (Indices can be seen in the 'word_index' or 'index_word'attributes)
encodedDocs = tokenizer.texts_to_sequences(reviews)
# NOTE: reviews = reviewsData['reviews.text'].values

# Comparing element of 'encodedDocs' to corresponding element of 'reviews'
print("ENCODED:")
print(encodedDocs[0], "\n")
print("ORIGINAL:")
print(reviews[0])

ENCODED:
[5, 330, 4, 53, 46, 25, 267, 25, 181, 804, 23, 215, 107, 3, 46, 54, 42, 9, 3444, 5, 238, 4, 8, 128, 181, 3, 82, 15, 4, 28, 29, 692, 25, 483, 32, 53, 245, 109, 7, 342, 404] 

ORIGINAL:
I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to read on it... not very comfortable as regular Kindle. Would definitely recommend a paperwhite instead.
