# Dataset

## Full dataset

In [169]:
import pandas as pd
# The whole data set
data = pd.read_csv("data/amazonConsumerReviews.csv")
print("COLUMN NAMES\n------------")
for c in data.columns: print(c)

COLUMN NAMES
------------
id
dateAdded
dateUpdated
name
brand
categories
primaryCategories
manufacturer
manufacturerNumber
reviews.date
reviews.doRecommend
reviews.numHelpful
reviews.rating
reviews.text
reviews.title


## Only keeping relevant columns

In [92]:
# Only selecting relevant columns
reviewsData = data[['id',
                  'reviews.doRecommend',
                  'reviews.rating',
                  'reviews.text',
                  'reviews.title']]
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price


## Converting ratings into sentiment labels

In [93]:
print("REVIEWS RATING INFO\n------------")
ratings = reviewsData['reviews.rating']
print("Minimum:", min(ratings))
print("Maximum:", max(ratings))
print("Mean:", sum(ratings)/len(ratings))

REVIEWS RATING INFO
------------
Minimum: 1
Maximum: 5
Mean: 4.5968


For our purposes, let rating < 3 mean negative, rating > 3 mean positive, and rating = 3 be neutral.

In [94]:
# Converting ratings to sentiment labels
sentiment = []
for r in ratings:
    if r < 3: sentiment.append(0)   # Negative
    elif r > 3: sentiment.append(1) # Positive
    else: sentiment.append('n')     # Neutral
reviewsData['sentiment'] = sentiment
try: del(reviewsData['reviews.rating'])
except: pass
reviewsData.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reviewsData['sentiment'] = sentiment


Unnamed: 0,id,reviews.doRecommend,reviews.text,reviews.title,sentiment
0,AVqVGZNvQMlgsOJE6eUY,False,I thought it would be as big as small paper bu...,Too small,n
1,AVqVGZNvQMlgsOJE6eUY,True,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,1
2,AVqVGZNvQMlgsOJE6eUY,True,Didnt know how much i'd use a kindle so went f...,Great for the price,1


## Removing neutral sentiment rows

In [97]:
# Removing neutral rows
reviewsData = reviewsData[reviewsData['sentiment'] != 'n']
# Hence we can see the first row (which has neutral sentiment label) will be removed
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.text,reviews.title,sentiment
1,AVqVGZNvQMlgsOJE6eUY,True,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,1
2,AVqVGZNvQMlgsOJE6eUY,True,Didnt know how much i'd use a kindle so went f...,Great for the price,1
3,AVqVGZNvQMlgsOJE6eUY,True,I am 100 happy with my purchase. I caught it o...,A Great Buy,1


# Tokenization

In [100]:
# Tokenising the words within the reviews
from tensorflow.keras.preprocessing.text import Tokenizer
reviews = reviewsData['reviews.text'].values
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(reviews)

The 'Tokenizer' class enables you to tokenize text. Tokenizing text is the process of breaking a text into tokens (usually individual words).

In [183]:
# Replacing words with their respective indices
# (The indices can be seen using the 'word_index' attribute of the 'Tokenizer' object)
encodedDocs = tokenizer.texts_to_sequences(reviews)

# Comparing element of 'encodedDocs' to corresponding element of 'reviews'
print("ENCODED:")
print(encodedDocs[0], "\n")
print("ORIGINAL:")
print(reviews[0])

ENCODED:
[1, 42, 2, 21, 12, 13, 43, 13, 22, 44, 14, 45, 23, 6, 12, 46, 47, 5, 48, 1, 49, 2, 9, 50, 22, 6, 24, 25, 2, 51, 52, 53, 13, 54, 7, 21, 55, 56, 3, 26, 57] 

ORIGINAL:
I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to read on it... not very comfortable as regular Kindle. Would definitely recommend a paperwhite instead.
