# References

1. Importance of normalization:<br>https://towardsdatascience.com/text-normalization-7ecc8e084e31
2. Intro to stemming:<br>https://www.geeksforgeeks.org/python-stemming-words-with-nltk/
3. Under-stemming & over-stemming<br>https://towardsdatascience.com/stemming-of-words-in-natural-language-processing-what-is-it-41a33e8996e2
4. Comparing NLTK and spaCy:<br>
    - https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2
    - https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/
5. Intro to spaCy:<br>https://spacy.io/usage/spacy-101
6. Tokenization & lemmatization using spaCy:<br>https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/
7. Definition of language models:<br>https://towardsdatascience.com/the-beginners-guide-to-language-models-aa47165b57f9
8. 'Language' class (spaCy):<br>https://spacy.io/api/language

In [1]:
# To ignore warning messages when filtering data
from warnings import filterwarnings
filterwarnings('ignore')

# Basic dataset preparation

## Columns of the full dataset

In [2]:
import pandas as pd
# The whole data set
data = pd.read_csv("data/amazonConsumerReviews.csv")
print("COLUMN NAMES\n------------")
for c in data.columns: print(c)

COLUMN NAMES
------------
id
dateAdded
dateUpdated
name
brand
categories
primaryCategories
manufacturer
manufacturerNumber
reviews.date
reviews.doRecommend
reviews.numHelpful
reviews.rating
reviews.text
reviews.title


## Only keeping relevant columns

In [3]:
# Only selecting relevant columns
reviewsData = data[['id',
                  'reviews.doRecommend',
                  'reviews.rating',
                  'reviews.text',
                  'reviews.title']]
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price


# Tokenization

## Overview

Tokenizing text is the process of breaking a text into tokens (usually individual words). This can be achieved using the built-in Python function **.split**, which partitions the texts based on whitespaces. For more advanced partitioning, we can even use the **split** function from the module **re** (regular expressions handling module). However, instead of bothering ourselves with the exact implementation, we will use available tokenization implementations. 
<br><br>
**CHOOSING A SUITABLE IMPLEMENTATION**<br>
Python contains two libraries specifically designed for language processing, namely **NLTK**, and **spaCy**. NLTK tokenizer functions take a string (single text) as an input, and returns a processed string (list of sentences or list of words, based on the tokenizer function used). It is a simple method of tokenization that we can easily implement using the **.split** function or using the regular expressions.
<br><br>
spaCy has a tokenizer class that is optimized for word tokenization. An instance of this tokenizer is a callable object that accepts a string as an argument and returns an object of type **Doc** (more precisely, **spacy.tokens.doc.Doc**). This object contains various attributes that are applicable to the tokenized words, including conversion to JSON format and text vectorization (i.e. converting tokenized text into its numerical representation).
<br><br>
Furthermore, spaCy also provides language models that offer powerful features, such as parts-of-speech tagging (i.e. identifying the grammatical role of a word or word sequence in a sentence), noun chunk extraction (i.e. detecting nouns) and named entity recognition (i.e. detecting proper nouns), all of which would lead to easier and more comprehensive language processing.
<br><br>
However, based on our perusal of NLTK and spaCy codes, we have come to the conclusion that spaCy is a more attractive choise due to the following reasons:
- **Doc** objects are easier to inspect, transform and extract information from
- spaCy's language models help produce more sophisticated and appropriate tokenization
- Lemmatization of **Doc** object elements is very straightforward using language models
- spaCy aims to provide fewer tools that deliver better performance and developer experience

**NOTE ON THE LAST POINT**<br>
spaCy is not a comprehensive package for natural language processing (NLP) methods. In other words, it is not suited for researching or learning about different NLP methods and concepts. Rather, it is focused on enhancing implementation, providing what the developers of spaCy think is the best method to achieve a certain NLP task.
<br><br>
**NOTE ON LANGUAGE MODEL**<br>
A statistical language model is a probability distribution of words or word sequences. In practice, a language model gives the probability of a certain word sequence being 'valid' in a given context. Note that validity here is not grammatical validity, but validity with respect to the actual usage of language. In other words, it aims to model how people use language.
<br><br>
**MAIN FACTOR FOR CHOOSING SPACY**<br>
While the above points make spaCy a more attractive choice for using in the development of our application, the main reason for ultimately choosing spaCy to perform tokenization is availability of language models in spaCy that would enhance the quality and effectiveness of lemmatization (discussed later).

## spaCy installation using terminal commands

- **pip install spacy** _(Installs spaCy library)_<br>
- **python -m spacy download en_core_web_sm** _(Downloads the language model for English)_

_(For installation in the Jupyter notebook environment, prefix these commands with !)_

## spaCy tokenization using language model

### Demonstration

_**Creating instance of tokenizer that uses English language model**_

In [139]:
from spacy import load
# Loading the language model
tokenizer = load("en_core_web_sm")
type(tokenizer)

spacy.lang.en.English

As we can see, above tokenizer is not an instance of the **Tokenizer** class present in the **tokenizer** module of spaCy. Instead, it is an instance of the class **English**, which is itself a subclass of the class **Language**. The **English** class implements tokenization using the statistical language model for English.

_**Tokenization + simultaneous entity extraction**_

In [192]:
text = "Manchester United is looking to sign Harry Kane for $90 million."
tokens = tokenizer(text)

# Inspecting the returned object
print("Object's type:", type(tokens))
print("Object's element's type:", type(tokens[0]))
print("Size:", len(tokens))

print("\nText:")
print(tokens.text)

# Printing the tokens
print("\nTokens:")
for word in tokens: print("-", word)
"""
Alternate iteration method (using indices):
for i in range(0, len(tokens)): print("-", tokens[i])
"""

# Printing the entities
print("\nEntities:")
print(tokens.ents)

Object's type: <class 'spacy.tokens.doc.Doc'>
Object's element's type: <class 'spacy.tokens.token.Token'>
Size: 13

Text:
Manchester United is looking to sign Harry Kane for $90 million.

Tokens:
- Manchester
- United
- is
- looking
- to
- sign
- Harry
- Kane
- for
- $
- 90
- million
- .

Entities:
(Manchester United, Harry Kane, $90 million)


Hence, we see that the return value of the callable instance of the class **English** is a **Doc** object that is a collection of **Token** objects. Furthermore, we see that the original text is preserved and the entities are extracted (accurately in this case) and stored in the list 'ents', which is an attribute of the **Doc** object. However, it is clear that for larger texts, the memory requirement for this tokenization process will be high.

# Normalization

Normalization is the process of converting a token into its base form, which involves 
- Removing inflections from a word to obtain the root word
- Replacing abbreviations with their actual meaning
- Identify informal intensifiers such as all-caps and character repetitions
- Special tokens such as hashtags, user tags, and URLs are replaced by placeholders<br>_**NOTE**: These placeholders indicate the token type that has been substituted._

The goal of normalization is to reduce redundancies in the data, thereby facilitating the deep learning process that uses this data. Redundant or irrelevant data being processed in the machine learning algorithm can affect the speed and accuracy of the learning process, since they are insignificant contributors to the sentiment of the text. Yet, weights of the neural network may be adjusted by traversing these redundant or irrelevant data as inputs to the network, and due to the insignificant contribution of this data to the sentiment, the changes may cause the weights to adjust without causing improvements in the accuracy of the model, potentially even increasing the error of the output.

## Stemming

Stemming is the process of substituting a word for its root word i.e. stem. For example, a stemming algorithm reduces the words 'running', 'ran' and 'run' to the stem, 'run', and the words 'retrieval', 'retrieved', 'retrieves' to the stem 'retrieve'. There are many available stemming algorithms (also called stemmers), each with its own advantages and disadvantages, such as:
- Porter stemmer
- Lovins stemmer
- Dawson stemmer
- N-gram stemmer
- Snowball stemmer

**POTENTIAL ISSUES**:<br>
**Over- stemming** is when a word is reduced more than appropriate, which leads to two or more words being reduced to the same root word or stem (incorrectly) when they should have been reduced to two or more stem words. For example, 'university' and 'universe' are supposed to be considered as different roots, since their meanings are significantly distinct.
<br><br>
**Under-stemming** is when two or more words are wrongly reduced to more than one root word, when they actually stem from the same root word. For example, 'data' and 'datum' stem from 'dat', but some algorithms may reduce these words to 'dat' and 'datu' respectively, which is wrong. Both of these have to be reduced to the same stem 'dat'. However, note that trying to optimize such models might lead to over-stemming.
<br><br>
**CHOOSING THE RIGHT STEMMING ALGORITHM**<br>
Some stemmers are more applicable for a particular language (ex. Porter stemmer was designed for English), while others are designed to be applicable for any language, even when it is unknown (ex. snowball stemmer). Since our application was only intended for English language text sources, and since sentiment analysis techniques we are using are designed only for English language texts, we will use stemmers designed for English.
<br><br>
Furthermore, the expression of sentiment can be quite complex and subtle, and having an excess of words is preferable to not having key words that could potentially affect the sentiment significantly. Hence, for analysing sentiment, under-stemming is preferable to over-stemming. Keeping this in mind, we shall use a less aggressive stemmer, such as **Porter stemmer**.
<br><br>
_A Python implementation of Porter stemmer is available in the 'stem' module of the 'nltk' library._

In [67]:
from nltk.stem import PorterStemmer