# References
1. Intro to spaCy:<br>https://spacy.io/usage/spacy-101
2. Tokenization & lemmatization using spaCy:<br>https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/
3. 'Language' class (spaCy):<br>https://spacy.io/api/language
4. Comparing NLTK and spaCy:<br>
    - https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2
    - https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/

# Basic dataset preparation

## Columns of the full dataset

In [35]:
import pandas as pd
# The whole data set

# Generalized code for accessing the data directory
# (Meant to work even if this file is within some other subdirectory)
path = "data/amazonConsumerReviews.csv"
while True:
    try:
        data = pd.read_csv(path)
        break
    except:
        path = "../" + path
print("COLUMN NAMES\n------------")
for c in data.columns: print(c)

COLUMN NAMES
------------
id
dateAdded
dateUpdated
name
brand
categories
primaryCategories
manufacturer
manufacturerNumber
reviews.date
reviews.doRecommend
reviews.numHelpful
reviews.rating
reviews.text
reviews.title


## Only keeping relevant columns

In [34]:
# Only selecting relevant columns
reviewsData = data[['id',
                  'reviews.doRecommend',
                  'reviews.rating',
                  'reviews.text',
                  'reviews.title']]
reviewsData.head(3)

Unnamed: 0,id,reviews.doRecommend,reviews.rating,reviews.text,reviews.title
0,AVqVGZNvQMlgsOJE6eUY,False,3,I thought it would be as big as small paper bu...,Too small
1,AVqVGZNvQMlgsOJE6eUY,True,5,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach
2,AVqVGZNvQMlgsOJE6eUY,True,4,Didnt know how much i'd use a kindle so went f...,Great for the price


# spaCy overview

spaCy has a tokenizer class whose instances are a callable objects that accept a string as an argument and return an object of type **Doc** (more precisely, **spacy.tokens.doc.Doc**). Such an object contains various attributes that are applicable to the tokenized words, including conversion to JSON format and text vectorization (i.e. converting tokenized text into its numerical representation).
<br><br>
Furthermore, spaCy also provides language models that offer powerful features, such as parts-of-speech tagging (i.e. identifying the grammatical role of a word or word sequence in a sentence), noun chunk extraction (i.e. detecting nouns) and named entity recognition (i.e. detecting proper nouns), all of which would lead to easier and more comprehensive language processing.

# spaCy vs. NLTK

Based on our perusal of NLTK and spaCy codes, we have come to the conclusion that spaCy could be a more attractive choise due to the following reasons:
- **Doc** objects are easier to inspect, transform and extract information from
- spaCy's language models help produce more sophisticated and appropriate tokenization
- Lemmatization of **Doc** object elements is very straightforward using language models
- spaCy aims to provide fewer tools that deliver better performance and developer experience

**NOTE ON THE LAST POINT**<br>
spaCy is not a comprehensive package for natural language processing (NLP) methods. In other words, it is not suited for researching or learning about different NLP methods and concepts. Rather, it is focused on enhancing implementation, providing what the developers of spaCy think is the best method to achieve a certain NLP task.
<br><br>
**NOTE ON LANGUAGE MODEL**<br>
A statistical language model is a probability distribution of words or word sequences. In practice, a language model gives the probability of a certain word sequence being 'valid' in a given context. Note that validity here is not grammatical validity, but validity with respect to the actual usage of language. In other words, it aims to model how people use language.
<br><br>
**MAIN FACTORS FOR CHOOSING SPACY**<br>
While the above points make spaCy a more attractive choice for using in the development of our application, the main reasons for ultimately choosing spaCy to perform tokenization are two-fold
<br><br>
Firstly, the availability of language models in spaCy that would enhance the quality and effectiveness of lemmatization (discussed later). Secondly, tokenization using language models applies lemmatization, entity extraction and part-of-speech analysis during tokenization itself.

# spaCy installation using terminal commands

- **pip install spacy** _(Installs spaCy library)_<br>
- **python -m spacy download en_core_web_sm** _(Downloads the language model for English)_

_(For installation in the Jupyter notebook environment, prefix these commands with !)_

# Tokenization

**(spaCy tokenization using language model)**

## Creating instance of tokenizer

**(This instance uses English language model)**

In [1]:
from spacy import load
# Loading the language model
tokenizer = load("en_core_web_sm")
type(tokenizer)

spacy.lang.en.English

As we can see, above tokenizer is not an instance of the **Tokenizer** class present in the **tokenizer** module of spaCy. Instead, it is an instance of the class **English**, which is itself a subclass of the class **Language**. The **English** class implements tokenization using the statistical language model for English.

## Tokenization

**(... along with the simultaneously performed results)**

In [44]:
text = "Manchester United is looking to sign Harry Kane for $90 million."
tokens = tokenizer(text)

# Inspecting the returned object
print("Object's type:", type(tokens))
print("Object's element's type:", type(tokens[0]))
print("Size:", len(tokens))

print("\nText:")
print(tokens.text)

# Printing the tokens
print("\nTokens:")
for word in tokens: print("-", word)
"""
Alternate iteration method (using indices):
for i in range(0, len(tokens)): print("-", tokens[i])
"""

# Printing the entities
print("\nEntities:")
print(tokens.ents)

# Printing the part-of-speech tags
print("\nPOS tags:")
for word in tokens: print(word, ":", word.pos_)
    
# Printing the lemmas for each word
print("\nLemmas:")
for word in tokens: print(word, ":", word.lemma_)

Object's type: <class 'spacy.tokens.doc.Doc'>
Object's element's type: <class 'spacy.tokens.token.Token'>
Size: 13

Text:
Manchester United is looking to sign Harry Kane for $90 million.

Tokens:
- Manchester
- United
- is
- looking
- to
- sign
- Harry
- Kane
- for
- $
- 90
- million
- .

Entities:
(Manchester United, Harry Kane, $90 million)

POS tags:
Manchester : PROPN
United : PROPN
is : AUX
looking : VERB
to : PART
sign : VERB
Harry : PROPN
Kane : PROPN
for : ADP
$ : SYM
90 : NUM
million : NUM
. : PUNCT

Lemmas:
Manchester : Manchester
United : United
is : be
looking : look
to : to
sign : sign
Harry : Harry
Kane : Kane
for : for
$ : $
90 : 90
million : million
. : .


Hence, we see that the return value of the callable instance of the class **English** is a **Doc** object that is a collection of **Token** objects. Furthermore, we see that the original text is preserved, and multiple analyses have been performed on the tokens, such as:
- Entity extration
- POS tagging
- Lemmatization

# Normalization (lemmatization)

As we have seen above, lemmatization occurs as tokenization is performed. Hence, the lemma of each token is already available in the **lemma_** attribute of the **Token** objects (which are elements of the **Doc** object that contains these tokens). 

# Code to perform text mining for all reviews...

In [39]:
# Tokenization
#------------
"""
Lemmatization (along with entity extraction and POS tagging)
are also performed simultaneously.
"""
from spacy import load
# Loading the language model
tokenizer = load("en_core_web_sm")
tokenizedDocs = []

# Tokenizing each review
for r in reviewsData['reviews.text']:
    tokenizedDocs.append(tokenizer(r))
#========================
# Lemmatization
#------------
lemmatizedDocs = []

# Iterating through each tokenized text
for doc in tokenizedDocs:
    lemmatizedDoc = []
    
    # Iterating through each token in the tokenized text
    for token in doc:
        lemmatizedDoc.append(token.lemma_)
    
    # Adding the lemmatized text to the list
    lemmatizedDocs.append(lemmatizedDoc)