# NLP Basics

![nlp](https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2019/01/NLP-Technology-in-Healthcare.jpg)

# What is NLP?

An area of computer science and artificial intelligence concerned with the interaction between computers and humans (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 

To add context to the above it is fair to say that when performing analysis, a lot of data is numerical, measurements based and quantifiable. Computers are exceptional at dealing with this type of data. But, textual data is a whole other realm of meaning. Humans can tell there is a plethora of information inside of text documents but a computer needs specialised processing in order to 'understand' raw text data. Text data is highly unstructured and can be a mix of multiple languages.

NLP is therefore a variety of techniques to create structure from text data. It is an area of active developments and advances in the field are constant. 

Typical use cases for NLP: 
- Email classification between spam and legitimate emails
- Sentiment Analysis of movie reviews text
- Analyzing trends from written feedback forms
- Understanding text commands (Siri, Alexa, Google assistant)

# Introduction to Libs

## Spacy

Spacy is an open source NLP library for Python. It is an implementation of common algorithms for effectively and efficiently dealing with language processing problems. For many NLP tasks Spacy only has one implemented method, typically the most efficient algorithm currently available at the time of publishing the library. This means there is often one-way to do things with Spacy. 

## NLTK

NLTK, or natural language toolkit is another open source option that is hugely popular. This dates back to 2001 where as Spacy was first published in 2015. This library has a more comprehensive suite of options for achieving particular tasks but that means it also includes options which are not the not efficient implementation, or suboptimal approach to certain problem sets. 

## Comparison notes
- For many tasks Spacy is faster and more efficient than NLTK at the cost of having less freedom of choice in algorithmic implementations.
- Spacy does not include pre-created models for some applications such as sentiment analysis, which is easier to achieve with NLTK.
- Approach taken will be to default to Spacy where a use-case exists and fall back to NLTK for problem sets where this library is better or has additional resources and tooling that are not available in Spacy.

[NLTK Vs Spacy speed comparison](https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2#:~:text=NLTK%20is%20a%20string%20processing,spaCy%20uses%20object%2Doriented%20approach.&text=As%20we%20can%20see%20below,sentence%20tokenization%2C%20NLTK%20outperforms%20spaCy)

## Additional installation instructions
- For conda envs: `conda install -c conda-forge spacy`
- For language pack additions (English): `python -m spacy download en`


# 3.1.0 - Spacy Basics

- Loading the language libraries
- Building a pipeline object
- Using tokens
- Parts-of-speech tagging
- Understanding token attributes 

Spacy works with a `pipeline object` this takes raw text and automatically performs a series of operations to tag, parseand describe the text data. 

In [1]:
import spacy

In [2]:
# loads the language pack (or model).
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [4]:
# analyse the output of the assignation to doc above. Note that we have tokenised 
# each word. It has a POS value and we can see the full descriptor of the types of 
# words that is for each. 

def doc_analysis(doc):
    print(f"{'Token':<25}{'POS':>5} {'POS_desc':<8}{'Syntactic_Dependency':>25} {'tag':<25}")
    print("-" * 75)
    for token in doc:
        print(f"{token.text:<25}{token.pos:>5} {token.pos_:<8}{token.dep_:>25} {token.tag_:<25}")

In [5]:
# run the analysis on our test text 
doc_analysis(doc)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
Tesla                       96 PROPN                       nsubj NNP                      
is                          87 AUX                           aux VBZ                      
looking                    100 VERB                         ROOT VBG                      
at                          85 ADP                          prep IN                       
buying                     100 VERB                        pcomp VBG                      
U.S.                        96 PROPN                    compound NNP                      
startup                     92 NOUN                         dobj NN                       
for                         85 ADP                          prep IN                       
$                           99 SYM                      quantmod $                        
6             

In [6]:
# describes the operations series that is performed on a submitted text. 
# tagger
# parser
# named entity recogniser
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fb5bde9af90>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fb5be46b8a0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fb5be52c520>)]

In [7]:
doc2 = nlp(u"Tesla isn't looking into startups anymore.")

In [8]:
doc_analysis(doc2)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
Tesla                       96 PROPN                       nsubj NNP                      
is                          87 AUX                           aux VBZ                      
n't                         94 PART                          neg RB                       
looking                    100 VERB                         ROOT VBG                      
into                        85 ADP                          prep IN                       
startups                    92 NOUN                         pobj NNS                      
anymore                     86 ADV                        advmod RB                       
.                           97 PUNCT                       punct .                        


We can follow up on the syntactic dependency details with the official docs: https://spacy.io/usage/linguistic-features#dependency-parse

In [9]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [10]:
life_quote = doc3[16:30]
life_quote

"Life is what happens to us while we are making other plans"

In [11]:
# we can see the spacy is smart enough to note differences between types of input
print(type(doc3))
print(type(life_quote))

<class 'spacy.tokens.doc.Doc'>
<class 'spacy.tokens.span.Span'>


In [12]:
doc4 = nlp(u"This is the first sentence. This is another sentence. This is the last sentence.")

In [13]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [14]:
# traverse each token in the doc sample to test if the word is the 
# start of a new sentece. 
for token in doc4:
    print(token.is_sent_start)

True
None
None
None
None
None
True
None
None
None
None
True
None
None
None
None
None


# 3.2.0 - Tokenization

**What is it?** Tokenization is the splitting process of a body of text into its component parts. Spacy is intelligent enough to be able to split on:
- `Whitespace` - Space between tokens / words
- `prefix` - characters at the beginning
- `infix` - characters in between 
- `exception` - special case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
- `suffix` - characters at the end

Tokenization yields tokens that are part of the original text, we don't see conversions to word stems or lemmas. Named Entity Recognition will come later. Tokens are the building blocks of a Doc object - everything that helps us to understand the meaning of the text is derived from tokens and their relationships to one another. 

In [15]:
doc = nlp(u'"we\'re moving to L.A!"')
doc_analysis(doc)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
"                           97 PUNCT                       punct ``                       
we                          95 PRON                        nsubj PRP                      
're                         87 AUX                           aux VBP                      
moving                     100 VERB                         ROOT VBG                      
to                          85 ADP                          prep IN                       
L.A                         96 PROPN                        pobj NNP                      
!                           97 PUNCT                       punct .                        
"                           97 PUNCT                       punct ''                       


In [16]:
doc2 = nlp(u"We're here to help! Send Smail-mail, email support@oursite.com or visit us at http://www.oursite.com")

# demonstrate the library's capability with modern strings
# such as web addresses or email addresses. 
# dot operators not identified as a punctuation token in 
# these instances. 
doc_analysis(doc2)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
We                          95 PRON                        nsubj PRP                      
're                         87 AUX                          ROOT VBP                      
here                        86 ADV                        advmod RB                       
to                          94 PART                          aux TO                       
help                       100 VERB                        advcl VB                       
!                           97 PUNCT                       punct .                        
Send                       100 VERB                         ROOT VB                       
Smail                       96 PROPN                    compound NNP                      
-                           97 PUNCT                       punct HYPH                     
mail          

In [17]:
doc3 = nlp(u"A 5km NYC cab ride costs $10.30")

# demonstrate spacy being smart enough to keep 
# monetary amounts intact while separating the 
# amounts fro a distance indicator with 5km.
doc_analysis(doc3)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
A                           90 DET                           det DT                       
5                           93 NUM                        nummod CD                       
km                          92 NOUN                     compound NN                       
NYC                         96 PROPN                    compound NNP                      
cab                         92 NOUN                     compound NN                       
ride                        92 NOUN                        nsubj NN                       
costs                      100 VERB                         ROOT VBZ                      
$                           99 SYM                          nmod $                        
10.30                       93 NUM                          dobj CD                       


In [18]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year")
doc_analysis(doc4)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
Let                        100 VERB                         ROOT VB                       
's                          95 PRON                        nsubj PRP                      
visit                      100 VERB                        ccomp VB                       
St.                         96 PROPN                    compound NNP                      
Louis                       96 PROPN                        dobj NNP                      
in                          85 ADP                          prep IN                       
the                         90 DET                           det DT                       
U.S.                        96 PROPN                        pobj NNP                      
next                        84 ADJ                          amod JJ                       
year          

In [19]:
doc5 = nlp(u"It is better to give than receive.")
print(doc5[2:5])
print(doc5[4], doc5[6])

better to give
give receive


In [20]:
print(doc5[7])

.


**note:** It's worth noting that spacy has derived a lot of information from the tokens of a doc, therefore `does not support inline reassignment of values to the doc or token` under analysis without performing the entire operation again. 

**What are ents?** we can also touch upon another type of token that may be derived. These can be accessed in the `.ents` method. These are the `named entities` meaning the library is smart enough to recognise some organisations names and place names as well as monetary units. 

In [21]:
doc6 = nlp(u"Apple to build Scottish factory in Glasgow creating 400 jobs at a cost of $290 million")

doc_analysis(doc6)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
Apple                       92 NOUN                        nsubj NN                       
to                          94 PART                          aux TO                       
build                      100 VERB                        relcl VB                       
Scottish                    84 ADJ                          amod JJ                       
factory                     92 NOUN                         dobj NN                       
in                          85 ADP                          prep IN                       
Glasgow                     96 PROPN                        pobj NNP                      
creating                   100 VERB                         ROOT VBG                      
400                         93 NUM                        nummod CD                       
jobs          

In [22]:
# show the entity info
for ent in doc6.ents:
    entity = ent
    lab = ent.label_
    print(f"{str(entity):<15}{str(lab):>10} {ent.label} {str(spacy.explain(ent.label_))}")

Apple                 ORG 383 Companies, agencies, institutions, etc.
Scottish             NORP 381 Nationalities or religious or political groups
Glasgow               GPE 384 Countries, cities, states
400              CARDINAL 397 Numerals that do not fall under another type
$290 million        MONEY 394 Monetary values, including unit


In [23]:
doc7 = nlp(u"Autonomous cars shify insurance liability toward manufacturers.")
doc_analysis(doc7)

Token                      POS POS_desc     Syntactic_Dependency tag                      
---------------------------------------------------------------------------
Autonomous                  84 ADJ                          amod JJ                       
cars                        92 NOUN                        nsubj NNS                      
shify                      100 VERB                         ROOT VBP                      
insurance                   92 NOUN                     compound NN                       
liability                   92 NOUN                         dobj NN                       
toward                      85 ADP                          prep IN                       
manufacturers               92 NOUN                         pobj NNS                      
.                           97 PUNCT                       punct .                        


In [24]:
# demonstrate noun chunks

for chunk in doc7.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


In [25]:
# buit in visualiser 
from spacy import displacy

In [26]:
doc8 = nlp(u"Apple to build factory in Glasgow creating 500 jobs costing $210 million")

In [27]:
displacy.render(doc8, style='dep', jupyter=True, options={'distance':70})

In [28]:
doc9 = nlp(u"Over the last quarter Apple sold nearly 6 hundred thousand iPad Pros for a profit of $312 million.")

In [29]:
displacy.render(doc9, style='ent', jupyter=True)

In [30]:
doc10 = nlp(u"This is being served up to you now.")
#displacy.serve(doc10, style='dep')

In [31]:
displacy.render(doc10, style='dep')

# 3.3.0 - Stemming 

**What's stemming?** Essentially stemming is a crude method of cataloging related words by chopping of letters at the end until a common stem is reached. It is generally accepted to work "quite well" but the English language has many quirks and imperfections in its construct - meaning there are many cases where a more sophisticated method is required. By example, if we search the word 'boat', we might return 'boats, boating, boater' so we can say that 'boat' is the `stem` for boat, boater, boating, boats.

The `Spacy` library doesn't include a `stemmer` opting instead to rely on lemmatization. Stemming is a huge topic in NLP and the subject of many discussions and debates. It is helpful to know and understand the basics of stemming before moving on to lemmatization. 

For the purposes of covering stemming we're going to have a look at the `NLTK` library. We will look at:
- Porter Stemming
- Snowball Stemming

[Porter's algorithm](https://tartarus.org/martin/PorterStemmer/def.txt) is considered one of the most common and most effective available. The algorithm employs five phases of word reduction, each with its own set of mapping rules.