# 2. NLP Glossary

Overview of most common linguistic terms useful when you need to process natural language, 
focusing on those related to today's task.

![Linguistic Pyramid](https://cdn-images-1.medium.com/max/1600/1*Qtk5pN8n_BcYUsosrKFrFg.png)

Generally, linguistic tasks can be separated into stages:
1. Morphology - understanding prefixes/suffixes, word forms, etc.
2. Syntax - relationships between words in a sentence, grammar
3. Semantics - extracting the meaning of sentences
4. Pragmatics - understanding the text as a whole

These can be though of as kind of abstraction layers for NLP tasks, 
the higher ones give the most insight, and precision in lower ones
can greatly improve the results of higher-level tasks.

Today, we will aim to perform a sentiment analysis, which is a high level task.
Therefore, we have to go through some other tasks at first in order to understand what we're doing.

## Tokenization

The process of splitting text into **tokens**.

Tokens are parts of the text that may in some context have some meaning.
Some of the most obvious tokens are:
- words
- punctuation
- emojis

Tokenization is a simple process, and for most languages can be performed using simple rules,
although there are differences between languages - most notable of them shortcuts and multi-word names.

SpaCy uses the same set of rules for all languages but allows them to add custom exceptions.

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("I'm in 💙 with N.Y. :)")
print(list(doc))  # doc = sequence of tokens

[I, 'm, in, 💙, with, N.Y., :)]


## Lemmatization

Sometimes, it is important to know the base form of a word (token),
which is the thing you would find in a language.

Knowing this base form can help with use cases such as:
- counting **word frequency** (how many times each word appears in the text)
- computing likelihood of two words being in one sentence

We will discuss why lemmatization is important later in this course, for now let's just remember that it is there.

As for the implementation, it is a much harder task than tokenization and requires much more information as an input. Luckily, SpaCy is our friend and gives us easy access to all tokens' lemmas:

In [3]:
doc = nlp("Apple is looking at buying U.K. startups for a total of $1 billion")
for token in doc:
    if not token.text == token.lemma_:
        print(f"{token} -> {token.lemma_}")

Apple -> apple
is -> be
looking -> look
buying -> buy
U.K. -> u.k.
startups -> startup


## Stop words

While most of the tokens have some meaning, there are some of them that don't.

In particular, words that appear very often often do not carry any meaning at all,
you can think of them like a syntax sugar for a natural language to make it prettier.
These words are called **stop words**.

Whenever we are preparing to apply statistical methods (like any ML models) to natural language,
it is worth removing all stop words as they are just an unnecessary noise.

![English Word Frequency](http://robslink.com/SAS/democd82/word_frequency.png)

In [4]:
print(set(["the", "of", "and", "to", "in", "a"]) - spacy.lang.en.stop_words.STOP_WORDS)

set()


## Word vectors

Because we will be trainging machine learning models, we need a memory-efficient representation of words and sentences.

Representing words or sentences as vectors in a right way can help us achieve exactly that.
If the vectors are constructed in a right way, we should be able to represent relationships between words
using standard vector operations.

![Linear relationships between vectors](https://www.tensorflow.org/images/linear-relationships.png)

There are many algorithms of embedding words into vectors, of which most important are:

- **one-hot-encoded words** (traditional approach, cannot infer any relationships between words)
- **word-2-vec** algorithm (simple neural network trained to recognize the probability of words occuring together)
- **CBOW** (simple neural network trained to predict probability of word appearing in a given context)
- **FastText** (word-2-vec trained on word n-grams to improve accuracy of words that are outside of training dataset)

For more info about vector embedding algorithms, see [this article](https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c).

While most of these models can be easily imported and traiend using Gensim library, 
we will use pre-trained vectors supplied with spaCy models.

In [14]:
print(f"{doc[4]} <-> {doc[4]} --> similarity={doc[2].similarity(doc[3]):.2f}")

buying <-> buying --> similarity=0.20


## POS Tagging

Part of Speech (POS) is one of the obvious attributes we can extract from words.

It is, however, not a trivial task and is most often performed using neural networks - 
in spaCy, POS Tagging, Dependency Parsing and Named Entity Recognition are all performed by the same network,
trained to perform multiple tasks simultaneously. This trick proved to achieve the highest scores across all 3 tasks.

POS Tags are often used to filter text and perform higher level tasks. 
We have already seen them a few times in our visualizations - tag names always appear below words:

In [15]:
doc = nlp("spaCy is the best way to prepare text for deep learning.")
spacy.displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

## Dependency Trees

Dependency tree is a representation of relations between words in a sentence.
The tree representation is, as we all know, something used a lot in computer science and makes it easy to analyze sentences.

Just as POS Tags, this is something you should all remember from primary school, 
and you have already seen how spaCy does it - take a closer look at the arrows on our visualizations:

In [15]:
doc = nlp("spaCy is the best way to prepare text for deep learning.")
spacy.displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

## Named Entities

The last small task we will need to perform in order to analyze sentiment is named entity recognition - 
after all, we will want to look for adjectives relating to a specific named entity.

It is important to remember that this is not a trivial task and is also performed using neural networks.
This allows us to take context into account, so that we don't confuse apple (the fruit) with Apple (the company).

In [19]:
doc = nlp("Apple is looking at buying U.K. startups for a total of $1 billion")
spacy.displacy.render(doc, style='ent', jupyter=True, options={'distance': 100})

## Sentiment analysis

Is the process of determining how positive/negative the text is about a given subject.

This is a high-level task and has a plenty of real-world use cases, such as:
- analyzing product reviews
- predicting election outcomes

Today, after learning nuts and bolts of language analysis using spaCy, we will try to
create an application that analyzes sentiment regarding a given named entity.