# 2. NLP Glossary

Overview of most common terms related to NLP, focusing on those related to today's task.

## Tokenization

The process of splitting text into **tokens**.

Tokens are parts of the text that may in some context have some meaning.
Some of the most obvious tokens are:
- words
- punctuation
- emojis

Tokenization is a simple process, and for most languages can be performed using simple rules,
although there are differences between languages - most notable of them shortcuts and multi-word names.

SpaCy uses the same set of rules for all languages but allows them to add custom exceptions.

In [27]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("I'm in 💙 with N.Y. :)")
print(list(doc))  # doc = sequence of tokens

[I, 'm, in, 💙, with, N.Y., :)]


## Lemmatization

Sometimes, it is important to know the base form of a word (token),
which is the thing you would find in a language.

Knowing this base form can help with use cases such as:
- counting **word frequency** (how many times each word appears in the text)
- computing likelihood of two words being in one sentence

We will discuss why lemmatization is important later in this course, for now let's just remember that it is there.

As for the implementation, it is a much harder task than tokenization and requires much more information as an input. Luckily, SpaCy is our friend and gives us easy access to all tokens' lemmas:

In [22]:
doc = nlp("Apple is looking at buying U.K. startups for a total of $1 billion")
for token in doc:
    if not token.text == token.lemma_:
        print(f"{token} -> {token.lemma_}")

Apple -> apple
is -> be
looking -> look
buying -> buy
U.K. -> u.k.
startups -> startup


## Stop words

While most of the tokens have some meaning, there are some of them that don't.

In particular, words that appear very often often do not carry any meaning at all,
you can think of them like a syntax sugar for a natural language to make it prettier.
These words are called **stop words**.

Whenever we are preparing to apply statistical methods (like any ML models) to natural language,
it is worth removing all stop words as they are just an unnecessary noise.

![English Word Frequency](http://robslink.com/SAS/democd82/word_frequency.png)

In [38]:
print(set(["the", "of", "and", "to", "in", "a"]) - spacy.lang.en.stop_words.STOP_WORDS)

set()


## POS Tagging

## Dependency Trees

## Named Entities

## Sentiment analysis