# Natural Language Processing (NLP)
Written language is composed of **strings of symbols** (alphabet), which are grouped into **words**, which are grouped into **sentences**.  
Computers can parse strings of symbols into words, and even words into sentences, but extracting meaning from sentences, especially with all the **nuances** of expressions, emotions, context and such, can be extremely challenging. 
To simplify computation most machine-learning algorithms use **vector representations (embeddings)** of words and/or sentences.  
Computers are very well suited to processing vectors: they are just ordered lists of numbers of fixed dimensionality after all, and **GPU**'s are hardware that are specifically designed to perform **massively parallel** computations on vectors.  

## Tools

### NLTK (Natural Language Toolkit)
https://www.nltk.org/  
NLTK is an NLP package that is designed to provide a suite of different NLP tools, models, algorithmns and corpora.  
It is primarily meant for teaching and research.  

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

### SpaCy
https://spacy.io/  
Spacy is an NLP package that is designed to provide out-of-the-box NLP tools.  
It is designed to have "industrial strength".  

In [None]:
import spacy
spacy_parser = spacy.load("en_core_web_sm")

### scikit-learn
https://scikit-learn.org/  
scikit-Learn is an ML package that is designed to provide a suite of classical ML tools.  
It is designed to be very accessible and simple.  

In [None]:
import sklearn

## Preprocessing: Parsing, Wrangling and Cleaning
Raw text is a classic example of **unstructured data**. It's generally quite dirty, and contains a lot of unnecessary information.  
The main preprocessing steps of a typical NLP pipelines are as follows:  
1. **Tokenization**: Break the text up into words.
2. **Lemmatization or Stemming**: Convert the words into their base forms.
3. **Stop Word Filtering**: Remove words with little meaning.

### Tokenization
https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization  
https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation  
Letters by themselves aren't particularly meaningful, however full words by themselves are.  
**Punctuation** should be considered its own token: it adds to the sentence meaning but not the individual word it is adjacent to.  
Tokenizers are almost always constructed from a **deterministic white-box ruleset**.  
Most commonly tokenizers use a complex **regex** expression to parse the text into words.


There do exist NLP algorithms that don't tokenize sentences into distinct words: for example byte-pair encoding.
https://en.wikipedia.org/wiki/Byte_pair_encoding  
There also exist phonetic tokenizers, which can be good at capturing slang, and acronyms.

In [None]:
doc = "If you knew how nuanced this gets, you wouldn't believe it!"
#doc = "Why use many words, when fewer words do the trick?"

In [None]:
# NLTK has a wide variety of tokenizers
# https://www.nltk.org/api/nltk.tokenize.html
nltk_space_tokenizer = nltk.tokenize.SpaceTokenizer()
nltk_space_tokenized_doc = nltk_space_tokenizer.tokenize(doc)
print(nltk_space_tokenized_doc)

nltk_treebank_tokenizer = nltk.tokenize.treebank.TreebankWordTokenizer()
nltk_treebank_tokenized_doc = nltk_treebank_tokenizer.tokenize(doc)
print(nltk_treebank_tokenized_doc)

In [None]:
# SpaCy automatically tokenizes your document as part of its standard procedure
spacy_doc = spacy_parser(doc)
spacy_tokenized_doc = list(map(str, spacy_doc))
print(spacy_tokenized_doc)

### Lemmatization and Stemming
https://en.wikipedia.org/wiki/Morphology_(linguistics)  
https://en.wikipedia.org/wiki/Lemma_(morphology)  
https://en.wikipedia.org/wiki/Stemming  
https://en.wikipedia.org/wiki/Lemmatisation  
Often the exact **morphology** of a word isn't very important to the meaning of a sentence. We just want the root of the word to capture its meaning.  
**Stemming** is a method of obtaining the root of a word by simply matching the suffix of a word and removing it.  
**Lemmatization** is a method of obtaining the root of a word by looking up the word in a dictionary that maps to a root.

In [None]:
nltk_snowball_stemmer = nltk.stem.snowball.EnglishStemmer()
nltk_stemmed_doc = list(map(nltk_snowball_stemmer.stem, nltk_treebank_tokenized_doc))
print(nltk_stemmed_doc)

nltk_wordnet_lemmatizer = nltk.stem.WordNetLemmatizer() 
nltk_lemmatized_doc = list(map(nltk_wordnet_lemmatizer.lemmatize, nltk_treebank_tokenized_doc))
print(nltk_lemmatized_doc)

In [None]:
spacy_lemmatized_doc = list(map(lambda token: token.lemma_, spacy_doc))
print(spacy_lemmatized_doc)

### Stop Words
https://en.wikipedia.org/wiki/Stop_words  
Some words are considered insignificant to the meaning of the document. The solution is simply to filter those tokens out.

In [None]:
#nltk.download('stopwords')
nltk_stopwords = set(nltk.corpus.stopwords.words('english'))
print(nltk_stopwords)
print('')
nltk_stopword_doc = list(filter(lambda token: token not in nltk_stopwords, nltk_lemmatized_doc))
print(nltk_stopword_doc)

In [None]:
spacy_stopwords = spacy_parser.Defaults.stop_words
print(spacy_stopwords)
print('')
spacy_doc_stopword_doc = list(map(str,filter(lambda token: not token.is_stop, spacy_doc)))
print(spacy_doc_stopword_doc)
print('')
spacy_doc_stopword_doc = list(map(str,filter(lambda token: not token in spacy_stopwords, spacy_lemmatized_doc)))
print(spacy_doc_stopword_doc)

### Spelling Correction and Edit Distance
https://en.wikipedia.org/wiki/Spell_checker  
https://en.wikipedia.org/wiki/Edit_distance  
https://en.wikipedia.org/wiki/Levenshtein_distance  
Often raw data will come with typoes, slang, jargon, and acronyms/initialisms, and other features of real-life writing in practice.  

## Bag Of Words
https://en.wikipedia.org/wiki/Bag-of-words_model  
https://en.wikipedia.org/wiki/Vector_space_model  
A Bag Of Words (BOW) model assumes the order of words isn't important to a sentence's meaning: it's simply what words are present and how many there are.  
The motivation behind BOW models is to easily vectorize documents, with the dimensions corresponging to different tokens. Note how orthogonality can be thought of as representing the 'difference in meaning' of different tokens. Also note how idenpendence between the vector dimensions means context is lost in this type of encoding.  

In [None]:
corpus = ['The man is a man', 
          'The woman is the woman']
tokenized_corpus = map(lambda doc: doc.lower().split(), corpus)

### One-Hot Encoding
https://en.wikipedia.org/wiki/One-hot  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html  
One-Hot Encoding is a technique for vectorizing discrete finite sets.  
Each possible element of the universal set corresponds to a dimension of the vector.  
Thus, a given set can be encoded as a vector by placing a 1 in each dimension for which its corresponding element apprears in the set.

Intuitively, each dimesion of the vector is a boolean value corresponding to "does this element appear in the set?"

In [None]:
OneHotEncoder = sklearn.preprocessing.MultiLabelBinarizer()
OneHotCorpus = OneHotEncoder.fit_transform(tokenized_corpus)
print(OneHotEncoder.classes_)
print(OneHotCorpus)

### Term Frequency
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html  
Also known as "word count" vectorizes documents by mapping every word in the corpus to the number of times it appears in a document.


In [None]:
CountEncoder = sklearn.feature_extraction.text.CountVectorizer(token_pattern=r'[a-z]+')
CountCorpus = CountEncoder.fit_transform(corpus)
print(CountEncoder.get_feature_names())
print(CountCorpus.toarray())

### TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html  
TF-IDF: Term Frequency–Inverse Document Frequency  
TF-IDF uses both the counts of a word in its document, as well as its prevelence in the corpus to determine its relevance.  
Thus terms that appear very frequenty everywhere and don't provide much information are weighted lower, where as words that appear in very few documents, but quite frequently within a document will be weighted highly.  

The formula for the TF-IDF value, $\mathrm{tfidf}$, of a term $t$ in a document $d$ is given by  
$$ \mathrm{tfidf}_{t, d} = \frac{\mathrm{f}_{t, d}}{\sum_{t' \in T} \mathrm{f}_{t', d}} \log \left ( \frac{\left | D \right |}{\left | \left \{ d' \in D \mid \mathrm{f}_{t, d'} > 0 \right \} \right |} \right )$$  
where:  
$\mathrm{tfidf}_{t, d}$ is the TF-IDF value of term $t$ in document $d$  
$\mathrm{f}_{t, d}$ is the number of times term $t$ appears in document $d$  
$T$ is the universal set of terms  
$D$ is the universal set of documents  


In [None]:
TFIDFEncoder = sklearn.feature_extraction.text.TfidfVectorizer(token_pattern=r'[a-z]+')
TFIDFCorpus = TFIDFEncoder.fit_transform(corpus)
print(TFIDFEncoder.get_feature_names())
print(TFIDFCorpus.toarray())

### n-grams
https://en.wikipedia.org/wiki/N-gram  
n-grams are groups of tokens of cardinality n that appear contiguously in their documents.  
A small amount of local context can be added to the bag-of-words model by extending tokens into n-grams.  
For example, marking the existance of the bigram 'not good' in a document would give more information about its content than the tokens 'not' and 'good' sparately: this small amount of local context can have significant meaning.

In [None]:
NGramCountEncoder = sklearn.feature_extraction.text.CountVectorizer(token_pattern=r'[a-z]+', ngram_range=(1,2))
NGramCountCorpus = NGramCountEncoder.fit_transform(corpus)
print(NGramCountEncoder.get_feature_names())
print(NGramCountCorpus.toarray())

## Contextual Models

### Word2Vec

### BERT ELMo

### Syntax Trees and Grammars

## Non-Vector models

### Princeton Wordnet