# 1. NLP: A Primer

### NLP Tasks

**Language modeling**: predicting (assessing probability of) the next word in a sentence, based on history. Applied to speech recognition, OCR recognition, handwriting recognition etc.

**Text classification**: bucketing text elements into a set of categories. Applied to spam identification and sentiment analysis.

**Information extraction**: finding relevant documents out of a collection.

**Conversational agent**: building systems able to converse in natural language. E.g., Alexa and Siri.

**Text summarization**: producing short, meaningful summaries of long documents.

**Question answering**: building systems able to answer questions asked in human language.

**Machine translation**: translating a documento from a language to another, automatically.

**Topic modeling**: extracting main topics out of a collection of documents.

### Language

Four major building blocks: <u>phonemes</u>, <u>morphemes and lexemes</u>, <u>syntax</u>, and <u>context</u>.

|      **Block**      |     **Concerns**    |                    **Applications**                    |
|:-------------------:|:-------------------:|:------------------------------------------------------:|
| Context             | Meaning             | Summarization, Topic Modeling, Sentiment Analysis      |
| Syntax              | Phrases & Sentences | Parsing, Entity Extraction, Relation Extraction        |
| Morphemes & Lexemes | Words               | Tokenization, Word Embeddings, POS Tagging             |
| Phonemes            | Speech & Sounds     | Speech to Text, Speaker Identification, Text to Speech |

**Phonemes**: smallest unit of sound. E.g., `/k/` as in *cat*.

**Morphemes and lexemes**:

- Morphemes: smallest meaningful unit of language. E.g., *unbreakable* -> *un + break + able*.

- Lexemes: structural variations of morphemes. E.g., *run* and *running* are related to the same lexeme.

**Syntax**: set of grammar rules to build sentences.

- Level 3 (sentence): *The girl laughed at the monkey*.

- Level 2 (phrases):
    - Noun phrase (NP): *The girl*;
    - Verb phrase (VP): *laughed at the monkey*.

- Level 1 (parts of speech):
    - Determinant (Det): *The*;
    - Noun (N): *girl*;
    - Verb (V): *laughed*;
    - Preposition: *at*;
    - Determinant (Det): *the*;
    - Noun: *monkey*.

- Level 0 (words): *The*, *girl*, *laughed*, *at*, *the*, *monkey*.

**Context**: interactions between the elements of the language that convey meaning. Usually composed of *semantics* and *pragmatics*.

- Semantics: meaning of the words without external contexr.

- Pragmatics: takes world knowledge and external context into consideration.

### Challenges

- Ambiguity
- Common knowledge
- Creativity
- Language diversity

### Approaches

**Heuristics-based NLP**: required expertise in the domain to formulate the rules. Tools: [`regex`](https://docs.python.org/3/library/re.html), [`pregex`](https://github.com/insperatum/pregex), [context-free grammars](https://hackage.haskell.org/package/Earley), [JAPE](https://en.wikipedia.org/wiki/JAPE_(linguistics)).

**Machine Learning for NLP**:

- Naive Bayes: algorithm for classification, based on Bayes' Theorem; calculates the probability of observing a label, given the input data; assumes the features are independent.

- Support Vector Machine: algorithm for classification, aimed at learning a (linear or non-linear) decision boundary, so that the distance of points in different classes is the maximum. Strength: robustness; weakness: scalability.

- Hidden Markov model: statistical model which assumes an underlying Markov process that generates the data.

- Conditional Random Fields: classification algorithm used for sequential data; classifies elements individually.

**Deep Learning for NLP**:

- Recurrent Neural Networks (RNN): read and process input data sequentially; have short "memory".

- Long Short-Term Memory (LSTM): a type of RNN; discards irrelevant context, only keeping the necessary part of it.

- Convolutional Neural Networks (CNN): uses convolutions and pooling layers to represent text in a condensed manner; are able to analyze groups of words.

- Transformers: model textual context non-sequentially, rather though [self-attention](https://towardsdatascience.com/understand-self-attention-in-bert-intuitively-cd480cbff30b); are thus granted with higher representation capacity. E.g., [BERT](https://huggingface.co/transformers/model_doc/bert.html).

- Autoencoders: p. 90