### Examples of what NLP can do

* What's the topic of this text? (text classification)
* Does this text contain abuse? (moderation)
* Does this text sound positive or negative? (sentiment analysis)
* What should be the next word in this incomplete sentence? (language modelling)
* How would you say this in Dutch? (translation)
* Produce a summary of this article in one paragraph. (summarization)

# What needs to be done to process text for neural networks?
* Standardizing; convert to lower case, remove punctuation -- although this is lossy!
* Split the text into units (tokens), such as characters, words, groups of words, clauses in sentences, etc
* Convert all tokens to a tensor. This means (typically) indexing the tokens.

### Example
The cat sat on the mat.

the cat sat on the mat

["cat", "sat", "on", "mat"]

[2, 34, 53, 8]

é -> e

è -> e

# Three ways of handling tokens
## Word-level tokenization
Tokens are space-separated substrings (or puncuation-separated if appropriate). A variant also splits into subwords, which is especially important for agglutinating and composing lanugages, such as Finnish or Swedish. 
## N-gram tokenization
Tokens are groups of N consecutive words. For example, "the cat", "he was", "over there" -- these are 2-grams or "bigrams".
## Character-level tokenization
Each character is its own token. In practice, useful for languages with rich writing systems or logogrammatic writing (cyrillic, hanzi, hangeul, abjads, abugidas, devangari, etc). Some of these benefit, or even require, N-character tokenization. Others should be trained with radicals and subsets of partial characters (hangul and hanzi in particular).
## Other tokenizations
Also worth mentioning the linguistic concepts "morpheme", "lexeme", "grapheme" and "phoneme". You can tokenize text in several layers (embeddings) if needed!

Morpheme
> a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming )

Lexeme
> a basic lexical unit of a language consisting of one word or several words, the elements of which do not separately convey the meaning of the whole.

Grapheme
> the smallest meaningful contrastive unit in a writing system.

Many languages written in latin alphabet(s) have complex graphemes:
- samhailchomhartha (Irish 'symbol', celtic spelling: saṁaılċoṁarṫa) 
- guillemet (French, also Norweigan, quotation marks «comme ça»)
- przybyszewszczyzna (guess what language this is)

Phoneme
> any of the perceptually distinct units of sound in a specified language that distinguish one word from another, for example p, b, d, and t in the English words pad, pat, bad, and bat.

Note that phonemes are incredibly complex. Diphtongs, ellipsis, lenition, pitch accent and tones make spoken language very difficult to generalize. For example, Swedish has tonal words despite not being a tonal language:
- tòmten (a lot of land around a house)
- tómten (definite form of 'gnome', Santa Claus)



## Embeddings

Tokenization itself isn't enough. Tokens end up basically anywhere in token-space (the vectorscape of the vectors we defined). We can side-step this by simply attaching a linear/mlp layer to each token-vector (the one-hot encoding for each token) and _learning_ the output of the linear layer together with all the other tokens. This is called an _embedding_, specifically a _learned embedding_. The networks learn to produce outputs such that similar features are close to eachother, regardless of how far apart the tokens are in tokenspace. If we apply the same logic as we did to CNNs, we can say that similar _meanings_ or _concepts_ are clustered by the embeddings. 

In pytorch, there's a much more efficient version; <code>nn.Embedding</code>. This bypasses the one-hot encoding entirely and thus saves alot of parameters in the model!

NOTE: a traditional way to encode relevance is to use TF-IDF, Term-Frequency Inverse-Document-Frequency. This is still used in some Neural Networks. It's a statistical measure that simply counts the number of occurences of a token in one sample and divides it by the inverse occurrences of the token in all samples.

There are a many available tokenizers and embeddings. Training your own is a large undertaking and faces many particular concerns.

#### Alignment
A famous example is that token embeddings of the English language will learn that 'King'-'Man'+'Woman' = 'Queen'. However, the same embeddings will learn 'Doctor'-'Man'+'Woman'='Nurse'. Here it has learned a pattern that can be interpreted as a historical stereotype that is not algined with current society-- especially in a country like Sweden were 58% of graduates with an MD are women. These embeddings will of course vary greatly with the corpus, what lables are present in the training data and the relevance of those features to the corpus. It may not be appropriate to learn "all of the UFO conspiracy forums" if the system is supposed to understand manuals for industrial machines. Similarily, deriving semantical gender or socio-economic relationships from hospital dramas or historical fiction is probably not aligned with factual society. 

#### Safety
An embedding may also learn entirely unsafe relationships. A famous example from the early days of computers is the meaning of traffic-lights derived from behavioural data: "Red - stop, Green - go, Yellow - go fast". If these embeddings exist  already in the tokenization, the later portions of the system will have a very difficult time reintepreting the meaning. The infamous "glue on pizza" google recommendation is partially an embedding failure; "food safe glue" is not an ingredient in cooking and doesn't belong in a recipie. You glue together a broken bowl with food-safe glue. You don't eat it. Doing so is harmful. _Edible_ glue-- better known as gum arabic-- is a product used for decorative deserts but that makes little sense on a pizza....

### Huggingface tokenizers

### Sentiment analysis

Code Along!