### Examples of what NLP can do

* What's the topic of this text? (text classification)
* Does this text contain abuse? (moderation)
* Does this text sound positive or negative? (sentiment analysis)
* What should be the next word in this incomplete sentence? (language modelling)
* How would you say this in Dutch? (translation)
* Produce a summary of this article in one paragraph. (summarization)

# What needs to be done to process text for neural networks?
* Standardizing; convert to lower case, remove punctuation-- although this is lossy and enables unicode injection!
* Ignore some very common words and glyphs, e.g. "the", "a/an"-- so called _stop words_. The assumption is that these don't carry significant meaning, but some linguistics research disagrees.
* Split the text into units (tokens), such as characters, words, groups of words, clauses in sentences, etc
* Convert all tokens to a tensor. This means (typically) indexing the tokens.

### Example
The cat sat on the mat.

the cat sat on the mat

["cat", "sat", "on", "mat"]

[2, 34, 53, 8]

√© -> e

√® -> e

but also:

ƒ∏ (kappa) -> k

ùëé (Mathematical Italic Small A, unicode U+1D44E) -> a

These don't display in all fonts and some font-renderers don't define a glyph for missing characters, thus becoming a security concern.

Then you have things like:

UÃ∂ÕêÕõÃ∫Ã¶nÃµÕãÃªÃ†iÃ∏ÃçÕ†ÃóÃ¨cÃµÃêÕêÃÆÃ¨Ã•oÃ∂ÃÄÃâÕíÃÆÃºÃ¶dÃ¥ÕÉÕÑÃ±Ã∞eÃ∂ÃíÃïÃÆÕéÃ• ÃµÃíÃÇÃ≤Ã¨Ã¶iÃ∑ÕòÕùÃÇÃúsÃ∏ÕòÃ¨Ãô Ã∂ÕãÃåÃ°Ã≠Ã≤aÃ¥ÃøÃÜÃîÃ≥ ÃµÃéÕñvÃ∏ÕíÃíÃÉÕÖÃòeÃ∂ÃÅÃáÃîÃórÃ∑ÕêÃõÃ≠ÕöyÃ∑ÃêÕöÕîÃ≤ Ã∏ÃâÕëÕìÃ¶gÃ∏ÃøÕöÕáÃòeÃµÃçÕÜÃáÃ≥nÃ∂ÕíÃöÃûÕîeÃµÃâÃìÕçÃ¶Ã§rÃµÕÜÃõÃÜÃùaÃ∂ÕóÕãÃõÃ∫lÃ¥ÃöÕâÕéÃÆ Ã∏ÃâÕôtÃ¥ÃÖÃÉÃ®ÃúeÃ∑ÕùÕéÃ§xÃµÕÜÃëÕÑÕîÃªtÃ¥ÕùÃ©ÃßÃò Ã¥ÃàÃûÃÆÃÆfÃ¥ÃïÕãÃ≠Ã´oÃ¥ÕõÕ†ÃÄÃ©rÃ∑ÃîÕùÃôÕçÃümÃµÕëÃãÕÑÕéÃØÃ•aÃ¥ÃÅÃáÃ´ÃºtÃ∏ÕÑÕàÕéÃ¢

"Unicode is a very general text format"

Should the tokenizer be able to parse the text or not?

# Three ways of handling tokens
## Word-level tokenization
Tokens are space-separated substrings (or puncuation-separated if appropriate). A variant also splits into subwords, which is especially important for agglutinating and composing lanugages, such as Finnish or Swedish. 
## N-gram tokenization
Tokens are groups of N consecutive words. For example, "the cat", "he was", "over there" -- these are 2-grams or "bigrams".
## Character-level tokenization
Each character is its own token. In practice, useful for languages with rich writing systems or logogrammatic writing (cyrillic, hanzi, hangeul, abjads, abugidas, devangari, etc). Some of these benefit, or even require, N-character tokenization. Others should be trained with radicals and subsets of partial characters (hangul and hanzi in particular).
## Other tokenizations
Also worth mentioning the linguistic concepts "morpheme", "lexeme", "grapheme" and "phoneme". You can tokenize text in several layers (embeddings) if needed!

Morpheme
> a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming )

Lexeme
> a basic lexical unit of a language consisting of one word or several words, the elements of which do not separately convey the meaning of the whole. For example 'run', 'runs', 'ran' and 'running' are all inflections of the lexeme RUN. 

Also consider "take care", "take care of", "take care of the", "care for", "care for a". These are separate _lexemes_ and the individual words "take", "care", "of", "for", "the", "a" do not separately convey the meaning of the whole lexemes; HANDLE, NURTURE, WANT, CONCERN. You can see why linguists are sceptical about stop words -- they _do_ drastically change some lexemes!

Grapheme
> the smallest meaningful contrastive unit in a writing system. English: 's', 'sh', 'ch', 'oo', 'th', 'b', 'a'

Many languages written in latin alphabet(s) have complex graphemes:
- samhailchomhartha (Irish 'symbol', celtic spelling: sa·πÅaƒ±lƒão·πÅar·π´a) 
- guillemet (French, also Norweigan, quotation marks ¬´comme √ßa¬ª)
- przybyszewszczyzna (Polish art movement)

Phoneme
> any of the perceptually distinct units of sound in a specified language that distinguish one word from another, for example p, b, d, and t in the English words pad, pat, bad, and bat.

Note that phonemes are incredibly complex. Diphtongs, ellipsis, lenition, pitch accent and tones make spoken language very difficult to generalize. For example, Swedish has tonal words despite not being a tonal language:
- t√≤mten (a lot of land around a house)
- t√≥mten (definite form of 'gnome', Santa Claus)



## Embeddings

Tokenization itself isn't enough. Tokens end up basically anywhere in token-space (the vectorscape of the vectors we defined). We can side-step this by simply attaching a linear/mlp layer to each token-vector (the one-hot encoding for each token) and _learning_ the output of the linear layer together with all the other tokens. This is called an _embedding_, specifically a _learned embedding_. The networks learn to produce outputs such that similar features are close to eachother, regardless of how far apart the tokens are in tokenspace. If we apply the same logic as we did to CNNs, we can say that similar _meanings_ or _concepts_ are clustered by the embeddings. Also not the similarity to the "NatureCNN" feature embeddings in PPO. 

In pytorch, there's a much more efficient version; <code>nn.Embedding</code>. This bypasses the one-hot encoding entirely and thus saves alot of parameters in the model!

NOTE: a traditional way to encode relevance is to use TF-IDF, Term-Frequency Inverse-Document-Frequency. This is still used in some Neural Networks. It's a statistical measure that simply counts the number of occurences of a token in one sample and divides it by the inverse occurrences of the token in all samples.

There are a many available tokenizers and embeddings. Training your own is a large undertaking and faces many particular concerns.

#### Alignment
A famous example is that token embeddings of the English language will learn that 'King'-'Man'+'Woman' = 'Queen'. However, the same embeddings will learn 'Doctor'-'Man'+'Woman'='Nurse'. Here it has learned a pattern that can be interpreted as a historical stereotype that is not algined with current society-- especially in a country like Sweden were 58% of graduates with an MD are women. These embeddings will of course vary greatly with the corpus, what lables are present in the training data and the relevance of those features to the corpus. It may not be appropriate to learn "all of the UFO conspiracy forums" if the system is supposed to understand manuals for industrial machines. Similarily, deriving semantical gender or socio-economic relationships from hospital dramas or historical fiction is probably not aligned with factual society. 

#### Safety
An embedding may also learn entirely unsafe relationships. A famous example from the early days of computers is the meaning of traffic-lights derived from behavioural data: "Red - stop, Green - go, Yellow - go fast". If these embeddings exist  already in the tokenization, the later portions of the system will have a very difficult time reintepreting the meaning. The infamous "glue on pizza" google recommendation is partially an embedding failure; "food safe glue" is not an ingredient in cooking and doesn't belong in a recipie. You glue together a broken bowl with food-safe glue. You don't eat it. Doing so is harmful. _Edible_ glue-- better known as gum arabic-- is a product used for decorative deserts but that makes little sense on a pizza....

### Huggingface tokenizers

In [1]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")

split = imdb_dataset["train"].train_test_split(train_size=0.8)

imdb_train_set, imdb_validation_set = split["train"], split["test"]

imdb_test_set = imdb_dataset["test"]

In [2]:
imdb_train_set[1]

{'text': 'Prom Night is shot with the artistic eye someone gives while finely crafting a Lifetime original film. You know the one. This October, Lifetime takes a break from the courageous tale of a woman surviving (insert disease name here) to tell the somewhat creepy tale of a woman pursued by a stalker ex-boyfriend. It\'s dramatic \x85 it\'s sappy \x85 it\'s immensely dull. It does nothing to further a genre, tell an original story, or strive for ANY sort of newness. Prom Night shares this plight. Watching the killer poke holes in his victims, we sit silently as they slump to the floor with not a drop of blood spilled. It occurred to me that this was the cleanest killer in movie history.<br /><br />Our director is working with a fairly good-looking killer so he is forced to pour on the camera angles to make him appear creepier. Think about Matthew McConaughey coming at you with a knife. You\'d probably go \x85 "OH! Good lookin guy is going to kill me? Naaaa." Not scary even for a sec

In [3]:
import tokenizers

#byte-pair encoding
bpe_model = tokenizers.models.BPE(unk_token="<unk>")
bpe_tokenizer = tokenizers.Tokenizer(bpe_model)
bpe_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()
special_tokens = ["<pad>", "<unk>"]
bpe_trainer = tokenizers.trainers.BpeTrainer(vocab_size=1000, special_tokens=special_tokens)

train_reviews = [review["text"].lower() for review in imdb_train_set]
bpe_tokenizer.train_from_iterator(train_reviews, bpe_trainer)






In [4]:
my_text = "what ùëé dreadfully awesome movie!"

bpe_encoding = bpe_tokenizer.encode(my_text)
bpe_encoding

Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [5]:
bpe_encoding.tokens, bpe_encoding.ids, bpe_encoding.offsets

(['what', '<unk>', 'd', 'read', 'fully', 'aw', 'es', 'ome', 'movie', '!'],
 [302, 1, 45, 574, 985, 374, 148, 223, 209, 4],
 [(0, 4),
  (5, 6),
  (7, 8),
  (8, 12),
  (12, 17),
  (18, 20),
  (20, 22),
  (22, 25),
  (26, 31),
  (31, 32)])

In [6]:
bpe_tokenizer.encode_batch(train_reviews[:3])


[Encoding(num_tokens=377, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=803, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=627, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

In [7]:
bpe_tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
bpe_tokenizer.enable_truncation(max_length=500)

In [8]:
import torch

bpe_encodings = bpe_tokenizer.encode_batch_fast(train_reviews[:3])


In [9]:
bpe_encodings

[Encoding(num_tokens=500, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=500, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=500, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

In [10]:
bpe_batch_ids = torch.tensor([encoding.ids for encoding in bpe_encodings])
bpe_batch_ids

tensor([[509,  10,  60,  ...,   0,   0,   0],
        [394, 164, 817,  ..., 150, 980, 274],
        [167, 323,  55,  ..., 317, 189, 439]])

In [11]:
attention_mask = torch.tensor([encoding.attention_mask for encoding in bpe_encodings])
attention_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])

In [12]:
lengths = attention_mask.sum(dim=-1)
lengths

tensor([377, 500, 500])

### Byte-level Byte-pair Encoding (BPPE)

In [13]:
import transformers

gpt2_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
gpt2_encoding = gpt2_tokenizer(train_reviews[:3], truncation=True, max_length=500)

In [14]:
gpt2_token_ids = gpt2_encoding["input_ids"][0][:10]
gpt2_token_ids

[1456, 338, 262, 922, 1705, 717, 13, 366, 38685, 1]

In [15]:
gpt2_tokenizer.decode(gpt2_token_ids)

'here\'s the good news first. "spirit"'

* BPPE
  > GPT, Llama, RoBERTa, BLOOM
* WordPiece
  > BERT, DistillBERT, ELECTRA
* Unigram
  > ALBERT, mBART, h√†nz√¨, hangeul
* SentencePiece
  > Subword tokenization, e.g. Arabic, Finnish, German, Hungarian, Polish, Swedish, Turkish


There are also pre-trained embeddings: 

In [16]:
bert_model = transformers.AutoModel.from_pretrained("bert-base-uncased")

In [17]:
bert_model.embeddings.word_embeddings

Embedding(30522, 768, padding_idx=0)

Note the pytorch embedding layer; this model was implemented in pytorch.

The 'most famous' embeddings are:
* word2vec (Google)
* GloVe (Stanford)
* FastText (Facebook/Meta)

## Positional encodings

Natural language is full of ordered information, but unlike a computer system they aren't in nice structures like stacks, trees or lists. An obvious solution for an NN is to simply add an encoding of _position_ within a text and attach an embedding as with the tokens themselves (ie a sparse vector). See pp 584-585 in the book, and the corresponding example in the handson-mlp repo for an example implementation. 