# One Flew over NLP

A quick look through current NLP.
We will look a Large Language Models (LLM), so popular in the recent years.
And at a more practical library for NLP on a smaller scale - SpaCy.

## LLMs

GPT, PalM, Galactica, you name it.
These popular models require hundreds of years in compute time to train.
They learn language patterns which they can later replicate.
The complete models contain hundreds of millions of neurons,
and are often too expensive to just run, not to mention train on small problems.

The [huggingface][hug] hub is one place where one can download
reasonably sized pre-trained versions of popular models.
These are often cut-to-size versions of the big trained models
but with little loss of accuracy.

[PyTorch][torch] integrates hunggingface hub as one of its model providers.
Please note that this is often experimental code so one needs to fiddle with
dependencies themselves.
For example, for the following examples we currently need to install:

    pip install torch
    pip install transformers

[hug]: https://huggingface.co/models
[torch]: https://pytorch.org/

In [None]:
from pprint import pprint

from torch import nn
from transformers import pipeline

Similar to `sklearn` the `pipeline` builds an easy to use interface into the models.

The important argument is the `task`,
one needs to know the model to use for a given task
as most models will only be capable of a few of the tasks.

In [2]:
help(pipeline)

Help on function pipeline in module transformers.pipelines:

pipeline(task: Optional[str] = None, model: Union[str, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel'), NoneType] = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, ForwardRef('PreTrainedTokenizerFast'), NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, image_processor: Union[str, transformers.image_processing_utils.BaseImageProcessor, NoneType] = None, processor: Union[str, transformers.processing_utils.ProcessorMixin, NoneType] = None, framework: Optional[str] = None, revision: Optional[str] = None, use_fast: bool = True, token: Union[str, bool, NoneType] = None, device: Union[int, str, ForwardRef('torch.device'), NoneType] = None, device_map=None, torch_dtype=None, trust_remote_code: Optional[bool] = None, model_kwargs: Optional[Dict[s

One common task is mask-filling.
Based on the context what word is most likely in the place of the mask.

Pretty much any BERT based model is capable of this task.

In [3]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


And we can try a few masked sentences.

The return from the `unmasker` is a list of ranked possibilities for the mask.

In [4]:
cuckoo = "One flew over the [MASK] nest"
print(cuckoo)
pprint(unmasker(cuckoo))

elephant = "A humongous [MASK] entered the porcelain shop"
print(elephant)
pprint(unmasker(elephant))

One flew over the [MASK] nest
[{'score': 0.10374893993139267,
  'sequence': 'one flew over the entire nest',
  'token': 2972,
  'token_str': 'entire'},
 {'score': 0.033855464309453964,
  'sequence': 'one flew over the empty nest',
  'token': 4064,
  'token_str': 'empty'},
 {'score': 0.03381901606917381,
  'sequence': 'one flew over the whole nest',
  'token': 2878,
  'token_str': 'whole'},
 {'score': 0.027772599831223488,
  'sequence': 'one flew over the tree nest',
  'token': 3392,
  'token_str': 'tree'},
 {'score': 0.018109258264303207,
  'sequence': 'one flew over the old nest',
  'token': 2214,
  'token_str': 'old'}]
A humongous [MASK] entered the porcelain shop
[{'score': 0.3986216187477112,
  'sequence': 'a humongous figure entered the porcelain shop',
  'token': 3275,
  'token_str': 'figure'},
 {'score': 0.14282794296741486,
  'sequence': 'a humongous voice entered the porcelain shop',
  'token': 2376,
  'token_str': 'voice'},
 {'score': 0.06942196190357208,
  'sequence': 'a hum

Masking is foten used to fing new terms for specific words.
For example, to find synonyms in text.

Another task we cam build a pipeline for is text generation.
This is a more "popular" way to use LLMs for conversations.

In [5]:
generator = pipeline("text-generation", model="gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0


A text generator produces a conversation-like text based on the input.

In [6]:
conversation = "Yesterday I went to town to"
pprint(generator(conversation))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Yesterday I went to town to talk with you about your new '
                    'book, The Great American Detective. Do you think that you '
                    "can become a detective someday? I haven't read a book in "
                    'four years. Why? Why did this happen?\n'
                    '\n'
                    'James'}]


Note that LLM does not know whether the statements are true or not.
The LLM only appears to be clever,
it only knows what a human would like to read,
not whether that fact is locgical or not.

In [7]:
conversation = "Humans landed on Mars on"
pprint(generator(conversation))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Humans landed on Mars on Dec. 9, 1999. Credit: NASA '
                    'Spitzer Space Telescope (USA)\n'
                    '\n'
                    'A new NASA Spitzer orbiter will bring the first '
                    "high-resolution images of the world's most distant star, "
                    'the far-'}]


### The pipeline

We call the `pipeline` a pipeline in the cases above because it is composed of more than one model.
Similar to the way as we did `PCA` and `k-means` in `sklearn`,
the transformers pipeline glues a `tokenizer` and an embedding `model` together.

The tokenizer is not too different from `tf-idf` in that it explodes words into hundreds of dimensions.
The difference is that the dimensions tell more about the context of the words than in plain `tf-idf`.
These high dimensional representations are then called word-embeddings.
The models themselves then are huge attention neural nets which take input
tokenised into these embeddings.

## Down to Earth NLP

A more practical set of NLP tools for common problems is provided by the SpaCy library.
SpaCy uses a BERT based model to build embeddings and parts of speech for text.

Instead of trying to be a input-output model SpaCy is a library of tools that uses
a BERT model trained in several ways.
One can then choose with ease how to manually craft an NLP system.

Here the `en_core_web_sm` is the pre-trained BERT model that SpaCy should
use to tokenize, parse and understand the given text.

In [None]:
import spacy  # noqa

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_noun_chunks");

One common taks for SpaCy is to identify action within a sentence.

i.e. find the verb.

In [None]:
sentence = nlp("We Flew Over the Cuckoo's Nest")
print(list(sentence))
[x.pos_ for x in sentence]

Note that the model is context based.
In a different context the word "flew" is not a verb.

In [None]:
sentence = nlp("One Flew Over the Cuckoo's Nest")
print(list(sentence))
[x.pos_ for x in sentence]

With a few tricks we can now find the action of the sentence.

We will assume that the closest nouns to the verb re the actor and the actioned.
There are better ways to do it but this way is easy enough to write.

In [None]:
NOUNS = ["NOUN", "PRON", "PROPN"]  # noun, pronoun, proper name
sentence = nlp("We Flew Over the Cuckoo's Nest")
verbs = [x for x in sentence if x.pos_ == "VERB"]
actions = []
for v in verbs:
    left = [le for le in v.lefts if le.pos_ in NOUNS]
    right = [ri for ri in v.rights if ri.pos_ in NOUNS]
    if left and right:
        actions.append(f"{left[-1]} - {v} -> {right[0]}")
actions

Let's make a real example.

Project Guttenberg is a collection of many books free of copyright.
We take Alice's Adventures in Wonderland by Lewis Carol from there.

In [None]:
alice_in_wonderland = open("lewis-carol-alice.txt", "r").read()
len(alice_in_wonderland)

In plain text the entire book is about 160KBs.

We can easily pass that through SpaCy.

In [None]:
alice = nlp(alice_in_wonderland)

Let's have a look at what we got.

SpaCy has already separated the text into sentence for us.

In [None]:
from itertools import islice

list(islice(alice.sents, 3, 10))

With the exact same technique for finding actions
we will search across the entire book.

A helper class will aid us in the search later.

In [None]:
NOUNS = ["NOUN", "PRON", "PROPN"]  # noun, pronoun, proper name


class Action:
    def __init__(self, left: str, verb: str, right: str) -> None:
        self.left = left
        self.verb = verb
        self.right = right

    def __repr__(self) -> str:
        return f"{self.left} - {self.verb} -> {self.right}"

    __str__ = __repr__


def actions_in_sentence(sentence: str) -> list[Action]:
    actions = []
    verbs = [x for x in sentence if x.pos_ == "VERB"]
    for v in verbs:
        left = [ll for ll in v.lefts if ll.pos_ in NOUNS]
        right = [rr for rr in v.rights if rr.pos_ in NOUNS]
        if left and right:
            actions.append(
                Action(
                    left[-1].text.replace("\n", " "),
                    v.text,
                    right[0].text.replace("\n", " "),
                )
            )
    return actions

For the time beginning run it over 100 setneces only.

In [None]:
actions = []
for sentence in islice(alice.sents, 100):
    actions += actions_in_sentence(sentence)
len(actions)

And that is a number of actions that we can visualise.

In [None]:
for a in actions:
    print(a)

Across the full book we got quite a number of actions.

We can now try to figure out how active Alice is.

In [None]:
actions = []
for sentence in alice.sents:
    actions += actions_in_sentence(sentence)
len(actions)

This is another simplification:
we will not distinguish between the actor and the actioned upon.

Statistically this should even out.

In [None]:
len(
    [
        a
        for a in actions
        if "alice" in a.left.lower() or "alice" in a.right.lower()
    ]
)

Alice is indeed quite active.

We can compare this against other proeminent characters in the book:

In [None]:
cheshire_cat = len(
    [a for a in actions if "cat" in a.left.lower() or "cat" in a.right.lower()]
)
hatter = len(
    [
        a
        for a in actions
        if "hatter" in a.left.lower() or "hatter" in a.right.lower()
    ]
)
white_rabbit = len(
    [
        a
        for a in actions
        if "hatter" in a.left.lower() or "hatter" in a.right.lower()
    ]
)
print(f"Cheshire Cat: {cheshire_cat}")
print(f"The Hatter: {hatter}")
print(f"White Rabbit: {white_rabbit}")

The above is quite a crude NLP system.
A more complete system would require a way to recognise whether the nouns we find
are indeed the characters we seek.
And also recodnise pronouns refering to the characters.

These tow sysmes are most often called Named Entity Recognition (NER)
and Correference Resolution (Coref).
Both systems are in active research.

## Andrey Karpathy's miniGPT (simplified):

In [None]:
import math
from collections import namedtuple

import torch
import torch.nn.functional as F

Config = namedtuple("Config", ["n_embd", "n_head", "attn_pdrop", "resid_pdrop"])


class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer
    with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here
    but I am including an explicit implementation here
    to show that there is nothing too scary about transformers.
    """

    def __init__(self, config: Config) -> None:
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn_k = nn.Linear(config.n_embd, config.n_embd)
        self.c_attn_q = nn.Linear(config.n_embd, config.n_embd)
        self.c_attn_v = nn.Linear(config.n_embd, config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = (
            x.size()
        )  # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch
        # and move head forward to be the batch dim
        k = self.c_attn_k(x)
        q = self.c_attn_q(x)
        v = self.c_attn_v(x)

        # causal self-attention;
        # Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = F.softmax(att, dim=-1)
        # skip connection
        att = att + self.attn_dropout(att)
        y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2)

        # output projection
        y = self.resid_dropout(y)
        return y