<a href="https://colab.research.google.com/github/kasparvonbeelen/data-culture-newspapers/blob/llms/1_Introduction_and_PLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This session reflects on the application of language models to research in digital humanities. It will be interactive, so you can play around with the examples and code snippets yourself.

- This is a gentle introduction, open to everyone!
- Also, still under construction.


## What's on the menu?

- Intro: What are language models, actually

- Language Models as Models of Language: Experiments with historical and queer GPT-2 and BERT models
- Using Instruction-tuned (or "chat" models) for distant reading
 - "Don't count, summarize?"
 - A simple RAG pipeline to investigate accidents in the news
 - Using LLMs for annotating and structuring newspaper data


# Language Models as Models of Language

## What are language models?

LMs tell us what word is likely to follow a given sequence. More technically:

> “[Language models] assign a probability* to each possible next word. (Jurafsky & Martin)”
- Given a vocabulary *V*, which *w* (word) in *V* is likely to follow a sequence *s*?

Given the sentence **“Predicting the future is hard, but not …”**

- P(“impossible” | sentence) is greater than P(“aardvark” | sentence)


> Read P(“impossible” | sentence) as the probability of observing the token “impossible” given the sequence “Predicting the future is hard, but not ...


> Probabilities are values between 0 and 1 that sum up to 1.

## How are Language models created?

- **Pre-training** using a language modelling task: iterate over a large collection of text and improve the model's performance on the next token prediction task. In the process the model learns a lot about language, society and the "real" world.
- **Instruction tuning** improves the model so that it follows instructions correctly.

E.g. [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) vs. [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

# Language Models One Token at a Time
## Next word prediction with GPT-2

Next word prediction is the principal building block of generative AI applications. We'll also encounter it when playing with larger language models.

In the following example, we generate just one token. We will show how a language model produces a probability distribution over the vocabulary from which it can sample the next token.

In [None]:
%%bash
pip install -q transformers accelerate datasets shap

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model, pipeline
import numpy as np
import shap
from torch.nn import Softmax
import pandas as pd
softmax = Softmax(dim=0) # initialize softmax function

In [None]:
#prompt = 'Hello my name is John' # define a prompt
prompt = 'Predicting the future is hard, but not'

In [None]:
# tokenizer will split a text in units the LM is built on
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
# load the gpt-2 model
gpt2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
# get logits from model
predictions = model(**tokenizer(prompt, return_tensors='pt'))
# the predictions as logits
# predictions.logits.shape
# get words with highest probability
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))
# order predictions
series = pd.Series(softmax(predictions.logits[0,-1,:]).detach()).sort_values(ascending=False)
# change token_ids to the actual tokens
index = [tokenizer.decode(x) for x in series.index]
# set tokens as index
series.index = index
# plot results
series[:100].plot(kind='bar',figsize=(20,5))

# From predicting the next token to generating complete documents

To generate longer text we repeat the next token prediction multiple times until we encounter a stop symbol (or the limit of tokens to generate).

Text generation involves the following steps:

- Create a probability distribution over the next token
- Sample a token based on this distribution*
- Add the sampled token to the input sequence, and repeat...


* we have a few tricks up our sleeves here: we can manipulate the sampling procedure using specific hyperparameters, such as `temperature`, `top_k`, `top_p`

#### An example of temperature
Increasing the temperature can make predictions more creative (or random if you [like](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.)))

![temperature](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7xj72SjtNHvCMQlV.jpeg)

Image taken for this [blog post](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.) on temperature in Softmax.

The generation steps outlined above are neatly integrated in the Hugging Face 'text-generation' `pipeline`.

First we instantiate the `pipeline` using the GPT2 model, the predecessor the now famous GPT3.

In [None]:
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2")
generator_gpt2 = pipeline('text-generation', # define the task
                     model = 'gpt2', # define the model
                     pad_token_id=tokenizer.eos_token_id)


Then we have to define a prompt (or input text) for which we want to predict the next token.

In [None]:
prompt = 'Predicting the future is hard, but not' # 'Hello, my name is'

Lastly, we can generate a few texts to follow the given prompt. We limit the length of these completions to 30 tokens. We use a low temperature so things won't get too weird!

In [None]:
completions = generator_gpt2(prompt,
            temperature=0.1, # increase temperature for more create generations
            max_length = 30,  # max length of each generated text
            num_return_sequences=3 # how many texts to generate
         )
completions

# Models ❤️ Data



- **Models "mimic" or "parrot" the data they are trained on.** Some scholars argue they should be referred to as **"corpus models"** instead of "language models". In this context, language models are understood to be **"compression of data"**, as their weights encode important patterns in the data. In this sense, language models can be used to investigate complex patterns and regularities in language, which I think is of interest to scholars in the digital humanities.
- **Different training data results in different model (outputs).** We can compare models trained in different data to study how language changes over time or by context. In these scenarios, the language models themselves are valuable objects of study, especially when combined with tools/methods that allow us to **interpret** the behaviour of these models.
- Below we have a closer look at some examples. First, we compare the vanilla GPT2 models to one fine-tuned newspaper article on Brexit (based on this wonderful programming historian [tutorial](https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt)) and a selection of [Queer 80s literature](https://www.london.ac.uk/about/services/senate-house-library/exhibitions/seized-books).


We need to import some required tools and libraries.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import shap # we use this tool to visualize model prediction

We load the different language models: `gpt2`, `QueerGPT2` and `gpt-brexit` (both still under construction).

In [None]:
explainer_dict = {}

for checkpoint in  ['gpt2','Kaspar/QueerGPT2','Kaspar/gpt-brexit']:
  tokenizer = AutoTokenizer.from_pretrained('gpt2', use_fast=True)
  model = AutoModelForCausalLM.from_pretrained(checkpoint)
  wrapped_model = shap.models.TopKLM(model, tokenizer, k=25)
  masker = shap.maskers.Text(tokenizer, mask_token="...", collapse_mask_token=True)
  explainer = shap.Explainer(wrapped_model, masker)
  explainer_dict[checkpoint] = explainer

# Interpreting model predictions with shap

The `shap` library nicely visualises which parts of the input segment influence the prediction of the next token.

### Exercise:

Change the prompt and look at how the predictions differ in relation to the input text.

In [None]:
prompt = "First I went to walk the dog, then I had breakfast, now I will go to the" # change the prompt
expl = 'gpt2'
shap_values = explainer_dict[expl]([prompt])
shap.plots.text(shap_values)

Importantly, if we fine-tune the model on different data, it gains novel knowledge and will change its behaviour and predictions.

In [None]:
prompt = "I want the United Kingdom to stay in the European Union. I will vote for" # change the prompt
expl = 'gpt2' # gpt2 | Kaspar/gpt-brexit
shap_values = explainer_dict[expl]([prompt])
shap.plots.text(shap_values)

In [None]:
# tokens predicted by QueerGPT
prompt = "When I grow up, I want to become a" # change the prompt
expl = 'Kaspar/QueerGPT2' # 'Kaspar/QueerGPT2' | gpt2
shap_values = explainer_dict[expl]([prompt])
shap.plots.text(shap_values)

Lastly, we can generate longer outputs and assess how these related to input tokens.

In [None]:
prompt = "I want the United Kingdom to stay in the European Union. Therefore, I will vote for" # change the prompt
checkpoint = 'Kaspar/gpt-brexit' #  choose gpt2 or Kaspar/QueerGPT2
tokenizer = AutoTokenizer.from_pretrained('gpt2', use_fast=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
# set model decoder to true
model.config.is_decoder = True
# set text-generation params under task_specific_params
model.config.task_specific_params["text-generation"] = {
    "do_sample": True,
    "max_length": 200,
    "temperature": 0.1,
    "top_k": 50,
    #"no_repeat_ngram_size": 2,
}
explainer = shap.Explainer(model, tokenizer)
shap_values = explainer([prompt])

In [None]:
shap.plots.text(shap_values)

# Historical Language Models

Comparing model predictions allows us to study linguistic and societal changes. In the Living with Machines project, we investigated the concept of atypical animacy, focussed on the portrayal of machines as being 'alive'.

You can read the technical paper [here](https://arxiv.org/abs/2005.11140) and the historical article [here](https://muse.jhu.edu/pub/1/article/903976/summary).

In these papers, we used a slightly different technique and type of model, namely autoencoding/masked language models (as opposed to the GPT-series which resort under the autoregressive/causal category).

We masked the word 'machine' and investigated what the model predicts, focussing on examples where it predicts humans or animals instead of mechanical objects.

Moreover, we used historical language models, to study how these predictions change by period.

We fine-tuned BERT models on 19th-century book collections as described in [our paper](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.48).

In [None]:
from transformers import pipeline
sentence = "Our sewing [MASK] stood near the wall where grated windows admitted sunshine, and their hymn to Labour was the only sound that broke the brooding silence."

In [None]:
masker = pipeline("fill-mask", model='bert-base-uncased')
masker(sentence)

In [None]:
victorian_masker = pipeline("fill-mask", model='Livingwithmachines/bert_1760_1850')
victorian_masker(sentence)

### Exercise

Can you think of another example where we might observe interesting historical differences? Change the sentence variable with another "[MASK]" token.

# Fin.