# 2. Pocking at Ever Larger Language Models
## An introduction for (digital) humanists


## Pretrained Language Models (10 mins)

- Transition from N-Gram to Neural Language Models (ca. 2013)

- [Don't count, predict!](https://aclanthology.org/P14-1023.pdf) (when **training** a language model)

## Terminology

<img src="https://soundgas.com/wp-content/uploads/2021/02/Vintage-mixers-from-Roland-Yamaha-1024x576.jpg" alt="knobs" width="500">

- Parameters are "knobs" you can adjust to transform an input to the output you want
- Deep Learning algorithms attempt to find the optimal setting of these knobs. The more knobs, the more complex stuff you can do (but equally, it becomes harder to understand how the machine actually works).
![simpleNN](https://miro.medium.com/v2/resize:fit:624/1*U3FfvaDbIjr7VobJj89fCQ.png)

- LM pretraining and fine-tuning (Why it works better)

## Common PLM variants
- Causal/Autoregressive language models (GPT series): Predict the next [BLANK]
- Masked Language Models (BERT and family): Predict the [BLANK] word.


## Text Generation with GPT-2

While more complex, GPT-2 operates similarly to our simple N-Gram LM.
- Given a prompt, it computes the probability over the following word
- Then we can sample a word from this distribution, add it to the prompt and repeat!

Materials inspired by this [blog post](https://huggingface.co/blog/how-to-generate) and the excellent Programming Historian lesson.


In [None]:
!pip install transformers xformers

## Next word prediction with GPT-2

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model
import numpy as np
from torch.nn import Softmax
import pandas as pd

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
# load the gpt-2 model
gpt2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
prompt = 'Hello my name is' # define a prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt') # tokenize prompt as input for language model
input_ids

In [None]:
predictions = model(**tokenizer(prompt, return_tensors='pt')) # get logits from model

In [None]:
predictions.logits.shape # the predictions as logits

In [None]:
# get words with highest probability
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))

In [None]:
softmax = Softmax(dim=0) # load softmax function
series = pd.Series(softmax(predictions.logits[0,-1,:]).detach()).sort_values(ascending=False)
index = [tokenizer.decode(x) for x in series.index] # change index to tokens
series.index = index # set tokens as index
series[:100].plot(kind='bar',figsize=(20,5)) # plot results

## Generating texts from prompts

In [None]:
#sequence = 'the duke of'
#sequence = 'A no deal Brexit'
sequence = 'The UK is'

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

Increasing the temperature can make predictions more creative (or random if you [like](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.)))

![temperature](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7xj72SjtNHvCMQlV.jpeg)

Image taken for this [blog post](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.) on temperature in Softmax.

In [None]:
import torch
torch.manual_seed(0)
generator(sequence, 
          max_length = 30, 
          num_return_sequences=5,
          do_sample=True, 
          top_k = 0,
          temperature=.000000001, # change temparature to .7
         )


### Top k sampling

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
         
          top_k=50)

### Top p of nucleus sampling

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
          top_k=0,
          top_p=.92)

## Adapting a language model

Building a GPT-Brexit model on top of GPT-2.

Question: How does it change the behaviour of this model?

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'Kaspar/gpt-brexit',tokenizer='gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

# Bias and Toxicity in PLM (and LLMs)

An excellent [HuggingFace tutorial](https://colab.research.google.com/drive/1-HDJUcPMKEF-E7Hapih0OmA1xTW2hdAv#scrollTo=MOsHUjgdIrIW) to be integrated later.

# Masked Language Models and History 

In [1]:
from transformers import pipeline
sentence = "Our sewing [MASK] stood near the wall where grated windows admitted sunshine, and their hymn to Labour was the only sound that broke the brooding silence."

masker = pipeline("fill-mask", model='bert-base-uncased')
masker(sentence)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.5995776057243347,
  'token': 6681,
  'token_str': 'machines',
  'sequence': 'our sewing machines stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.11765312403440475,
  'token': 3698,
  'token_str': 'machine',
  'sequence': 'our sewing machine stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.02953513339161873,
  'token': 2282,
  'token_str': 'room',
  'sequence': 'our sewing room stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.017924729734659195,
  'token': 4734,
  'token_str': 'rooms',
  'sequence': 'our sewing rooms stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0

In [3]:
from transformers import pipeline

victorian_masker = pipeline("fill-mask", model='Livingwithmachines/bert_1760_1850')
victorian_masker(sentence)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'score': 0.44799289107322693,
  'token': 3057,
  'token_str': 'girls',
  'sequence': 'our sewing girls stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.13632763922214508,
  'token': 2308,
  'token_str': 'women',
  'sequence': 'our sewing women stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.03960563242435455,
  'token': 5208,
  'token_str': 'sisters',
  'sequence': 'our sewing sisters stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score': 0.03331466764211655,
  'token': 2336,
  'token_str': 'children',
  'sequence': 'our sewing children stood near the wall where grated windows admitted sunshine, and their hymn to labour was the only sound that broke the brooding silence.'},
 {'score':

# Fin.