# 2. Pocking at Ever Larger Language Models
## An introduction for (digital) humanists


## Pretrained Language Models (10 mins)

- Transition from N-Gram to Neural Language Models (ca. 2013)

- [Don't count, predict!](https://aclanthology.org/P14-1023.pdf) (when **training** a language model)

## Terminology
- Parameters are "knobs" you can adjust to transform an input to the output you want
- Deep Learning algorithms attempt to find the optimal setting of these knobs. The more knobs to more complex stuff you can do (but equally, it becomes harder to understand how the machine actually works).
![simpleNN](https://miro.medium.com/v2/resize:fit:624/1*U3FfvaDbIjr7VobJj89fCQ.png)

- LM pretraining and fine-tuning (Why it works better)

## Common PLM variants
- Causal/Autoregressive language models (GPT series): Predict the next [BLANK]
- Masked Language Models (BERT and family): Predict the [BLANK] word.


## Text Generation with GPT-2

While more complex, GPT-2 operates similarly to our simple N-Gram LM.
- Given a prompt, it computes the probability over the following word
- Then we sample a word from this distribution, add it to the prompt and repeat!

Materials inspired by this [blog post](https://huggingface.co/blog/how-to-generate) and the excellent Programming Historian lesson.


In [15]:
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Next word prediction with GPT-2

In [58]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model
import numpy as np


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [40]:
# load the gpt-2 model
gpt2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [60]:
prompt = 'Hello my names is' # define a prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt') # tokenize prompt as input for language model
input_ids

tensor([[15496,   616,  3891,   318]])

In [61]:
predictions = model(**tokenizer(sequence, return_tensors='pt')) # get logits from model

In [62]:
predictions.logits.shape

torch.Size([1, 3, 50257])

In [64]:
# get words with highest probability
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))

' the'

## Generating texts from prompts

In [16]:
#sequence = 'the duke of'
#sequence = 'A no deal Brexit'
sequence = 'The UK is'

In [48]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

[{'generated_text': 'The UK is "worrying about the UK\'s future exit from the EU after Brexit, and have a strong interest in moving forward at such a'},
 {'generated_text': "The UK is one of the largest producers and exporters of the world's most sought-after luxury goods, which is used for clothing, accessories,"},
 {'generated_text': 'The UK is planning to go into an aggressive phase to boost the domestic trade deficit - though its approach will also depend on how high up the tax-'},
 {'generated_text': "The UK is on track to take full advantage of the Government's commitment to address the issue of the health and safety benefits for all British Columbians,"},
 {'generated_text': 'The UK is one of the most technologically advanced nations on Earth with population of just over 80 billion worldwide, and a huge amount of this has been due'},
 {'generated_text': 'The UK is one of our leading suppliers of modern hardware.\n\n"To help us fulfil this objective, we are making an investment i

In [55]:
generator(sequence, 
          max_length = 30, 
          num_return_sequences=2,
          num_beams=5, 
          no_repeat_ngram_size=2, )

[{'generated_text': 'The UK is set to leave the European Union in 2019.\n\nIn a statement, Prime Minister David Cameron said he was "deeply disappointed"'},
 {'generated_text': 'The UK is the only country in the world that does not have a minimum wage. The UK has one of the lowest minimum wages of any OECD country'}]

In [56]:
generator(sequence, 
          max_length = 30, 
          num_return_sequences=2,
          do_sample=True, 
          top_k = 0,
          temperature=.7 )

[{'generated_text': 'The UK is attempting to negotiate an end to the £20bn deficit that has dogged the government for years and is now thought to be due to excessive'},
 {'generated_text': 'The UK is estimated to have more than half a million dogs on the streets of England but only 2.6 per cent have been sold to other states'}]

In [57]:
generator(sequence, 
          max_length = 30, 
          num_return_sequences=2,
          top_k = 50,)

[{'generated_text': 'The UK is leading the way in developing this approach and, in particular, a range of development and marketing tools. It enables us to help developers start'},
 {'generated_text': "The UK is on course to create 300,000 jobs a year as Brexit negotiations approach. The government's chief trade negotiator, Liam Fox, said this"}]

## Adapting a language model

Building a GPT-Brexit model on top of GPT-2.

Question: How does it change the behaviour of this model?

In [49]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'Kaspar/gpt-brexit',tokenizer='gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

[{'generated_text': 'The UK is already more than three-quarters of the way through the legal phase of talks. If Brexit talks end on October 31, they go through'},
 {'generated_text': 'The UK is set to leave the EU for six months from the 29th Feb with no deal - though the EU - on 31 October it will be'},
 {'generated_text': 'The UK is not the only European country to be affected by a disruption to the border between Northern Ireland and the Republic of Ireland. This is likely to'},
 {'generated_text': 'The UK is set to leave the EU on 29 March 2019.\nThe transition period is the length of the transition period to allow for a smooth transition'},
 {'generated_text': 'The UK is at "a crossroads" in its Brexit negotiations.\nThe EU said it would only provide assurances about the country\'s legal status after'},
 {'generated_text': 'The UK is in an economic and social limbo with no clear plan at the moment.\n"With time running out and a trade deal so far under'},
 {'generated_text': 'The U

# Bias and Toxicity in PLM (and LLMs)

An excellent [HuggingFace tutorial](https://colab.research.google.com/drive/1-HDJUcPMKEF-E7Hapih0OmA1xTW2hdAv#scrollTo=MOsHUjgdIrIW) to be integrated later.

# Fin.