# Pocking at Ever Larger Language Models
## An introduction for (digital) humanists


## Pretrained Language Models

- Transition from N-Gram to Neural Language Models

- [Don't count, predict!](https://aclanthology.org/P14-1023.pdf) (when training a language models)

## Terminology
- Paramaters are "knobs" you can adjust to get the output you want
- Deep Learning algorithms attempt to find the optimal setting of these knobs
![simpleNN](https://miro.medium.com/v2/resize:fit:624/1*U3FfvaDbIjr7VobJj89fCQ.png)

- LM pretraining and fine-tuning (Why it works better)

## Common PLM variants
- Causal/Autoregressive language models (GPT series)
- Masked Language Models (BERT and family)


## Text Generation with GPT-2

While more complex, GPT-2 does essentially the same as our simple N-Gram LM.
- Given a prompt, it computes the probability of the next word
- Then we sample a word, add it to prompt and repeat!

Materials are inspired on this [Blog Post](https://huggingface.co/blog/how-to-generate) and the excellent programming historian post.


In [15]:
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
#sequence = 'the duke of'
#sequence = 'A no deal Brexit'
sequence = 'The UK is'

In [30]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2')
generator(sequence, max_length = 30, num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The UK is still grappling with the aftermath of Brexit in a period when millions are left living without the means to support themselves and their children."\n\n'},
 {'generated_text': 'The UK is leading the way with free wireless and home wireless. We have some of the best wireless coverage, including the largest networks in the world.'},
 {'generated_text': 'The UK is no longer able to buy new passports in any area, which means that people can get their British passport through the UK with no hassle.'},
 {'generated_text': "The UK is the UK's largest commercial oil producer and its financial sector is the second largest in the world after China, and currently accounts for only 15"},
 {'generated_text': "The UK is the leading producer of global high quality health services and public health services that are designed to tackle the most pressing challenges and to ensure people's"},
 {'generated_text': 'The UK is a leading market in the digital space, but as the num

In [31]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'Kaspar/gpt-brexit',tokenizer='gpt2')
generator(sequence, max_length = 30, num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The UK is a member of the European Union as it is outside it.\nWhen Parliament takes the final vote on a transition deal, if a majority'},
 {'generated_text': 'The UK is due to leave the EU on 29 March 2019. It will have a very narrow role in EU affairs – but also have a significant role'},
 {'generated_text': 'The UK is set to leave the European Economic Area in March 2019.\nThe EU has said it wants the UK to leave the bloc on 29 March'},
 {'generated_text': 'The UK is committed to having a long and prosperous post-Brexit relationship and the Government’s focus should be on our common interests in trade,'},
 {'generated_text': 'The UK is prepared to consider “alternative arrangements” to avoid a hard border on the island of Ireland. However, this would create uncertainty'},
 {'generated_text': 'The UK is due to leave the EU on March 29 and Mr Johnson is set to announce the terms and conditions of Brexit on Tuesday in an unprecedented address'},
 {'generated_text':

## Behind the pipeline: Next word prediction

In [39]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model
import numpy as np


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [40]:
model2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [41]:
input_ids = tokenizer.encode(sequence, return_tensors='pt')


In [42]:
input_ids

tensor([[ 464, 3482,  318]])

In [43]:
predictions = model(**tokenizer(sequence, return_tensors='pt'))

In [44]:
predictions.logits.shape

torch.Size([1, 3, 50257])

In [45]:
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))

' the'

# Bias and Toxicity in PLM (and LLMs)

An excellent [HuggingFace tutorial](https://colab.research.google.com/drive/1-HDJUcPMKEF-E7Hapih0OmA1xTW2hdAv#scrollTo=MOsHUjgdIrIW)