# Language Models 1 | Inference

For more, see [here](https://huggingface.co/tasks/text-generation) and [here](https://huggingface.co/docs/transformers/generation_strategies).

## Install & Workflow

#### Drive

If you need to load/save to your drive:

```python
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/IS53055B-DMLCP/DMLCP/python') # to change to another directory
```

#### Huggingface login

For some models and datasets, and if you want to push your model to HF (same as GitHub, but for models) you need to be logged into your HF account.

For that, you need to create an account [here](https://huggingface.co/) and then to ['/settings/tokens'](https://huggingface.co/settings/tokens) to create an access token.

```python
from pathlib import Path
from huggingface_hub import notebook_login
if not (Path.home()/'.huggingface'/'token').exists():
    notebook_login()
```

#### Install

1. On Colab, you need to install `transformers`:

```python
!pip install -Uq transformers
```

2. Locally, I recommend creating a new environment when working with Huggingface, simply because it'll be faster and because the preferred library behind HF is PyTorch, which can conflict with TensorFlow... I detailed the steps [in the PyTorch part of `setup.md`](https://github.com/jchwenger/DMLCP/blob/main/setup.md#pytorch--huggingfacegradio).

## Imports

In [None]:
import torch

# Get cpu, gpu or mps device for training.
# See: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

from transformers import pipeline
from transformers import GenerationConfig

from transformers import GPT2Tokenizer
from transformers import GPT2LMHeadModel

In [None]:
# device = "cpu"

## Out-of-the-box Generation: the `pipeline`

In [None]:
import textwrap # The textwrap module automatically formats text for you

tw = textwrap.TextWrapper(   # many more options, see them with textwrap.TextWrapper?
    width=79,                # the formatted width we want
    replace_whitespace=False # this will keep whitespace & line breaks in the original text
)

def wrap_print(s):
    """Format text into Textwrapped lines and print it"""
    print("\n".join(tw.wrap(s)))

In [None]:
generator = pipeline(
    'text-generation', # the specific task, which is also the tag on huggingface
    model='gpt2',      # so many more models here: https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads
    device=device           # the default is just cpu, see here: https://huggingface.co/docs/transformers/pipeline_tutorial#device
)

See [here](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.from_pretrained.example) for an example using `GenerationConfig` and [here](https://github.com/huggingface/transformers/issues/19853#issuecomment-1290759818) for the `pad_token_id` fix.

In [None]:
generation_config = GenerationConfig.from_pretrained("gpt2")
generation_config.pad_token_id = generation_config.eos_token_id

The Huggingface is transitioning towards the use of generation config files (good for automation).

In [None]:
generation_config.max_length = 25
generation_config.do_sample = True
generation_config.top_p = 0.95
generation_config.temperature = .9

### Quick vocab note:

`bos`: beginning of sentence  
`eos`: end of sentence  
`pad`: padding

These are special tokens that have been inserted into the text at training time.

For instance, in our case the 'beginning' of the text is 'endoftext', as during training the texts are fed to the network one after the other, with this special token between them:
```python
print(generator.tokenizer.bos_token) # '<|endoftext|>'
```

### Generate text!

In [None]:
generation_config

In [None]:
# torch.manual_seed(1)
generator(
    "Once upon a time,",
    generation_config=generation_config
)

Parallel generation!

In [None]:
# torch.manual_seed(1)
generator(
    ["Once upon a time,"] * 2,
    generation_config=generation_config
)

---

## Deeper:`Tokenizer` and `Model` classes

What does the pipeline do under the hood?

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained(
    "gpt2",
    pad_token_id=tokenizer.eos_token_id # add the EOS token as PAD token to avoid warnings
).to(device) # to GPU/MPS/CPU

### Note on model classes

Huggingface automates everything, so instead of `GPT2LMHeadModel` and `GPT2Tokenizer` you can use `AutoModelForCausalLM`, `AutoTokenizer`.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(device)
```

The automation of the right choice of model architecture by the Huggingface library has become so popular that these autoclasses are used almost everywhere now.

### The tokenizer

See [the Preprocess](https://huggingface.co/docs/transformers/preprocessing) tutorial on Huggingface for more details.

In [None]:
toks = tokenizer.encode("Oh sweet midnight")
print(toks)
print(tokenizer.decode(toks))
print()

toks = tokenizer(["Oh sweet midnight", "harbinger of doom"])
print(toks)
print(tokenizer.batch_decode(toks['input_ids']))

In [None]:
input_ids = tokenizer.encode('Once upon a time', return_tensors='pt') # pytorch tensors
print(input_ids)

batched_input_ids = torch.tile(input_ids, (4,1)).to(device) # just copying the tensor 4 times
print(batched_input_ids)

### Generate Text!

In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('Once upon a time', return_tensors='pt') # pytorch tensors

batched_input_ids = torch.tile(input_ids, (4,1)).to(device) # copy and place on GPU/MPS/CPU

# same logic as before
generation_config = GenerationConfig.from_pretrained("gpt2")
generation_config.pad_token_id = generation_config.eos_token_id
generation_config.max_length = 25
generation_config.do_sample = True
generation_config.top_p = 0.95
generation_config.temperature = .9

# generate text until the output length (which includes the context length) reaches 50
output = model.generate(
    batched_input_ids, # try input_ids as well for a single strand
    generation_config=generation_config,
)

In [None]:
texts = tokenizer.batch_decode(output, skip_special_tokens=True)

for t in texts:
    wrap_print(t)
    print("-" * 40)

---

# Experiments

1. Test everything! Make sure you understand and develop an intuition of:
 - the various parameters: `temperature`, `top_k`, `top_p`;
 - the `tokenizer` object to convert text into tokens and back;
 - how to handle the whole pipeline;
   Also, you can search for different [models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)! (Some of them may exceed your GPU capacity, beware). People have finetuned language models on many types of texts.
2. Can you think of a way to introduce computational thinking into this? Ideas:
  - First, you could explore ways of making things look nicer? Instead of just having a list of objects? You could write a nice print function that knows exactly how to take the model output and print it in a nice way. The specialised Python package with many text functionalities is [textwrap](https://docs.python.org/3/library/textwrap.html) (see also [here](https://www.geeksforgeeks.org/textwrap-text-wrapping-filling-python/);
  - Can you think of ways to construct a writing **loop**? By that, I mean:  
    a. Prepare prompt  
    b. Generate one or more strands of text  
    c. Select text from strands, go back to a.  
    This could simply mean writing a system of helper functions and classes to assist you in the writing...
  - One could imagine all sorts of strange ways to work with text, from programmatically chunking the generated text and scrambling it before using it again as a prompt, to explore what the model does if you use unreasonable parameters (e.g. a very high or low `temperature`).
  - Also, can you think of ways to work with various strands of text (Taking advantage of the fact that a model can generate in parallel)?

3. Something that has already been the subject of a lot of debate and controversy, is the exploration of the *biases* of the models (and there are tons!). GPT-2 was trained mostly on Internet text, top-ranked reddit posts, etc. (see [this open-source replication](https://github.com/jcpeterson/openwebtext)). Unsurprisingly, the topics and points of view reflect that corner of human activities...