# CS6493 - Tutorial 7:  Language Model for Text Generation

In this tutorial, we will first introduce how to fine tune a language model using Huggingface.

There are two types of language modeling, causal and masked. This tutorial illustrates causal language modeling.
Causal language models are frequently used for text generation. You can use these models for creative applications like
choosing your own text adventure or an intelligent coding assistant like Copilot or CodeParrot.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on
the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

This tutorial will show you how to:

1. Finetune [DistilGPT2](https://huggingface.co/distilgpt2) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5_category) dataset.
2. Use your finetuned model for inference.


<Tip>
You can finetune other architectures for causal language modeling following the same steps in this guide.
Choose one of the following architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
[OpenAI GPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/openai-gpt), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [Llama](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/llama), [CodeLlama](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/code_llama), etc.


<!--End of the generated tip-->

</Tip>


## Preparing Data

Again, Before you begin, make sure you have all the necessary libraries installed:


In [None]:
! pip install accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

In [None]:
! pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.

### Load ELI5 dataset

Start by loading a smaller subset of the r/askscience subset of the [ELI5_category dataset](https://huggingface.co/datasets/eli5_category) from the 🤗 Datasets library.
 This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

 To save the training time, we only use part of the dataset for illustration purpose. You could use the whole dataset for better model performance in your own enviornment.

In [None]:
from datasets import load_dataset

eli5 = load_dataset("eli5_category", split="train[:500]")

README.md:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

eli5_category.py:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

The repository for eli5_category contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eli5_category.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

Split the dataset's `train_asks` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [None]:
eli5 = eli5.train_test_split(test_size=0.2)

Then take a look at an example:

In [None]:
eli5["train"][0]

{'q_id': '5lj9ns',
 'title': 'Why is the HD quality that comes out of my $25 indoor antenna better than what I used to get through a cable box?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dbw4z5c'],
  'text': ['Your cable company is compressing the signal. Over-the-air transmissions are 100% pure, uncompressed video signals.'],
  'score': [8],
  'text_urls': [[]]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

While this may look like a lot, you're only really interested in the `text` field. What's cool about language modeling
tasks is you don't need labels (also known as an unsupervised task) because the next word *is* the label.

## Preprocess

The next step is to load a DistilGPT2 tokenizer to process the `text` subfield:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to
extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:

In [None]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '5lj9ns',
 'title': 'Why is the HD quality that comes out of my $25 indoor antenna better than what I used to get through a cable box?',
 'selftext': '',
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dbw4z5c'],
 'answers.text': ['Your cable company is compressing the signal. Over-the-air transmissions are 100% pure, uncompressed video signals.'],
 'answers.score': [8],
 'answers.text_urls': [[]],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead
of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.

Here is a first preprocessing function to join the list of strings for each example and tokenize the result:

In [None]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need:

In [None]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1801 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1543 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1209 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1145 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1647 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1062 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1433 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1695 > 1024). Running this sequence through the model will result in indexing errors


This dataset contains the token sequences, but some of these are longer than the maximum input length for the model.

You can now use a second preprocessing function to
- concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [None]:
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Apply the `group_texts` function over the entire dataset:

In [None]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/100 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling). It's more efficient to *dynamically pad* the
sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

Use the end-of-sequence token as the padding token and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Train

<Tip>

Transformers provides a Trainer class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

</Tip>

You're ready to start training your model now! Load DistilGPT2 with [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM):

In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

In [None]:
import math
training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mliuzengyan1012[0m ([33mliuzengyan1012-city-university-of-hong-kong[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Perplexity: 55.37


In [None]:
from transformers import pipeline
prompt = "Somatic hypermutation allows the immune system to"
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
generator(prompt)

Device set to use cuda:0


[{'generated_text': 'Somatic hypermutation allows the immune system to generate the immune system to generate the defense systems to function. When antibodies target immune cells, immune cells start to suppress the attack. Once the immune system gets weakened, the immune system becomes more vulnerable.'}]

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, datasets, and data collator.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,3.910408
2,No log,3.90252
3,No log,3.903068


TrainOutput(global_step=393, training_loss=3.89877389223521, metrics={'train_runtime': 82.0318, 'train_samples_per_second': 38.107, 'train_steps_per_second': 4.791, 'total_flos': 102101705293824.0, 'train_loss': 3.89877389223521, 'epoch': 3.0})

Once training is completed, use the [evaluate()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.evaluate) method to evaluate your model and get its perplexity:

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 49.55


## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with a prompt you'd like to generate text from:

In [None]:
prompt = "Somatic hypermutation allows the immune system to"

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for text generation with your model, and pass your text to it:

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
generator(prompt)

Device set to use cuda:0


[{'generated_text': 'Somatic hypermutation allows the immune system to detect patterns of infection (for example a number of things), but not in one or two cases, the immune response from multiple cells. This means that a lot of the time that this happens causes antibodies'}]

## Inference via OpenAI API

The well-pretrained models like Deepseek-R1 and GPT-4 could also be used for inference via OpenAI API.

First, we install the openai package.

In [None]:
!pip install openai



In [None]:
from openai import OpenAI

client = OpenAI(api_key="sk-a4dfb7a4a40141b8a93fe6ee574d5d8f", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "tell a short joke"},
    ],
    stream=False,
    temperature=0
)

print(response.choices[0].message.content)

Sure! Here's a short one for you:

Why don’t skeletons fight each other?  
They don’t have the guts! 😄


We could set other parameters for decoding to obtain different answer.

(1) Temperature

We introducte a *temperature* parameter, denoted as t, to modify the vocabulary probability distribution, making it more biased towards high probability words:

$P(x|x_{1:t-1}) = \frac{exp(u_t / t)}{\sum_{t'} exp( u_{t'} / t )} $

When t -> 0, it becomes greedy decoding; when t -> $∞$, it becomes uniform sampling. By adjusting the value of t, sampling from the tail distribution can be avoided.

Refer to paper [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)

(2) Top-p Sampling (also known as Nucleus Sampling)

Top-p sampling is another method used for controlling the diversity of generated text. Instead of selecting a fixed number of tokens (as in top-k sampling), top-p sampling selects the minimum number of tokens required to reach a cumulative probability threshold, p.

Here's how top-p sampling works:

1. Sort the probability distribution of tokens in descending order.
2. Calculate the cumulative probabilities until the sum exceeds the threshold, p.
3. Retain only the tokens that contribute to the cumulative probability up to p.
4. Adjust the probabilities of the selected tokens by renormalizing them.

By adjusting the value of p, you can control the diversity of the generated text. Higher p values allow for more diverse outputs, as more tokens are considered, while lower p values result in more focused outputs, as fewer tokens are considered.

(3) n (number of candidates)

The parameter 'n' indicates the number of candidates or completions generated by the language model. When generating text, the model produces 'n' different completions based on the input prompt. These completions can provide alternative responses or variations in the generated text.

By increasing the value of 'n', you can obtain a larger set of candidate responses, allowing for more diverse options or alternative completions. However, note that generating more candidates may require additional API usage and could impact response times.


In [None]:
from openai import OpenAI

client = OpenAI(api_key="sk-a4dfb7a4a40141b8a93fe6ee574d5d8f", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "tell a short joke"},
    ],
    top_p=0.8

)

print(response.choices[0].message.content)

Sure! Here's a short one for you:

Why don’t skeletons fight each other?  
They don’t have the guts! 😄



## Question
However, LLMs may encounter problems when facing large number calculations (e.g., 638726816387261768327866*3926328917872368391). Could you come up with some ideas to solve this problem?

In [None]:
from openai import OpenAI

client = OpenAI(api_key="sk-a4dfb7a4a40141b8a93fe6ee574d5d8f", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "calculate 638726816387261768327866*3926328917872368391"},
    ]

)

print(response.choices[0].message.content)

To calculate \(638,\!726,\!816,\!387,\!261,\!768,\!327,\!866 \times 3,\!926,\!328,\!917,\!872,\!368,\!391\), we can use the distributive property of multiplication. However, due to the large size of these numbers, the calculation is typically done using computational tools or programming languages like Python.

Here is the result:

\[
638,\!726,\!816,\!387,\!261,\!768,\!327,\!866 \times 3,\!926,\!328,\!917,\!872,\!368,\!391 = 2,\!507,\!955,\!315,\!973,\!043,\!944,\!620,\!832,\!797,\!126,\!006
\]

Let me know if you'd like further clarification!
