## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

## 14. Natural Language Generation and Large Language Models

Generating text with HuggingFace and open-weights LLMs is actually quite easy.
We will work with a small model (Qwen2-1.5B-Instruct) that we will load in a quantized format. You can find information on the model here: https://ai.azure.com/catalog/models/qwen-qwen2-1-5b-instruct

### Quantization
In full-precision, each parameter (weight) of the model takes 32bits in the memory. Quantization is a technique that compresses the model weights such that every parameter only takes 16-bit or 8-bit (or 4-bit if we really want to downsize). The performance is often not compromised a lot, though occasionally LLMs exhibit "weird" behavior when quantized.

Want to learn more about this? --> You are welcome to join our "Search Engines and Neural Information Retrieval" course in your Master studies! 😀


In [None]:
# Imports
import torch
print(torch.cuda.is_available())
from transformers import pipeline, BitsAndBytesConfig
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device) # This must print True and "cuda:0"
import timeit

In [None]:
# This will take a while! You may have to restart the environment after installing.
!pip install -U bitsandbytes

In [None]:
# For our project, models will be provided in Licca scratch - talk to Fabio about which models you need!
model_path = "Qwen/Qwen2-1.5B-Instruct"

# Define how we want to quantize the model
quantization_config = BitsAndBytesConfig(load_in_16bit=True)

# Define the pipeline that we use to access the model
pipe = pipeline("text-generation", model=model_path, model_kwargs={"quantization_config": quantization_config}, device_map="auto")

Inspect the data structure returned by the model.

In [None]:
# Define the messages that will be added to the prompt
messages = [
{"role" : "system", "content" : "Hello"},
{"role": "user", "content": "Who are you?"},
]

# The pipe function passes the messages to the model.
generated_text = pipe(messages, max_new_tokens=100, do_sample=False, temperature=0.0)

print(generated_text)

# Task: print only the generated response text.

The pipeline can also process batches of instances as illustrated in the following code segment.

You can ignore the following warning (it is due to an [inconsistency in the HuggingFace library](https://github.com/Vision-CAIR/MiniGPT-4/issues/129)):


> A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Try out various batch sizes and observe how long it takes.

💡Make use of batch processing in your software project!

In [None]:
BATCH_SIZE = 2

# Prepare a batch of messages
batch_messages = [
    [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a short story about a dog."}],
    [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a short story about a cat."}],
    [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a poem about the ocean."}],
    [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum physics in simple terms."}],
]

# Use the pipeline to process the batch of messages
start = timeit.default_timer()
batch_results = pipe(batch_messages, max_new_tokens=256, do_sample=False, batch_size=BATCH_SIZE)
end = timeit.default_timer()

print("Batch processing took", "{:.1f}".format(end - start), "seconds")

# Print the generated text for each item in the batch
for i, result in enumerate(batch_results):
    print(f"Result for item {i+1}:")
    print(result)
    print("-" * 20)

This coding approach should work similarly with other open-weight models when you access them via HuggingFace.

Hints: For using LLama/Gemma, first create and set HuggingFace access token following this tutorial:
https://pyimagesearch.com/2025/04/04/configure-your-hugging-face-access-token-in-colab-environment/

It may take several hours to 1–2 days for your access to be approved