# Exercise 4: (Multimodal) Large Language Models

Install all requirements:

In [None]:
! pip install accelerate
! pip install transformers
! pip install bitsandbytes
! pip install duckduckgo_search
! pip install sentencepiece

We prepared a convenient wrapper class called `Model` which you can use to easily load an LLM from 🤗 Hugging Face. We will use the 8B version of **Llama 3** because it is among the state-of-the-art in the class of open source LLMs with 8 billion parameters.

Remember to set the `HUGGING_FACE_USER_ACCESS_TOKEN` as explained in the task instructions PDF.

In [None]:
import torch
from transformers import pipeline, BitsAndBytesConfig

HUGGING_FACE_USER_ACCESS_TOKEN = ""


class Model:
    """Convenient wrapper, embodying the LLM."""

    def __init__(self, name: str, task: str = 'text-generation',
                 temperature: float = 0.2, top_k: int = 50,
                 top_p: float = 0.9, revision=None):
        self.name = name
        self.temperature = temperature
        self.top_k = top_k
        self.top_p = top_p
        self.max_prompt_len = 512
        self.max_output_len = 512
        self.task = task
        self.revision = revision

        self.pipeline = self.load(name)

    def load(self, model_name: str):
        """Takes the Hugging Face model identifier and loads the corresponding model."""

        # Load the model in 4-bit precision to save computational resources
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        )

        pl = pipeline(
            self.task,
            max_new_tokens=self.max_output_len,
            temperature=self.temperature,
            top_k=self.top_k,
            top_p=self.top_p,
            model=model_name,
            device_map="auto",
            token=HUGGING_FACE_USER_ACCESS_TOKEN,
            truncation=True,
            model_kwargs={"quantization_config": quantization_config},
            revision=self.revision,
        )

        # Define the padding token
        pl.tokenizer.pad_token_id = pl.tokenizer.eos_token_id

        return pl

    def generate(self, prompt: str, **kwargs) -> str:
        """Takes the prompt string (and optionally hyperparameters like temperature,
        top_k, and top_p) and returns the string sequence continued by the LLM."""

        # Turn prompt into adequately formatted message
        message = [{"role": "user", "content": prompt.strip()}]
        message_formatted = self.pipeline.tokenizer.apply_chat_template(
            message,
            tokenize=False,
            add_generation_prompt=True
        )

        # Insert prompt message into the LLM pipeline to generate an output
        output = self.pipeline(message_formatted,
            do_sample=True,
            eos_token_id=self.pipeline.tokenizer.eos_token_id,
            pad_token_id=self.pipeline.tokenizer.pad_token_id,
            **kwargs
        )

        # Extract the response from the output message
        return output[0]['generated_text'][len(message_formatted):]


In [None]:
model = Model("meta-llama/Meta-Llama-3-8B-Instruct")

In [None]:
print(model.generate("Write me a 4-line poem about Star Wars."))

## Task 4.1: LLM Prompt Engineering

### Task 4.1a) Basic Prompt Best Practices

Each of the following prompt best practices is accompanied by a scenario and a bad prompt. Your task is to come up with a better prompt according to the respective best practice. Execute the old and the new prompt to observe the differences.

#### Write Clear Instructions

**Scenario**: You want to implement merge sort as a recursive function in Python. The function takes a list of integers and returns the list in ascending order. You want the function to be well readable, so you want short comments and use typing.

❌ Bad prompt:

In [None]:
prompt = "Implement merge sort."
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

#### Provide Context

**Scenario**: As an IT consultant you are going to meet business executives from the food producing industry. You want to prepare a handout for the meeting in which you want to explain machine learning to them in the context of food production, including chances and risks.

❌ Bad prompt:

In [None]:
prompt = """Explain machine learning."""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

#### Assign a Role

**Scenario**: You are an elementary school teacher and want to explain the solar system to an 8 year old.

❌ Bad prompt:

In [None]:
prompt = """Describe the solar system to an 8 year old."""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

#### Guide the Model's Attention

**Scenario**: You work in the support team of a large tech company, receiving many customer requests. You just received the following request:

*"I recently purchased a laptop from your store, and it arrived with a cracked screen. Additionally, the battery life is much shorter than advertised."*

You want to categorize the request into 'urgent' if it needs immediate response, 'medium' if it is okay if a response takes a few days, or 'spam' if the request is not legit.

❌ Bad prompt:

In [None]:
prompt = """You work in the support team of a large tech company, receiving many
customer requests. You just received the following request: "I recently
purchased a laptop from your store, and it arrived with a cracked screen.
Additionally, the battery life is much shorter than advertised." Your task is to
categorize the request into 'urgent' if it needs immediate response, 'medium' if
it is okay if a response takes a few days, 'spam' if the request is not legit."""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

#### Use Numbered Steps

**Scenario**: You are wondering if Nvidia shares would be a good investment. Since such an investment is a crucial decision, you want to get a well-informed recommendation which thoroughly analyzes the company's financial performance, market trends, competitive landscape, and potential risks.

❌ Bad prompt:

In [None]:
prompt = """Analyze whether it is a good idea to buy Nvidia shares right now.
Consider factors like the company's financial performance, market trends,
competitive landscape, and potential risks."""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

#### Specify the Output Format

**Scenario**: You are debating with your friend about the impact of social media on mental health. You both realize that your discussion is in lack of arguments. You want a human-readable analysis of a few aspects, positive as well as negative with an overall conclusion. But, you don't want to read more than 200 words.

❌ Bad prompt:

In [None]:
prompt = """Analyze the impact of social media on mental health."""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
prompt = """"""
print(model.generate(prompt))

### Task 4.1b) Top-k, Top-p, and Temperature

Run the following code to see the impact of the hyperparams on the generated output.

In [None]:
import itertools

temperature = [0.01, 0.2, 2.0]
top_p = [0.1, 0.9, 0.999]
top_k = [1, 5, 100]

prompt = "Write a one-paragraph technical report about the GPT-4 model."

for temperature, top_p, top_k in itertools.product(temperature, top_p, top_k):
    print(f"Temperature: {temperature}, top-p: {top_p}, top-k: {top_k}")
    print(model.generate(prompt, temperature=temperature, top_p=top_p, top_k=top_k) + "\n")

### Task 4.1c) Chain-of-Thought (CoT) Reasoning

We begin with defining the context.

In [None]:
context = """I baked 16 muffins. My friends ate three quarters of them. I ate 1
muffin and gave 1 muffin to a neighbor. My partner then bought 6 more muffins
and ate as double as much than me. How many muffins do we have now? """

❌ Bad prompt:

In [None]:
instruction = "I'm a lazy reader, so just write down the resulting number of muffins."
print(model.generate(context + instruction))

The right answer is 6. Most likely, the model outputs the wrong number here because it didn't get the chance to "think through" the problem. In a way, the model is forced to tell its gut feeling, stating just an estimate and not a well developed solution.

❌ Still a bad prompt:

In [None]:
instruction = """Immediately state the number of remaining muffins before you
write down how you got to that number."""
print(model.generate(context + instruction))

✅ Better prompt:

Rewrite the above prompt to apply CoT. The pompt lenght should be less than 2 lines.

In [None]:
instruction = """"""
print(model.generate(context + instruction))

### Task 4.1d) In-Context Learning (ICL)

**Scenario**: You work at a retail for kitchen equipment and you want your LLM to be a chatbot called "Melinda", responding to customer requests. You want the bot to be friendly, brief, and to answer always in a consistent manner as follows:

*"Hi, I'm Melinda!*
*...*
*Sincerely, Melinda"*

Furthermore, you want the chatbot to ask for the order number whenever a customer has a problem.

Write a prompt that teaches the LLM to behave as described not via an explicit instruction but just via ICL exemplars. 

❌ Bad prompt:

In [None]:
prompt = """You are a friendly, helpful chatbot, responding to customer requests.
Please answer the following request.

Customer: "I received a blender, and it makes a loud noise and smells like
burning when I use it. What should I do?"
"""
print(model.generate(prompt))

✅ Better prompt:

In [None]:
exemplars = """"""
print(model.generate(prompt + exemplars))

If you lack exemplars, you may apply a trick by letting the model first dynamically determine and state the best solving practices for the task at hand before solving it. See the following for an example:

❌ Bad prompt:

In [None]:
context = """You are the Digital Marketing Specialist of a travel agency. Your
goal is to produce a short Instagram Reel to promote travels to Greece. """
instruction = """Draft a creative, funny and catching screenplay for that Reel. """
print(model.generate(context + instruction))

✅ Better prompt:

In [None]:
expertise = """Begin with a professional expert-level GUIDE which
summarizes how to produce a successful Instagram promotion. Then,
draft the screenplay by taking the GUIDE into account."""
print(model.generate(context + instruction + expertise))

This approach is different to ICL but has a similar goal: namely to train the model through the context. It is especially useful for more complex problems with larger outputs.

## Task 4.2: Implementing an LLM-based Fact-Checker

### Task 4.2a) A Simple Veracity Classifier

Implement the `verify()` method of the `FactChecker` class below. The method takes a `claim` as a string and returns two strings: the model's verdict about the claim veracity and a one-paragraph justification. The verdict should equal
* `'supported'` if the claim holds true
* `'refuted'` if the claim is false
* `'not enough info'` if the knowledge is insufficient to come to a conclusion

You're provided with the helper function `extract_delimited()` which you can use to get the predicted label from the LLM's response.

In [None]:
import re


class FactChecker:
    valid_verdicts = ["supported", "refuted", "not enough info"]

    def __init__(self, llm: Model):
        self.llm = llm

    def verify(self, claim: str) -> (str, str):
        """Checks a given claim. Returns 'supported', 'refuted' or 'not enough info',
        depending on the claim's veracity, along with a short justification."""
        pass


def extract_delimited(text: str, delimiter: str = '`') -> str:
    """Extracts the (last) string that is enclosed by the specified delimiter.
    Returns an empty string if no matches were found."""
    pattern = f"{delimiter}(.*?){delimiter}"
    matches = re.findall(pattern, text)
    return matches[-1].strip(' \n') if matches else ''

Run the following to see if you model works:

In [None]:
fc = FactChecker(model)
print(fc.verify("The earth is flat!"))

In order to test your fact-checker, we implemented an evaluation function with a mini-benchmark consisting of 10 test examples:

In [None]:
benchmark = [
    ("Bananas are berries, but strawberries aren't.", "supported"),
    ("According to a study, people who wear mismatched socks are 32% more creative.", "not enough info"),
    ("In Switzerland, it is illegal to own just one guinea pig.", "supported"),
    ("Ukraine was involved in the Crocus City attack in Russia.", "refuted"),
    ("Scientists have discovered that eating pizza upside down enhances its flavor by 18%.", "not enough info"),
    ("Donald Trump deported fewer immigrants than Barack Obama did.", "supported"),
    ("All of the satellite weather data for the day of the Iranian president's crash has been removed.", "refuted"),
    ("Statistically, giraffes are much more likely to get hit by lightning than people.", "supported"),
    ("The sea level has not risen in Rio de Janeiro since 1880.", "refuted"),
    ("There has been a 60% drop in government revenue.", "not enough info"),
]


def evaluate(fact_checker: FactChecker):
    n_correct = n_wrong = 0
    for i, (claim, ground_truth) in enumerate(benchmark):
        prediction, justification = fact_checker.verify(claim)
        print(f'{i + 1}. Claim: "{claim}"\n'
              f'Prediction: "{prediction}"\n'
              f'Ground truth: "{ground_truth}"\n'
              f'Justification: "{justification}"\n')
        if prediction == ground_truth:
            n_correct += 1
        else:
            n_wrong += 1

    print(f"{n_correct} out of {len(benchmark)} claims correctly verified.")

Evaluate your `FactChecker` by running the following snippet. No worries, it's actually not that easy to get all 10 predictions correct. You're already good if your model is correct for at least 5 out of 10 instances.

In [None]:
evaluate(fc)

### Task 4.2b) Add Retrieval-Augmented Generation (RAG)

We prepared the following `DuckDuckGo` helper class for you to easily perform web searches with the DuckDuckGo search engine.

In [None]:
from duckduckgo_search import DDGS


class DuckDuckGo:
    """Class for querying the DuckDuckGo search engine."""

    def __init__(self):
        self.max_tries = 3

    def search(self, query: str, limit: int = 10) -> str:
        """Run a search query and return structured results."""
        attempt = 0
        while attempt < self.max_tries:
            try:
                response = DDGS().text(query, max_results=limit)
                parsed = self._parse_results(response)
            except:
                pass

            if parsed:
                return parsed

            attempt += 1
            query += '?'  # Modify the query to increase chance that DuckDuckGo behaves differently

        return ""

    def _parse_results(self, response: list[dict[str, str]]) -> str:
        """Parse results from DuckDuckGo search and return them as a string."""
        results = []
        for i, result in enumerate(response):
            url = result.get('href', '')
            title = result.get('title', '')
            body = result.get('body', '')
            text = f"{title}: {body}"
            results.append(f'{i + 1}. From {url}\n{text}')
        return "\n\n".join(results)


Searching the web is now as easy as that:

In [None]:
ddg = DuckDuckGo()
print(ddg.search("Strangest things sold on eBay"))

Implement the `verify()` method of the `FactCheckerWithRAG` class so that it employs Retrieval-Augmented Generation (RAG).

In [None]:
class FactCheckerWithRAG(FactChecker):
    def __init__(self, llm: Model):
        super().__init__(llm)
        self.ddg = DuckDuckGo()

    def verify(self, claim: str) -> (str, str):
        """Checks a given claim. Returns 'supported', 'refuted' or 'not enough info',
        depending on the claim's veracity."""
        pass


Again, test your fact-checking model:

In [None]:
fc_rag = FactCheckerWithRAG(model)
evaluate(fc_rag)

Your model is good if it achieves at least 7 out of 10 correct predictions.

## Task 4.3: Geolocating Photos with an MLLM

You are provided with a mini benchmark, consisting of 10 photos captured by one of our WiMis. The goal is to implement a Geolocator which predicts the country where the image was captured.

First, load the benchmark by executing the following snippet:

In [None]:
from PIL import Image
import os

IMG_WIDTH = 500
IMG_DIR = "img/"

def load_image(path: str) -> Image:
    img = Image.open(path)
    img = img.convert("RGB")
    aspect_ratio = img.height / img.width
    new_height = int(IMG_WIDTH * aspect_ratio)
    img = img.resize((IMG_WIDTH, new_height), Image.LANCZOS)
    return img

# Build the benchmark
geolocation_benchmark = []
img_paths = os.listdir(IMG_DIR)
for img_path in img_paths:
    img = load_image(IMG_DIR + img_path)
    label = img_path.split(' ')[1].split('.')[0]
    geolocation_benchmark.append((img, label))


Run the next snippet to see if the benchmark was loaded correctly. You should see each image along with its label. The label represents the 2-digit ISO country code, for example, "es" = "Spain".

In [None]:
for img, label in geolocation_benchmark:
    display(img)
    print(f"Location: {label}\n")


For your convenience, we implemented the following `MultimodalModel` which extends the original `Model` class.

In [None]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

class MultimodalModel(Model):
    """Usage is the same like with Model, except that the generate() method
    now accepts an Image next to the prompt."""
    
    def load(self, model_name: str):
        self.processor = LlavaNextProcessor.from_pretrained(model_name)
        self.processor.tokenizer.pad_token_id = self.processor.tokenizer.eos_token_id
        model = LlavaNextForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
        )
        model.to("cuda:0")
        return model

    def generate(self, image: Image, prompt: str, **kwargs) -> str:
        prompt = f"[INST] <image>\n{prompt} [/INST]"
        assert len(prompt) / 3 < self.max_prompt_len, "Prompt is too long."
        inputs = self.processor(prompt, image, return_tensors="pt").to("cuda:0")
        output = self.pipeline.generate(
            **inputs,
            max_new_tokens=self.max_output_len,
            pad_token_id=self.processor.tokenizer.pad_token_id
        )
        return self.processor.decode(output[0], skip_special_tokens=True)[len(prompt)-6:]

Load the **LLaVA 1.6 7B** multimodal model (also called LLaVA-NeXT). This will take a few minutes.

In [None]:
mm = MultimodalModel("llava-hf/llava-v1.6-mistral-7b-hf")

Run the MLLM on the first image to see if it works:

In [None]:
img, _ = geolocation_benchmark[0]
response = mm.generate(img, "Where was that image captured?")
print(response)

Now, we're at the core of this task: Implement the `locate()` method of the `Geolocator` class below. It receives an `Image` and is supposed to return the 2-digit ISO country code of the image's location as a string.

In [None]:
class Geolocator:
    def __init__(self, mllm: MultimodalModel):
        self.mllm = mllm
    
    def locate(self, img: Image) -> str:
        """Returns the predicted 2-digit ISO country code of the image's location."""
        pass


Let's see again where your `Geolocator` locates the first image.

In [None]:
geolocator = Geolocator(mm)
img, _ = geolocation_benchmark[0]
print(geolocator.locate(img))

Again, we prepared an evaluation function for your `Geolocator`.

In [None]:
def evaluate_geolocator(geolocator: Geolocator):
    n_correct = n_wrong = 0
    for i, (image, ground_truth) in enumerate(geolocation_benchmark):
        prediction = geolocator.locate(image)
        print(f'{i + 1}. Prediction: "{prediction}"\n'
              f'Ground truth: "{ground_truth}"\n')
        if prediction == ground_truth:
            n_correct += 1
        else:
            n_wrong += 1

    print(f"{n_correct} out of {len(geolocation_benchmark)} images correctly located.")

Run the following snippet to test your `Geolocator`'s performance on the mini benchmark.

In [None]:
evaluate_geolocator(geolocator)

Geolocation is a hard task. Also humans need to invest quite some effort to perform it accurately. Since there are about 200 countries on the globe (creating a high risk of confusion), your model is good if it achieves already a few (3 or 4) correct predictions.