<a target="_blank" href="https://colab.research.google.com/github/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment/blob/main/Text/practical_solutions.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Week 4 AI & Text

**Objectives**

This week, we will:

- Learn how to interact with Large Language Models (LLMs) via code.
- Explore different prompting techniques and identify common prompting pitfalls.
- Apply LLMs to the extraction of metadata from scientific paper abstracts.

**Notes**

- If a line starts with the fountain pen symbol (üñåÔ∏è), it asks you to implement a code part or answer a question.
- Lines starting with the light bulb symbol (üí°) provide important information or tips and tricks.
- Lines starting with the checkmark symbol (‚úÖ) reveal the solutions to specific exercises.

# Intro

## Setup script

Before anything please run the following cell to install the required dependencies.

In [None]:
!curl -fsSL "https://github.com/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment/raw/refs/heads/main/Text/setup.sh" | bash

This may take some time to finish so continue reading.

## Ollama

Running a **Large Language Model (LLM)** is not easy, and usually, it's hidden from us behind web chat interfaces like ChatGPT or Gemini.
However, we can run the models ourselves, and this is what we will be doing in this notebook using Ollama.

[**Ollama**](https://ollama.com/) is an open-source tool that allows us to run Large Language Models (LLMs) on our own computer (or in this case, on the Google Colab server) rather than sending our data to a company like OpenAI or Google via the internet.
In research, this is very useful because it gives us more control over the models and data privacy, and allows us to use "open-source" models for free.

## Choosing an LLM model

On Ollama, and other platforms like Hugging Face, you will find hundreds if not thousands of models to choose from.
Have a look at [Ollama‚Äôs¬†model¬†library](https://ollama.com/search).
This can be overwhelming!
However, there are a few considerations to keep in mind when choosing a model:

- **Model Size**: This refers to the number of parameters the model has.
  It is directly related to how big the model file is, how much memory (RAM) it needs, and how long it takes to generate a response.
  A larger model is likely to generate better responses, but you may not have the hardware to run it.
  In our environment, we have limited resources, so we will favour smaller, efficient models.

- **Intent and Specialisation**: Some models are "generic," while others are trained for specific purposes like coding, mathematical reasoning, or safety reviews.
  Choose one that aligns with your specific research goal.

- **Context Size**: Depending on the task, you may need a model with a large context window to process large chunks of text (like a full scientific paper) at once.
  This is especially important for multi-step interactions where a long conversation needs to fit into the model's context.

- **Model Capabilities**: As we will see below, some models have "thinking" abilities, others can use "tools" (like calculators), and others can analyse images.

- **Recency**: More recent models are likely to perform better due to novel research and innovation in training techniques.

üí° Safety Note: Downloading unknown models can be risky, but any model from well-known AI labs or those that are highly popular on these platforms are safe to test.

Today we will use two models:

- **Gemma 3 (by Google)** is an open-source model.
  We are using a lightweight version today that features a 128k token context window (approx. 100,000 words), allowing it to process large chunks of text at once.
  While it is multimodal and can analyse images, we will focus on its text-processing abilities.

- **Qwen 3 (by Alibaba)** is a specialised "thinking model." Unlike standard LLMs that answer instantly, a thinking model generates a hidden chain of reasoning first.
  This helps it handle complex logic and multi-step problems without making mistakes.
  It is also designed for tool use, meaning it can work with external functions to increase its accuracy for specific tasks.

*Important note:* Because these are "smaller" models, they are more limited than the commercial versions, which are usually massive.
However, all models share the main limitations, such as hallucinations (making up facts).
The prompting tips we cover today are relevant to all models, large or small.

## Basic usage

Here we will use the ollama Python library to interact with the models.
The first step is to create a "client" that connects to the ollama program running in the background.

Since we have ollama installed on this specific Colab machine, we use localhost (which means "this computer").
However, this same code could be used to connect to a remote server or a high-performance computer elsewhere.

In [None]:
import ollama

# Create the client to talk to the local Ollama service
client = ollama.Client(host="http://localhost:11434")

The easiest way to prompt an LLM using Ollama is with the generate method.
This is a "one-shot" interaction: you send a prompt, and the model sends back a response.

In [None]:
# Generate a simple response using the Gemma model
result = client.generate(model="gemma3", prompt="What is ecology?")

# The result is a complex object with metadata, but we can print just the text
print(result.response)

#### üñåÔ∏è *Ask anything*

Ask `gemma3` anything you want.
Try asking it to define a specific ecological term or to write a short story about a forest.

‚úÖ There's no right answer to this question.
Let's ask it to write a story about writen from the perspective of a tropical rainforest on a scorching day.

In [None]:
result = client.generate(
    model="gemma3",
    prompt="Write a story from the prespective of a tropical rainforest waking up in a scorching day.",
)

print(result.response)

#### üñåÔ∏è *Compare models*

Run the same prompt ("What is ecology?") on `qwen3`.
Are the results different?
How is the tone or structure different?
Is there one that seems more "useful"?

‚úÖ The code block below prompts `qwen3` with the same question.

In [None]:
result = client.generate(model="qwen3", prompt="What is ecology?")

print(result.response)

‚úÖ Gemma uses a friendly and energetic tone, often including additional motivational sentences.
In contrast, Qwen3 provides a shorter, more direct definition that is straight to the point and avoids conversational filler.

Both models share a similar structure, making heavy use of Markdown headers and bullet points with bold headings.
However, Qwen3 tends to use a more structured list format compared to Gemma's more expansive layout.

In terms of helpfulness, Gemma is useful as it provides extra resources, though its verbosity can make it harder to quickly find specific information.
Qwen3 is more useful for a quick reference due to its conciseness, even though it did not provide further resources.

*Note*: These models are stochastic, and the outputs may vary across different runs of the same prompt.
There is no single "right" answer; the intent of this exercise is to identify the discriminating features of LLM responses and observe how these characteristics change between different models.

# Usual pitfalls

In this section, we will explore different ways LLMs can generate unhelpful or even harmful outputs, and look at simple strategies to mitigate these risks.

## Hallucinations

The most significant challenge when using LLMs is their tendency to generate **hallucinations**: outputs that are perfectly coherent but factually false.

In [None]:
result = client.generate(model="gemma3", prompt="What happened to the Parisian tiger?")

print(result.response)

If you use model outputs uncritically, it is very likely you will include something false that looks correct.
In academic research, text is scrutinised heavily by peers and supervisors.
Scientists are specifically trained to spot inconsistencies, so relying on unverified AI output can be a big risk.
**Always** double-check any output.

#### üñåÔ∏è *Reference checking*

A common and dangerous type of hallucination is the creation of fake scientific references.
Run the following prompt and using [Google¬†Scholar](https://scholar.google.com/) verify if any of the papers found exist.

In [None]:
result = client.generate(
    model="gemma3",
    prompt="Give me a list of recent impactful papers that explore the relation between bats migration and climate change.",
)

print(result.response)

‚úÖ Gemma3 provided six references divided by topic.
Upon checking Google Scholar, none of the suggested papers actually exist.
Since the journals mentioned are well-known, it is highly unlikely they would be missing from a major database, confirming the references are hallucinations.

While the titles and descriptions sounded plausible and interesting, they are ultimately made up. The model did, however, helpfully suggest that the user verify the results using Google Scholar or other databases.
Notably, the model claimed today is November 2, 2023, which is false and likely reflects its training data cutoff.

Here is the answer provided by Gemma3 for reference:

```
Okay, here's a list of recent impactful papers exploring the relationship between bat migration and climate change, categorized for clarity and with brief summaries. Note that "impactful" is subjective and based on citation counts, journal prestige, and overall influence within the field. As of today, November 2, 2023, these are some of the most discussed and influential:

**1. Long-Term Trends & Broad Scale Impacts:**

* **Title:** "Climate change alters the spatial range of migratory bats"
   * **Authors:**  Ryan McAllister, et al.
   * **Journal:** *Nature*, 2021
   * **Impact:** This is arguably *the* landmark paper on this topic. It used a combination of bat tracking data (radar, radio tags), climate modeling, and species distributions to demonstrate a significant northward shift in the ranges of multiple migratory bat species across North America. It strongly links range shifts directly to warming temperatures and altered wind patterns. It‚Äôs been hugely influential. 
   * **Key Findings:** Demonstrated predictable northward shifts in bat ranges, linked to changes in wind regimes, which bats use for efficient flight. Highlighted the vulnerability of bats at the southern edges of their ranges.

* **Title:** ‚ÄúClimate change influences migration routes and stopover sites of migratory bats‚Äù
   * **Authors:** David A. Smith, et al.
   * **Journal:** *Global Ecology and Biogeography*, 2020
   * **Impact:** Utilizes a large dataset of bat migration tracking to reveal how changing climate conditions are altering migration routes and the timing of stopover arrival.
   * **Key Findings:** Highlighted shifts in timing and routes, especially linked to changes in temperature and rainfall, demonstrating how bats are actively adapting to new conditions.

**2. Physiological & Behavioral Responses:**

* **Title:** "Climate-induced shifts in bat migration timing and energetic demands"
   * **Authors:**  Elizabeth H. Baker, et al.
   * **Journal:** *Proceedings of the National Academy of Sciences*, 2020
   * **Impact:**  A highly cited paper exploring the physiological consequences of climate-induced shifts in bat migration.
   * **Key Findings:** Showed that bats responding to warmer temperatures were experiencing earlier migration onset and increased energetic demands (higher metabolic rates) during the journey.  This suggests bats are energetically ‚Äòrunning‚Äô to stay ahead of climate change.

* **Title:** "Bat migration under climate change: a meta-analysis of movement ecology data"
   * **Authors:** Michael A. O‚ÄôNeal, et al.
   * **Journal:** *Biology Letters*, 2020
   * **Impact:** A quantitative meta-analysis that pulls together data from numerous bat tracking studies to provide a more robust understanding of migration patterns and how they‚Äôre changing.
   * **Key Findings:** Offered insights into the key drivers of bat migration, including temperature, wind, and landscape features, quantifying the strength of these influences.

**3. Specific Species & Regional Studies:**

* **Title:** "Changes in the migration phenology of Mexican free-tailed bats under a warming climate"
   * **Authors:**  Joshua J. Evans, et al.
   * **Journal:** *Ecology*, 2019
   * **Impact:** Focused on a well-studied bat species (Mexican free-tailed bats) and demonstrated the impacts of climate change on migration phenology (timing) with robust data.
   * **Key Findings:**  Demonstrated earlier arrival at and departure from breeding grounds in response to warmer temperatures, with potential implications for reproductive success.

* **Title:** ‚ÄúClimate change is reshaping the migratory landscape of horseshoe bats‚Äù
   * **Authors:** Matthew R. Davis, et al.
   * **Journal:** *Science*, 2022
   * **Impact:** This study examined the effects of climate change on horseshoe bats (a group facing significant declines), revealing shifts in migration timing and range, and highlighting the importance of understanding how changes in climate are impacting these vulnerable populations.
   * **Key Findings:** Showed that changing wind patterns and temperature are affecting migration routes and the availability of suitable habitat for horseshoe bats, potentially exacerbating existing population declines.

**4.  Modeling and Predictive Research:**

* **Title:** "Climate-driven shifts in bat migration routes: A multi-species model"
   * **Authors:**  Sarah J. Roulier, et al.
   * **Journal:** *Ecography*, 2021
   * **Impact:** Utilizes modeling to predict future bat migration shifts under different climate change scenarios.
   * **Key Findings:** Provided projections of range shifts and altered migration routes, offering valuable information for conservation planning.

**Resources for Finding More Papers:**

* **Google Scholar:** [https://scholar.google.com/](https://scholar.google.com/) -  Use keywords like "bat migration climate change," "bat tracking climate," "bat range shift climate."
* **Web of Science:** [https://www.webofscience.com/](https://www.webofscience.com/) - A subscription-based database, but often accessible through university libraries.
* **ResearchGate:** [https://www.researchgate.net/](https://www.researchgate.net/) ‚Äì Researchers often share their pre-prints and publications on this platform.

**Important Notes:**

* **Ongoing Research:** This field is rapidly evolving, with new studies published frequently.
* **Data Limitations:** Bat tracking data has limitations (e.g., tag weight, battery life).
* **Complex Interactions:** The relationship between bats and climate change is complex and influenced by many factors, including habitat loss, disease, and human activity.

To help me refine this list further and provide even more relevant papers, could you tell me:

*   Are you interested in a specific bat species or region?
*   What aspects of the migration-climate change relationship are you most interested in (e.g., timing, range shifts, physiology, conservation)?
```

This happens because LLMs operate on statistical probability.
During training, the model learns the structural patterns of language; since scientific citations follow a highly predictable format, the model can "simulate" a reference by mixing common researcher names with relevant keywords.
Because its internal knowledge is static (frozen at the moment its training ended) it isn't searching a real library.
Instead, it is simply predicting what a plausible citation looks like based on the patterns it has seen.

Nowadays, commercial models try to fix this by using web search to ground their answers in real-time data.
While this helps significantly, it doesn't solve the problem entirely.
Models can still misinterpret the search results or hallucinate details even when they have the correct source in front of them.
**Always double check the outputs**

## Stochasticity

LLMs generate text by "sampling" one word at a time.
Instead of simply predicting a single next token, the model assigns a probability score to every possible token in its vocabulary, giving higher scores to those that are most likely to follow.
It then samples a word at random based on these scores.
This means the model's output can change from one run to the next.

#### üñåÔ∏è *Word predictability*

Take the following sentence from Daan's paper:

> The body of ecological literature, which informs much of our knowledge of the global loss of biodiversity, has been experiencing rapid growth in recent decades.

Which of these words are most predictable based on the preceding text?
Use the function below to test this on different segments, such as "The body of..." or "The body of ecological...".
Prompt the model several times with the same segment to see if, and how often, the response changes.

In [None]:
def predict_next_word(sentence_chunk):
    result = client.generate(
        model="gemma3",
        prompt=f"Complete the given sentence. One word only. {sentence_chunk}:",
    )

    return result.response


# How predictable is ecological?
print(predict_next_word("The body of"))

‚úÖ By running the predict_next_word function, I found that "biodiversity" and "growth" were the most predictable words in the sentence.
When prompted with the preceding segments, Gemma3 consistently predicted these specific words every time, and all other words were never predicted.

*Note:* I identified these words by manually testing different sentence segments.
However, the process can be automated using the script below, which iterates through the sentence and measures the predictability of each word by running multiple trials.

In [None]:
# The original sentence to analyze
sentence = "The body of ecological literature, which informs much of our knowledge of the global loss of biodiversity, has been experiencing rapid growth in recent decades."

# Split the sentence into individual words
words = sentence.split()


# This function cleans both the predicted and expected word to avoid
# counting small variants as different
def normalise_word(word):
    """Lowercase the word and remove trailing punctuation for fair comparison."""
    return word.lower().strip(",").strip(".")


def measure_predictability(word_index, num_predictions=10):
    # Get the target word by its index
    target_word = words[word_index]

    # Take all the words up to the one we are interested in
    preceding_words = words[:word_index]

    # Join them into a sentence fragment to use as a prompt
    preceding_text = " ".join(preceding_words)

    # Counter for how many times the target word is correctly predicted
    num_matches = 0

    # List to store all predictions for this segment
    predicted_words = []

    # Repeat the trial several times to account for model variation
    for _ in range(num_predictions):
        # Generate the next-word prediction
        predicted_word = predict_next_word(preceding_text)

        # Normalise the word to ignore case and punctuation
        predicted_word = normalise_word(predicted_word)
        predicted_words.append(predicted_word)

        # Check if the prediction matches the target word
        if predicted_word == normalise_word(target_word):
            num_matches += 1

    # Calculate predictability as the ratio of successful matches
    predictability = num_matches / num_predictions

    # Return a dictionary containing the results and context
    return {
        "word": target_word,
        "score": predictability,
        "preds": predicted_words,
        "preceeding": preceding_text,
    }


# Print the predictability of each word (starting from the fourth word)
for index in range(3, len(words)):
    results = measure_predictability(index)
    print(f"{results['word']} : {results['score']:.1%}")

This randomness can be a significant issue for the reproducibility of science.
We can address this stochasticity by lowering the **"temperature"** of the model, which concentrates the probability on the most likely word, or by setting a **"seed"** to ensure the same random choices are made every time.
While we will explore this later, note that these settings are often harder to control when using commercial models through their standard web interfaces.

## Verbosity

Model outputs tend to be very long and verbose.
You can see this in the previous examples, but let's test another scenario: imagine you encountered the term "LLM" for the first time and wanted a quick definition.

In [None]:
# A broad question often leads to a very long, conversational answer
result = client.generate(model="gemma3", prompt="What is an LLM?")

print(result.response)

Verbosity increases the likelihood of the output containing hallucinations or generally unhelpful content.
It also makes it harder to find the specific information you need for your research.

You can address this in two ways: specify a length limit or be more precise about what you are asking.
For example, if you only need a brief summary, tell the model exactly how much to write:


In [None]:
# Setting a clear constraint helps
result = client.generate(
    model="gemma3", prompt="What is an LLM? Reply in less than 100 words"
)

Often, a small change in the question results in a much more useful and direct answer.
Instead of asking a broad "what is" question, try to pinpoint the exact information you require.

In [None]:
# A more specific question leads to a more targeted response
result = client.generate(model="gemma3", prompt="What does the acronym LLM stand for?")

print(result.response)

## Logical errors

Models do not reason in the way we understand it‚Äîby constructing solid, valid logical arguments step by step.
While they can emulate reasoning, they often fail even in simple scenarios that require basic spatial or logical awareness.

In [None]:
result = client.generate(
    model="gemma3",
    prompt="I get out on the top floor (third floor) at street level. How many stories is the building above the ground?",
)

print(result.response)

One way researchers address this is by using models specifically trained to "reason" or "think".
During the fine-tuning process (often using Reinforcement Learning), these models are rewarded for creating accurate reasoning chains, which significantly improves their performance on logical tasks.
Another method is "Chain-of-Thought" (CoT) prompting, where you explicitly ask the model to explain its thinking process step by step.

#### üñåÔ∏è *Reasoning models*

`qwen3` is a reasoning model, it generates a long chain of text where it lays out its internal logic.
Try the same "third floor" prompt with `qwen3`.
Does it get it right?
Why or why not?
Give your interpretation.

‚úÖ See solution below

In [None]:
result = client.generate(
    model="qwen3",
    prompt="I get out on the top floor (third floor) at street level. How many stories is the building above the ground?",
)

print(result.response)

‚úÖ The correct answer is that the building has zero stories above ground (or is just one floor), since the top floor is already at street level.
Qwen3 was unable to solve this correctly.

During its "thinking", the model correctly identified two separate facts: that the third floor is the top floor, and that this floor is at street level.
However, it failed to combine these two inferences into a single logical conclusion.

While the "thinking" process may have helped the model break the problem into steps, and each piece of information was handled correctly, the model could not logically combine them to find the right answer.

#### üñåÔ∏è *Chain of thought prompting*

Now, return to `gemma3`.
Try asking the same question again, but this time, specifically instruct it to "provide a logical explanation step by step before giving the final answer".
Does the quality of the output improve?

‚úÖ Check solution below

In [None]:
result = client.generate(
    model="gemma3",
    prompt="I get out on the top floor (third floor) at street level. How many stories is the building above the ground? Provide a logical explanation step by step before giving the final answer.",
)

print(result.response)

‚úÖ While Gemma3 followed the instructions to break down its reasoning step-by-step, it still arrived at an incorrect solution.
The very first step of its explanation was logically invalid.

*Note*: This shows that while "Chain of Thought" prompting can encourage a model to be more methodical, it is does not prevent hallucinations.
If a model generates an incorrect premise or "hallucinates" a logical step early on, the entire reasoning chain will lead to a wrong conclusion.

## Bias

Since model answers are based on the most statistically likely word, the output is heavily influenced by the training data.
If certain words, concepts, or taxa appear together frequently in the text the model was trained on, it will naturally recreate those associations.
This can bias the outputs towards common or well-documented examples, often at the expense of less-studied or minority cases.

In [None]:
# Observe which taxa the model prioritises based on common literature trends
prompt = (
    "Complete the sentence by filling in the blank with a single taxa, only provide the answer: "
    "Recent years have seen increasing pressures from multiple fronts including "
    "increased tourism and encroaching deforestation, in particular populations "
    "of <blank> have been declining."
)

result = client.generate(model="gemma3", prompt=prompt)

print(result.response)

Because of this, LLM outputs should rarely be used to answer questions that require a careful consideration of multiple viewpoints or diverse sources.
While the model can be a helpful search tool, you must be aware of its biases and always contrast its output with other sources of information.
Ultimately, answers to complex ecological questions should incorporate broad evidence and use rigorous statistical approaches to control for known biases in the data.

#### üñåÔ∏è *Spatial bias*

Compare the model's ability to name common bird species in two different ecological contexts.
Use the function below to list three common species in the UK, then try it again for a location you suspect is less represented in global ecological literature.
Verify if the species provided are actually found in those locations.

In [None]:
def get_common_bird_species(location, n=3):
    result = client.generate(
        model="gemma3",
        prompt=f"List {n} common bird species in the {location}. Provide the list only, if possible using both common and scientific names.",
    )
    return result.response


print(get_common_bird_species("UK"))

‚úÖ See solution below

In [None]:
print(get_common_bird_species("Mexico"))

‚úÖ While all UK species were correct in terms of commonality and scientific names, the response for Mexico contained several errors.

```
Here are 3 common bird species in Mexico:

*   Mexican Warbler (Setophaga phoenicea)
*   Blue Jay (Cyanocitta cristata)
*   House Finch (Haemorhous mexicanus)
```

The Blue Jay is not present in Mexico.
The Mexican Warbler is not a specific species name; there are multiple types of warblers in Mexico, and the scientific name provided (Setophaga phoenicea) is hallucinated.
Only House Finch was correct, as it is common in Mexico and its scientific name is accurate.

## Writing style

Due to the inherent bias in LLMs, it is generally a **bad idea** to ask them to write a research piece from scratch.
Doing so often generates generic content that may not align with your specific intent or the nuances of your study.
It is much safer to provide the model with a skeleton or a draft of your ideas and use the tool to help you polish the language or structure.
While the outputs may look like high-quality academic writing, academic evaluation focuses on the strength of your ideas, narrative, and argumentation, areas where LLMs are not inherently strong.
Ultimately, a model will either produce generic information or build upon what you provide.
Therefore, it remains **your responsibility to provide the solid reasoning and verified facts** that form the core of the work.

When using LLMs to refine or generate text, it is important to account for their default writing style.
They are trained to be "helpful" and "polite," which often results in a particular tone that may be too wordy or formal for your needs.
You can steer the model by being specific about the desired tone and audience.

In [None]:
# A simple prompt often results in the model's 'default' academic style
prompt = """Refine my text:

LLMs are being increasingly applied in ecology.
They can be used to extract information from papers.
This makes large automated sythesis collection feasible.
It is very recent but already promising.
"""

result = client.generate(model="gemma3", prompt=prompt)

print(result.response)

#### üñåÔ∏è *Guide the style*

Assume you want to write a blog post for first-year undergraduate students about LLMs in ecology.
Use the previous text as a starting point, but modify the prompt to steer the style so it communicates the ideas effectively for that specific audience.
Write your improved prompt below and compare the output to the previous one.

‚úÖ See prompt below

In [None]:
prompt = """
I am writing a blog post for first-year undergraduate students about using Large Language Models (LLMs) in ecology.
Please refine the text below for this audience using these guidelines:
Assume the reader is a first-year student who may not be familiar with technical AI details or what ecological synthesis is.
Use a relaxed and approachable tone.
Keep the language simple and precise, and avoid over-enthusiastic wording.
Avoid technical breakdowns and favour high-level, intutive explanations.
Follow the original narrative structure sentence by sentence.
You may expand the text slightly to clarify or explain concepts where needed.
Provide only the refined text.

Text to refine:

LLMs are being increasingly applied in ecology.
They can be used to extract information from papers.
This makes large automated sythesis collection feasible.
It is very recent but already promising.
"""

result = client.generate(model="gemma3", prompt=prompt, options={"seed": 42})

print(result.response)

‚úÖ The difference in style is clear.
The model adopted a more conversational tone and provided additional context to explain why each statement is important.

*Note*: Explicitly defining the target audience helps the model move from its default style to a more helpful and tailored explanation.
While this specific prompt is longer than the text itself and takes time to craft, you can reuse it for other sections of the same blog post or for future projects.
We recommend building a personal prompt library to save time and promote consistency across your work.

## Summary

In general, follow these basic recommendations when working with LLMs:

- **Always check the output**: Never assume the model is factually correct.
- **Be precise**: Give clear instructions on exactly how you want your output to look.
- **Use restrictions**: Add constraints to guide the model, such as "use a maximum of 100 words" or "format as a bulleted list."
- **Challenge bias**: Models are inherently biased toward their training data; do not rely on them as your sole source of information.
- **Account for randomness**: Output may vary from one run to the next.
- **Provide context**: The more information and background you provide, the more focused and relevant the outputs will be.

#### üñåÔ∏è *Critique the prompt*

Have a look at the following prompt.
Based on what we have discussed regarding verbosity, bias, and precision, how could you make it better?

> "Tell me about climate change and birds."

‚úÖ The prompt points to a relevant topic, but it is too vague.

In terms of precision, it only gives two broad concepts ("climate change" and "birds") and does not specify what information is needed.
A better prompt should define the relationship of interest, the bird group, and the scope.
For example, it could ask: "How do rising peak temperatures affect seabird populations in the UK?"

In terms of bias, a broad prompt makes the model fill in missing details on its own.
This can bias the answer toward patterns that are most common in the training data, including which climate effects and bird species are discussed.
The model also has a training cutoff (around 2023), so newer evidence may be missing.
Being more specific reduces this issue, but external sources may still be needed for recent or less well-covered information.

The prompt also does not specify output style, structure, or length.
As a result, the model may return a default response that is long and conversational.
This can be improved by adding clear constraints, such as: "Provide a concise executive summary of no more than 200 words."

An improved prompt could be:

> How do rising peak temperatures affect seabird populations in the UK? Provide a concise executive summary of no more than 200 words.

*Note:* This type of factual prompt should be used carefully unless the model can access external, validated data sources.
Without those tools, factual errors and hallucinations are more likely, so outputs should always be checked.

In [None]:
prompt = """
How do rising peak temperatures affect seabird populations in the UK?
Provide a concise executive summary of no more than 200 words.
Return the summary only, no filler conversation.
"""

result = client.generate(model="gemma3", prompt=prompt, options={"seed": 42})

print(result.response)

# General prompting techniques

## Zero-shot prompting

In **zero-shot prompting**, you provide the model with a direct instruction or question without any prior examples or demonstrations.
This is the "usual" approach to prompting.
You are relying on the model's pre-trained knowledge to understand and execute the instruction.

While simple, zero-shot prompting is highly effective for general tasks like basic summarisation or sentiment analysis.
However, as we saw previously, it is likely to result in hallucinations or formatting errors.
In a zero-shot context, the model performs significantly better when your prompt is very specific and rich in context.

In [None]:
# A typical zero-shot prompt: just an instruction
prompt = "Classify the following research description as 'Field Study' or 'Lab Experiment': We measured the growth rates of 50 oak seedlings in a temperature-controlled greenhouse."

result = client.generate(model="gemma3", prompt=prompt)
print(result.response)

## Role prompting

A **role prompt** provides the model with a specific persona.
By instructing the model to "Act as a senior research ecologist," you steer it away from the generic, conversational tone of a standard "chatbot".
While this doesn't give the model new knowledge it hasn't been trained on, it can significantly improve the depth and relevance of the information it chooses to present.
You are pre-conditioning the output to a particular style or scientific context.

In [None]:
# Observe how the tone and detail change when a role is assigned
task = "Explain why biodiversity loss matters."

# Standard instruction
prompt_simple = f"Answer this: {task}"

# Role-based instruction
prompt_role = f"You are a Professor of Conservation Biology. Provide a technically rigorous answer to this: {task}"

print("--- Simple Prompt ---")
print(client.generate(model="gemma3", prompt=prompt_simple).response)

print("--- Role Prompt ---")
print(client.generate(model="gemma3", prompt=prompt_role).response)

#### üñåÔ∏è *Switching Personas*

Try changing the role to something very different, like "A science journalist writing for a local newspaper" or "A critical peer-reviewer".
Notice how the model adjusts its vocabulary and the types of evidence it emphasises.

‚úÖ See answer below

In [None]:
role_1 = "A science journalist writing for a local newspaper"
prompt_1 = f"You are {role_1}. {task}"
print(f"---- {role_1} ----")
print(client.generate(model="gemma3", prompt=prompt_1).response)

role_2 = "A critical peer-reviewer"
prompt_2 = f"You are {role_2}. {task}"
print(f"---- {role_2} ----")
print(client.generate(model="gemma3", prompt=prompt_2).response)

‚úÖ Changing the role produced clear differences in both style and focus.
The journalist focuses on communicating the importance of biodiversity loss matters for a broad audience, while the peer-reviewer adopts a critical tone and gives feedback on an imaginary text that the model assumed it was reviewing.

## Few-shot prompting

LLMs are great at recognising patterns and we can leverage that.
Few-shot prompting involves giving the model 2 or 3 examples of how you want a task done before asking it to perform the task itself.
Simplified example

In [None]:
# We provide examples of extracting a 'focal species' from a sentence
prompt = """
Extract the focal species from the text.

Text: We monitored the nesting habits of Chelonia mydas in Costa Rica.
Species: Chelonia mydas

Text: Camera traps were used to detect the presence of Panthera onca.
Species: Panthera onca

Text: Our study focus was the movement of migrating Passer domesticus across urban gradients.
Species: 
"""

result = client.generate(model="gemma3", prompt=prompt)
print(result.response)

## Chain-of-thought prompting

For complex tasks, it a helpful strategy is to ask it to decompose the tasks into several steps and provide reasoning for each.

In [None]:
# Task: Identify if a study is 'Experimental' or 'Observational'

abstract = """
Chinese tallowtree, Triadica sebifera (L.) Small (Euphorbiaceae), is one of the worst invasive weeds of the southeastern USA impacting coastal wetlands, forests, and natural areas.
Traditional mechanical and chemical controls have been unable to limit the spread, and this invasive species continues to expand its range.
A proposed biological control candidate, the flea beetle Bikasha collaris (Baly) (Coleoptera: Chrysomelidae), shows high specificity for the target weed Chinese tallowtree.
Results from a series of no-choice and choice feeding tests of B.¬†collaris adults and larvae indicated that this flea beetle was highly specific to Chinese tallowtree.
The larvae of B.¬†collaris feed by tunneling in the roots, whereas the adults feed on the leaves of Chinese tallowtree.
A total of 77 plant taxa, primarily from members of the tallow plant family Euphorbiaceae, were tested in numerous test designs.
Larval no-choice tests indicated that larvae completed development only on two of the non-target taxa.
Of 80 B.¬†collaris larvae fed roots of Hippomane mancinella L. and 50 larvae fed roots of Ricinus communis L., two and three larvae completed development, respectively.
The emerging adults of these five larvae died within 3¬†days without reproducing.
Larval choice tests also indicated little use of these non-target taxa.
Adult no-choice tests indicated little leaf damage by B.¬†collaris on the non-targets except for Ditrysinia fruticosa (Bartram) Govaerts & Frodin and Gymnanthes lucida Sw. When given a choice, however, B.¬†collaris adults consumed much less of the non-targets D.¬†fruticosa (7.4%) and G.¬†lucida (6.1%) compared with the control leaves.
Finally, no-choice oviposition tests indicated that no eggs were produced when adults were fed all non-target taxa, except those fed G.¬†lucida.
These B.¬†collaris adults fed G.¬†lucida leaves produced an average of 4.6 eggs compared with 115.0 eggs per female when fed Chinese tallowtree.
The eggs produced from adults fed G.¬†lucida were either inviable or the emerging larvae died within 1¬†day.
These results indicate that the flea beetle B.¬†collaris was unable to complete its life cycle on any of the non-target taxa tested.
If approved for field release, B.¬†collaris will be the first biological control agent deployed against Chinese tallow tree in the USA.
This flea beetle may play an important role in suppressing Chinese tallowtree and contribute to the integrated control of this invasive weed.
"""

prompt = f"""
Analyze the following study abstract. 
First, identify the methodology. 
Second, explain if any variables were manipulated by the researchers. 
Third, conclude if the study is 'Experimental' or 'Observational'.

Abstract: {abstract}
"""

result = client.generate(model="gemma3", prompt=prompt)
print(result.response)

# Advanced control over the generation


## Chat

To build a multi-turn conversation, we use the `chat` method.
Unlike the generate method, chat allows the model to maintain context by keeping track of the history between the user and the LLM model.

In a chat every message is assigned a specific role:
- System: Sets the persona or "rules" for the AI (e.g., "You are a helpful biologist").
- User: Your instructions or questions.
- Assistant: The model's previous responses.

The code below shows how to make a simple chat call, similar to the `generate` method.

In [None]:
response = ollama.chat(
    model="gemma3", messages=[{"role": "user", "content": "What is ollama?"}]
)

# Notice the output is a bit different. It has more metadata, but we can now access the
# answer in the `content` of the `message`.
print(response.message.content)

A conversation is essentially a list of these message objects.
To create a "memory" effect, you simply append new messages to the list and send the whole history back to the model.
This allows the assistant (LLM) to refer back to things said earlier in the conversation.

In [None]:
# To maintain memory, we keep the history in a list
messages = [
    {"role": "system", "content": "You are a concise science communicator."},
    {"role": "user", "content": "What is a trophic cascade?"},
    {
        "role": "assistant",
        "content": "A process where predators limit the density or behavior of their prey, benefiting the lower trophic levels.",
    },
    {"role": "user", "content": "Give me a classic example from North America."},
]

response = ollama.chat(model="gemma3", messages=messages)
print(response.message.content)

## Temperature & reproducibility

As we discussed in the "stochasticity" section, LLMs are probabilistic.
If you want the model to be more predictable, you can adjust the temperature and seed.

The **temperature** controls the "randomness" of the sampling:
- When `temperature = 0` the model always picks the most likely next word.
  This is best for data extraction and factual tasks.
- When `temperature > 1` the model is more likely to choose rare words and becomes more "creative" and diverse, which may be better for brainstorming or creative writing.

In [None]:

response = ollama.chat(
    model="gemma3",
    messages=[{"role": "user", "content": "Name a common bird in the UK."}],
    options={
        "temperature": 0,
    },
)

print(response.message.content)

#### üñåÔ∏è *Up the temperature*

Use the LLM to help you pick a title for a paper on the impact of microplastics on urban bumblebees.
Experiment with the temperature 0, 0.7, 1.5, and above.

‚úÖ See solution below

In [None]:
prompt = "Can you suggest 3 alternative titles for a scientific paper on the impact of microplastics on urban bumblebees? Be creative. List the titles only, no conversation filler."

print("--- Temperature = 0 ---")
response = ollama.chat(
    model="gemma3",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 0},
)
print(response.message.content)

print("--- Temperature = 0.7 ---")
response = ollama.chat(
    model="gemma3",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 0.7},
)
print(response.message.content)

print("--- Temperature = 1.5 ---")
response = ollama.chat(
    model="gemma3",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 1.5},
)
print(response.message.content)

A **seed** is a starting number for the random number generator.
By setting a fixed seed the sampling process will make the same "random" choices every time.
This is the way to achieve reproducibility in your research.

In [None]:

response = ollama.chat(
    model="gemma3",
    messages=[
        {
            "role": "user",
            "content": "Pick one random British bird and describe it in 5 words.",
        }
    ],
    options={
        "temperature": 0.7,  # We keep temperature up to allow for variety
        "seed": 42,  # but the seed should keep the 'random' choice consistent.
    },
)

print(response.message.content)

üí° **Note**: While these tools help, true 100% determinism is difficult to achieve in AI due to how computer hardware (GPUs) handles floating-point math.
However, for most cases, `temperature=0` or a fixed seed are sufficient.


## Structured outputs

Often, you don't want a conversational response.
In metadata extraction, you need specific data points (like species names, sample sizes, or coordinates) that can be directly writen into a spreadsheet or database.

This is such a common requirement that modern LLM libraries now provide ways to specify exactly how the output should be structured.
In Python, the gold standard for this is **Pydantic**.
Pydantic is a library used to define **data containers** (models).
Each data point in a Pydantic model has a specific "type" (e.g., integer, string, list) and can include a description to help the LLM understand what to look for.


In [None]:
from pydantic import BaseModel, Field


# Pydantic models are defined via "classes" in Python. No need to understand that for now, focus on the rest
class BirdObservation(BaseModel):
    # Each field has a name and a type. The common syntax is field_name: data_type
    species_name: str = Field(description="The scientific name of the bird")

    # Notice that fields can be integers to.
    count: int = Field(description="The number of individuals observed")

    # You can provide the list with a description which will help the model parse the relevant information
    location: str = Field(description="The specific site or park name")

    # Some fields can be optional too
    is_migratory: bool | None = Field(
        default=None, description="Whether the species is known to be migratory"
    )

The model defines a schema, which is a textual description the structure of your data.

In [None]:
# If you're curious this is what the schema looks like. No need to understand it.
BirdObservation.model_json_schema()

When using `ollama`, we can pass this schema directly into the `chat` method using the `format` parameter.
This forces the model to ignore its usual conversational output and return only a JSON object that matches your Pydantic model.

üí° *A Note on JSON*: JSON (JavaScript Object Notation) is a lightweight text format for storing and transporting data.
[Learn¬†more¬†about¬†JSON](https://www.json.org/json-en.html)

In [None]:
response = ollama.chat(
    model="gemma3",
    messages=[
        {
            "role": "user",
            "content": "I saw three Erithacus rubecula at Hyde Park yesterday morning.",
        }
    ],
    format=BirdObservation.model_json_schema(),
    options={"temperature": 0},
)

# The output is now a clean JSON string
print(response.message.content)

# We can then turn that string back into a Python object for easy use
data = BirdObservation.model_validate_json(response.message.content)
print(f"Extracted Species: {data.species_name}")

#### üñåÔ∏è *Metadata from a scientific abstract*

Consider the following text

> We conducted a 2-year field study on the growth of 45 individual Quercus robur saplings.
> We monitored natural variations in soil moisture and leaf area index.
> No experimental treatments were applied.

Extract the focal species, the sample size and duration of the study using the previous approach.

‚úÖ See solution below

In [None]:
text = """
We conducted a 2-year field study on the growth of 45 individual Quercus robur saplings.
We monitored natural variations in soil moisture and leaf area index.
No experimental treatments were applied.
"""


class AbstractMetadata(BaseModel):
    focal_species: str = Field(description="Name of the main study species.")

    sample_size: int = Field(description="Total number of sampled units reported for the focal species.")

    duration: int = Field(description="Total study duration in years as an integer.")

response = ollama.chat(
    model="gemma3",
    messages=[
        {
            "role": "user",
            "content": text,
        }
    ],
    format=AbstractMetadata.model_json_schema(),
    options={"temperature": 0},
)


abstract_data = AbstractMetadata.model_validate_json(response.message.content)
print(abstract_data)

## Tool usage

The final way to expand and control LLM outputs is through **tool usage**.
While relatively recent, this is one of the most effective ways to ground a model's response by injecting factual, real-time data.

The core idea is simple: an LLM can be given the ability to invoke external functions to enhance its capabilities.
Instead of relying solely on its static training data, the model can interact with the external world when it needs more information.

A common example is conducting searches of the web via a search engine.
During its reasoning, the LLM might think it is worth doing some extra research and so indicates that it wants to use a web search tool.
The LLM itself provides all the needed info on how to run that tool, for example what to search for, and the query is then executed.
The search engine is not an LLM and the outputs are fed back to the LLM so that it can continue its reasoning with the tool's results now injected into its context.
Other tools could include access to a calculator, a specific database like eBird, or any programmatic function you define.

An **agent** is simply a loop where the model can access tools until it has enough information to finish the task.
Here is an annotated implementation of that logic.

In [None]:
def run_agent_with_tools(prompt, tools, model="qwen3"):
    # Start the chat with the user prompt
    messages = [{"role": "user", "content": prompt}]

    # Loop until we invoke the "break" keyword
    while True:
        # Given the current state of the chat generate an answer
        response = client.chat(
            model=model,
            messages=messages,
            tools=list(tools.values()),  # Provide tools!
            think=True,
        )

        # Append the new response to the chat history
        messages.append(response.message)

        if response.message.thinking:
            print("Thinking: ", response.message.thinking)

        if response.message.content:
            print("Content: ", response.message.content)

        # Do this in case the agent wants to use a tool
        if response.message.tool_calls:
            print("Tool calls: ", response.message.tool_calls)

            # For each tool the agent wants to use
            for tool_call in response.message.tool_calls:
                # Get the tool by name
                function_to_call = tools.get(tool_call.function.name)

                if function_to_call:
                    # Get the arguments specified by the LLM
                    args = tool_call.function.arguments

                    # And use the tool
                    result = function_to_call(**args)
                    print("Result: ", str(result)[:200] + "...")

                    # Append to the chat history the results of the tool
                    messages.append(
                        {
                            "role": "tool",
                            "content": str(result)[: 2000 * 4],
                            "tool_name": tool_call.function.name,
                        }
                    )
                else:
                    # In case the tool wasn't found.
                    messages.append(
                        {
                            "role": "tool",
                            "content": f"Tool {tool_call.function.name} not found",
                            "tool_name": tool_call.function.name,
                        }
                    )
        else:
            # Break the agent loop if the model provides a final answer without more tool calls
            break

Here is an example using the built-in web_search and web_fetch tools.
Note that we are using `qwen3` here; not all models are trained to use tools, and `gemma3` currently is not.

You will also need an additional key to use the search functionality.
Please ask the instructor for this key.

In [None]:
OLLAMA_API_KEY = "<ask instructor for the key>"

client = ollama.Client(
    host="http://localhost:11434", headers={"Authorization": f"Bearer {OLLAMA_API_KEY}"}
)

available_tools = {"web_search": client.web_search, "web_fetch": client.web_fetch}

run_agent_with_tools(
    prompt="What is the main goal of the BIOS0032 module at UCL?",
    tools=available_tools,
    model="qwen3",
)

You can also create custom tools.
For that, you write a standard Python function and then provide a precise description so the model understands when and how to use it.

For an agent to use a tool effectively, it needs to know what the tool does and what arguments it expects.
This is done by providing docstrings, the special strings defined with three quotes (""") immediately below the function definition.

As with any prompting, the more precise you are in your description, the better the agent will perform.
Following the [Google¬†docstring¬†style¬†guide](https://google.github.io/styleguide/pyguide.html#383-functions-and-methods) is highly recommended to make sure the model interprets your description correctly.

Here's an example:

In [None]:
def get_ebird_observations(region_code: str, days_back: int = 14):
    """
    Fetches recent bird observations for a specific region from eBird.
    Args:
        region_code: The subnational code (e.g., 'US-NY' for New York).
        days_back: Number of days to look back for records (1-30).
    """
    # This is a mock tool; in a real scenario, you would query an API or database.
    return f"Found 12 observations in {region_code} over the last {days_back} days."

Once the function is defined, you can include it in the tools dictionary when running your agent.
The model will analyse your prompt, realise it needs bird observation data, and hopefully invoke `get_ebird_observations` with the correct `region_code`.

In [None]:
run_agent_with_tools(
    prompt="What birds were seen in US-NY lately?",
    tools={"get_ebird_observations": get_ebird_observations},
)

#### üñåÔ∏è *Get the current date and time*

First, ask `qwen3` to give you today's date.
Since LLMs have a "knowledge cutoff" and aren't naturally aware of the passing of time, it will likely struggle to get it right.
If it doesn't get it right, implement a tool that gives the current date and time to ground the model in the present.
Hint: You can use the `datetime.datetime.now()` function from the built-in `datetime` library.

‚úÖ See solution below

In [None]:
import datetime

def get_current_date():
    """Fetch current date and time."""
    return datetime.datetime.now()

# Run without tools
run_agent_with_tools(
    prompt="What day is today?",
    tools={},
)

# Run with new date tool!
run_agent_with_tools(
    prompt="What day is today?",
    tools={"get_current_date": get_current_date},
)

# Extracting info from abstracts

In this section we will attempt to replicate sections of [Scheepens¬†et¬†al.¬†2024](https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210x.14341).

> Scheepens, D., Millard, J., Farrell, M., & Newbold, T. (2024). Large language models help facilitate the automated synthesis of information on potential pest controllers. Methods in Ecology and Evolution, 15(7), 1261-1273.

First let's have a look of a set of 27 abstracts taken from the test set of that paper.

In [None]:
import pandas as pd

abstracts = pd.read_csv("sample_abstracts.csv")

These are a small subset of all papers that may be relevant to the research question Daan has.
Here is the full Scopus query used.

> TITLE-ABS-KEY(‚Äùpest control‚Äù OR ‚Äùbiological control‚Äù OR ‚Äùpest management‚Äù OR ‚Äùnatural enem*‚Äù) AND (LIMIT-TO(DOCTYPE, ‚Äùar‚Äù)) AND (LIMIT-TO(SUBJAREA, ‚ÄùAGRI‚Äù) OR LIMIT-TO(SUBJAREA, ‚ÄùENVI‚Äù)) AND (LIMIT-TO(LANGUAGE, ‚ÄùEnglish‚Äù))

Let's take a look at a single random abstract:

In [None]:
print(abstracts.sample(n=1).iloc[0]["Abstract"])

For this section let's focus on getting the list of species mentioned and categorised as either hervibores or natural enemies.
Here's the data schema Daan used:


In [None]:
from pydantic import BaseModel, Field


class SpeciesList(BaseModel):
    herbivores: list[str] = Field(
        description=(
            "List of all herbivorous (i.e., phytophagous species) mentioned in the text. "
            "Only list herbivorous species that are mentioned at the genus or species level (latin binomials)."
        ),
    )
    natural_enemies: list[str] = Field(
        description=(
            "List of all natural enemies (e.g., predators or parasitoids) of herbivores mentioned in the text. "
            "Only list natural enemies that are mentioned at the genus or species level (latin binomials). "
            "Ignore species that are solely mentioned as hyperparasitoids or solely as predators of other natural enemies."
        ),
    )

Usually, you would select a subset of your data for "training." In the context of LLMs, this doesn't mean retraining the model's weights, but rather fine-tuning your prompts and examples to get the best possible performance on that specific dataset.
This is an iterative process that often feels like a mix of science and art.
You might try different system personas, add few-shot examples, or adjust the Pydantic descriptions until the output matches your requirements.
We are lucky Daan has already done that work, providing the optimised prompts and schemas.

#### üñåÔ∏è *Analyse the prompt*

Have a look at Daan's prompts.
Can you list all the prompting techniques he used?

Here is the system prompt

In [None]:
system_prompt = """
You are an expert ecologist and taxonomist and you are tasked with extracting information on herbivorous species and their natural enemies in scientific publications.
You are to carefully follow the instructions given to you and are to extract information only from the text provided to you.

Reasoning: high
"""

and here is the user prompt.

In [None]:
user_prompt_simple = """
Based on this abstract, you are tasked with identifying any herbivorous species and their natural enemies. Let's think through the following tasks step by step.


# Herbivorous species

Your first task is to determine if there are any mentions of herbivorous (i.e., phytophagous) species in the text.
- If there are mentions of pests, you are looking only for species that are pests due to their herbivory, so ignore livestock pests such as flesh flies, vectors of human diseases, or any pest that isn't related directly to herbivory, such as fungi, bacteria or viruses.
- Ideally, the species is mentioned in association with a particular industry (e.g. field crops, greenhouse production, pasture, fruit plantations, forestry and timber production, stored grains, ornamental and horticultural plants, etc.) or affect/host crop or plant (e.g. corn, apple, pine trees, stored wheat, etc.).
- Herbivory does not have to be stated explicitly: If the pest belongs to an order or family of known herbivores, then herbivory may be inferred.
- You are only interested in herbivores that are mentioned at the genus or species level (latin binomials): Ignore herbivores that are only mentioned at the family or order level, or that are only mentioned with their common name (e.g. spiders, mice, aphids, etc.). 

If the text mentions any herbivorous species at the genus or species level, list these. Only list herbivores from the main abstract - don't list species that are only mentioned in the keywords.


# Natural enemies

Your second task is to determine if there are any natural enemies (e.g., predators and parasitoids) of herbivores.
- Natural enemies may be biological control agents of a herbivorous pest, but any mention or implication of predation, parasitism, or otherwise preying on, attacking or regulating a herbivore suffices as evidence for being a natural enemy of this herbivore.
- Look out for mentions of consumption: If a species is found to consume a herbivorous species, this is evidence for the species being a natural enemy of the herbivore. 
- Only list natural enemies of herbivorous species: Ignore species that are solely mentioned as hyperparasitoids or predators that predate on other predators. 
- You are only interested in natural enemies that are mentioned at the genus or species level (latin binomials or genus only): Ignore natural enemies that are only mentioned at the family or order level, or that are only mentioned with their common name (e.g. spiders, mice, aphids, etc.). 

If the text mentions any natural enemies at the genus or species level, list these. Only list natural enemies from the main abstract - don't list species that are only mentioned in the keywords. 


# Important

- Ignore any species that are plants, bacteria, fungi or viruses.
- Only list species that are mentioned at the genus or species level. 
- Only list species from the main abstract - don't list species that are only mentioned in the keywords.
- If the text mentions multiple synonyms of a species, only list one (ideally the more common/newest one).
"""

‚úÖ Daan's setup combines several prompting techniques.
The system prompt uses role prompting by assigning the model the role of an expert ecologist and taxonomist.
It is also zero-shot prompting, since no worked examples are provided.
The user prompt applies chain-of-thought style guidance by asking the model to think step by step and by splitting the task into two structured sections (herbivores and natural enemies) with clear criteria.
It also uses strong instruction constraints, i.e. what to include, what to exclude, output granularity, and where species must appear.

Now we can combine everything we've learned (role prompting, structured outputs, and reproducibility settings) into a single function.

This function takes a row from our dataset, bundles the title and keywords for context, and asks the model to extract the species.
Note how we use `temperature=0` and a `seed` to ensure the results are consistent every time we run the code.
It uses the data schema, as well as the prompts.

In [None]:
def extract_species_list(row):
    # Gather all available information for the paper
    title = row["Title"]
    abstract_text = row["Abstract"]
    author_keys = row["Author Keywords"]
    index_keys = row["Index Keywords"]
    paper_info = (
        f"{title}. Abstract: {abstract_text}"
        f"\nAuthor keywords: {author_keys}"
        f"\nIndex keywords: {index_keys}"
    )

    # Assemble the final prompt
    prompt = f"""
    Here are a title, abstract and keywords: {paper_info}. 
    {user_prompt_simple}
    Explain your reasoning.
    """

    # Define the conversation with the system role
    chat_history = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]

    response = client.chat(
        model="gemma3",
        messages=chat_history,
        format=SpeciesList.model_json_schema(),  # use the schema!
        options={
            "temperature": 0,  # Notice temperature = 0
            "seed": 42,  # and seed is set
            "think": True,
        },
    )

    # Convert the JSON string response back into a Python object
    return SpeciesList.model_validate_json(response.message.content)

You can test this on a single random abstract from your collection to see how well it performs before running it on the full set.

In [None]:
# Test on a single random row
row = abstracts.sample(n=1).iloc[0]
print(f"Abstract: {row['Abstract']}")

# Run our extraction pipeline
species_list = extract_species_list(row)
print(species_list)

#### üñåÔ∏è *Compare against ground truth.*

Load the `test_set.xlsx` file.
It contains all the human annotations of each of the abstracts of the test set.
You can find all 27 of our subset here.
You can identify them by their "EID" (electronic identifier).
Extract the species on all of the sample set and compare to the ground truth.
How many did we get right?

‚úÖ See solution below

In [None]:
# Load the 'ground truth' test data from an Excel file
test_set = pd.read_excel("test_set.xlsx")

# Filter the main 'abstracts' DataFrame to only keep rows with an EID that is present in our test set.
# The .isin() method checks if each EID in abstracts exists in the test_set's EID column.
test_abstracts = abstracts[abstracts["EID"].isin(test_set["EID"])]


# Helper function to combine separate Genus and Species columns
# into a single string representing the full binomial scientific name.
def combine_binomial(row):
    genus = row["Genus"]
    species = row["Species"]
    return f"{genus} {species}"


# Initialize counters for evaluation metrics
total_correct = 0
total_species = 0
total_predictions = 0

# Loop through each abstract row
for index, row in test_abstracts.iterrows():
    eid = row["EID"]

    # Extract predicted species using our custom function
    species_list = extract_species_list(row)

    # Combine herbivores and natural enemies into one list.
    # Using set() removes duplicates; sorted() sorts them alphabetically.
    predicted_species = sorted(set(species_list.herbivores + species_list.natural_enemies))

    # Get the ground truth
    # Filter the test set for the current abstract's EID
    subset = test_set[test_set["EID"] == eid]

    # Apply combine_binomial row-by-row to get scientific names
    species_names = subset.apply(combine_binomial, axis=1)

    # Remove duplicates and sort them alphabetically
    ground_truth_species = sorted(set(species_names))

    # Update Counters
    total_species += len(ground_truth_species)
    total_predictions += len(predicted_species)

    # Evaluate each predicted species
    for pred_species in predicted_species:
        # Count a prediction true if it appears in the ground truth list
        if pred_species in ground_truth_species:
            total_correct += 1

# Print a summary of the extraction performance
print(f"Summary: The model extracted {total_predictions} species ({total_correct} correct). Actual species count: {total_species}.")