# 🔍 Using LLMs for Scientific Information Extraction

In this notebook, we focus on making calls to a Large Language Model (LLM) using [LiteLLM](https://github.com/BerriAI/litellm), a lightweight abstraction layer for various LLM providers (e.g. OpenAI, Anthropic, Azure).

We define a simple function that takes a **system prompt** and a **user prompt** ([explanation of user vs. system prompt](https://chatgptnavigator.com/chatgpt-system-prompt-vs-user-prompt)) as parameters and returns the model's response.

This allows us to flexibly test different prompt formulations and system instructions – a key step in the design of robust extraction workflows.


We will use [Groq](https://console.groq.com/home) for the LLM call. You can get your free API key on the [Website](https://console.groq.com/home) by clicking on logging in. 

In [None]:
import os
os.environ["GROQ_API_KEY"] = "___" # TODO: add your groq api key here

In [None]:
from litellm import completion

def call_llm(system_prompt, user_prompt, model="groq/llama-3.3-70b-versatile"):
    """
    Sends a prompt to a Groq-hosted LLM via LiteLLM and returns the response.

    Parameters:
        system_prompt (str): The system message (sets model behavior).
        user_prompt (str): The actual user query or input.
        model (str): The model name to use (default: Groq LLaMA 3).

    Returns:
        str: The text response from the model.
    """
    response = completion(
        model=model, 
        messages=[
            {"role": "system", "content": ___},  # TODO: insert system prompt
            {"role": "user", "content": ___}     # TODO: insert user prompt
        ]
    )
    return response['choices'][0]['message']['content']

## 📄 Running the LLM on a Text Document

Now that we have our cleaned and chunked text files, we will try a first simple LLM call.

We load one text file and prompt the model to extract all polymer names mentioned in it.

This serves as a first test of the LLM’s capabilities and helps us evaluate the basic prompt structure.

In [None]:
import os

# Load the .txt file
file_path = "static/example_paper.txt"
with open(file_path, "r", encoding="utf-8") as f:
    document_text = f.read()
    
print(document_text[:500])  # Preview first 500 characters

In [None]:
# Define a basic system and user prompt
system_prompt = "You are an expert in polymer science. Extract relevant scientific information."
user_prompt = f"Extract all polymer names from the following text:\n\n{document_text}"

# Call the model
response = call_llm(___, ___) # TODO: add the prompts to call the call_llm function

print(response)

Since the user and system prompts can affect the performance of the data extraction significantly, you can now try different combinations of system and user prompts to improve results. 

## ✳️ One-Shot Prompting

Instead of only giving the model an instruction, we can include **a single example** in the prompt to help guide its output.

This technique is called **one-shot prompting** and can improve the quality and consistency of the model's responses.

Below, we provide the model with a sample input-output pair and then ask it to perform the same extraction on a new text.


In [None]:
# User prompt that includes one-shot example inline
user_prompt = f"""
Here is an example of what I want:

Input:  # TODO: add a short example text including min one polymer name. 


Output: # TODO: add the desired output to your example text 


---

Now do the same for this input:

Input:
{document_text}

Output:
"""

system_prompt = "You are a chemistry assistant. Extract all polymer names from the given text."

# Run the LLM
response = call_llm(system_prompt, user_prompt)

print(response)

## 🔭 Beyond Basic LLM Calls

This notebook demonstrated the basic workflow of using a Large Language Model to extract structured information from scientific text.

However, LLMs can be integrated into **more advanced workflows** that go far beyond single-prompt extraction:

- 🤖 **Agent systems** that combine multiple reasoning steps, tools, and memory to guide extraction over many documents.
- 🧪 **Vision–Language Models (VLMs)** that can process both text and images (e.g. extract data from figures, tables, or chemical diagrams).

You can explore more examples, notebooks, and real-world use cases in the **[Matextract project](https://matextract.pub)**.
