# Preparing annotation examples

## Step 1: Load Example Data from the STI Unlabelled Dataset

We begin by selecting a sample of research abstracts from the STI (Science, Technology, and Innovation) unlabelled dataset. This data will serve as the input for our information extraction pipeline.


In [15]:
import pandas as pd
from datasets import load_dataset

# load dataset 
ds_dict = load_dataset("SIRIS-Lab/unlabelled-sti-corpus")

# get train split
ds_train = ds_dict["train"]

# get 30 random exmaples
df_sampled_examples = pd.DataFrame(ds_train.shuffle(seed=42))

display(df_sampled_examples.head(4))

Unnamed: 0,id,title,abstract,type
0,220752,Enhancing innovation management capacities of ...,Enterprise Europe Network is a main support in...,project/european
1,W2124685859,Mutual cooperative effects between single- and...,Two types of resonance between the spontaneous...,publication
2,US 2012/0048877 W,SYSTEM AND METHOD FOR PERFORMING WELLBORE FRAC...,Methods for performing oilfield operations are...,patent
3,W2600571758,Sales Tax Competition among State–Local Govern...,ABSTRACTThis article estimates tax reaction fu...,publication


## Step 2: Annotate Each Example Using Two LLMs

For each title and abstract, we will generate key information extractions using **two different language models (LLMs)**. Each model will produce its own structured output for the defined dimensions (motivations, objectives, methods, impact, and research topic). This provides options for downstream human annotation and quality comparison.


In [2]:
prompt_base = '''Given the title and abstract of a research proposal, extract and summarize the following information in a clear, structured, and programmatically friendly JSON format:

- **Motivations:**  
  Provide a list of 1 to 3 clear and concise sentences summarizing the main motivations or problems addressed by the research. Each motivation should be independent and not reference others. Avoid redundancy—if motivations are similar, combine or rephrase for clarity.

- **Objectives:**  
  Provide a list of 1 to 3 clear and concise sentences summarizing the main objectives of the research. Each objective should be independent and not reference others. If two objectives are highly similar, combine them into a single sentence to avoid redundancy.

- **Methods:**  
  Provide a list of 1 to 3 clear and concise sentences summarizing the main methods, techniques, or approaches. Omit specific details such as sample sizes, participant numbers, and timeframes.

- **Results:**  
  Provide a list of 1 to 3 clear and concise sentences summarizing the main results, expected results or impact. Omit details such as sample sizes, participant numbers, and timeframes.

- **Research Topic:**  
  Summarize the central research topic as a single, concise phrase or sentence.

**Formatting instructions:**  
Return only the output as valid JSON, following this structure (with no additional explanation or text):

```json
{
  "motivations": ["First motivation here", "Second motivation here", ...],
  "objectives": ["First research objective here", "Second research objective here", ...],
  "methods": ["First research method here", "Second research method here", ...],
  "results": ["First result/impact here", "Second result/impact here", ...],
  "research_topic": "Specific research topic here"
}

***Return only the JSON content, and nothing else.***

-----------------
Input:

Title: 
Democratising and making sense out of heterogeneous scholarly content

Abstract: 
SciLake's mission is to build upon the OpenAIRE ecosystem and EOSC services to (a) facilitate and empower the creation, interlinking and maintenance of Scientific/Scholarly Knowledge Graphs (SKGs) and the execution of data science and graph mining queries on top of them, (b) contribute to the democratization of scholarly content and the related added value services implementing a community-driven management approach, and (c) offer advanced, AI-assisted services that exploit customised perspectives of scientific merit to assist the navigation of the vast scientific knowledge space. In brief, SciLake will develop, support, and offer customisable services to the research community following a two-tier service architecture. First, it will offer a comprehensive, open, transparent, and customisable scientific data-lake-as-a-service (service tier 1), empowering and facilitating the creation, interlinking, and maintenance of SKGs both across and within different scientific disciplines. On top of that, it will build and offer a tier of customisable, AI-assisted services that facilitate the navigation of scholarly content following a scientific merit-driven approach (tier 2), focusing on two merit aspects which are crucial for the research community at large: impact and reproducibility. The services in both tiers will leverage advanced AI techniques (text and graph mining) that are going to exploit and extend existing technologies provided by SciLake's technology partners. Finally, to showcase the value of the provided services and their capability to address current and anticipated needs of different research communities, four scientific domains (neuroscience, cancer research, transportation, and energy) have been selected to serve as pilots. For each, the developed services will be customised, to accommodate differences in research procedures, practices, impact measures and types of research objects, and will be validated and evaluated through real-world use cases.

Output:

```json
{
  "motivations": [
    "Researchers face challenges in navigating, interlinking, and maintaining heterogeneous scholarly content, which limits the accessibility and usability of scientific knowledge.",
    "There is a need for democratized access and advanced tools to analyze and evaluate scholarly content based on scientific merit such as impact and reproducibility."
  ],
  "objectives": [
    "Facilitate the creation, interlinking, and maintenance of Scientific Knowledge Graphs across and within various scientific disciplines.",
    "Develop and provide AI-assisted, customizable services to support navigation and analysis of scholarly content based on scientific merit.",
    "Promote the democratization of scholarly content through a community-driven management approach."
  ],
  "methods": [
    "Implement a scientific data-lake-as-a-service to support the creation and maintenance of knowledge graphs.",
    "Leverage advanced AI techniques, including text and graph mining, to enable enhanced navigation and analysis of scholarly content.",
    "Customize and validate the developed services in pilot domains such as neuroscience, cancer research, transportation, and energy."
  ],
  "results": [
    "Empower researchers with effective tools to navigate, analyze, and utilize scholarly content.",
    "Enhance the reproducibility and impact assessment of scientific research.",
    "Support diverse research communities through tailored AI-driven services and promote wider access to scientific knowledge."
  ],
  "research_topic": "Democratization and advanced analysis of heterogeneous scholarly content using AI and knowledge graphs"
}
```
'''

In [None]:
import os
import re
import json
from tqdm.notebook import tqdm
import openai
from together import Together
import anthropic
import os
from dotenv import load_dotenv

load_dotenv()

os.environ['TOGETHER_AI']

# Directory to store interim results
os.makedirs("../data/interim/pseodoannotation", exist_ok=True)
interim_dir = "../data/interim/pseodoannotation"

llama_client = Together(api_key=os.environ['TOGETHER_AI'])
openai_client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def extract_json(text):
    json_match = re.search(r'```json(.*?)```', text, re.DOTALL)
    if json_match:
        json_str = json_match.group(1).strip()
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)
    else:
        print("No JSON block found.")
    return None

for i, row in tqdm(df_sampled_examples.iterrows(), total=len(df_sampled_examples)):
    id = row['id']
    interim_path = os.path.join(interim_dir, f"{id}.json")
    if os.path.exists(interim_path):
        # Skip if this example is already processed
        continue

    title = row['title']
    abstract = row['abstract'].replace('\n', ' ').replace('  ', ' ')
    prompt = f"""{prompt_base}
-----------------
Title:
{title}

Abstract:
{abstract}"""

    # --- Llama call ---
    llama_response = llama_client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
        messages=[{"role": "system", "content": prompt}],
        temperature=0.1,
        top_p=0.1,
        top_k=50,
        repetition_penalty=1,
        stop=["<|eot_id|>", "<|eom_id|>"],
        stream=False
    )
    llama_output = llama_response.choices[0].message.content

    # --- OpenAI call ---
    openai_response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful academic data assistant that extracts information from research documents."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.1,
        max_tokens=1500,
    )
    openai_output = openai_response.choices[0].message.content.strip()

    # --- Anthropic (Claude) call ---
    anthropic_response = anthropic_client.messages.create(
        model="claude-opus-4-20250514	",  # or another claude-3 model if you prefer
        max_tokens=1500,
        temperature=0.1,
        system="You are a helpful academic data assistant that extracts information from research documents.",
        messages=[{"role": "user", "content": prompt}]
    )
    # For Claude v3, output is in .content (as a list of message blocks)
    anthropic_output = anthropic_response.content[0].text if hasattr(anthropic_response, "content") and anthropic_response.content else ""

    # Extract JSON outputs
    llama_json = extract_json(llama_output)
    openai_json = extract_json(openai_output)
    anthropic_json = extract_json(anthropic_output)

    result = {
        "index": i,
        "title": title,
        "abstract": abstract,
        "llama_json": llama_json,
        "openai_json": openai_json,
        "anthropic_json": anthropic_json,
        "llama_raw": llama_output,
        "openai_raw": openai_output,
        "anthropic_raw": anthropic_output,
    }

    # Save each result as a JSON file
    with open(interim_path, "w") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)


In [20]:
# To Combine Results Later:
import glob
import json

results = []
for filename in sorted(glob.glob("../data/interim/pseodoannotation/*.json")):
    with open(filename, "r") as f:
        results.append(json.load(f))

## Step 3: Push LLM Outputs to Argilla for Human Review

Both sets of LLM-generated annotations will be uploaded to **Argilla** as candidate suggestions. Annotators will review, choose, or edit the outputs, creating high-quality, human-verified annotations to build our training dataset.

In [35]:
import logging
import pandas as pd
import argilla as rg
import os
from dotenv import load_dotenv

load_dotenv()

logger = logging.getLogger(__name__)

def ensure_workspace(workspace_name="default"):
    # Ensure the workspace exists
    existing_workspaces = client.workspaces.list()
    if not any(ws.name == workspace_name for ws in existing_workspaces):
        workspace = rg.Workspace(name=workspace_name)
        client.workspaces.add(workspace)
        logger.info(f"Created new workspace: {workspace_name}")
    else:
        logger.info(f"Workspace '{workspace_name}' exists.")

# argilla settings
URL = 'http://argilla2.unics.cloud/'
KEY = os.getenv("ARGILLA")

client = rg.Argilla(api_url=URL, api_key=KEY)

# create work space if it doesn't exist
ARGILLA_WORKSPACE = 'mutomo'
ensure_workspace(ARGILLA_WORKSPACE)

In [36]:
# metadata_overlap = rg.TermsMetadataProperty(
#     name="overlap_status",
#     options=["Partial overlap", "Total overlap", "Zero overlap"],
#     title="Model agreement",
#     visible_for_annotators=True,
# )

guidelines = '''# STI Document Classification Guidelines

This task involves classifying science, technology, and innovation (STI)  documents (pubs, projects, and patents) according to the HORZION Clusters and Areas of Intervention and the SDGs.

## Your Task

For each grant, you will:
1. Review the title and abstract
2. Select the most adequate categories for the record (suggestions are provided to speed up the process and to generate debate)
3. Optionally provide feedback on the classification criteria or the category definitions

## Classification Categories
https://docs.google.com/document/d/1WlSAu9rzs515uuNNQ0AiTtt8LsSWX_1in81tz86GMk8/edit?tab=t.0#heading=h.im41vt96h3g9
'''

metadata_doctype = rg.TermsMetadataProperty(
    name="doctype",
    title="Type of document/source",
    visible_for_annotators=True,
)

metadata_id = rg.TermsMetadataProperty(
    name="id",
    title="Document Id",
    visible_for_annotators=True,
)

# argilla dataset settings
settings = rg.Settings(
    guidelines=f"{'HOLA!'}",
    fields=[
        #rg.TextField(name="id"),  # unique identifier for each record (metadata)
        rg.TextField(name="title"),  # title of the document (metadata)
        rg.TextField(name="abstract"),  # abstract of the document (metadata)
        rg.TextField(name="responses",use_markdown=True),  # justifications from the llm (metadata)

    ],
    questions=[
        # multi-label question for llm suggested labels
        rg.LabelQuestion(
            name="motivations",  
            labels=['model-1','model-2','model-3','none'],  # replace with actual cluster labels
            title='Motivations (choose the most accurate, if none, write your own)',
            description="Choose the most accurate, if none, write your own.",
            required=True,
        ),
        rg.TextQuestion(name='motivations_feedback',
                        title="If 'none' of the responses are helpful and cocrent, provie the response in bullet points",
                        required=False),
        rg.LabelQuestion(
            name="objectives",  
            labels=['model-1','model-2','model-3','none'],  # replace with actual cluster labels
            title="Objectives (choose the most accurate, if none, write your own)",
            description="Choose the most accurate, if none, write your own.",
            required=True,
        ),
        rg.TextQuestion(name="objectives_feedback",
                        title='If none of the responses are helpful and cocrent, provie the response in bullet points',
                        required=False),
        rg.LabelQuestion(
            name="methods",  
            labels=['model-1','model-2','model-3','none'],  # replace with actual cluster labels
            title="Methods (choose the most accurate, if none, write your own)",
            description="Choose the most accurate, if none, write your own.",
            required=True,
        ),
        rg.TextQuestion(name="methods_feedback",
                        title='If none of the responses are helpful and cocrent, provie the response in bullet points',
                        required=False),

        rg.LabelQuestion(
            name="results",  
            labels=['model-1','model-2','model-3','none'],  # replace with actual cluster labels
            title="Results (choose the most accurate, if none, write your own)",
            description="Choose the most accurate, if none, write your own.",
            required=True,
        ),
        rg.TextQuestion(name="results_feedback",
                        title='If none of the responses are helpful and cocrent, provie the response in bullet points',
                        required=False),

        rg.LabelQuestion(
            name="research-subject",  
            labels=['model-1','model-2','model-3','none'],  # replace with actual cluster labels
            title="Research Subject (choose the most accurate, if none, write your own)",
            description="Choose the most accurate, if none, write your own.",
            required=True,
        ),
        rg.TextQuestion(name="research-subject_feedback",
                        title='If none of the responses are helpful and cocrent, provie the response in bullet points',
                        required=False),
        
    ],
    #suggestions =  [rg.Suggestion("Comments/Feedback", "default text", agent="Majority Vote")]
    metadata=[metadata_id, metadata_doctype]#, metadata_overlap]
)

In [37]:
# creating the argilla dataset
dataset_name = "mutomo"
# remove if exists
#dataset_to_delete = client.datasets(name=dataset_name)

dataset = rg.Dataset(name=dataset_name,
                     settings=settings,
                     client=client,
                     workspace=ARGILLA_WORKSPACE
)

# check if dataset exists before creating
try:
    dataset.create()
except Exception as e:
    print(f"Dataset already exists or cannot be created: {e}")
    dataset_to_delete = client.datasets(name=dataset_name, workspace=ARGILLA_WORKSPACE)
    dataset_deleted = dataset_to_delete.delete()
    dataset.create()

In [38]:
import random

def markdown_aggregate_shuffled(row, dimensions=None, model_keys=None, seed=None):
    if dimensions is None:
        dimensions = ["motivations", "objectives", "methods", "results", "research_subject"]
    if model_keys is None:
        model_keys = [k for k in row.keys() if k.endswith("_json")]

    rnd = random.Random(seed)
    shuffled_model_keys = model_keys.copy()
    rnd.shuffle(shuffled_model_keys)

    out = []
    for dim in dimensions:
        out.append(f"### {dim.capitalize()}\n")
        for i, mk in enumerate(shuffled_model_keys, start=1):
            model_label = f"model-{i}"
            model_data = row.get(mk) or {}  # Handle None by using empty dict
            vals = model_data.get(dim, None)
            out.append(f"- **{model_label}**")
            if not vals:  # If None or empty, add a placeholder
                out.append("  - *(empty)*")
            elif isinstance(vals, list):
                for s in vals:
                    out.append(f"  - {s}")
            else:  # string or single value
                out.append(f"  - {vals}")
            out.append("")  # Blank line for spacing
        out.append("")  # Blank line for next dimension
    return "\n".join(out), shuffled_model_keys

# Example usage:
md, shuffle_order = markdown_aggregate_shuffled(row, seed=42)
print(md)
print("Shuffle order:", shuffle_order)  # Useful for mapping model-N back to real model if needed


### Motivations

- **model-1**
  - There is a need for coordinated management and protection of biodiversity in Italian and Croatian coastal wetlands.
  - There is a lack of public awareness about the value of wetlands ecosystems and the need for active engagement in territorial governance.

- **model-2**
  - There is a need for coordinated management and protection of coastal wetlands in the Italy-Croatia cross-border region to preserve biodiversity.
  - Lack of public awareness about the value of wetland ecosystems hinders effective governance and conservation efforts.

- **model-3**
  - Coastal wetlands in the Italy-Croatia cross-border region lack coordinated management approaches, threatening biodiversity conservation.
  - There is insufficient public awareness and engagement regarding the ecological value of wetland ecosystems among stakeholders and the general public.


### Objectives

- **model-1**
  - Establish a cross-border observatory to monitor best practices and data on I

In [39]:
records = []

for row in results:
    doctype = df_sampled_examples.iloc[row['index']]['type']
    docid = df_sampled_examples.iloc[row['index']]['id']
    md, shuffle_order = markdown_aggregate_shuffled(row)
    
    suggestions = [rg.Suggestion("motivations_feedback", "default text", agent="default")]

    record = rg.Record(
            fields={
                #"id": str(row['id']),
                "title": row['title'],
                "abstract": row['abstract'],
                "responses": md#''#row['justification_output']
            },
            metadata={
                "id": docid,
                "doctype": doctype,
                #'overlap_status': row['overlap_status']
            },
        suggestions=suggestions
        )
    records.append(record)

In [None]:
batch_size = 10  # Adjust based on server limits

for i in range(0, len(records), batch_size):
    try:
        dataset.records.log(records[i:i+batch_size])
        print(f"✅ Uploaded batch {i // batch_size + 1} ({len(records[i:i+batch_size])} records)")
    except Exception as e:
        print(f"⚠️ Failed to upload batch {i // batch_size + 1}: {e}")

In [122]:
# define users
client.users

username,id,role,updated_at
argilla,7ee2325f-05b4-4d11-a4a3-07b5be8fc14d,owner,2025-01-09T09:51:02.614050
annotator_smart_cities,5593bf77-d9f0-4210-8676-58471ff82808,annotator,2025-01-28T13:03:35.266504
annotator_ict,440e7203-81fd-4f53-b899-fdfd727af6fe,annotator,2025-01-28T13:08:26.425012
annotator_smart_buildings,f49deb3c-75a2-4056-9e90-df4ce77791e3,annotator,2025-01-28T13:08:26.875474
annotator_circular_economy,bc60a264-bada-46de-b0f0-b0c2712c2c3f,annotator,2025-01-28T13:08:27.282235
annotator_sustainable_food,32b0dcc3-cda7-46f0-b2d3-099a526dc1fc,annotator,2025-01-28T13:08:27.692671
annotator_sustainable_tourism,c9a9d53c-67de-4ae3-a09e-25bf86ff2518,annotator,2025-01-28T13:08:28.097751
annotator_health_medicine,494668b1-f973-47b4-85bc-0a1e4e01347c,annotator,2025-01-28T13:08:28.509356
annotator_mobility,344987ea-3a5b-4fd1-83a2-f33131668f85,annotator,2025-01-28T13:08:28.919467
annotator_materials_end_products,4e99a761-5f33-49fb-b8c2-64ca826f557f,annotator,2025-01-28T13:08:29.330462


In [None]:
workspace = client.workspaces('mutomo')

# berta
user = client.users('berta.grimau@sirisacademic.com')
added_user = user.add_to_workspace(workspace)

# nicolau
user = client.users('nicolau.duransilva@sirisacademic.com')
added_user = user.add_to_workspace(workspace)

# theodore
user = client.users('theodore.hervieux@sirisacademic.com')
added_user = user.add_to_workspace(workspace)

In [132]:
users_argilla = '''benedikt.schmidt@sirisacademic.com	nOFxeR4M07'''.split('\n')
users_argilla = [item.split('\t') for item in users_argilla]

# adding nico as a user 
for user_name, user_pswd in users_argilla:
    new_user = rg.User(
        username=user_name,
        password=user_pswd,
        role="annotator", 
        client=client
    )

    created_user = new_user.create()

    user = client.users(user_name)
    workspace = client.workspaces('mutomo')

    added_user = user.add_to_workspace(workspace)