# SciLake Energy Pilot – GeoTagging Scientific Publications

This repository provides examples and workflows for **geotagging scientific publications** in the context of the **Energy Planning Pilot** of the [SciLake European Project](https://www.scilake.eu/).  

The notebook demonstrates how to extract and normalize **geographical information** from scientific publications using two main components:

- [**AffilGood**](https://github.com/sirisacademic/affilgood/tree/main/docs)  
  Identifies and geolocates geographical components of **affiliations** extracted from publications.  
  📄 Paper: [*AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis* (ACL 2024)](https://aclanthology.org/2024.sdp-1.13/)

- [**GEORDIE**](https://github.com/sirisacademic/geordie/tree/dev)  
  Extracts **geographical mentions** in text (title, abstract, full text), normalizes them, and characterizes them by semantic role.  
  📄 Paper: under development


**0. Imports**

In [18]:
%load_ext autoreload
%autoreload 2

import sys, os, json
from datasets import load_dataset
import geordie
from affilgood import AffilGood

sys.path.append(os.path.abspath("..")) 
from scripts.utils import build_all_payload 

def proc_text(g, text: str | None):
    if not text or (isinstance(text, str) and not text.strip()):
        return None
    try:
        return g.process_text(text)
    except Exception as e:
        print("⚠️ GEORDIE error on text snippet:", str(e)[:200])
        return None

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**1. Load dataset & pick one publication**

In [38]:
# Download the sample of publications on energy planning
ds = load_dataset("nicolauduran45/scilake-additional-fulltext-corpus")
pubs = ds["energy_planning"]

row = pubs[18]   # ← first item
row.get("doi"), row.get("title")[:120] if row.get("title") else None

('https://doi.org/10.5278/ijsepm.5400',
 'Application of a cost-benefit model to evaluate the investment viability of the small-scale cogeneration systems in the ')

**2. Init GEORDIE & AffilGood**

In [None]:
# Init geordie, cpu is fine for few examples
g = geordie.Geordie(device="cpu")

# Init basic config of affilgood
affil_good = AffilGood(
    span_separator='',  # Use model-based span identification
    span_model_path='SIRIS-Lab/affilgood-span-multilingual',  # Custom span model
    ner_model_path='SIRIS-Lab/affilgood-NER-multilingual',  # Custom NER model
    entity_linkers=['Whoosh'],#, 'DenseLinker'],  # Use multiple linkers
    return_scores=True,  # Return confidence scores with predictions
    metadata_normalization=True,  # Enable location normalization
    verbose=False,  # Detailed logging
    device='cpu'  # Auto-detect device (CPU or CUDA)
)

**3. Process title, abstract, and full text sections**

In [52]:
def entities_only(geo_result):
    """
    Accepts:
      - dict like {"text": "...", "entities": [...]}
      - list of entity dicts
      - None
    Returns: list[dict] (entity dicts as GEORDIE emits them)
    """
    if geo_result is None:
        return []
    if isinstance(geo_result, dict):
        return geo_result.get("entities") or geo_result.get("spans") or []
    if isinstance(geo_result, list):
        return geo_result
    return []

# GEORDIE runs
title_geo_raw    = proc_text(g, row.get("title"))
abstract_geo_raw = proc_text(g, row.get("abstract"))

# Extract **lists of entities**
title_entities    = entities_only(title_geo_raw)
abstract_entities = entities_only(abstract_geo_raw)

# Sections: build a list-of-lists aligned with `fulltext_sections`
sections = row.get("fulltext_sections") or []
ft_entities = []
for sec in sections:
    content = (sec or {}).get("section_content")
    sec_geo_raw = proc_text(g, content)
    ft_entities.append(entities_only(sec_geo_raw))

out = {
    "doi": row.get("doi"),
    "affilgood": row.get('affiliations'),                 # <-- raw affilgood list
    "title": row.get("title") or "",
    "title_geordie": title_entities,            # <-- list of entity dicts
    "abstract": row.get("abstract") or "",
    "abstract_geordie": abstract_entities,      # <-- list of entity dicts
    "fulltext_sections": sections,              # <-- original sections (with text)
    "fulltext_sections_geordie": ft_entities    # <-- list-of-lists (aligned)
}

# Postpocess output
payload = build_all_payload(out)

# Store output 
os.makedirs("../examples", exist_ok=True)
out_path = f"../examples/{row['title']}.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump(payload, f, ensure_ascii=False, indent=2)
print(f"Output file stored in : {out_path}")

# print output
def compact(p):
    c = dict(p)
    c["title"] = {"text": "....", "entities": p["title"]["entities"]}
    c["abstract"] = {"text": "....", "entities": p["abstract"]["entities"]}
    c["fulltext"] = [
        {
            "section_num": s.get("section_num"),
            "section_name": s.get("section_name"),
            "text": "....",
            "entities": s.get("entities", [])[:5],  # show a few
        }
        for s in p.get("fulltext", [])[:2]  # show first two sections
    ]
    return c

print(json.dumps(compact(payload), ensure_ascii=False, indent=2))

Output file stored in : ../examples/Application of a cost-benefit model to evaluate the investment viability of the small-scale cogeneration systems in the Portuguese context.json
{
  "doi": "https://doi.org/10.5278/ijsepm.5400",
  "affiliations": [],
  "title": {
    "text": "....",
    "entities": [
      {
        "raw_entity": "Portuguese",
        "role": "Object of study",
        "osm_entity": "Portugal",
        "osm_link": "https://www.openstreetmap.org/relation/295480",
        "osm_id": "295480",
        "place_id": "263154752"
      }
    ]
  },
  "abstract": {
    "text": "....",
    "entities": [
      {
        "raw_entity": "Portuguese",
        "role": "Object of study",
        "osm_entity": "Portugal",
        "osm_link": "https://www.openstreetmap.org/relation/295480",
        "osm_id": "295480",
        "place_id": "263154752"
      }
    ]
  },
  "fulltext": [
    {
      "section_num": "1.",
      "section_name": "Introduction",
      "text": "....",
      "entit