# 01 - Analysing SHS Projects

In this tutorial, we will work step by step to analyze research projects and classify whether they belong to **Social Sciences and Humanities (SHS)**.  

---

## Step 1: Data Gathering

We will collect datasets from three main sources:  

1. **ANR Projects (France)**  
   - Source: [data.gouv.fr](https://www.data.gouv.fr/api/1/datasets/r/74a59cc0-ef79-458a-83e0-f181f9da459f)  

2. **CORDIS Horizon Projects (EU)**  
   - Source: [CORDIS Portal](https://cordis.europa.eu/data/cordis-HORIZONprojects-csv.zip)  

3. **Irish Research Council (IRC)**  
   - Source: [research.ie](https://research.ie/awardees_search/download.php?file=irc-awardees-search--.csv)  

We will store the raw files under:  

```
data/analysing_SHS_projects/
```


In [1]:
# Step 1: Setup
import os
import pandas as pd
import zipfile
import requests

# Create data folder
data_dir = "../data/analysing_SHS_projects"
os.makedirs(data_dir, exist_ok=True)

### 1.1 Download ANR Projects

In [2]:
anr_url = "https://www.data.gouv.fr/api/1/datasets/r/74a59cc0-ef79-458a-83e0-f181f9da459f"
anr_path = os.path.join(data_dir, "anr_projects.csv")

if not os.path.exists(anr_path):
    r = requests.get(anr_url)
    with open(anr_path, "wb") as f:
        f.write(r.content)

df_anr = pd.read_csv(anr_path, sep=";", low_memory=False)
print("ANR shape:", df_anr.shape)
df_anr.head()

ANR shape: (6478, 10)


Unnamed: 0,Projet.Code_Decision_ANR,AAP.Edition,Projet.Acronyme,Projet.Titre.Francais,Projet.Titre.Anglais,Projet.Resume.Francais,Projet.Resume.Anglais,Programme.Acronyme,Projet.Montant.AF.Aide_allouee.ANR,Projet.T0 scientifique
0,ANR-09-VPTT-0015,2009,ATAC-CONCEPT,Convertisseur auxiliaire avancé à refroidissem...,,L’objectif du Projet ATAC-CONCEPT est de dével...,,VTT,481373.0,2009-12-01
1,ANR-09-VPTT-0014,2009,VOLHAND,VOLant pour personne âgée et/ou HANDicapée : D...,,"A ce jour, il n’existe pas de système de direc...",,VTT,959412.0,2009-10-01
2,ANR-09-VPTT-0013,2009,TAYLRUB,Gérer les compromis de propriétés des caoutcho...,,Les silices précipitées sont utilisées comme c...,,VTT,490545.0,2009-10-01
3,ANR-09-VPTT-0012,2009,SYNERGIE,SYstème d'admission Novateur pour des Emission...,,Les transports terrestres et en particulier au...,,VTT,1574420.0,2009-10-01
4,ANR-09-VPTT-0011,2009,SPEEDCAM,Détermination de la Limitation de Vitesse par ...,,Afin d'assurer une sécurité maximale sur les r...,,VTT,672464.0,2009-10-01


### 1.2 Download CORDIS Horizon Projects

In [3]:
cordis_url = "https://cordis.europa.eu/data/cordis-HORIZONprojects-csv.zip"
cordis_zip = os.path.join(data_dir, "cordis_projects.zip")
cordis_path = os.path.join(data_dir, "project.csv")

if not os.path.exists(cordis_path):
    r = requests.get(cordis_url)
    with open(cordis_zip, "wb") as f:
        f.write(r.content)
    with zipfile.ZipFile(cordis_zip, "r") as zip_ref:
        zip_ref.extractall(data_dir)

# The extracted file has a long name, detect it
for file in os.listdir(data_dir):
    if file.startswith("projects") and file.endswith(".csv"):
        cordis_path = os.path.join(data_dir, file)

df_cordis = pd.read_csv(cordis_path, low_memory=False, on_bad_lines="skip", sep=";")
print("CORDIS shape:", df_cordis.shape)
df_cordis.head()

CORDIS shape: (18110, 20)


Unnamed: 0,id,acronym,status,title,startDate,endDate,totalCost,ecMaxContribution,legalBasis,topics,ecSignatureDate,frameworkProgramme,masterCall,subCall,fundingScheme,objective,contentUpdateDate,rcn,grantDoi,keywords
0,101234994,OPTIMALMINE,SIGNED,OPTIMALMINE: slope optimal design for a paradi...,2025-09-01,2029-08-31,0,1072140,HORIZON.1.2,HORIZON-MSCA-2024-SE-01-01,2025-07-10,HORIZON,HORIZON-MSCA-2024-SE-01,HORIZON-MSCA-2024-SE-01,HORIZON-TMA-MSCA-SE,The European Union is currently addressing cha...,2025-07-25 11:08:05,274682,10.3030/101234994,"mine optimisation, rock slope engineering, o..."
1,101232577,HSAFE,SIGNED,Innovative high-sensitivity avalanche field-ef...,2025-09-01,2029-08-31,0,1618230,HORIZON.1.2,HORIZON-MSCA-2024-SE-01-01,2025-07-22,HORIZON,HORIZON-MSCA-2024-SE-01,HORIZON-MSCA-2024-SE-01,HORIZON-TMA-MSCA-SE,"The focus of the HSAFE project, which aligns w...",2025-07-25 11:08:05,274696,10.3030/101232577,"Field-effect transistor-based biosensors, Canc..."
2,101236527,DRU,SIGNED,Democratic Roles of Universities (DRU): Practi...,2026-02-01,2030-01-31,0,1593180,HORIZON.1.2,HORIZON-MSCA-2024-SE-01-01,2025-07-10,HORIZON,HORIZON-MSCA-2024-SE-01,HORIZON-MSCA-2024-SE-01,HORIZON-TMA-MSCA-SE,DRU’s objective is to find new ways that unive...,2025-07-25 11:08:04,274676,10.3030/101236527,"Universities, Citizen science, Civic engagement"
3,101236483,3D-TOPO,SIGNED,3D topological states in solid and soft ferroe...,2026-02-01,2030-01-31,0,1117230,HORIZON.1.2,HORIZON-MSCA-2024-SE-01-01,2025-07-22,HORIZON,HORIZON-MSCA-2024-SE-01,HORIZON-MSCA-2024-SE-01,HORIZON-TMA-MSCA-SE,The 3D-TOPO project uncovers novel topological...,2025-07-25 11:08:06,274700,10.3030/101236483,"Ferroelectrics, Liquid Crystals, Topological s..."
4,101235387,I-TEXGEO,SIGNED,IoT Supported Electronic Geotextiles for Susta...,2026-01-01,2029-12-31,0,1683360,HORIZON.1.2,HORIZON-MSCA-2024-SE-01-01,2025-07-22,HORIZON,HORIZON-MSCA-2024-SE-01,HORIZON-MSCA-2024-SE-01,HORIZON-TMA-MSCA-SE,This project aims to develop an innovative IoT...,2025-07-25 11:08:06,274694,10.3030/101235387,


## Step 2: Data Exploration & Cleaning

We now have two datasets:  
- **ANR Projects** (French National Research Agency)  
- **CORDIS Projects** (EU Horizon)  

The schemas are different, so we will:  
1. Inspect the columns  
2. Select comparable fields (id, title, abstract)  
3. Harmonize them into a single dataset  
4. Add a `source` column  

In [4]:
# Inspect first few columns
print("ANR columns:", df_anr.columns.tolist())
print("CORDIS columns:", df_cordis.columns.tolist())

# Select & rename relevant columns for ANR
df_anr_clean = df_anr.rename(
    columns={
        "Projet.Code_Decision_ANR": "project_id",
        "Projet.Titre.Francais": "title",
        "Projet.Resume.Francais": "abstract"
    }
)[["project_id", "title", "abstract"]]

df_anr_clean["source"] = "anr"

print("ANR clean shape:", df_anr_clean.shape)
df_anr_clean.head(2)

# Select & rename relevant columns for CORDIS
df_cordis_clean = df_cordis.rename(
    columns={
        "id": "project_id",
        "title": "title",
        "objective": "abstract"
    }
)[["project_id", "title", "abstract"]]

df_cordis_clean["source"] = "cordis"

print("CORDIS clean shape:", df_cordis_clean.shape)
df_cordis_clean.head(2)

# Merge both datasets
df_projects = pd.concat([df_anr_clean, df_cordis_clean], ignore_index=True)

print("Merged shape:", df_projects.shape)
df_projects.sample(5)

ANR columns: ['Projet.Code_Decision_ANR', 'AAP.Edition', 'Projet.Acronyme', 'Projet.Titre.Francais', 'Projet.Titre.Anglais', 'Projet.Resume.Francais', 'Projet.Resume.Anglais', 'Programme.Acronyme', 'Projet.Montant.AF.Aide_allouee.ANR', 'Projet.T0 scientifique']
CORDIS columns: ['id', 'acronym', 'status', 'title', 'startDate', 'endDate', 'totalCost', 'ecMaxContribution', 'legalBasis', 'topics', 'ecSignatureDate', 'frameworkProgramme', 'masterCall', 'subCall', 'fundingScheme', 'objective', 'contentUpdateDate', 'rcn', 'grantDoi', 'keywords']
ANR clean shape: (6478, 4)
CORDIS clean shape: (18110, 4)
Merged shape: (24588, 4)


Unnamed: 0,project_id,title,abstract,source
10304,101080164,Deep Ultraviolet Laser For Quantum Technology,Lasers are the heart of today’s quantum scienc...,cordis
15085,101188025,An Application for leveraging large-scale hist...,HistText is a groundbreaking application devel...,cordis
13009,101158232,Universal Data Compression with Circular Conte...,"Within the ITUL project, funded by an ERC Cons...",cordis
10090,101052653,Why late earliest occupation of Western Europe ?,The project aims to question human migrations ...,cordis
20812,101087829,Straintronic control of correlations in twiste...,Correlations and topology are the cornerstones...,cordis


## Step 3: Classification into Social Sciences & Humanities (SSH)

We will now use a pretrained classifier from **SIRIS-Lab** to detect whether each project belongs to the *Social Sciences and Humanities (SSH)* domain.  

### Steps:
1. **Dataset preparation** — Concatenate title + abstract into a single field  
2. **Load the model** — Use Hugging Face `transformers`  
3. **Run the classification** — Apply the model to our dataset  


In [None]:
# Step 3.1 Dataset preparation
df_projects['text'] = (
    'Title: ' + df_projects['title'].fillna('').str.strip() +
    '\n\nAbstract: ' + df_projects['abstract'].fillna('').str.strip()
)

# drop duplicates
df_projects = df_projects.drop(columns=['id','title', 'abstract']).drop_duplicates().reset_index(drop=True)

df_projects[['project_id', 'source', 'text']].sample(5)

Unnamed: 0,project_id,source,text
12928,101097433,cordis,Title: Ultimate fracture toughness through thi...
17267,101110350,cordis,Title: The Drought Impact on the Climate Benef...
10685,101084481,cordis,Title: The ForestWard Observatory to Secure Re...
17939,101151931,cordis,Title: VIA-TARIQ: Analysing the long-term chan...
20104,101181208,cordis,Title: MYcotoxin MAnagement (AI)platform To fa...


In [6]:
# Step 3.2 Load the model

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TextClassificationPipeline
)

# ⚠️ IMPORTANT: if the model is private, you must be logged in with your Hugging Face token:
#   from huggingface_hub import login
#   login("YOUR_HF_TOKEN")

model_name = "SIRIS-Lab/ssh_binary_classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

pipeline_ssh = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    top_k=1,
    device=0,      # use GPU if available, else set -1
    padding=True,
    truncation=True,
    max_length=512
)

Device set to use cuda:0


In [None]:
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

# Only include text column
dataset = Dataset.from_pandas(df_projects[['text']])

# Run the pipeline
outputs = []
for out in tqdm(pipeline_ssh(KeyDataset(dataset, "text"), batch_size=32), total=len(dataset)):
    outputs.append(out)

# Store results back into df_projects
df_projects['ssh'] = [False if o[0]['label'] == 'LABEL_0' else True for o in outputs]
df_projects['p'] = [o[0]['score'] for o in outputs]

df_projects[['project_id', 'source', 'ssh', 'p']].head()

Unnamed: 0,project_id,source,ssh,p
0,ANR-09-VPTT-0015,anr,False,0.999436
1,ANR-09-VPTT-0014,anr,False,0.950545
2,ANR-09-VPTT-0013,anr,False,0.999514
3,ANR-09-VPTT-0012,anr,False,0.998463
4,ANR-09-VPTT-0011,anr,False,0.985343


In [None]:
df_projects['project_id'] = df_projects['project_id'].astype(str)
df_projects.to_parquet('../data/analysing_SHS_projects/projects_with_ssh_labels.parquet', index=False, engine='pyarrow')

## Step 4: ERC Discipline Classification (Multilabel)

For the projects classified as SSH, we now assign them to one or more **ERC panels/disciplines** using a pretrained multilabel classifier.  


In [None]:
df_projects = pd.read_parquet('../data/analysing_SHS_projects/projects_with_ssh_labels.parquet',engine='pyarrow')


In [26]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

# Step 4.1 Filter SSH projects
df_ssh = df_projects[df_projects.ssh].reset_index(drop=True).copy()
print("SSH projects:", len(df_ssh))

# Step 4.2 Load ERC classifier (multilabel)
erc_model_name = "nicolauduran45/erc_classifier_demo"

erc_tokenizer = AutoTokenizer.from_pretrained(erc_model_name)
erc_model = AutoModelForSequenceClassification.from_pretrained(erc_model_name)

pipeline_erc = TextClassificationPipeline(
    model=erc_model,
    tokenizer=erc_tokenizer,
    device=0,              # GPU
    padding=True,
    truncation=True,
    max_length=512,
    return_all_scores=True # critical for multilabel
)

# Step 4.3 Run predictions
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

dataset_ssh = Dataset.from_pandas(df_ssh[['text']])

outputs = []
for out in tqdm(pipeline_erc(KeyDataset(dataset_ssh, "text"), batch_size=16), total=len(dataset_ssh)):
    outputs.append(out)

# Step 4.4 Convert predictions to dataframe columns
# Collect labels from model config
erc_labels = erc_model.config.id2label.values()

# Convert each prediction row into dict {label: score}
erc_preds = [
    {pred['label']: pred['score'] for pred in row}
    for row in outputs
]

df_erc = pd.DataFrame(erc_preds, index=df_ssh.index)

# Join predictions back to main df
df_ssh = pd.concat([df_ssh, df_erc], axis=1)

# Step 4.5 Decide labels (e.g., threshold 0.5)
threshold = 0.5
df_ssh['erc_labels'] = df_erc.apply(lambda row: [label for label, score in row.items() if score >= threshold], axis=1)

df_ssh[['project_id', 'source', 'erc_labels']].head()


Unnamed: 0,project_id,source,erc_labels
0,ANR-09-VPTT-0004,anr,[Neuroscience and Disorders of the Nervous Sys...
1,ANR-09-VILL-0011,anr,[Products and Processes Engineering]
2,ANR-09-VILL-0009,anr,"[Earth System Science, Products and Processes ..."
3,ANR-09-VILL-0008,anr,[]
4,ANR-09-VILL-0006,anr,"[Individuals, Markets and Organisations]"


## Step 5: Valorisation Channels Extraction

Beyond SSH/discipline classification, we now check whether projects describe **valorisation channels** — i.e. pathways through which research can generate societal or cultural impact.  

Based on the literature, we look for the following channels:
- Public and academic dissemination  
- Policy advice and consultation  
- Stakeholder engagement and co-creation  
- Citizen science and participatory research  
- Education and training initiatives  
- Cultural production and advocacy  
- Expert services and open licensing  
- Institutional and social innovation  

We will use a Large Language Model (LLM) API to classify each project description (title + abstract) and return which channels are mentioned.  


In [82]:
import os
from dotenv import load_dotenv
from together import Together

# --- CONFIG ---
load_dotenv()
api_key = os.getenv("TOGETHER_API_KEY")
if not api_key:
    raise ValueError("Please set TOGETHER_API_KEY in your .env")

client = Together(api_key=api_key)

# List of models to test
MODELS = [
    "meta-llama/Llama-3.3-70B-Instruct-Turbo",
]

# --- Valorisation channels ---
VALORISATION_CHANNELS = [
    "Public and academic dissemination",
    "Policy advice and consultation",
    "Stakeholder engagement and co-creation",
    "Citizen science and participatory research",
    "Education and training initiatives",
    "Cultural production and advocacy",
    "Expert services and open licensing",
    "Institutional and social innovation"
]

# --- Prompt builder ---
def build_prompt(channels, text):
    return f"""You are a scientific tagger.

Goal:
Identify mentions of knowledge valorisation channels described in the project text.

Channels to check:
{channels}

Project text:
{text}

Output format:
- For each channel, output a bullet point in the form:
  - <Channel Label>: "<verbatim mention from the text>"
- If multiple mentions exist for the same channel, include multiple bullets (one per mention).
- If a channel is not mentioned at all, output it with no quotes.

Example:
- Public and academic dissemination: "publiation of conference proceedings"
- Public and academic dissemination: "journal article"
- Policy advice and consultation: "consultancy to the European Commission in the climate policies
- Stakeholder engagement and co-creation: "organisation of stakeholder workshops with local actors"

Rules:
1. Use verbatim spans from the input text (no paraphrasing, no translation).
2. If ambiguous/generic terms appear, include them only if clearly tied to the channel by context.
3. Do not add explanations, headers, or prose — only the bullet points.
4. Return in English, even if the input text is in another language.
"""

# --- Test with a single project ---
id = 50
text = df_ssh.loc[id, "text"]

TEST_PROMPT = build_prompt(VALORISATION_CHANNELS, text)

# --- MAIN LOOP ---
for model_name in MODELS:
    print(f"\n=== Testing model: {model_name} ===")
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": TEST_PROMPT}],
            temperature=0,
            stop=["<|eot_id|>", "<|eom_id|>"],  # optional, helps truncate
        )
        output = response.choices[0].message.content
        print("Response:", output, "\n")
    except Exception as e:
        print(f"Error with model {model_name}: {e}")



=== Testing model: meta-llama/Llama-3.3-70B-Instruct-Turbo ===
Response: - Public and academic dissemination: 
- Policy advice and consultation: 
- Stakeholder engagement and co-creation: "les acteurs de la société civile"
- Citizen science and participatory research: 
- Education and training initiatives: 
- Cultural production and advocacy: 
- Expert services and open licensing: 
- Institutional and social innovation: "meilleure structuration de la gouvernance du risque" 



In [83]:
import re
import json
from IPython.display import display, HTML

# --- Highlighter for valorisation spans ---
def highlight_entities_html(title, abstract, entities):
    # Sort entities by start
    entities = sorted(entities, key=lambda e: e['start'])
    out, last_idx = "", 0

    # fixed palette
    palette = {
        "Public and academic dissemination": "#ff9999",
        "Policy advice and consultation": "#99ff99",
        "Stakeholder engagement and co-creation": "#9999ff",
        "Citizen science and participatory research": "#ffcc99",
        "Education and training initiatives": "#cc99ff",
        "Cultural production and advocacy": "#99ffff",
        "Expert services and open licensing": "#ffb366",
        "Institutional and social innovation": "#66b3ff"
    }

    # Build highlighted abstract
    highlighted = ""
    last_idx = 0
    for ent in entities:
        start, end, label = ent["start"], ent["end"], ent["label"]
        highlighted += abstract[last_idx:start]
        color = palette.get(label, "#dddddd")
        span = (
            f"<mark style='background-color:{color}'>{abstract[start:end]}"
            f"</mark><sub><b style='color:{color}'>{label}</b></sub>"
        )
        highlighted += span
        last_idx = end
    highlighted += abstract[last_idx:]

    # Add title + abstract sections with color headings
    html = f"""
    <div style="font-family:Arial, sans-serif; line-height:1.5;">
      <div><b style="color:darkblue;">Title:</b> {title}</div>
      <div style="margin-top:10px;"><b style="color:darkgreen;">Abstract:</b> {highlighted}</div>
    </div>
    """
    return html

# --- Parser for bullet outputs ---
def parse_valorisation_output(output_text):
    entities = []
    for line in output_text.splitlines():
        line = line.strip()
        if not line.startswith("-"):
            continue
        match = re.match(r'-\s*(.*?):\s*"(.*)"', line)
        if match:
            label, mention = match.groups()
            entities.append({"label": label.strip(), "mention": mention.strip()})
    return entities

# --- Expand mentions into spans ---
def mentions_to_entities(text, mentions):
    entities = []
    for m in mentions:
        label, mention = m["label"], m["mention"]
        for match in re.finditer(re.escape(mention), text):
            entities.append({"label": label, "start": match.start(), "end": match.end()})
    return entities

# --- Example usage ---
title = df_ssh.loc[id, "title"]
abstract = df_ssh.loc[id, "abstract"]
model_output = output  # text returned by model

# Parse model output → mentions
mentions = parse_valorisation_output(model_output)

# Convert mentions → spans (on abstract only)
entities = mentions_to_entities(abstract, mentions)

# Highlight Title + Abstract
display(HTML(highlight_entities_html(title, abstract, entities)))

# Print ERC labels
pred_labels_html = ', '.join([
    f"<b style='color:orange'>{label}</b>"
    for label in df_ssh.loc[id, 'erc_labels']
])
erc_html = f"<div style='margin-top:10px;'><b>Predicted ERC labels:</b> {pred_labels_html}</div>"
display(HTML(erc_html))


In [46]:
id

5708