# Assigning SDG labels using structured output

SDG: [Sustainable Development Goals](https://sdgs.un.org/goals)

- Read subset of publications (and abstracts)
- Use [structured outputs](https://openai.com/index/introducing-structured-outputs-in-the-api/) to retrieve labels for titles and abstracts
- Not possible [with all models](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/structured-outputs?tabs=python-secure%2Cdotnet-entra-id&pivots=programming-language-csharp#supported-models)

## 1. Read data, invert abstracts

In [1]:
import json
import pandas as pd

In [2]:
n_subsets = 20

In [3]:
df = pd.read_csv("publications2024.csv", nrows=n_subsets)

In [4]:
with open("abstracts_inverted.json", "r") as f:
    abstracts_inverted = json.loads(f.read())

abstracts = []
for item in abstracts_inverted[:n_subsets]:

    if not item["abstract"]:
        continue
    # Create list of placeholders with sufficiently large size
    abstract = [None] * 10000
    
    for word in item["abstract"]:
        indexes = item["abstract"][word]
        for index in indexes:
            abstract[index] = word
    
    # Remove the remaining placeholders
    abstract = " ".join([word for word in abstract if word])

    abstracts.append({"id": item["id"], "abstract": abstract})

In [5]:
df = df.merge(pd.DataFrame(abstracts))  # only keep publications with abstracts

## 2. Assign labels

In [6]:
import requests
import json

from config import api_key

In [7]:
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {api_key}"
}

In [None]:
# We define a json schema to structure our response (label + sdg number)
# https://json-schema.org/
response_schema = {
    "name": "sdg_response",
    "strict": True,
    "schema": {
        "type": "object",
        "properties": {
            "sdg": {"type": "integer", "minimum": 1, "maximum": 17},
            "label": {"type": "string"}  # we should restrict this using enum: https://json-schema.org/understanding-json-schema/reference/enum
        },
        "required": ["sdg", "label"],
        "additionalProperties": False
    }
}

In [9]:
url = "https://ai-research-proxy.azurewebsites.net/chat/completions"
persona = "You are an expert on Sustainable Development Goals. For a given publication you can assign the most suitable SDG."

responses_titles = []
for title in df.title:
    data = {
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": persona},
            {"role": "user", "content": title},
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": response_schema
        }
    }
    responses_titles.append(requests.post(url=url, data=json.dumps(data), headers=headers))

responses_abstracts = []
for abstract in df.abstract:
    data = {
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": persona},
            {"role": "user", "content": abstract},
        ],
        "response_format": {
            "type": "json_schema",
            "json_schema": response_schema
        }
    }
    responses_abstracts.append(requests.post(url=url, data=json.dumps(data), headers=headers))

In [10]:
len(responses_titles), len(responses_abstracts)

(14, 14)

In [11]:
sdgs_titles = [json.loads(response.json()["choices"][0]["message"]["content"]) for response in responses_titles]
sdgs_abstracts = [json.loads(response.json()["choices"][0]["message"]["content"]) for response in responses_abstracts]

In [12]:
pd.concat([df.sustainable_development_goals, pd.DataFrame(sdgs_titles), pd.DataFrame(sdgs_abstracts)], axis=1)

Unnamed: 0,sustainable_development_goals,sdg,label,sdg.1,label.1
0,Partnerships for the goals,3,Good Health and Well-being,3,Good Health and Well-being
1,Good health and well-being,3,Good Health and Well-Being,3,Good Health and Well-Being
2,Good health and well-being,3,Good Health and Well-being,3,Good Health and Well-Being
3,Partnerships for the goals,3,Good Health and Well-being,3,Good Health and Well-being
4,Climate action,13,Climate Action,13,Climate Action
5,"Industry, innovation and infrastructure",9,"Industry, Innovation and Infrastructure",9,"Industry, Innovation and Infrastructure"
6,Good health and well-being,3,Good Health and Well-being,3,Good Health and Well-Being
7,Partnerships for the goals,3,Good Health and Well-being,3,Good Health and Well-being
8,,9,"Industry, Innovation, and Infrastructure",9,"Industry, Innovation, and Infrastructure"
9,,9,"Industry, Innovation and Infrastructure",9,"Industry, Innovation and Infrastructure"
