# Labelling relevant vs non-relevant data

- Load the data
- Define prompt
- OpenAI function calling
- Save the labels in a standard format
- Keep track of data that's already labelled

Open questions:
- Shall we re-do the labelling at least twice to compare the results?
- Alternatively, can shuffle the order of different categories to remove some bias

In [6]:
from discovery_child_development.utils.openai_utils import client
import json
import random

def create_category_description_string(categories: dict, randomise: bool=False) -> str:
    """Create the category descriptions for the prompt
    
    Args:
        categories (Dict): The categories, in the format {category: description}
        randomise (bool, optional): Whether to randomise the order of the categories. Defaults to False.

    Returns:
        str: The category descriptions with each category and description in a new line
    """
    category_descriptions = ""
    all_categories = list(categories.keys())
    if randomise:
        all_categories = random.sample(all_categories, len(all_categories))
    # randomise the order categories so that the order is not always the same
    for category in all_categories:
        category_descriptions += f"{category}: {categories[category]}\n"
    return category_descriptions

In [7]:
# Test texts
texts = [
    # relevant
    "A fun activity for babies aged 3-6 months to help development and language learning. Try blowing bubbles with your baby and see how they react. Talk to them about what they're seeing.",
    # non-relevant (child is too old)
    "A fun activity for 6 year old children to help development and language learning. Try blowing bubbles with your child and see how they react. Talk to them about what they're seeing.",
    # non-relevant (non human)
    "A fun activity for a piglet to help development and learning. Try blowing bubbles with your little one and see how they react. Talk to them.",
    # unclear (age not specified)
    "A fun activity for a child to help development and learning. Try blowing bubbles and see how they react. Talk to them.",
]

category_descriptions = {
    "Relevant": "Text that describes an innovation, technology or aspect related to human child development and developmental needs, where the child is between 0 and 5 years old (including 5 year olds). If the age is not specified, texts about infants, babies, toddlers or preschool are also relevant.",
    "Non-relevant": "Text about any other topic than child development, or if the children are too old (older than 5 years old and/or already going to school), or if the text is about non-human children.",
    "Unclear": "Text that is about human child development and developmental needs, but the age of the children has not been explicitly specified."
}

categories = list(category_descriptions.keys())
n_categories = len(categories)

In [3]:
print(create_category_description_string(category_descriptions))

Relevant: Text that describes an innovation, technology or aspect related to human child development and developmental needs, where the child is between 0 and 5 years old (including 5 year olds). If the age is not specified, texts about infants, babies, toddlers or preschool are also relevant.
Non-relevant: Text about any other topic than child development, or if the children are too old (older than 5 years old and/or already going to school), or if the text is about non-human children.
Unclear: Text that is about human child development and developmental needs, but the age of the children has not been explicitly specified.



In [4]:
function = {
    "name": "predict_relevance",
    "description": "Predict the relevance of a given text",
    "parameters": {
           "type": "object",
           "properties": {
                  "prediction": {
                         "type": "string",
                         "enum": ["Relevant", "Non-relevant", "Unclear"],
                         "description": "The predicted relevance of the given text. Infer this from the provided relevance criteria."
                  }
             },
             "required": ["prediction"]
    }
}


In [5]:
text = texts[3]
prompt = {"role": "user", "content": f"###Relevance criteria###\nTexts can be categorised in the following {n_categories} categories.\n{create_category_description_string(category_descriptions)}\n\n###Instructions###\nCategorise the following text to one relevance category.\n{text}\n"}


In [8]:
r = client.chat.completions.create(
   model="gpt-4",
   temperature=0.0,
   messages=[prompt],
   functions=[function],
   function_call={"name": "predict_relevance"},
)

2023-11-29 13:04:24,263 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [18]:
json.loads(r.choices[0].message.function_call.arguments)

{'prediction': 'Unclear'}

In [9]:
import pandas as pd
df = (
    pd.read_csv("relevance_labels.csv")
    .drop(columns=["Unnamed: 0"])
    .astype({"timestamp": str})
)
df.head()

Unnamed: 0,prediction,id,source,model,timestamp,text
0,Relevant,https://openalex.org/W3214165237,openalex,gpt-4-1106-preview,20231231000000,Parenting Contributions to Latinx Children’s D...
1,Not-relevant,https://openalex.org/W4205457572,openalex,gpt-4-1106-preview,20231231000000,Vitamin D Reduces Covid-19 Mortality and Serio...
2,Not-specified,https://openalex.org/W4285223789,openalex,gpt-4-1106-preview,20231231000000,Child development and artistic development in ...
3,Relevant,https://openalex.org/W3156698247,openalex,gpt-4-1106-preview,20231231000000,Changes in the social-emotional functioning of...
4,Relevant,https://openalex.org/W3213749934,openalex,gpt-4-1106-preview,20231231000000,Child Protection Measures as a Legal Instrumen...


In [10]:
import json
# save as as a jsonl file
dicts = df.to_dict(orient="records")
with open("relevance_labels.jsonl", "w") as f:
    for d in dicts:
        json.dump(d, f)
        f.write("\n")

In [15]:
from discovery_child_development.utils.utils import load_jsonl

data = load_jsonl("relevance_labels.jsonl")
pd.DataFrame(data).tail(6)

Unnamed: 0,prediction,id,source,model,timestamp,text
2116,Not-specified,CN-216418200-U,patents,gpt-4-1106-preview,20240104181548,A kind of rehabilitation physical training ins...
2117,Not-specified,https://openalex.org/W4384627278,openalex,gpt-4-1106-preview,20240105104801,Back to Actual Behavior – A Modest Proposal on...
2118,Not-relevant,https://openalex.org/W2969279225,openalex,gpt-4-1106-preview,20240105104801,Fertility Preservation Using GnRH Agonists: Ra...
2119,Not-relevant,https://openalex.org/W4249457706,openalex,gpt-4-1106-preview,20240105104801,A framework for professional learning and deve...
2120,Not-relevant,https://openalex.org/W3156141421,openalex,gpt-4-1106-preview,20240105104802,The roles of noninvasive mechanical ventilatio...
2121,Relevant,https://openalex.org/W4200308700,openalex,gpt-4-1106-preview,20240105104802,Parents’ Attitudes Toward Domestic Violence as...
