# Labelling relevant vs non-relevant data

- Load the data
- Define prompt
- OpenAI function calling
- Save the labels in a standard format
- Keep track of data that's already labelled

Open questions:
- Shall we re-do the labelling at least twice to compare the results?
- Alternatively, can shuffle the order of different categories to remove some bias

In [6]:
from discovery_child_development.utils.openai_utils import client
import json
import random

def create_category_description_string(categories: dict, randomise: bool=False) -> str:
    """Create the category descriptions for the prompt
    
    Args:
        categories (Dict): The categories, in the format {category: description}
        randomise (bool, optional): Whether to randomise the order of the categories. Defaults to False.

    Returns:
        str: The category descriptions with each category and description in a new line
    """
    category_descriptions = ""
    all_categories = list(categories.keys())
    if randomise:
        all_categories = random.sample(all_categories, len(all_categories))
    # randomise the order categories so that the order is not always the same
    for category in all_categories:
        category_descriptions += f"{category}: {categories[category]}\n"
    return category_descriptions

In [7]:
# Test texts
texts = [
    # relevant
    "A fun activity for babies aged 3-6 months to help development and language learning. Try blowing bubbles with your baby and see how they react. Talk to them about what they're seeing.",
    # non-relevant (child is too old)
    "A fun activity for 6 year old children to help development and language learning. Try blowing bubbles with your child and see how they react. Talk to them about what they're seeing.",
    # non-relevant (non human)
    "A fun activity for a piglet to help development and learning. Try blowing bubbles with your little one and see how they react. Talk to them.",
    # unclear (age not specified)
    "A fun activity for a child to help development and learning. Try blowing bubbles and see how they react. Talk to them.",
]

category_descriptions = {
    "Relevant": "Text that describes an innovation, technology or aspect related to human child development and developmental needs, where the child is between 0 and 5 years old (including 5 year olds). If the age is not specified, texts about infants, babies, toddlers or preschool are also relevant.",
    "Non-relevant": "Text about any other topic than child development, or if the children are too old (older than 5 years old and/or already going to school), or if the text is about non-human children.",
    "Unclear": "Text that is about human child development and developmental needs, but the age of the children has not been explicitly specified."
}

categories = list(category_descriptions.keys())
n_categories = len(categories)

In [3]:
print(create_category_description_string(category_descriptions))

Relevant: Text that describes an innovation, technology or aspect related to human child development and developmental needs, where the child is between 0 and 5 years old (including 5 year olds). If the age is not specified, texts about infants, babies, toddlers or preschool are also relevant.
Non-relevant: Text about any other topic than child development, or if the children are too old (older than 5 years old and/or already going to school), or if the text is about non-human children.
Unclear: Text that is about human child development and developmental needs, but the age of the children has not been explicitly specified.



In [4]:
function = {
    "name": "predict_relevance",
    "description": "Predict the relevance of a given text",
    "parameters": {
           "type": "object",
           "properties": {
                  "prediction": {
                         "type": "string",
                         "enum": ["Relevant", "Non-relevant", "Unclear"],
                         "description": "The predicted relevance of the given text. Infer this from the provided relevance criteria."
                  }
             },
             "required": ["prediction"]
    }
}


In [5]:
text = texts[3]
prompt = {"role": "user", "content": f"###Relevance criteria###\nTexts can be categorised in the following {n_categories} categories.\n{create_category_description_string(category_descriptions)}\n\n###Instructions###\nCategorise the following text to one relevance category.\n{text}\n"}


In [8]:
r = client.chat.completions.create(
   model="gpt-4",
   temperature=0.0,
   messages=[prompt],
   functions=[function],
   function_call={"name": "predict_relevance"},
)

2023-11-29 13:04:24,263 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [18]:
json.loads(r.choices[0].message.function_call.arguments)

{'prediction': 'Unclear'}

## Preparing a test sample of Patents and OpenAlex data

In [25]:
from discovery_child_development.getters import openalex
from discovery_child_development.utils.utils import load_jsonl
from discovery_child_development.getters import patents
import pandas as pd

openalex_sample = "data/relevance/relevance_openalex.jsonl"
patents_sample = "data/relevance/relevance_patents.jsonl"

output_file = 'data/relevance/relevance_test_sample.csv'

In [18]:
openalex_df = openalex.get_abstracts()

patents_df = (
    patents.get_patents_from_s3()
    .assign(text=lambda df: df["title"] + ". " + df["abstract"])
)

In [36]:
df_openalex = (
    pd.DataFrame(load_jsonl(openalex_sample))
    .assign(source='openalex')
    .merge(openalex_df[['id', 'text']], how='left', on='id')
)
df_patents = (
    pd.DataFrame(load_jsonl(patents_sample))
    .assign(source='patents')
    .merge(patents_df[['publication_number', 'text']].rename(columns={'publication_number': 'id'}), how='left', on='id')
)
df_labels = pd.concat([df_openalex, df_patents]).reset_index(drop=True)

In [24]:
# sample 10 random rows per each unique column of 'prediction' and 'source'
def sample_labelled_data(df, n=10):
    return df.groupby(['prediction', 'source']).apply(lambda x: x.sample(n)).reset_index(drop=True)

sample_labelled_data(df_labels).to_csv(output_file, index=False)

In [38]:
len(df_labels)

1262

In [39]:
df_labels.groupby(['prediction', 'source']).count().reset_index(drop=False)

Unnamed: 0,prediction,source,id,text
0,Not-relevant,openalex,380,380
1,Not-relevant,patents,94,94
2,Not-specified,openalex,68,66
3,Not-specified,patents,225,225
4,Relevant,openalex,302,302
5,Relevant,patents,193,193


In [41]:
from discovery_child_development import S3_BUCKET

In [44]:
from nesta_ds_utils.loading_saving import S3

S3.upload_obj(df_labels, S3_BUCKET, 'data/labels/afs_relevance/relevance_labels_20231212.csv')