# Categorization of abstracts

In [1]:
import os
from dotenv import load_dotenv
from openai import OpenAI

%load_ext autoreload
%autoreload 2

In [2]:
# load environment variables
load_dotenv()

# get API key
api_key = os.getenv("OPENAI_API_KEY")

# init OpenAI client
client = OpenAI(api_key=api_key)


def prompt_gpt(
    model,
    prompt: str,
    temperature: float = 0.0,
    max_tokens: int = 200,
):
    # query ChatGPT
    chat_completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant designed to carefully analyze academic abstracts based on specific inclusion and exclusion criteria.",
            },
            {
                "role": "user",
                "content": prompt,
            },
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return chat_completion.choices[0].message.content

In [3]:
model = "gpt-4o-mini"
prompt = "Say this is a test and nothing else."
response = prompt_gpt(model, prompt)
print(response)

This is a test.


In [31]:
import pandas as pd
from gpt.config import get_data_path


file_name = "limit_n=30_text_answers_gpt-4o.csv"
input_path = f"{get_data_path()}/generated_answers/{file_name}"
df = pd.read_csv(input_path)
df.head(5)

Unnamed: 0,title,generated_answers,reasoning
0,atlaslang mts 1- arabic text language into ara...,1. Challenges of morphological complexity A...,The abstract provides a clear overview of the ...
1,"utilizing lexical similarity between related, ...",1. Challenges of morphological complexity A...,The abstract primarily focuses on the effectiv...
2,structural biases for improving transformers o...,1. Challenges of morphological complexity A...,The abstract provides a comprehensive overview...
3,comparative study of low resource digaru langu...,1. Challenges of morphological complexity ...,The abstract provides a general overview of th...
4,hybrid approaches for augmentation of translat...,,


In [5]:
answer_outputs = df["generated_answers"].tolist()
answer_outputs[:5]

['1. Challenges of morphological complexity   Answer: Yes, the paper discusses challenges posed by the morphological complexity of the Arabic language.   Evidence: The abstract mentions that "Arabic is a Semitic language written from right to left. It is a derivational and flexional language, which is morphologically complex," indicating that the complexity of Arabic morphology presents challenges for machine translation.  2. Proposed techniques   Answer: The paper proposes a combination of rule-based and example-based techniques to address the challenges of morphological complexity in low-resource machine translation.   Evidence: The abstract states that the system "is based on rule-based Interlingua and example-based approaches," highlighting the specific methodologies employed to tackle the morphological challenges.  3. Morphology-aware techniques   Answer: Yes, the paper utilizes morphology-aware techniques, specifically a morphological analyzer for processing Arabic text.   Eviden

In [6]:
import re

# regular expressions for extracting answers
rq_patterns = {
    "rq1": r"1\. Challenges of morphological complexity\s+Answer: (.*?)\s+Evidence:",
    "rq2": r"2\. Proposed techniques\s+Answer: (.*?)\s+Evidence:",
    "rq3": r"3\. Morphology-aware techniques\s+Answer: (.*?)\s+Evidence:",
    "rq4": r"4\. Specific findings per morphological typology.*?Answer: (.*?)\s+Evidence:",
}


# function to extract answers based on patterns
def extract_answers(row, patterns):
    answers = {}
    for key, pattern in patterns.items():
        match = re.search(pattern, row)
        answers[key] = match.group(1) if match else None
    return answers


# apply the extraction function to each row
answers_df = df["generated_answers"].apply(lambda x: extract_answers(x, rq_patterns))
# convert the list of answers into separate columns
answers_df = pd.DataFrame(answers_df.tolist(), index=df.index)
# merge the extracted answers with the original dataframe
df = df.drop(columns=[col for col in df.columns if col.startswith("rq")])
df = pd.concat([df, answers_df], axis=1)
df.head(3)

Unnamed: 0,title,generated_answers,reasoning,rq1,rq2,rq3,rq4
0,atlaslang mts 1- arabic text language into ara...,1. Challenges of morphological complexity An...,The abstract reveals that the paper addresses ...,"Yes, the paper discusses challenges posed by t...",The paper proposes a combination of rule-based...,"Yes, the paper utilizes morphology-aware techn...",The abstract does not provide specific finding...
1,"utilizing lexical similarity between related, ...",1. Challenges of morphological complexity An...,The abstract primarily focuses on the effectiv...,The abstract does not explicitly discuss the c...,The paper proposes using subword-level pivot-b...,"The paper employs subword modeling techniques,...",The abstract indicates that subword-level mode...
2,structural biases for improving transformers o...,1. Challenges of morphological complexity An...,The abstract provides a clear overview of the ...,"Yes, the paper discusses challenges posed by t...",The paper proposes two techniques to address t...,"Yes, the paper uses morphology-aware technique...",The paper provides findings specific to agglut...


In [7]:
# extract the answers into separate lists
rq1_answers = df["rq1"].tolist()
rq2_answers = df["rq2"].tolist()
rq3_answers = df["rq3"].tolist()
rq4_answers = df["rq4"].tolist()

# print the first 5 answers for each research question
print(rq1_answers[0])
print(answer_outputs[0])

Yes, the paper discusses challenges posed by the morphological complexity of the Arabic language.
1. Challenges of morphological complexity   Answer: Yes, the paper discusses challenges posed by the morphological complexity of the Arabic language.   Evidence: The abstract mentions that "Arabic is a Semitic language written from right to left. It is a derivational and flexional language, which is morphologically complex," indicating that the complexity of Arabic morphology presents challenges for machine translation.  2. Proposed techniques   Answer: The paper proposes a combination of rule-based and example-based techniques to address the challenges of morphological complexity in low-resource machine translation.   Evidence: The abstract states that the system "is based on rule-based Interlingua and example-based approaches," highlighting the specific methodologies employed to tackle the morphological challenges.  3. Morphology-aware techniques   Answer: Yes, the paper utilizes morphol

In [8]:
def combine_answers(answers: list) -> str:
    combined = ", ".join(answers)
    return combined


# combine all answers into a single string
rq1_answers_combined = combine_answers(rq1_answers)
rq2_answers_combined = combine_answers(rq2_answers)
rq3_answers_combined = combine_answers(rq3_answers)
rq4_answers_combined = combine_answers(rq4_answers)
# print the combined answers
rq1_answers_combined[:200]

'Yes, the paper discusses challenges posed by the morphological complexity of the Arabic language., The abstract does not explicitly discuss the challenges posed by the morphological complexity of the '

In [9]:
def get_prompt(answer: str):
    prompt = f"""
    Based on the following analysis, identify five **thematic categories** that summarize the key concepts, methods, or contributions mentioned.

    --- BEGIN ANALYSIS ---
    {answer}
    --- END ANALYSIS ---

    The categories should reflect actual *topics or themes* found in the answers, such as techniques, challenges, or typologies discussed (e.g., "Cross-lingual Transfer Learning", "Morphological Generation", "Polysynthetic Language Challenges").

    Only return the categories, comma-separated, in the following format:
    OUTPUT: Category 1, Category 2, Category 3, Category 4, Category 5
    """
    return prompt


def extract_output(result: str) -> tuple:
    # Use re.DOTALL to make '.' match newline characters
    match = re.search(r"OUTPUT:\s*(.*)", result, re.DOTALL)
    if match:
        output_text = match.group(1).replace("\n", " ").strip().strip("*")
        return output_text
    else:
        return result


responses = {}
combined_answers = [
    rq1_answers_combined,
    rq2_answers_combined,
    rq3_answers_combined,
    rq4_answers_combined,
]
# loop through each answer and query ChatGPT
for i, answer in enumerate(combined_answers):
    prompt = get_prompt(answer)
    # query ChatGPT
    response = prompt_gpt(model, prompt, max_tokens=500)
    response = extract_output(response)
    responses[f"rq{i+1}_categories"] = response
    print(f"Response {i+1}/{len(combined_answers)}...")

Response 1/4...
Response 2/4...
Response 3/4...
Response 4/4...


In [10]:
for key, value in responses.items():
    print(f"{key}: {value}")

rq1_categories: Morphological Complexity Challenges, Agglutinative Morphology, Low-Resource Machine Translation, Language-Specific Challenges, Cross-Linguistic Analysis
rq2_categories: Morphological Complexity in Low-Resource Machine Translation, Techniques for Machine Translation, Evaluation of Translation Models, Subword-Level Approaches, Linguistic Preprocessing Methods
rq3_categories: Morphology-aware Techniques, Subword Modeling, Morphological Analysis, Comparative Effectiveness, Pre-annotation Techniques
rq4_categories: Agglutinative Language Performance, Morphological Segmentation Techniques, Challenges in Morphological Typologies, Subword-Level Models in Translation, Comparative Analysis of Morphological Complexity


In [20]:
# create a DataFrame to store the responses
responses_df = pd.DataFrame([responses])
responses_df.head()

Unnamed: 0,rq1_categories,rq2_categories,rq3_categories,rq4_categories
0,"Morphological Complexity Challenges, Agglutina...",Morphological Complexity in Low-Resource Machi...,"Morphology-aware Techniques, Subword Modeling,...","Agglutinative Language Performance, Morphologi..."


## read created dataframe

In [33]:
file_name = "limit_n=30_text_answers_gpt-4o.csv"
input_path = f"{get_data_path()}/categories/{file_name}"
cat_df = pd.read_csv(input_path)
cat_df.head()

# print the responses
columns = cat_df.columns
for col in columns:
    print(f"{col}: {cat_df[col][0]}")

rq1_categories: Morphological Complexity Challenges, Agglutinative Language Issues, Low-Resource Language Contexts, Polysynthetic Language Challenges, Rich Morphology and Data Sparsity
rq2_categories: Morphological Complexity in Low-Resource Machine Translation, Subword and Morphological Segmentation Techniques, Neural and Statistical Machine Translation Methods, Rule-Based and Example-Based Approaches, Multimodal and Pivot-Based Translation Techniques
rq3_categories: Morphology-Aware Techniques, Subword Modeling Techniques, Morphological Analysis and Tokenization, Evaluation of Morphological Techniques, Machine Translation Improvement
rq4_categories: Agglutinative Language Processing, Morphological Segmentation Techniques, Translation Performance Improvements, Polysynthetic Language Challenges, Fusional Language Handling
