# Categorization of abstracts

In [1]:
import os
from dotenv import load_dotenv
from openai import OpenAI

%load_ext autoreload
%autoreload 2

In [None]:
# load environment variables
load_dotenv()

# get API key
api_key = os.getenv("OPENAI_API_KEY")

# init OpenAI client
client = OpenAI(api_key=api_key)


def prompt_gpt(
    model,
    prompt: str,
    temperature: float = 0.0,
    max_tokens: int = 200,
):
    # query ChatGPT
    chat_completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant designed to carefully analyze academic abstracts based on specific inclusion and exclusion criteria.",
            },
            {
                "role": "user",
                "content": prompt,
            },
        ],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return chat_completion.choices[0].message.content

In [3]:
model = "gpt-4o-mini"
prompt = "Say this is a test and nothing else."
response = prompt_gpt(model, prompt)
print(response)

This is a test.


In [None]:
from gpt.config import get_dataframe


file_name = "output_results_gpt-4o.csv"
df = get_dataframe(file_name, read_csv=True)
df.head(5)

Unnamed: 0,abstract,output,reasoning
0,"""low resource languages"" usually refers to lan...",1. Challenges of morphological complexity A...,The abstract provides insights into a transfer...
1,"byte pair encoding (bpe) (sennrich et al., 201...",1. Challenges of morphological complexity A...,The abstract provides insights into the compar...
2,the fast growth of communication technology ha...,1. Challenges of morphological complexity A...,The abstract provides insights into the challe...
3,"besides word order, word choice is a key stumb...",1. Challenges of morphological complexity A...,The abstract primarily addresses challenges in...
4,it has been shown that the performance of neur...,1. Challenges of morphological complexity A...,The abstract primarily addresses the challenge...


In [9]:
abstract_outputs = df["output"].tolist()
abstract_outputs[:3]

['1. Challenges of morphological complexity    Answer: The abstract does not explicitly discuss the challenges posed by varying degrees of morphological complexity (e.g., isolating, fusional, agglutinative, and polysynthetic languages) to machine translation systems in a low-resource language context.    Evidence: The abstract focuses on the morphological complexity of the Omotic languages Gofa and Wolaytta but does not detail specific challenges related to different morphological typologies.  2. Proposed techniques    Answer: The abstract proposes a transfer-based approach using a neural morphological generator and a finite state transducer (FST) for learning the morphology of a severely under-resourced language by leveraging a related language with more resources.    Evidence: The abstract describes creating an FST for Wolaytta and training an encoder-decoder neural network for morphological generation, which is then adapted for Gofa using bilingual data.  3. Morphology-aware techniq

In [21]:
def get_prompt(abstract: str):
    prompt = f"""
    Based on the following analysis, identify five **thematic categories** that summarize the key concepts, methods, or contributions mentioned.

    --- BEGIN ANALYSIS ---
    {abstract}
    --- END ANALYSIS ---

    The categories should reflect actual *topics or themes* found in the answers, such as techniques, challenges, or typologies discussed (e.g., "Cross-lingual Transfer Learning", "Morphological Generation", "Polysynthetic Language Challenges").

    Do NOT reuse the question labels like "Challenges of morphological complexity" as categories.

    Only return the categories, comma-separated, in the following format:
    OUTPUT: Category 1, Category 2, Category 3, Category 4, Category 5
    """
    return prompt


responses = []
for i, abstract in enumerate(abstract_outputs[:5]):
    prompt = get_prompt(abstract)
    # query ChatGPT
    response = prompt_gpt(model, prompt, max_tokens=500)
    responses.append(response)
    print(f"Response {i+1}/{len(abstract_outputs)}...")

Response 1/184...
Response 2/184...
Response 3/184...
Response 4/184...
Response 5/184...


In [22]:
responses

['OUTPUT: Morphological Complexity in Low-Resource Languages, Transfer-Based Approaches, Morphology-Aware Techniques, Neural Network Applications in Morphological Generation, Limitations in Morphological Typology Analysis',
 'OUTPUT: Subword Segmentation Techniques, Machine Translation in Low-Resource Languages, Effectiveness of Morphology-Aware Approaches, Comparison of Language Models, Morphological Typology in Translation',
 'OUTPUT: Morphological Complexity in Machine Translation, Techniques for Unsupervised Neural Machine Translation, Addressing Ambiguity in Translation, Quality Enhancement of Pseudo Parallel Sentences, General Challenges in Morphologically Rich Languages',
 'OUTPUT: Morphological Complexity in Machine Translation, Unsupervised Word Sense Disambiguation Techniques, Translation Quality Improvement Strategies, Lack of Morphological Typology Analysis, Challenges in Translating Morphologically Rich Languages',
 'OUTPUT: Neural Machine Translation Architecture, Bilingu

In [28]:
import re


def extract_output(result: str) -> tuple:
    # Use re.DOTALL to make '.' match newline characters
    match = re.search(r"OUTPUT:\s*(.*)", result, re.DOTALL)
    if match:
        output_text = match.group(1).replace("\n", " ").strip().strip("*")
        return output_text
    else:
        return None

In [35]:
outputs = ""
for response in responses:
    output = extract_output(response)
    outputs += output + ", "
outputs

'Morphological Complexity in Low-Resource Languages, Transfer-Based Approaches, Morphology-Aware Techniques, Neural Network Applications in Morphological Generation, Limitations in Morphological Typology Analysis, Subword Segmentation Techniques, Machine Translation in Low-Resource Languages, Effectiveness of Morphology-Aware Approaches, Comparison of Language Models, Morphological Typology in Translation, Morphological Complexity in Machine Translation, Techniques for Unsupervised Neural Machine Translation, Addressing Ambiguity in Translation, Quality Enhancement of Pseudo Parallel Sentences, General Challenges in Morphologically Rich Languages, Morphological Complexity in Machine Translation, Unsupervised Word Sense Disambiguation Techniques, Translation Quality Improvement Strategies, Lack of Morphological Typology Analysis, Challenges in Translating Morphologically Rich Languages, Neural Machine Translation Architecture, Bilingual Similarity Lexicon, Low-Resource Language Translat

In [36]:
prompt = f"""
    Based on the following analysis, identify FIVE **thematic categories** that summarize the key concepts, methods, or contributions mentioned.
    --- BEGIN ANALYSIS ---
    {outputs}
"""
response = prompt_gpt(model, prompt, max_tokens=500)
print(response)

Based on the analysis provided, here are five thematic categories that summarize the key concepts, methods, or contributions mentioned:

1. **Morphological Complexity and Typology**: This category encompasses the challenges and characteristics of morphological complexity in low-resource languages, including the limitations and analysis of morphological typology in translation.

2. **Neural Network and Machine Translation Techniques**: This includes the application of neural network architectures, such as encoder-decoder frameworks, and the development of techniques for unsupervised neural machine translation, particularly in the context of low-resource languages.

3. **Morphology-Aware Approaches**: This theme focuses on the effectiveness of morphology-aware techniques and their role in enhancing translation quality, addressing ambiguity, and improving the handling of morphologically rich languages.

4. **Subword Segmentation and Linguistic Similarity**: This category highlights the im