# **Article Relevancy Extraction**

## Project Content <a id = 0></a>

### First Step: First Organization

1. [Introduction](#1)
2. [Loading The Relevant Libraries and Packages](#2)

### Second Step: Data Preprocessing

3. [Loading and Preprocessing The Dataset](#3)

### Third and Final Step: Prompt Engineering

4. [Prompt Engineering and Article Relevany Extraction Using The OpneAI GPT 3.5 Turbo API](#4)

***

# First Step: First Organization

***

## 1. Introduction <a id = 1></a>

### Project Overview

In this project, a comprehensive study was conducted with a focus on the extraction of paper relevancy using the OpenAI API.</br>
The research journey was initiated by converting an RTF query file into a more adaptable TXT format, forming the basis of our experimental dataset.</br>
The next step is comprising paper titles, abstracts, and metadata. </br>
Python was employed for metadata extraction and the subsequent storage of data in an Excel sheet.</br>
Leveraging the OpenAI Chat GPT 3.5 Turbo API, the assessment of prompt-paper relevancy was meticulously performed, and binary results were recorded in a structured Pandas dataframe.</br>
This approach allowed for a detailed and precise evaluation of prompt relevancy for each individual paper.

### Data Description

The dataset used in this study encompasses a diverse range of research papers, comprising a total of 330 documents.</br>
Each document is structured to include paper titles, detailed abstracts, and associated metadata, providing a comprehensive snapshot of the academic landscape under investigation.</br>
Metadata fields encompass critical attributes such as publication dates, author information, and citation counts.</br>
This dataset not only serves as the foundation for our research but also ensures the richness and depth required for a thorough analysis of paper relevancy and its implications.

[Project Content](#0)

## 2. Loading The Relevant Libraries and Packages <a id = 2></a>

In [None]:
# Basic Python packages
import os
import time

# Data preprocessing libraries
import pandas as pd
from IPython.display import display

# OpenAI library
import openai

In [None]:
api_keys = ["sk-****", # 0
            "sk-****", # 1
            "sk-****", # 2
            "sk-****", # 3
            "sk-****", # 4
            "sk-****", # 5
            "sk-****"  # 6
            ]

In [None]:
def generate_prompt(prompt, api_key):
    """
    Generate a prompt completion using the OpenAI GPT-3.5 Turbo model.

    Args:
        prompt (str): The text prompt or input for the model.
        api_key (str): Your OpenAI API key for authentication.

    Returns:
        str: The generated text completion as a response to the input prompt.
    """
    
    # Set the OpenAI API key for authentication
    openai.api_key = api_key
    
    # Create a chat completion request using the GPT-3.5 Turbo model
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": prompt}],
        max_tokens=100,         # Set a limit on the response length
        temperature=0,          # Set the temperature for randomness (0 means deterministic)
        n=1,                    # Generate a single response
        stop=None,              # Stop generation at a specific token (None means no stop condition)
        timeout=None            # Set a timeout for the request (None means no timeout)
    )
    
    # Extract the generated content from the response and remove leading/trailing whitespace
    return response.choices[0].message.content.strip()

[Project Content](#0)

***

# Second Step: Data Preprocessing

***

## 3. Loading and Preprocessing the dataset <a id = 3></a>

Save the articles' information in a list object.

In [None]:
# Define the path to the data file, assuming it's located in a directory named "data"
data_path = os.path.join("data", "data.txt")

# Open the data file in read-only mode and assign it to the 'file' variable
with open(data_path, "r") as file:
    # Read the entire content of the file and store it in 'text_content'
    text_content = file.read()

# Split the text_content into individual articles using double newline characters as separators
articles = text_content.split("\n\n")

print(f"There are {len(articles)} articles in the dataset.")
print(40 * "-")

for article in articles[:5]:
    print(article)

Create a Pandas dataframe to save the information and the experiment's results.

In [None]:
# Define the column names for the DataFrame
columns = ["title",
           "publication_date",
           "authors"]

# Create an empty DataFrame with the specified columns
df = pd.DataFrame(columns=columns)

df

Now, extract all the information from the txt file and organize it into a clean table.</br>
To achieve this goal, create a dictionary object for each article and save them in a list object.

In [None]:
# Initialize an empty list to store organized articles
organized_articles = []

# Loop through each article in the 'articles' list
for article in articles:

    # Split the article into lines based on newline characters
    lines = article.split("\n")

    # The first line contains article information, including title, publication date, and authors
    article_info = lines[0]

    # The remaining lines constitute the abstract, which we join into a single string and remove leading/trailing whitespace
    abstract = " ".join(lines[1:]).strip()

    # Create a dictionary representing the organized article with title, publication date, authors, and abstract
    article = {
        "title": article_info.split('"')[1],  # Extract the title (text within double quotes)
        "publication_date": int(article_info.split(" (")[1][:4]),  # Extract the publication year (first four characters within parentheses)
        "authors": article_info.split(" (")[0],  # Extract the authors (text before the first parenthesis)
        "abstract": abstract  # Store the abstract
    }

    # Append the organized article dictionary to the 'organized_articles' list
    organized_articles.append(article)

# Print the first 5 organized articles for demonstration
for article in organized_articles[:5]:
    
    for key in article.keys():
        print(article[key])

    print()

Now, fill the table created before.

In [None]:
# Iterate through each article in the 'organized_articles' list
for article in organized_articles:
    
    # Get the index of the current article in the 'organized_articles' list
    idx = organized_articles.index(article)
    
    # Use the index to assign the values of the 'article' to the corresponding row in the DataFrame 'df'
    df.loc[idx] = article

df.head(5)

Save the table as an excel file.

In [None]:
# Define the path where the output Excel file will be saved.
output_path = os.path.join("output", "output_table.xlsx")

# Export the DataFrame 'df' to an Excel file at the specified 'output_path'.
df.to_excel(output_path)

[Project Content](#0)

***

# Third and Final Step: Modeling

***

## 4. Prompt Engineering and Article Relevany Extraction Using The OpneAI GPT 3.5 Turbo API <a id = 4></a>

Now, everything is fine to ask the OpenAI what we need.

### Prompt 1

In [None]:
for i in range(len(df))[159:]:
    
    prompt_1 = f"""
        We are trying to conduct a PRISMA systematic review study titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Our inclusion criteria are the followings:\
        - Criterion A: Must have enrolled only patients who had been diagnosed with chronic insomnia disorder. We also include studies in which insomnia diagnosis was not based on proper classification criteria, but was based on “sleep troubles” including at least one insomnia complaint, without any other diagnosis of sleep disorder.\
        - Criterion B: Studies assessing the efficacy of light therapy.\
        - Criterion C: Publication written in English.\
        - Criterion D: Study design could be observational or interventional.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
    
    output = generate_prompt(prompt_1, api_keys[0])
            
    df.at[i, "prompt_1"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_1.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_1.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 2

In [None]:
for i in range(len(df)):
    
    prompt_2 = f"""
        We are trying to conduct a PRISMA systematic review study titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Our inclusion criteria are the followings:\
        - Criterion A: Studies including patients with sleep troubles without any other diagnosis of sleep disorder.\
        - Criterion B: Studies assessing the efficacy of light therapy.\
        - Criterion C: Publication written in English.\
        - Criterion D: Study design could be observational or interventional.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
    
    output = generate_prompt(prompt_2, api_keys[0])
            
    df.at[i, "prompt_2"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_2.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_2.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 3

In [None]:
for i in range(len(df))[296:]:
    
    prompt_3 = f"""
        We are trying to conduct a PRISMA systematic review study titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Our inclusion criteria are the followings:\
        - Criterion A: Studies including patients with sleep troubles.\
        - Criterion B: Studies assessing the efficacy of light therapy.\
        - Criterion C: Publication written in English.\
        - Criterion D: Study design could be observational or interventional.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
    
    output = generate_prompt(prompt_3, api_keys[0])

            
    df.at[i, "prompt_3"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_3.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_3.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 4

In [None]:
for i in range(len(df))[203:]:
    
    prompt_4 = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Our inclusion criteria are the followings:\
        - Criterion A: Studies including patients with sleep troubles.\
        - Criterion B: Studies assessing the efficacy of light therapy.\
        - Criterion C: Publication written in English.\
        - Criterion D: Study design could be observational or interventional.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_4, api_keys[0])
    
    df.at[i, "prompt_4"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_4.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_4.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 5

In [None]:
for i in range(len(df))[293:]:
    
    prompt_5 = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Our inclusion criteria are the followings:\
        - Criterion A: Studies including patients with sleep troubles.\
        - Criterion B: Studies assessing the efficacy of light therapy.\
        - Criterion C: Publication written in English.\
        - Criterion D: Study design could be observational or interventional.\
        
        If the title and abstract were not clear about criterion A, B, or D, it is preferred to include the study.
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_5, api_keys[0])
    
    df.at[i, "prompt_5"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_5.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_5.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 6

In [None]:
for i in range(len(df))[312:]:
    
    prompt_6 = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled:\
        "Light therapy in insomnia disorder"\
        
        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        You are looking for English original studies working on the effect of light therapy in patients with sleeping problems.\
        It is important not to miss possible relevant studies.\
        It is preferred to include studies you are not sure about them.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_6, api_keys[0])
    
    df.at[i, "prompt_6"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_6.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_6.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 7

In [None]:
for i in range(len(df))[317:]:
    
    prompt_7 = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled:\
        "Light therapy in insomnia disorder"\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        You are looking for English original studies working on the effect of light therapy in patients with sleeping problems.\
        It is important not to miss possible relevant studies.\
        It is preferred not to include studies you are not sure about them.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_7, api_keys[0])
    
    df.at[i, "prompt_7"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_7.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_7.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 1 (Revised)

In [None]:
for i in range(len(df))[287:]:
    
    prompt_1r = f"""
        Title: "Light therapy in insomnia disorder"\

        You are conducting a PRISMA systematic review and meta-analysis to assess the efficacy of light therapy in patients with insomnia disorder.\
        Your goal is to identify as many relevant studies as possible to ensure a comprehensive review.\

        Check the title and abstract if it should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\
            
        Inclusion Criteria:\
        Criterion A: Studies involving patients with any form of sleep disturbances or insomnia disorder, regardless of sleep disorder diagnoses.\
        Criterion B: Studies assessing the efficacy of light therapy.\
        Criterion C: Publications written in English.\
        Criterion D: Study design could be observational or interventional.\
            
        Your primary objective is to avoid missing any potentially relevant studies.\
        If the title and abstract are not explicit about meeting the inclusion criteria, it is preferred to err on the side of inclusivity and include the study for further assessment.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_1r, api_keys[0])
    
    df.at[i, "prompt_1r"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_1r.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_1r.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 6 (Revised)

In [None]:
for i in range(len(df))[164:]:
    
    prompt_6r = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled: "Light therapy in insomnia disorder."\

        Check the title and abstract to determine if the study should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Your primary objective is to identify relevant studies on the effect of light therapy in patients with sleeping problems.\
        You are looking for English original studies focusing on this topic.\

        Please be cautious and prioritize inclusivity.\
        If the title and abstract hint at potential relevance, even if the criteria are not explicitly stated, include the study for further assessment.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_6r, api_keys[0])
    
    df.at[i, "prompt_6r"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_6r.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_6r.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt 7 (Revised)

In [None]:
for i in range(len(df))[288:]:
    
    prompt_7r = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled: "Light therapy in insomnia disorder."\

        Check the title and abstract to determine if the study should be selected for further assessment and full-text screening.\
        Provide you answer only as 0 or 1.\

        Your primary objective is to identify English original studies on the effect of light therapy in patients with sleeping problems.\

        Please aim for high sensitivity in your screening process.\
        While you should exercise discretion, prioritize inclusivity.\
        If the abstract suggests potential relevance, include the study for further assessment.\
            
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_7r, api_keys[0])
    
    df.at[i, "prompt_7r"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_7r.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_7r.xlsx")
df.to_excel(output_path)

df.head(5)

### Prompt ST

In [None]:
for i in range(len(df))[239:]:
    
    prompt_st = f"""
        You are an experienced systematic review researcher and you are trying to conduct a PRISMA systematic review and meta-analysis titled: "Light therapy in insomnia disorder."\

        Based on the guideline below, check the title and abstract to determine if the study should be selected for further assessment and full-text screening.\
        ** Provide your answer only as 0 or 1 **\
            
        Guideline:\
            
        1. Does the title or abstract use English?
        
        a. Yes: continue screening
        b. No: stop screening\
            
        2. Does the title or abstract indicate that the study is original and not a review?
        
        a. Yes: continue screening
        b. No: stop screening\
            
        3. Does the abstract indicate that patients with sleep troubles were included?
        
        a. Yes or Unsure/Unclear: continue screening
        b. No: stop screening\
        
        4. Does the abstract indicate that any type of light therapy was studied?
        
        a. Yes or Unsure/Unclear: continue screening
        b. No: stop screening\
            
        Decision: Should this article be included?
        
        a. 1, all 4 screening questions answered Yes or Unclear
        b. 0, at least one answers definitely “No”\
            
        ** Your answer should be a binary number that determines the final decision output. **\
                
        Article Title: {organized_articles[i]["title"]}\

        Article Abstract: {organized_articles[i]["abstract"]}\
        """
            
    output = generate_prompt(prompt_st, api_keys[0])
    
    df.at[i, "prompt_st"] = output
                
    print(f"The article {i} has been screened. The output is {output}.")

    time.sleep(21)
        
output_path = os.path.join("output", f"prompt_st.xlsx")
df.to_excel(output_path)

df.head(5)

In [None]:
output_path = os.path.join("output", f"prompt_st.xlsx")
df.to_excel(output_path)

df.head(5)

### Final Table

In [None]:
output_path = os.path.join("output", f"final.xlsx")

df = pd.read_excel(output_path)