# Convert to Dataframe
#### Script Purpose
In this sample script, we will set up the environment for chatgpt api, then run chatgpt on documents in the sample data. Note that this is a pilot and we welcome all feedback / suggestions.


#### API Usage
Models available: 
<br>gpt_35_turbo, 
<br>gpt_4o, 
<br>gpt_4o_2024_08_06, 
<br>gpt_4o_mini,
<br>o1_mini,
<br>o1_preview


Limitations:
<br>50,000 Tokens/Minute on LLM Access
<br>10 Requests/Second across all API endpoints
<br>$5 spend/day 

Cost per 1k tokens:
| Model                     | In / 1k Tok | Out / 1k Tok |
|:---------------------------|:-------------|:--------------|
| gpt_35_turbo     | 0.0000005     | 0.0000015       |
| gpt_4o      | 0.000005      | 0.000015       |
| gpt_4o_2024_08_06         |  0.0000025       | 0.00001        |
| gpt_4o_mini         | 0.00000015       | 0.0000006        |
| o1_mini         |  0.000003       | 0.0000012        |
| o1_preview         |  0.000015       | 0.00006        |






# Prepare sample data 

In [2]:
# Libraries for parsing data
import os
import pandas as pd
from lxml import etree
from bs4 import BeautifulSoup

In the cell below, you can change the corpus directory to the dataset which you are working with.

In [3]:
# Set corpus to the folder of files you want to use
corpus = '/home/ec2-user/SageMaker/data/SAMPLEDATA/'

# Read in files
input_files = os.listdir(corpus)

# Select the number of articles to sample
sample_size = 30

# Generate a sample of articles
try:
    sample_input_files = input_files[0:sample_size]

except ValueError:
    sample_input_files = input_files
    
print("Currently sampling", len(sample_input_files), "documents.")

Currently sampling 30 documents.


In [4]:
# Function to strip html tags from text portion
def strip_html_tags(text):
    stripped = BeautifulSoup(text).get_text().replace('\n', ' ').replace('\\', '').strip()
    return stripped

# Retrieve metadata from XML document
def getxmlcontent(corpus, file, strip_html=True):
    try:
        tree = etree.parse(corpus + file)
        root = tree.getroot()

        if root.find('.//GOID') is not None:
            goid = root.find('.//GOID').text
        else:
            goid = None

        if root.find('.//Title') is not None:
            title = root.find('.//Title').text
        else:
            title = None

        if root.find('.//NumericDate') is not None:
            date = root.find('.//NumericDate').text
        else:
            date = None
            
        if root.find('.//PublisherName') is not None:
            publisher = root.find('.//PublisherName').text
        else:
            publisher = None

        if root.find('.//FullText') is not None:
            text = root.find('.//FullText').text

        elif root.find('.//HiddenText') is not None:
            text = root.find('.//HiddenText').text

        elif root.find('.//Text') is not None:
            text = root.find('.//Text').text

        else:
            text = None

        # Strip html from text portion
        if text is not None and strip_html == True:
            text = strip_html_tags(text)
    
    except Exception as e:
        print(f"Error while parsing file {file}: {e}")
    
    return goid, title, date, publisher, text

In [5]:
# Create lists to store GOIDs and text
goid_list = []
date_list = []
text_list = []

for file in sample_input_files:
    
    goid, title, date, publisher, text = getxmlcontent(corpus, file, strip_html=True)
    
    if text is not None:
        goid_list.append(goid)
        date_list.append(date)
        text_list.append(text)


Create Dataframe: This section uses the collected fields to make a dataframe.

In [6]:
df_text = pd.DataFrame({'GOID': goid_list, 'Date':date_list, 'Text': text_list})
df_text.set_index('GOID')

Unnamed: 0_level_0,Date,Text
GOID,Unnamed: 1_level_1,Unnamed: 2_level_1
1323379997,2012-01-01,About the Authors: Steffen Dommerich Contr...
1671014263,2015-04-01,About the Authors: Sirinart Techa Affiliat...
2276833012,2019-08-01,About the Authors: David Benrimoh Roles Co...
1289270399,2010-02-01,About the Authors: Vibha Gupta Contributed...
1344508114,2012-09-01,About the Authors: Jing Zhao Contributed e...
1304981084,2011-06-01,About the Authors: Jesse A. Solomon Affili...
2186082215,2019-02-01,About the Authors: Jonathan Steinke Roles ...
1875828310,2017-03-01,About the Authors: Björn Hansson Contribut...
1325499152,2012-07-01,About the Authors: Joëlle K. Muhlemann Aff...
1764879334,2016-02-01,About the Authors: Sameh Rabhi Affiliation...


# Using ChatGPT for Information Extraction

In [1]:
import openai
from openai import OpenAI, OpenAIError
import tiktoken
import time
import json

In [None]:
# Set your key to your Academic AI Platform Key
client = OpenAI(api_key="",
    base_url="https://agai-proxy.prod.int.tdmstudio.proquest.com/large-language-models-openai-compatible/")

## Approach 1: Explore prompts on a single data point

In [2]:
# Modify the prompt as needed
prompt = 'Summarize the text.'

In [None]:
sample_text = df_text['Text'][0]

In [69]:
response = client.chat.completions.create(
    # Modify the model here with values from "Models available" of API Usage section above         
    model='gpt_4o',
    messages=[
    # Modify the output json format                      
        {'role': 'system', 'content': "You are a helpful assistant."},
        {'role': 'user', 'content': prompt},
        {"role": "user", "content": sample_text}
    ],
)
result = response.choices[0].message.content
print(result)

The authors Clare E. McElcheran, Benson Yang, Kevan J. T. Anderson, Laleh Golenstani-Rad, and Simon J. Graham have conducted a study on the safety and feasibility of using parallel radiofrequency transmission (pTx) in 3 Tesla magnetic resonance imaging (MRI) to reduce heating in long conductive leads. Such leads are used in deep brain stimulation (DBS) implants to treat neurological disorders. 

MRI procedures can cause localized heating around these leads due to the electric component of the RF transmission field, posing a risk of tissue damage. The study investigates using pTx with static RF shimming, where different coil elements transmit independently with adjusted amplitudes and phases, to minimize this heating effect while maintaining the necessary B1-field homogeneity for effective imaging.

Simulations and experimental validation showed that using pTx with optimized amplitude and phase settings could significantly reduce E-field at the lead tip and along the wire, resulting in 

## Approach 2: Run ChatGPT on a batch of data using JSON output

### Create records to keep track of tokens used
Run this section to manually update the token usage every 24 hours

In [8]:
# Modify cost record output to desired save name
token_file = '/home/ec2-user/SageMaker/chatgpt_token.csv'

In [9]:
# call this function if you want to refresh/clear token file and restart the count  
def clear_token_record(token_file):
    # refresh/clear token file only if you want to restart the count
    if os.path.exists(token_file):
        os.remove(token_file)
        print(f"{token_file} has been deleted.")

clear_token_record(token_file)

In [10]:
def get_token_record(token_file):
    # get the previous token usage info and keep adding onto it
    if os.path.exists(token_file):
        # If the file exists, read the CSV file into a DataFrame
        runs_record = pd.read_csv(token_file).reset_index(drop=True)
    else:
        # If the file does not exist, return an empty DataFrame
        runs_record = pd.DataFrame()
    
    return runs_record

# get the previous token usage info and keep adding onto it
token_record = get_token_record(token_file)
token_record

### Create records to keep track of all output from chatgpt
Run this section to load the previous output and populate it with the new results

In [11]:
# Modify result output file to desired save name
output_file = '/home/ec2-user/SageMaker/chatgpt_output.csv'

In [12]:
# call this function if you want to refresh/clear output file and restart the count     
def clear_output_record(output_file):
    # refresh/clear output file only if you want to restart the count
    if os.path.exists(output_file):
        os.remove(output_file)
        print(f"{output_file} has been deleted.")

clear_output_record(output_file)

In [13]:
def get_output_record(output_file):
    # get the previous output info and keep adding onto it
    if os.path.exists(output_file):
        # If the file exists, read the CSV file into a DataFrame
        output_record = pd.read_csv(output_file).reset_index(drop=True)
    else:
        # If the file does not exist, return an empty DataFrame
        output_record = pd.DataFrame()
    
    return output_record

# get the previous output info and keep adding onto it
output_record = get_output_record(output_file)
output_record

### Run chatgpt on a batch of data

In [6]:
# Modify the prompt as needed
prompt = 'out put 1 keyword about its topic and methodology for the text given.'

In [15]:
# lists for chatgpt output
topic_list = []
methodology_list = []

# lists for document date, token cost, and index of the document in original dataset
date_list = []
cost_list = []
index_list = []


if output_record.empty:
    start_index = 0
else:
    start_index = output_record['Index'].iloc[-1] + 1
print(f"Starting from row {start_index}.")
stop = False

for index, (row_index, row_data) in enumerate(df_text[start_index:].iterrows()):
# for index, row in df_text.iterrows():
    if stop:
        break
    text = row_data['Text']
    date = row_data['Date']
    
    while True:
        try:
            response = client.chat.completions.create(
                # Modify the model here with values from "Models available" of API Usage section above         
                model='gpt_4o',
                messages=[
                # Modify the output json format                      
                    {'role': 'system', 'content': 
                     """Return in JSON format: 
                        {
                         'topic': ['topic'], 
                         'methodology': ['methodology']
                        }
                    """},
                    {'role': 'user', 'content': prompt},
                    {"role": "user", "content": text}
                ],
                response_format={
                    "type": "json_object"
                }
            )
            result = response.choices[0].message.content
            json_response = json.loads(result)
#             print(json_response)
            
            topic = json_response.get("topic")
            if topic is not None:
                topic = topic[0]
            else:
                topic = ""
                
            methodology = json_response.get("methodology")
            if methodology is not None:
                methodology = methodology[0]
            else:
                methodology = ""
                
            total_tokens = response.usage.total_tokens
            print(f"Topic: {topic}, Methodology:{methodology}, Index:{index + start_index}, Cost: {total_tokens}")
            
            topic_list.append(topic or "")
            methodology_list.append(methodology or "")
            date_list.append(date)
            cost_list.append(total_tokens)
            index_list.append(index + start_index)
            
            # Break the loop if request is successful
            break
    
        except openai.RateLimitError as e:
            error_message = str(e)
            if "Application token/minute rate exceeded" in error_message:
                print("Error code: 429 - Application token/minute rate exceeded")
                print("Will wait for 60 seconds and retry.")
                time.sleep(60) 
            elif "Application cost/day rate exceeded" in error_message:
                print("Error code: 429 - Application cost/day rate exceeded")
                print("End program.")
                stop = True
                break
            else:
                print("Other rate limit exceeded, will wait and retry.")
                time.sleep(60)
                
        except openai.BadRequestError as e:
            print(f"OpenAI BAD REQUEST error: {e}")
            print("Will skip this prompt.")
            # Modify the code below to handle 400 Bad Request errors, which occur when the server cannot process the request due to Azure OpenAI's content management policy. Implement the necessary logic as required for your program. 
            topic_list.append("")
            methodology_list.append("")
            date_list.append(date)
            cost_list.append(0)
            index_list.append(index + start_index)
            break
            
        except OpenAIError as e:
            print(f"HTTP error: {e}")
            stop = True
            break

 


Starting from row 0.
Topic: C, Methodology:H, Index:0, Cost: 14367
Topic: molting hormones, Methodology:qRT-PCR, Index:1, Cost: 15820
Topic: schizophrenia, Methodology:computational modelling, Index:2, Cost: 15372
Topic: Mtb-BirA, Methodology:crystallography, Index:3, Cost: 12293
Error code: 429 - Application token/minute rate exceeded
Will wait for 60 seconds and retry.
Topic: cardiovascular disease, Methodology:network-based approach, Index:4, Cost: 13332
Topic: ALS, Methodology:mouse models, Index:5, Cost: 19611
Topic: agriculture, Methodology:Positive Deviance, Index:6, Cost: 11676
Topic: Adipocyte Biology, Methodology:RT-qPCR, Index:7, Cost: 11452
Topic: floral_development, Methodology:transcriptome_metabolome_analysis, Index:8, Cost: 16978
Topic: leishmaniasis, Methodology:transcriptomic analysis, Index:9, Cost: 15197
Topic: emotion, Methodology:behavioral-coding, Index:10, Cost: 16598
Topic: food web, Methodology:stable isotope analysis, Index:11, Cost: 10716
Error code: 429 - A

In [18]:
topic_list 

['C',
 'molting hormones',
 'schizophrenia',
 'Mtb-BirA',
 'cardiovascular disease',
 'ALS',
 'agriculture',
 'Adipocyte Biology',
 'floral_development',
 'leishmaniasis',
 'emotion',
 'food web',
 'Fatigue',
 'SpatialResolution',
 'forest growth',
 'biomass',
 'AAN',
 'PCNA',
 'structural balance',
 'sex-specific gene expression',
 'Neurostimulation',
 'muscle contraction',
 'root',
 'liver transplantation',
 'non-consent',
 'Behavioral Interactions',
 'glucose_metabolism',
 'Inflammation',
 'Gliomas',
 'MRI']

In [19]:
result_df = pd.DataFrame()
result_df['Index'] = index_list
result_df['Date'] = date_list
result_df['Topic'] = topic_list
result_df['Methodology'] = methodology_list
# result_df['Publication'] = publication_list
result_df

Unnamed: 0,Index,Date,Topic,Methodology
0,0,2012-01-01,C,H
1,1,2015-04-01,molting hormones,qRT-PCR
2,2,2019-08-01,schizophrenia,computational modelling
3,3,2010-02-01,Mtb-BirA,crystallography
4,4,2012-09-01,cardiovascular disease,network-based approach
5,5,2011-06-01,ALS,mouse models
6,6,2019-02-01,agriculture,Positive Deviance
7,7,2017-03-01,Adipocyte Biology,RT-qPCR
8,8,2012-07-01,floral_development,transcriptome_metabolome_analysis
9,9,2016-02-01,leishmaniasis,transcriptomic analysis


In [20]:
token_df = pd.DataFrame()
token_df['Token'] = cost_list
token_df

Unnamed: 0,Token
0,14367
1,15820
2,15372
3,12293
4,13332
5,19611
6,11676
7,11452
8,16978
9,15197


### Save Output to CSV

In [21]:
# save output to output_file
output_record = pd.concat([output_record, result_df], ignore_index=True)
output_record.to_csv(output_file, index=False)

In [22]:
# save token usage to token_file
token_record = pd.concat([token_record, token_df], ignore_index=True)
token_record.to_csv(token_file, index=False)

-- End of approach 2 --

# Instructions to set up environment

### Create Virtual Environment

We need to first install openai packages in the environment we want to use. 
In your workbench, find the **New** button (top right), and select **Terminal** from the drop-down menu.
1. Create a new environment: 
     <br><code>conda create -n chatgpt python=3.11 ipykernel</code>
2. Check the if environment is succeefully created: 
     <br><code>conda env list</code>
     <br>you should see the your env(chatgpt) listed under "# conda environments"
3. Activate the environment: 
     <br><code>source activate chatgpt</code>
4. Register the new environment as a jupyter kernel:
     <br><code>python -m ipykernel install --prefix=/home/ec2-user/SageMaker/.jupyter --name chatgpt</code>
5. Install openai packages
     <br><code>conda install openai pandas lxml bs4 tiktoken</code>
     
     
### Select the Environment for Notebook

We switch the environment of current notebook is running to the one we just created.
1. Click **Kernel** in the top toolbar of your Jupyter notebook. You will see a dropdown with several menu options.
2. In the dropdown menu, mouse over **Change kernel**, and select the name of the environment. It should look like conda_chatgpt.
