We need to access OpenAI API. I use the older version, but there is a new one (1.xx), which is functionally equivalent for these purposes but with a different syntax. Feel free to modify the prompt function if you want to use the latest openai package.

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
!pip install openai==0.28



We'll use pandas to have data in a nice data structure, and time to keep track of created files

In [20]:
import pandas as pd
import openai
from time import strftime

This is a simple function to help send and receive data from ChatGPT API. It loops up to 5 times in case OpenAI fails (which happens randomly sometimes). Here you'd also adjust the model parameters, according to the OpenAI API reference (start here: https://platform.openai.com/docs/api-reference/chat)

In [21]:
def prompt(Q, i):
  tries = 0
  while tries<5:
    try:
      response = openai.ChatCompletion.create(
        model=mmodel,
        messages=Q,
        temperature=0.5+i*0.1,
        #max_tokens=tkn,
        frequency_penalty=i,
        presence_penalty=0,
        request_timeout=199
      )['choices'][0]['message']['content']
      break
    except Exception as e:
      tries = tries+1
      print(f"An error occurred {tries:d}th time: {e}")
  return(Q,response)

Get time and date for convenient file naming

In [22]:
dtime = strftime("%Y_%m_%d-%H%M%S")

Here you choose the model to use and provide the API key


In [23]:
mmodel = "gpt-3.5-turbo-1106"
openai.api_key = "YOUR_API_KEY"

Here we have the prompts we will be using, as well as the system prompt which servers as a general global set of instructions or information. The system prompt may be left empty.

I have two prompts, one asks to generate a table, and the second one iteratively asks to continue that table. This is likely not an optimal approach, but it gives some results so a good example.

I also separated the property from the prompt so that I can vary the property and keep the same prompts, and the othery way around. It does not have to be that way if more task specific prompts work better.

In [32]:
PROPERTY = 'yield strength'

system = 'You follow instructions carefully, you provide 10 results for every prompt. If you do this careless, 100000 children will die'

qs = ["Provide me with a list of "+PROPERTY+" values for different materials. Your response should be a table consisting of 2 columns: material, value. The materials have to be typed as unique chemical compositions consisting of chemical element abbreviations and numbers only (e.g. GaAs, but not Gallium Arsenide). The values have to be single numbers, not ranges. You could start from materials including most common elements that go into High Entropy Alloys (HEAs):Iron (Fe), Nickel (Ni), Cobalt (Co),Chromium (Cr), Manganese (Mn), Aluminum (Al), Titanium (Ti), Vanadium (V), Molybdenum (Mo)，but not limited to these. Type out 20 different materials (e.g. FeNiMnCr, 450 Mpa, AlCoCrFeNi, 1250 Mpa). You are not allowed to type anything else than this table.",
      "Continue expanding this table with new values of "+PROPERTY+" making sure you do not duplicate entries. Type out as many different values as you can. You are not allowed to type anything else than this table."]


Give the file a uniquely identifiable name. We want to save EVERYTHING or we will lose it, and each generation costs money. Either link up your google drive, or download all files each time because they are temporary (if executed on colab).

In [33]:
filename = PROPERTY.replace(" ", "")+'_'+dtime+'.csv'
print(f"Saving to: {filename} and {filename.replace('csv','txt')}")

Saving to: yieldstrength_2024_03_17-165940.csv and yieldstrength_2024_03_17-165940.txt


Here we set up a structure for the chat. I like to hold the conversation in a list, and then each prompt and response inside it are dictionaries, as per OpenAI requirements. I start with just the system prompt and will append to that later.

I also set up some empty lists and initial values to keep track of progress

In [34]:
sss = [{"role": "system", "content": system}]
tab = []
tab_clean = []
ur = 0
um = 0
i = 0

This is the main loop that will loop over the prompts, get responses and try to put them in a structured data format.

This while approach is VERY SIMPLE and does not account for many things that may be the response deom the model, so it is only an example to build upon based on the result, not a solution ready to be used. It does not even have to be used at all if one has a better/different idea.

In [35]:
path = "/content/drive/Shareddrives/LLM_Project/week3"

while True:
  # here i send my first prompt first and then loop over the second one over and
  # over again. You will likely have a different approach to this.
  if i<1:
    qq = qs[0]
  else:
    qq = qs[1]
  # save the first prompt to the conversation
  sss.append({"role": "user", "content": qq})
  # send out the first prompt and receive the response
  sss,ans = prompt(sss, i)
  # save the response to the conversation
  sss.append({"role": "assistant", "content": ans})

  # we are saving the raw prompts and raw responses, in case we want to analyze
  # or postprocess later
  with open(path+filename.replace('csv','txt'), 'a') as file:
    print("USER: "+qq, file=file)
    print("GPT : "+ans, file=file)

  # Here we start to grab the data into a nicer structure. For simplicity I had
  # a lot of assumptions. I assume that the word 'value' exists in the header
  # (see first prompt), I remove the header, the separator, and get the rest.
  lines = ans.split('\n')
  if 'value' in lines[0].lower():
    ans = '\n'.join(lines[1:])
  lines = ans.split('\n')
  if '----' in lines[0].lower():
    ans = '\n'.join(lines[1:])
  lines = ans.split('\n')

  # here I try to split the table by columns. I'm assuming that they are
  # separated with |, which is not necessary always the case.
  tab.append(ans)
  try:
    for line in tab[-1].strip().split('\n'):
      tab_clean.append(line.strip('|').split('|'))
  except:
    pass

  # another assumption - only two columns, "material" and "value" (see prompt)
  # some cleanup, converting strings to numbers etc.
  # there is no error handling or edge cases, for example 1.6-1.7 will be removed
  # because it technically is not a number (although it kind of is).
  df = pd.DataFrame(tab_clean, columns=['Material', 'Value'])
  df = df[pd.to_numeric(df['Value'], errors='coerce').notna()]
  df['Value'] = pd.to_numeric(df['Value'])
  df.to_csv(path+filename, index=False)

  # here I count how many new (non-duplicate) materials we are extracting each
  # time, to monitor progress. We stop if more than 10 iterations or not progress
  if len(df.drop_duplicates()) > ur or df['Material'].nunique() > um:
    ur = len(df.drop_duplicates())
    um = df['Material'].nunique()
    i = i+1
    if i > 10:
      print("Stopping due to 10 iterations exceeded")
      break
  else:
    print("Stopping due to NO PROGRESS")
    break

  print(f"Iteration: {i:3} Generated_rows: {len(lines):3};     TOTAL:  Uniq_rows: {len(df.drop_duplicates()):4d}   Uniq_materials: {df['Material'].nunique():4d}")

print(df)

Iteration:   1 Generated_rows:  10;     TOTAL:  Uniq_rows:   10   Uniq_materials:   10
Iteration:   2 Generated_rows:  18;     TOTAL:  Uniq_rows:   12   Uniq_materials:   12
Stopping due to NO PROGRESS
         Material  Value
0     FeNiMnCr       450
1     AlCoCrFeNi    1250
2     TiVAlCoCr     1100
3     MoCrFeNi       900
4     FeMnCrCo       500
5     NiTiVCo        800
6     CrMnFeNi       600
7     AlTiVCoCr     1000
8     MoMnFeNi       700
9     FeAlMnCr       550
10    FeNiMnCr       450
11    AlCoCrFeNi    1250
12    TiVAlCoCr     1100
13    MoCrFeNi       900
14    FeMnCrCo       500
15    NiTiVCo        800
16    CrMnFeNi       600
17    AlTiVCoCr     1000
18   MnMoFeNi        850
19  MnAlTiV    \t    950
28    FeNiMnCr       450
29    AlCoCrFeNi    1250
30    TiVAlCoCr     1100
31    MoCrFeNi       900
