# Simulating Write-in Occupations with LLMs

The purpose of this notebook is to illustrate how we simulated write-in occupations to supplement our training data for the fine-tuned occupation classifier. Why do we need to do this? When the FEC collects occupation data, they only do so with a text box, not with a pre-formed menu. This means that there are often misspellings or ambiguities in how people describe their job. The BLS does provide a list of alternative spellings and names for different occupations. However, the coverage is relatively small and uneven. 

To improve the performance of our fine-tuned model, we decided to simulate additional instances of how people write out their occupations. After reading through the data ourselves, we designed a prompt that captures some common ways that people write-out their occupations, in the FEC dataset - for example, with misspellings, under-specifications, over-specifications, and the like. 

Then, we reframed this into the form of a dynamic prompt, where different BLS occupation groups were substituted in. The prompt is written in the form of an analogy. 

# Load Data

In [1]:
import os
api_key = os.getenv("OPENAI_API_KEY")


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


# 
import openai
import pandas as pd

# Load the data into a pandas DataFrame
path = '/content/drive/My Drive/FEC_project/occupationsimulation_22sept2023.csv'
blsoccupation = pd.read_csv(path)

# Set up the OpenAI API key
openai.api_key = "REPLACE WITH API KEY"  # Replace with your valid API key

# New column initialization
blsoccupation['completion'] = ''

# Generate completions using the Chat completions API
for index, row in blsoccupation.iterrows():
    # Get the text from the current row
    text = row['occupation']

    # Create chat prompt
    messages = [{"role": "user", "content": 'Please complete the analogy and include mispellings.\n Lawyer (13) = Attorney; Atorney; Attorney-at-Law; Lawer; Patent Lawyer; Corporate Lawyer; Finance Lawyer; Human Rights Attorney; lawyer; lawyr; attorne; attorney at law; trial lawyer.\n ' + text + '(100) ='}]

    # Create completion using chat format
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages
    )

    # Append the response to the 'completion' column
    blsoccupation.at[index, 'completion'] = response.choices[0].message['content']

# Save the initial output
blsoccupation.to_csv("/content/drive/My Drive/FEC_project/occupationsimulation_updated_22sept2023.csv", index=False)

# Tidy up the format of the output
blsoccupation['completion'] = blsoccupation['completion'].str.split('\t').str.join(', ')

# Save the tidied DataFrame
blsoccupation.to_csv("/content/drive/My Drive/FEC_project/occupationsimulation_tidy_22sept2023.csv", index=False)
