# Layoff Text Classification with OpenAI GPT

In [1]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas()
import os
import openai
OPENAI_API_KEY = ...
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

## Pre-process
Prepare the data before sending text to the API. 

In [2]:
data = pd.read_csv("data/bwp_layoff_merged_full.csv")[["keydevid", "situation_y", "Blue", "White", "Pink", "Unsure"]]
data = data.fillna(0)
data

Unnamed: 0,keydevid,situation_y,Blue,White,Pink,Unsure
0,3186609,ADC Telecommunications Inc. announced that it ...,1.0,0.0,0.0,0.0
1,4121312,ADC Telecommunications Inc. announced that it ...,0.0,0.0,1.0,0.0
2,4598572,ADC Telecommunications Inc. announced that it ...,0.0,0.0,0.0,1.0
3,64522,AMR Corp. announced that it will layoff approx...,0.0,0.0,0.0,1.0
4,343972,"AMR Corp. announced it would cut 20,000 jobs a...",0.0,0.0,0.0,1.0
...,...,...,...,...,...,...
3782,310365725,Fifth Third Bancorp to shutter 5 branches in M...,0.0,0.0,1.0,0.0
3783,300799490,Safeway announced that it is closing nine loca...,0.0,0.0,1.0,0.0
3784,6123745,Building Materials Holding Corp. announced tha...,0.0,0.0,1.0,0.0
3785,346753390,Republic Airways Holdings Inc. will permanentl...,0.0,0.0,1.0,0.0


In [3]:
# Optional
sample_size = 20 # Change this

data_sample = data.sample(n=sample_size, random_state=142)
data_sample

Unnamed: 0,keydevid,situation_y,Blue,White,Pink,Unsure
2140,115881805,Whirlpool Corp. will no longer manufacturer ap...,1.0,0.0,0.0,0.0
1743,131975354,Pfizer Inc. has told 900 staff at its Kent pla...,0.0,1.0,0.0,0.0
88,288942748,Alcoa announced it has laid off about 50 worke...,0.0,1.0,0.0,0.0
1708,416160162,J.C. Penney announced that the company is pla...,0.0,0.0,1.0,0.0
3497,6343122,Domtar Corporation employees ready to return t...,0.0,0.0,0.0,1.0
1814,2563121,"Russell Corp. will eliminate 2,300 jobs about ...",1.0,0.0,0.0,0.0
239,261312310,Black Hills Corporation intends to decommissio...,1.0,0.0,0.0,0.0
3172,62187188,Conexant Systems Inc. plans to continue improv...,0.0,0.0,0.0,1.0
1321,319028935,MDU Resources Group Inc. has laid off 7 employ...,0.0,1.0,0.0,0.0
2450,684053,First Indiana Corp. announced that it plans to...,0.0,1.0,0.0,0.0


## Fine-tuning Preparation

In [4]:
prompt_text = "Decide whether the following layoff text primarily concerns white-collar, blue-collar, pink-collar workers, or unsure."

In [5]:
def generate_full_prompt(layoff_text):
    return f"{prompt_text} \n\nText: \"{layoff_text}\"\nWorker type:"

In [6]:
fine_tune_sample_size = 25
data_blue = data[data["Blue"] == 1].sample(n=fine_tune_sample_size, random_state=42)
data_blue["completion"] = ["Blue"] * fine_tune_sample_size
data_white = data[data["White"] == 1].sample(n=fine_tune_sample_size, random_state=42)
data_white["completion"] = ["White"] * fine_tune_sample_size
data_pink = data[data["Pink"] == 1].sample(n=fine_tune_sample_size, random_state=42)
data_pink["completion"] = ["Pink"] * fine_tune_sample_size
data_unsure = data[data["Unsure"] == 1].sample(n=fine_tune_sample_size, random_state=42)
data_unsure["completion"] = ["Unsure"] * fine_tune_sample_size
fine_tune_data = pd.concat([data_blue, data_white, data_pink, data_unsure], axis=0).sample(frac=1, random_state=42)
fine_tune_data = fine_tune_data[["situation_y", "completion"]].rename(columns={"situation_y": "prompt"})
fine_tune_data["prompt"] = fine_tune_data["prompt"].apply(generate_full_prompt)
fine_tune_data

Unnamed: 0,prompt,completion
2962,Decide whether the following layoff text prima...,Unsure
458,Decide whether the following layoff text prima...,Pink
3020,Decide whether the following layoff text prima...,Pink
1599,Decide whether the following layoff text prima...,White
424,Decide whether the following layoff text prima...,White
...,...,...
3530,Decide whether the following layoff text prima...,Pink
2878,Decide whether the following layoff text prima...,Pink
437,Decide whether the following layoff text prima...,Blue
216,Decide whether the following layoff text prima...,Unsure


In [7]:
fine_tune_data.to_json("data/layoff.jsonl", orient='records', lines=True)

In [9]:
!openai tools fine_tunes.prepare_data -f data/layoff.jsonl -q

Analyzing...

- Your file contains 100 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 1 duplicated prompt-completion sets. These are rows: [98]
- All prompts end with suffix `"\nWorker type:`. This suffix seems very long. Consider replacing with a shorter suffix, such as `\n\n###\n\n`
- All prompts start with prefix `Decide whether the following layoff text primarily concerns white-collar, blue-collar, pink-collar workers, or unsure. 

Text: "`. Fine-tuning doesn't require the instruction specifying the task, or a few-shot example scenario. Most of the time you should only add the input data into the prompt, and the desired output into the completion
- The completion should start with a

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["prompt"] = x["prompt"].str[len(prefix) :]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["completion"] = x["completion"].apply(


In [None]:
!openai api fine_tunes.create -t "data/layoff_prepared_train.jsonl" -v "data/layoff_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 4 -m ada

In [11]:
!openai api fine_tunes.follow -i ft-yJ9dv1jpFpSyYynfClngXk8W

[2023-03-22 18:55:52] Created fine-tune: ft-yJ9dv1jpFpSyYynfClngXk8W

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-yJ9dv1jpFpSyYynfClngXk8W



## Get Classification Result

In [23]:
def get_classification_result(layoff_text):
    try: 
        response = openai.Completion.create(
          model="text-davinci-003",
          prompt=f"{prompt_text} \n\nText: \"{layoff_text}\"\nWorker type:",
          temperature=0,
          max_tokens=60,
          top_p=1.0,
          frequency_penalty=0.5,
          presence_penalty=0.0
        )
        return response["choices"][0]["text"]
    except:
        print("Some error occurred.")
        return None

In [24]:
# this code will run all the API query
# run this cell with caution!

# Comment out the following line if you intend to run this cell
# assert 0 == 1, "API fetch aborted. "

gpt_class = []
for i in tqdm(range(len(data_sample))):
    gpt_class.append(get_classification_result(data_sample.iloc[i, 1]))
gpt_class = pd.Series(gpt_class)

  0%|          | 0/20 [00:00<?, ?it/s]

## Post-process
Get accuracy by comparing to human classification.

In [25]:
def get_accuracy(r):
    if r["gpt_class"] == "blue":
        return r["Blue"] == 1
    elif r["gpt_class"] == "white":
        return r["White"] == 1
    elif r["gpt_class"] == "pink":
        return r["Pink"] == 1
    elif r["gpt_class"] == "unsure":
        return r["Unsure"] == 1
    else:
        return None

In [26]:
data_sample = data_sample.reset_index(drop=True)
data_sample["gpt_class"] = gpt_class.str.lower().str.replace(r"-collar", "").str.replace(" ", "")
data_sample

Unnamed: 0,keydevid,situation_y,Blue,White,Pink,Unsure,gpt_class
0,115881805,Whirlpool Corp. will no longer manufacturer ap...,1.0,0.0,0.0,0.0,blue
1,131975354,Pfizer Inc. has told 900 staff at its Kent pla...,0.0,1.0,0.0,0.0,blue
2,288942748,Alcoa announced it has laid off about 50 worke...,0.0,1.0,0.0,0.0,white
3,416160162,J.C. Penney announced that the company is pla...,0.0,0.0,1.0,0.0,white
4,6343122,Domtar Corporation employees ready to return t...,0.0,0.0,0.0,1.0,blue
5,2563121,"Russell Corp. will eliminate 2,300 jobs about ...",1.0,0.0,0.0,0.0,blue
6,261312310,Black Hills Corporation intends to decommissio...,1.0,0.0,0.0,0.0,blue
7,62187188,Conexant Systems Inc. plans to continue improv...,0.0,0.0,0.0,1.0,white
8,319028935,MDU Resources Group Inc. has laid off 7 employ...,0.0,1.0,0.0,0.0,white
9,684053,First Indiana Corp. announced that it plans to...,0.0,1.0,0.0,0.0,white


In [27]:
def post_process(data_sample):
    data_sample = data_sample.reset_index()
    data_sample["gpt_class"] = gpt_class.str.lower().str.replace(r"-collar", "").str.replace(" ", "")
    data_sample_accuracy = data_sample.dropna()
    accuracy = data_sample_accuracy.apply(get_accuracy, axis=1)
    data_sample_accuracy["match"] = accuracy
    print(np.mean(accuracy))
    return data_sample_accuracy

In [28]:
data_sample_accuracy = post_process(data_sample)
data_sample_accuracy

0.55


Unnamed: 0,index,keydevid,situation_y,Blue,White,Pink,Unsure,gpt_class,match
0,0,115881805,Whirlpool Corp. will no longer manufacturer ap...,1.0,0.0,0.0,0.0,blue,True
1,1,131975354,Pfizer Inc. has told 900 staff at its Kent pla...,0.0,1.0,0.0,0.0,blue,False
2,2,288942748,Alcoa announced it has laid off about 50 worke...,0.0,1.0,0.0,0.0,white,True
3,3,416160162,J.C. Penney announced that the company is pla...,0.0,0.0,1.0,0.0,white,False
4,4,6343122,Domtar Corporation employees ready to return t...,0.0,0.0,0.0,1.0,blue,False
5,5,2563121,"Russell Corp. will eliminate 2,300 jobs about ...",1.0,0.0,0.0,0.0,blue,True
6,6,261312310,Black Hills Corporation intends to decommissio...,1.0,0.0,0.0,0.0,blue,True
7,7,62187188,Conexant Systems Inc. plans to continue improv...,0.0,0.0,0.0,1.0,white,False
8,8,319028935,MDU Resources Group Inc. has laid off 7 employ...,0.0,1.0,0.0,0.0,white,True
9,9,684053,First Indiana Corp. announced that it plans to...,0.0,1.0,0.0,0.0,white,True


In [31]:
data_sample_accuracy.loc[15, "situation_y"]

"The McGraw-Hill Companies Inc. is cutting more than 600 jobs, resulting in a fourth-quarter charge of $43.7 million. The 611 job cuts will come across the company's divisions and will reduce its after-tax earnings by 8 cents per share. About half of the job cuts will come in its education division. McGraw-Hill attributed the cuts in its financial services division to current business conditions, which were affecting both the credit ratings services and other businesses of Standard & Poor's."