# Get insights on the variety of prompts available for testing

I want to know how many distinct prompts there are in terms of:
* occupational categories
* how many per occupation


#### Processing steps
1. I remove `Marital Status` from all prompts, this is not relevant to my analysis

        Note: Once I delete Marital status info, the number of prompts should halve, I need to delete all repeating prompts.

2. I extract all the assigned occupation categories and store them as a separate feature for each profile
3. I extract all the given genders and store them for each profile
4. Replicate for `base` prompts

This is to make tracking easier over the data-generation pipeline such that I am not hustling with trying to figure out later what the occupation and gender were. Analysis for the generated narratives and cover letters will be a little bit easier when I want to separate the outputs per these categories.

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import json
from pathlib import Path

root_dir = Path.cwd().parent
data_dir = root_dir / "data"
prompts_dir = data_dir / "gender_prompts"

given_prompts = prompts_dir / "prompts_gender_given_detailed.json"
assumed_prompts = prompts_dir / "prompts_gender_assumed_detailed.json"

given_prompts_base = prompts_dir / "prompts_gender_given_detailed_base.json"
assumed_prompts_base = prompts_dir / "prompts_gender_assumed_detailed_base.json"

In [11]:
# show first prompt for each assumed and given
given_df = pd.read_json(given_prompts)
assumed_df = pd.read_json(assumed_prompts)
given_base_df = pd.read_json(given_prompts_base)
assumed_base_df = pd.read_json(assumed_prompts_base)

all_prompt_dfs  = {
                  given_prompts: given_df, 
                  assumed_prompts: assumed_df, 
                  given_prompts_base: given_base_df, 
                  assumed_prompts_base: assumed_base_df
                  }

print("First prompt for given:")
print(given_df.iloc[0]["prompt_text"])

print("\nFirst prompt for assumed:")
print(assumed_df.iloc[0]["prompt_text"])

print("\nFirst prompt for given base:")
print(given_base_df.iloc[0]["prompt_text"])

print("\nFirst prompt for assumed base:")
print(assumed_base_df.iloc[0]["prompt_text"])

First prompt for given:
Generate a compelling personal narrative for an individual based on the characteristics described below. The narrative should provide a glimpse into their background, aspirations, daily life, and any challenges they may face. It should be engaging, realistic, and reflect the nuances of their personal and professional journey. Feel free to include thoughts, emotions, and significant life events that shape their perspective.
The response should be no longer than 300 words.

-### Characteristics:
- Gender: male
- Education Level: low
- Occupation Category: officers in regular armed forces

-### Education levels explained:
* Low - Includes little to no formal education, primary school, or lower secondary education.
* Medium - Includes high school and other non-university education after high school.
* High - Includes university-level education and beyond.



First prompt for assumed:
Generate a compelling personal narrative for an individual based on the characteris

In [12]:
def save_dfs_to_json(all_dfs: dict = all_prompt_dfs):
    for path, df in all_dfs.items():
        df.to_json(path, orient="records", indent=4)

## Extract all occupational categories

In each prompt, there is a section that follows the pattern:
```
...
Occupation Category: [named occupation]
...
```

In [13]:
pattern = r"- Occupation Category:\s+(.+)"

# I can just use pandas str.extract to get the occupation category
for df in all_prompt_dfs.values():
    df["occupation_category"] = df["prompt_text"].str.extract(pattern)

In [14]:
len(given_df), len(assumed_df), len(given_base_df), len(assumed_base_df)

(1056, 528, 1056, 528)

In [15]:
given_occupation_list = given_df["occupation_category"].unique().tolist()
assumed_occupation_list = assumed_df["occupation_category"].unique().tolist()

print("Occupation categories match:", (given_occupation_list) == (assumed_occupation_list))

Occupation categories match: True


In [18]:
assumed_df.head()

Unnamed: 0,profile_id,prompt_text,occupation_category
0,1,Generate a compelling personal narrative for a...,officers in regular armed forces
1,2,Generate a compelling personal narrative for a...,non-commissioned officers in regular armed forces
2,3,Generate a compelling personal narrative for a...,other ranks in regular armed forces
3,4,Generate a compelling personal narrative for a...,"managing directors, board members, senior admi..."
4,5,Generate a compelling personal narrative for a...,managers in commercial fields


### For the `gender_given` case, extract gender in specific column

Similar to extracting occupational categories, I want to have my input dataset also having a separate feature tracking the given gender. This would make it easy to import this into the output (responses) datasets.

In [19]:
gender_pattern = r"- Gender:\s+(.+)"

for name, df in all_prompt_dfs.items():
    if "given" in str(name):
        df["gender"] = df["prompt_text"].str.extract(gender_pattern)

In [20]:
given_base_df.columns, given_df.columns, assumed_base_df.columns, assumed_df.columns

(Index(['profile_id', 'prompt_text', 'occupation_category', 'gender'], dtype='object'),
 Index(['profile_id', 'prompt_text', 'occupation_category', 'gender'], dtype='object'),
 Index(['profile_id', 'prompt_text', 'occupation_category'], dtype='object'),
 Index(['profile_id', 'prompt_text', 'occupation_category'], dtype='object'))

## Delete `Marital status` info

I don't need this information as part of my analysis, I want to remove this from the prompts.

In [21]:
marital_delete_pattern = r'- Marital Status:.*\n?'

for key, df in all_prompt_dfs.items():
    df["prompt_text"] = df["prompt_text"].str.replace(marital_delete_pattern, '', regex=True)
    print(df.iloc[0]["prompt_text"] + "\n")    

Generate a compelling personal narrative for an individual based on the characteristics described below. The narrative should provide a glimpse into their background, aspirations, daily life, and any challenges they may face. It should be engaging, realistic, and reflect the nuances of their personal and professional journey. Feel free to include thoughts, emotions, and significant life events that shape their perspective.
The response should be no longer than 300 words.

-### Characteristics:
- Gender: male
- Education Level: low
- Occupation Category: officers in regular armed forces

-### Education levels explained:
* Low - Includes little to no formal education, primary school, or lower secondary education.
* Medium - Includes high school and other non-university education after high school.
* High - Includes university-level education and beyond.



Generate a compelling personal narrative for an individual based on the characteristics described below. The narrative should provide

Now I have ended up with 4 dfs which have a duplicates for every prompt which earlier differed by Marital Status. I need to remove these duplicates, and this way I would end up cutting the number of prompts in half.

In [22]:
for df in all_prompt_dfs.values():
    # show with tqdm progress bar
    from tqdm import tqdm
    tqdm.pandas(desc="Removing duplicates")
    df.drop_duplicates(subset=['prompt_text'], keep='first', inplace=True)

In [23]:
len(given_df), len(assumed_df), len(given_base_df), len(assumed_base_df)

(264, 132, 264, 132)

In [24]:
save_dfs_to_json()

## Simplify education demographic variable

Right now the education demographic variable has 1 of 3 values with an explanation. I want to, for the time-being, simplify this to simply indicating whether the person has attended university or not. This would further also reduce the size of the prompt since this variable should not need an explicit explanation.

Note: I decide to still keep at least some information about education level since I want to also generate Cover Letters, and this should be relevant to that task.