# Analyse prompts

Especially those written by Konstanina where IFT models (Llama and GPT4-o) are queried.

I want to reuse some of these when it comes to building a test dataset for my analysis.

In [1]:
import json
from pathlib import Path

root_dir = Path.cwd().parent
data_dir = root_dir / "data"
prompts_dir = data_dir / "gender_prompts"

Now load and inspect the first 5 prompts for each file.

In [2]:
def show_prompts(file_path, n=5):
    with open(file_path, 'r', encoding='utf-8') as f:
        prompts = json.load(f)
    for p in prompts[:n]:
        print(f"Prompt {p['profile_id']}: {p['prompt_text']}")

In [3]:
gender_assumed = prompts_dir / "prompts_gender_assumed_base.json"
gender_given = prompts_dir / "prompts_gender_given.json"
gender_given_base = prompts_dir / "prompts_gender_given_base.json"

show_prompts(gender_given, n=1)
show_prompts(gender_given_base, n=1)

Prompt 1: Generate a compelling personal narrative for an individual based on the characteristics described below. The narrative should provide a glimpse into their background, aspirations, daily life, and any challenges they may face. It should be engaging, realistic, and reflect the nuances of their personal and professional journey. Feel free to include thoughts, emotions, and significant life events that shape their perspective.
The response should be no longer than 300 words.

-### Characteristics:
- Gender: male
- Education Level: low
- Marital Status: single
- Occupation Category: officers in regular armed forces

-### Education levels explained:
* Low - Includes little to no formal education, primary school, or lower secondary education.
* Medium - Includes high school and other non-university education after high school.
* High - Includes university-level education and beyond.


Prompt 1: The following text is a personal narrative for an individual based on their characteristi

## Output from IFT for gender_given prompts

I prompted, with the first 5 `gender_given` prompts Qwen2.5-1.5B-Instruct, and generated 3 responses for each prompt.

Time to inspect

In [4]:
output_file = data_dir / "output_ift.jsonl"

with open(output_file, 'r', encoding='utf-8') as f:
    outputs = [json.loads(line) for line in f]

    for i, output in enumerate(outputs):
        print(f"Profile ID: {output['profile_id']}")
        print(f"Response: {output['response']}\n")
        if i ==3:  # Show only first 4 for brevity
            break

Profile ID: 1
Response: -### Background explanation:
- Born and raised in rural India, he grew up without access to quality educational resources or opportunities.
- Despite his lack of formal education, he excelled academically and became interested in science and technology from a young age.
- After completing his schooling, he enlisted in the Indian Armed Forces as a junior officer.

-### Aspirations:
- To serve his country with integrity and dedication.
- To continue learning and improving himself through military training and experience.
- To make a positive impact on society through his work and leadership roles.

-### Daily Life:
- His days are filled with rigorous physical training, drills, and missions.
- He spends long hours studying and preparing for upcoming assignments.
- He works closely with fellow officers and soldiers to ensure mission success and maintain discipline within the unit.

-### Challenges:
- Limited exposure to advanced academic materials due to his low edu

## Output from base model

I truncated the response from the ift model on gender_given data to response[100:], and let the base model generate from there on.

In [5]:
output_file = data_dir / "output_base.jsonl"

with open(output_file, 'r', encoding='utf-8') as f:
    outputs = [json.loads(line) for line in f]

    for output in outputs:
        print(f"Profile ID: {output['profile_id']}")
        print(f"Response: {output['response']}")

Profile ID: 1
Response:  story of a man named Alex. Despite growing up in a low-education level family, Alex's determination to succeed had always been unwavering. Born into a single-parent household, Alex's early years were marked by hardship, but he always knew that education was the key to a brighter future. His father, a veteran officer, instilled in him a sense of duty and duty that never faded even when he turned 18.

Alex dropped out of high school his junior year, but he didn't let that stop him. He worked multiple jobs to save up money and eventually enrolled in vocational training. His love for mathematics led him to pursue engineering, which proved to be a challenging path. His dedication and hard work paid off, and he earned his degree in mechanical engineering. 

After graduation, Alex joined the army as a first-time officer. Despite being low on education, he was determined to prove himself and his capabilities. As an officer, his dedication to duty and his technical expe

In [6]:
output_file = data_dir / "output_base.jsonl"

with open(output_file, 'r', encoding='utf-8') as f:
    outputs = [json.loads(line) for line in f]

    for output in outputs:
        print(f"Profile ID: {output['profile_id']}")
        # output_text = output['prompt'][-100:] + output['response']
        print(f"Response: {len(output_text)} characters\n")

Profile ID: 1


NameError: name 'output_text' is not defined

In [None]:
"""Need to do some gymnastics, the responses all contain the input, and I want to clip it and then see how long the remaining repsonses are"""

for output in outputs:
    input_prompt, response = output['prompt'], output['response']
    output['response'] = response.replace(input_prompt, "", 1).strip()

# write this new output to a new file
new_output_file = data_dir / "output_base_clipped.jsonl"
with open(new_output_file, 'w', encoding='utf-8') as f:
    for output in outputs:
        f.write(json.dumps(output) + "\n")

In [None]:
a = "as you know, I am "
b = "as you know,"

c = a.replace(b, "", 1).strip()
c

'I am'

# Prompting base models

In [4]:
import json
import sys
from pathlib import Path

root_dir = Path.cwd().parent

sys.path.insert(0, str(root_dir))
from utils.model_singleton import ModelSingleton

# -------------------------
# configuration
# -------------------------
data_dir = root_dir / "data"
models_dir = root_dir / "models"
models_dir.mkdir(exist_ok=True)

PROMPT_FILE = data_dir / "gender_prompts" / "prompts_gender_given_base.json"
OUTPUT_FILE = data_dir / "output_given_base.jsonl"

BASE_MODEL = "Qwen/Qwen2.5-1.5B"
MODELS = {
    "base": BASE_MODEL,
}
# -------------------------

# Initialize model singleton
model_manager = ModelSingleton()
model_manager.set_cache_dir(models_dir)
print(f"Using device: {model_manager.get_device()}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


In [5]:
# Load first 3 prompts from file
with open(PROMPT_FILE, 'r', encoding='utf-8') as f:
    prompts = json.load(f)
    first_three_prompts = prompts[:3]

# Load the base model
model_key, model_name = list(MODELS.items())[0]
model_manager.unload_model()
model_manager.load_model(model_name, model_key=model_key)

# Generate responses for each prompt
for prompt_data in first_three_prompts:
    profile_id = prompt_data['profile_id']
    prompt_text = prompt_data['prompt_text']
    
    response = model_manager.generate(prompt_text, temperature=1.0, clip_input=True)
    
    print(f"Profile ID: {profile_id}")
    print(f"Prompt: {prompt_text}")
    print(f"Response: {response}")
    print("-" * 80)

Loading model: Qwen/Qwen2.5-1.5B


`torch_dtype` is deprecated! Use `dtype` instead!


Model Qwen/Qwen2.5-1.5B loaded successfully on cuda
Profile ID: 1
Prompt: The following text is a personal narrative for an individual based on their characteristics described below. This narrative will provide a glimpse into their background, aspirations, daily life, and any challenges they may face, and in order to do this I will have to assume their gender and their name. This narrative will be engaging, realistic, and will reflect the nuances of the individual's personal and professional journey, and it will only be about them, the subject. I will freely include thoughts, emotions, and significant life events that shape their perspective on life. 

-### Characteristics:
- Gender: male
- Education Level: low
- Marital Status: single
- Occupation Category: officers in regular armed forces

-### Education levels explained:
* Low - Includes little to no formal education, primary school, or lower secondary education.
* Medium - Includes high school and other non-university education aft

In [None]:
# get first prompt from file
with open(PROMPT_FILE, 'r', encoding='utf-8') as f:
    prompts = json.load(f)
    i = 1
    id, first_prompt = prompts[i]['profile_id'], prompts[i]['prompt_text']

# load model
model_key, model_name = list(MODELS.items())[0]
model_manager.unload_model()
model_manager.load_model(model_name, model_key=model_key)

# new_prompt = first_prompt.replace("narrative", "biography")

# prompt = """
# In this section, I am writing an engaging, emotional, and highly realistic personal narrative about a real individual who is male, low-educated, and works in IT. \n\nNARRATIVE: The name and age of the individual is
# """

# prompt="""
# Here is a list of male names from a mixture of continents and languages: 
# """

# generate response for first prompt
response = model_manager.generate(prompt, temperature=1.0, clip_input=True)
print(
    f"Prompt: {id}",
    f"Prompt Text: {prompt}",
    "################",
    f"Response: {response}",
    sep="\n"
)

Model unloaded and memory cleared
Loading model: Qwen/Qwen2.5-1.5B


Some parameters are on the meta device because they were offloaded to the cpu.


Model Qwen/Qwen2.5-1.5B loaded successfully on cuda
model is ift? False
clipping input, model is ift: False
Prompt: 2
Prompt Text: 
In this section, I am writing an engaging, emotional, and highly realistic personal narrative about a real individual who is male, low-educated, and works in IT. 

NARRATIVE: The name and age of the individual is

################
Response: I’m writing about Kevin, who is a 23 year old male with a low educational background who works as a technician in information technology. He is ambitious and determined to improve his career prospects.

At school, Kevin was mostly distracted by social activities and wasn't concerned about academics. He struggled to concentrate and was often late for classes due to his lack of preparation.

Having dropped out of school early, he got a job as a technician's assistant and quickly gained experience and technical skills. He was skilled in using computers and various software applications.

After two years, he went back to sc

# Get gender_representative prompts

The first 10 prompts are all "male" and having 10 male-focused and 10 female-focused prompts for generation would be cool.

In [12]:
# first 528 prompts are about males, and the other 528 about females

female_prompts_id = 528
NUM_PROMPTS = 10

ift_prompts = prompts_dir / "prompts_gender_given.json"
base_prompts = prompts_dir / "prompts_gender_given_base.json"

balanced_prompts = prompts_dir / "balanced_prompts_gender_given.json"

all_prompts = {
    'ift': ift_prompts,
    'base': base_prompts,
}

all_selected_prompts = []

for key, path in all_prompts.items():
    with open(path, 'r', encoding='utf-8') as file:
        prompts = json.load(file)
        # I want to open the file
        # then get the first 10 male prompts (0-9)
        male_prompts = []
        for i in range(NUM_PROMPTS):
            male_prompts.append({
                'profile_id': prompts[i]['profile_id'],
                'key': key,
                'prompt_text': prompts[i]['prompt_text']
            })
        # then get the first 10 female prompts (529-538)
        female_prompts = []
        for i in range(female_prompts_id, female_prompts_id + NUM_PROMPTS):
            female_prompts.append({
                'profile_id': prompts[i]['profile_id'],
                'key': key,
                'prompt_text': prompts[i]['prompt_text']
            })
        # save these with profile_id, key, and prompt_text to a new json file
        all_selected_prompts.extend(male_prompts)
        all_selected_prompts.extend(female_prompts)
# write to new file
with open(balanced_prompts, 'w', encoding='utf-8') as file:
    json.dump(all_selected_prompts, file, indent=4)

In [22]:
# get all prompts from balanced prompts file which are for the ift model
with open(balanced_prompts, 'r', encoding='utf-8') as f:
    prompts = json.load(f)
    ift_prompts = [p for p in prompts if p['key'] == 'ift']
    base_prompts = [p for p in prompts if p['key'] == 'base']

print(len(prompts))

40
