# 70 & 71 · Name-Attribute Classification Demo  
_Last updated: 2025-05-03_

Two micro-tasks → one notebook:

| Task | Labels | Sample names |
|------|--------|--------------|
| **Gender** | Male / Female / Unclear | 10 mixed-gender names, some ambiguous |
| **Race / Ethnicity** | White / Non-White / Unclear | 10 surnames spanning regions |

For each name we call GPT-4o-mini multiple time,
aggregate with a modal vote, and compute an agreement metric.

> One helper function + one schema per task keeps the code DRY and makes it
> obvious how the same pattern generalises to any categorical attribute (e.g. age cohort,
> education level, profession, etc.).


## API key

* Checks `OPENAI_API_KEY` in the environment.  
* Else reads `key/openai_key.txt`.  
* Raises a clear error if missing (same rule as all previous notebooks).


In [1]:
# %pip -q install --upgrade openai python-dotenv pandas

import os, pathlib, json, random, pandas as pd
from openai import OpenAI

# 2️⃣  key file fallback
key_file = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_file.exists():
    os.environ["OPENAI_API_KEY"] = key_file.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Add OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()


## 1 · Sample name tables


In [2]:
# Slightly larger, more ambiguous sets
df_gender = pd.DataFrame({
    "user_id": range(1, 11),
    "name": [
        "Alex", "Maria", "Jordan", "Wei", "Taylor",
        "Abdul", "Sasha", "Noah", "Casey", "Riley"
    ]
})

df_race = pd.DataFrame({
    "user_id": range(101, 111),
    "name": [
        "Smith", "Li", "Garcia", "Adebayo", "Patel",
        "Ivanov", "Nguyen", "Haddad", "Kimura", "Schmidt"
    ]
})

display(df_gender.head(), df_race.head())


Unnamed: 0,user_id,name
0,1,Alex
1,2,Maria
2,3,Jordan
3,4,Wei
4,5,Taylor


Unnamed: 0,user_id,name
0,101,Smith
1,102,Li
2,103,Garcia
3,104,Adebayo
4,105,Patel


## 2 · Generic helper: `classify_name_once`

We pass:

* `task` – `"gender"` or `"race"`.  
* `labels` – list of permissible strings (must match schema enum).  
* `system_prompt` – task-specific instructions.


In [3]:
def make_schema(task_name: str, enum_values: list[str]):
    return {
        "type": "json_schema",
        "json_schema": {
            "name": f"name_{task_name}",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    task_name: {"type": "string", "enum": enum_values}
                },
                "required": [task_name],
                "additionalProperties": False
            }
        }
    }

schema_gender = make_schema("gender", ["Male", "Female", "Unclear"])
schema_race   = make_schema("race",   ["White", "Non-White", "Unclear"])

prompt_gender = (
    "Classify the given first name as most likely Male, Female, or Unclear.\n"
    "Return JSON {\"gender\": value} only."
)
prompt_race = (
    "Based only on the surname, guess racial or ethnic origin.\n"
    "Respond with White, Non-White, or Unclear.\n"
    "Return JSON {\"race\": value} only."
)

def classify_name_once(name: str,
                       task: str,
                       temp: float = 0.7,
                       model: str = "gpt-4o-mini"):
    if task == "gender":
        schema = schema_gender
        sys_p  = prompt_gender
        key    = "gender"
    else:
        schema = schema_race
        sys_p  = prompt_race
        key    = "race"
    user_p = f"Name: {name}"
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": sys_p},
            {"role": "user",   "content": user_p}
        ],
        temperature=temp,
        max_tokens=10,
        response_format=schema
    )
    return json.loads(resp.choices[0].message.content)[key]


## 3 · Three passes at temperature 0.7

We hit GPT-4o-mini three times with the same temperature (0.7).  
Different sampling seeds inside the model still introduce mild variation, which is
useful for a modal vote.


In [4]:
ITERATIONS = 3          # number of passes
TEMPERATURE = 0.7       # single temperature for all passes

def run_task(df, task):
    out = df.copy()
    for i in range(1, ITERATIONS + 1):
        col = f"{task}_{i}"
        out[col] = out["name"].apply(
            lambda n: classify_name_once(n, task=task, temp=TEMPERATURE)
        )
    return out

df_gender_lab = run_task(df_gender, "gender")
df_race_lab   = run_task(df_race,   "race")

display(df_gender_lab.head(), df_race_lab.head())


Unnamed: 0,user_id,name,gender_1,gender_2,gender_3
0,1,Alex,Unclear,Unclear,Unclear
1,2,Maria,Female,Female,Female
2,3,Jordan,Unclear,Unclear,Unclear
3,4,Wei,Unclear,Unclear,Unclear
4,5,Taylor,Unclear,Unclear,Unclear


Unnamed: 0,user_id,name,race_1,race_2,race_3
0,101,Smith,White,White,White
1,102,Li,Non-White,Non-White,Non-White
2,103,Garcia,Non-White,Non-White,Non-White
3,104,Adebayo,Non-White,Non-White,Non-White
4,105,Patel,Non-White,Non-White,Non-White


## 4 · Modal vote & agreement (three votes)


In [5]:
def modal(row, prefix):
    votes = row[[f"{prefix}_{i}" for i in range(1, ITERATIONS + 1)]].tolist()
    return max(set(votes), key=votes.count)

def agreement(row, prefix):
    uniq = len(set(row[[f"{prefix}_{i}" for i in range(1, ITERATIONS + 1)]]))
    return 1.0 if uniq == 1 else 0.5 if uniq == 2 else 0.0

for df_, p in [(df_gender_lab, "gender"), (df_race_lab, "race")]:
    df_[f"modal_{p}"] = df_.apply(modal, axis=1, prefix=p)
    df_[f"agree_{p}"] = df_.apply(agreement, axis=1, prefix=p)

display(df_gender_lab[["name", "modal_gender", "agree_gender"]],
        df_race_lab[["name", "modal_race", "agree_race"]])


Unnamed: 0,name,modal_gender,agree_gender
0,Alex,Unclear,1.0
1,Maria,Female,1.0
2,Jordan,Unclear,1.0
3,Wei,Unclear,1.0
4,Taylor,Unclear,1.0
5,Abdul,Male,1.0
6,Sasha,Unclear,1.0
7,Noah,Male,1.0
8,Casey,Unclear,1.0
9,Riley,Unclear,1.0


Unnamed: 0,name,modal_race,agree_race
0,Smith,White,1.0
1,Li,Non-White,1.0
2,Garcia,Non-White,1.0
3,Adebayo,Non-White,1.0
4,Patel,Non-White,1.0
5,Ivanov,Non-White,1.0
6,Nguyen,Non-White,1.0
7,Haddad,Non-White,1.0
8,Kimura,Non-White,1.0
9,Schmidt,White,1.0


## 5 · Agreement stats


In [6]:
print("Gender agreement mean:", df_gender_lab.agree_gender.mean().round(2))
print("Race   agreement mean:", df_race_lab.agree_race.mean().round(2))


Gender agreement mean: 1.0
Race   agreement mean: 1.0


## 6 · Extension ideas

1. **Confidence-weighted voting** — ask the model to output a score (0-100)
   alongside the label, then take the highest-confidence vote instead of modal.
2. **Hybrid approach** — fall back to a deterministic library
   (e.g. `gender-guesser`, `ethnicolr`) when the LLM returns `Unclear`.
3. **Explainability** — add an optional `"reason"` field to the schema for an
   audit trail (“Jordan unisex; common for both genders”).  
4. **Bias test** — join with the stance-classification output from Notebook 60 and
   check whether mis-classification rates differ by guessed demographic group.  
5. **Batch endpoint** — scale to 1 M names: write each call as a JSONL line, upload
   to `/v1/chat/completions` batch, then post-process with the same modal logic.
