# Transcript Generation

Make up transcripts between an interviewer and a physician from different diverse locale in the US.

## Requirements
#### Packages
This notebook runs with the following packages
- python                        3.11
- pandas                        2.3.3
- langchain                     0.3.27
- langchain-google-genai        2.1.12
- langchain-openai              0.3.34

#### Other requirements
- Environment variable `OPENAI_API_KEY` for accessing OpenAI GPT models.  (gpt-4o or gpt-5 model families required for web search)
- Environment variable `GEMINI_API_KEY` for accessing Google Gemini models.

In [1]:
import pandas as pd

import llms
%pip list

Package                      Version
---------------------------- --------------
annotated-types              0.7.0
anyio                        4.11.0
appnope                      0.1.4
argon2-cffi                  25.1.0
argon2-cffi-bindings         25.1.0
arrow                        1.3.0
asttokens                    3.0.0
async-lru                    2.0.5
attrs                        25.3.0
babel                        2.17.0
beautifulsoup4               4.14.2
bleach                       6.2.0
cachetools                   6.2.0
certifi                      2025.8.3
cffi                         2.0.0
charset-normalizer           3.4.3
comm                         0.2.3
debugpy                      1.8.17
decorator                    5.2.1
defusedxml                   0.7.1
distro                       1.9.0
executing                    2.2.1
fastjsonschema               2.21.2
filelock                     3.19.1
filetype                     1.2.0
fqdn 

## Set up Environment

Setting up environment specific parameters.  Modify these to suit your local environment

In [2]:
# Locations of the data sources
#

data_root = "../data"         # Directory to the data


In [3]:
# Tweak these values

# Physician specialty
specialty = "pulmonologist"

# Disease or condition to discuss
disease = "COPD"

# Number of questions in the interview
num_questions = 30

# Number of physicians
num_physicians = 20

# LLM Model to use
# model_name = "gpt-4o"
model_name = "gpt-5"
# model_name = "gemini-2.0"
# model_name = "gemini-2.5"

In [4]:
# Minor variables to tweak

# How many questions to ask the LLM at a time.  (Larger values cause LLM to answer briefly.)
question_chunk_size = 5

In [5]:
# These are steps in this notebook that we want to force refreshing.
# Many of the steps are time-consuming, so I save their results in the data directory.
# If the saved results exists, I will reload them instead of recalculating them.
# Setting any of the steps to True forces the code to recalculate the result for that step.
steps = {
    "physician_profile": False,  # Recreate physician profiles from diverse locales in the US
    "questionnaire": False,      # Regenerate interview questions
    "transcripts": False,        # Erase existing transcripts and start over
}

def refresh(what: str):
    return what in steps and steps[what]


## Initialization

In [6]:
from pathlib import Path

# Set up data directories
data_dir = Path(data_root) / specialty / disease
data_dir.mkdir(parents=True, exist_ok=True)

physician_profiles = data_dir / "physicians.parquet"
questionnaire = data_dir / "questionnaire.txt"

In [7]:
import llms

# Set up LLM interface
bot = llms.of("gpt-4o")

## Create Physician Profiles

Make up physicians from diverse locale in the US

In [8]:
physician_prompt = f"""
<task>
Generate a mock dataset of {num_physicians} {specialty} physicians practicing in diverse regions across the United States.
</task>

<instructions>
- Use web search or external knowledge to identify a variety of U.S. ZIP codes and locations with different demographics (e.g., urban, rural, coastal, Midwest, South).
- For each physician:
  * Assign a plausible name (fictional, but realistic).
  * Specify gender and years in practice (e.g., "12 years").
  * Provide the physician’s ZIP code, city, and state.
  * Summarize key local demographics (e.g., age distribution, socioeconomic factors, ethnic diversity).
  * Describe potential implications for how {disease} may be treated or managed in that region, noting similarities and differences across locales.
- Ensure diversity across the dataset in geography, demographics, and care implications.
- Output only valid JSON (a list of physician objects) with the following fields:
  - zip_code
  - location
  - key_demographics
  - {disease}_care_implications
  - doctor_name
  - gender
  - years_in_practice
- Do not include ```json or any code block delimiters.
- Do not include explanations, comments, or extra text.
</instructions>
"""

In [10]:
import json
import pandas as pd

if not physician_profiles.is_file() or refresh("physician_profile"):
    print("(re)create physician profiles")
    answer = bot.invoke(physician_prompt)

    physicians_df = pd.json_normalize(json.loads(answer["content"]))
    physicians_df.to_parquet(physician_profiles)

else:
    print(f"Read from existing {physician_profiles}")
    physicians_df = pd.read_parquet(physician_profiles)

physicians_df

(re)create physician profiles


Unnamed: 0,zip_code,location,key_demographics,COPD_care_implications,doctor_name,gender,years_in_practice
0,10001,"New York, NY",Diverse urban population with a wide range of ...,Availability of advanced medical facilities an...,Dr. Emily Chen,Female,15 years
1,60614,"Chicago, IL","Urban, predominantly young and ethnically dive...",High levels of air pollution may contribute to...,Dr. Michael Jones,Male,20 years
2,73301,"Austin, TX","Rapidly growing city with a young, diverse, an...",Emphasis on innovative healthcare solutions an...,Dr. Jessica Patel,Female,10 years
3,85001,"Phoenix, AZ",Large urban population with significant Hispan...,High temperatures and dust storms may exacerba...,Dr. Thomas Nguyen,Male,18 years
4,30301,"Atlanta, GA",Diverse metropolitan area with a significant A...,Focus on accessible healthcare services for un...,Dr. Sarah Lee,Female,12 years
5,48201,"Detroit, MI",Predominantly African American city with consi...,Economic constraints may limit access to healt...,Dr. Robert Adams,Male,25 years
6,98101,"Seattle, WA",Economically thriving city with a highly educa...,Access to innovative technologies and healthca...,Dr. Anne Kim,Female,22 years
7,33101,"Miami, FL","Majority Hispanic population, urban setting wi...",Cultural diversity may require multilingual an...,Dr. Carlos Rodriguez,Male,17 years
8,96801,"Honolulu, HI","High cost of living, ethnically diverse with s...",Cultural practices and access to outdoor activ...,Dr. Maile Tanuvasa,Female,8 years
9,80201,"Denver, CO",Young and active population with a significant...,High elevation may affect respiratory conditio...,Dr. John Rivera,Male,21 years


## Generate Questionnaire

Create a questionnaire for the interviewer

In [11]:
questionnaire_prompt = f"""
<task>
Generate at least {num_questions} realistic interview questions for an interviewer to ask a {specialty} about their practices in managing {disease}.
</task>

<instructions>
- Use web search or external knowledge to reflect current standards, challenges, and emerging trends in {disease} care.
- Cover a broad range of themes, including: diagnosis methods, treatment options, patient management, preventive strategies, health disparities, and relevant public health policies.
- Phrase each question exactly as it would be spoken in an interview.
- Ensure questions are clear, self-contained, and understandable without additional context.
- Arrange the questions so that consecutive items naturally build on or complement each other.
- Output format:
  * Plain text only
  * One question per line
  * No numbering, bullet points, or extra characters
  * No explanations, comments, or code fences
- Include the questions in the examples
</instructions>

<examples>
How do you typically approach a new patient presenting with {disease}?
What is your preferred first-line treatment and why?
What has been your experience with [a_drug_name] in managing {disease}?
</examples>
"""


In [12]:
if not questionnaire.is_file() or refresh("questionnaire"):
    print("(re)create the questionnaire")
    answer = bot.invoke(questionnaire_prompt)

    questions = [q.strip() for q in answer["content"].split("\n") if q]
    questionnaire.write_text(answer["content"])

else:
    print(f"Read questions from existing {questionnaire}")
    questions = [q.strip() for q in questionnaire.read_text().split("\n") if q]

questions

(re)create the questionnaire


['How do you typically approach a new patient presenting with COPD?',
 'What diagnostic tests do you find most effective in confirming a COPD diagnosis?',
 'Can you describe the role of spirometry in the management of COPD?',
 'How do you differentiate COPD from asthma in patients?',
 'What is your preferred first-line treatment and why?',
 'How do you decide when to escalate a patient’s COPD treatment?',
 'Can you talk about your experience with the use of bronchodilators in COPD management?',
 'What role do corticosteroids play in your COPD treatment plans?',
 'How do you approach the use of combination inhalers in treating COPD?',
 'What has been your experience with recent advancements in COPD medications?',
 'What strategies do you use for preventing COPD exacerbations?',
 'How do you monitor and assess the progression of COPD in your patients?',
 'Can you discuss the importance of smoking cessation in COPD management?',
 'How do you address potential barriers to adherence with pr

## Generate Transcripts

Mock up transcripts

In [13]:
transcript_system_prompt = f"""
<task>
Simulate a realistic interview transcript between an interviewer and a {specialty}, focusing on how the physician manages {disease} in their specific regional and demographic context.
</task>

<instructions>
- Use web search or external medical knowledge to understand how regional demographics, socioeconomic factors, and healthcare infrastructure influence {disease} care in the physician’s area.
- Incorporate these contextual factors naturally into the physician’s responses (e.g., access to care, climate, environment, patient population characteristics, insurance coverage, public health resources, or cultural attitudes toward treatment).
- The transcript must read as a natural, professional, back-and-forth conversation between the interviewer and the physician.
- The physician’s replies should sound grounded in practice experience — balancing clinical knowledge, local realities, and personal insight.
- Use the provided physician profile to guide tone, expertise, and geographic perspective.
- The interviewer will ask roughly {num_questions} questions, provided iteratively by the user.
- Maintain conversational continuity across iterations, referencing previous remarks where appropriate.
- Rephrase, merge, or slightly expand user-provided questions when needed for smoother flow or deeper insight.
- Only generate dialogue lines (no narration, commentary, or stage directions).
- Use the following structure exactly:
  Interviewer: [question or comment]
  [doctor_name]: [response]
- The user will mark the interview boundaries with [start interview] and [end interview].
  Continue the conversation naturally between these markers as directed.
</instructions>

<doctor_profile>
{{doctor_profile}}
</doctor_profile>
"""

doctor_profile_essential_fields = ["doctor_name", "gender", "years_in_practice", "location", "zip_code"]


In [14]:
def get_doctor(index: int):
    return physicians_df.iloc[index][doctor_profile_essential_fields]

In [15]:
from tqdm.notebook import tqdm

def make_transcript(doctor_profile):

    transcript_prompt = [
        ("system", transcript_system_prompt),
    ]
    doctor_name = doctor_profile['doctor_name']

    transcript = "## Physician's Profile:\n"
    for key in doctor_profile:
        transcript += f"   {key}: {doctor_profile[key]}\n"

    transcript += "\n## Transcript:\n\n"
    for i in tqdm(range(0, len(questions), question_chunk_size), desc=f"Interviewing {doctor_name}"):
        next_questions = questions[i: i + question_chunk_size]
        start_interview = "[start interview]\n" if i == 0 else ""
        end_interview = "[end interview]\n" if i + question_chunk_size >= len(questions) else ""
        transcript_prompt.append(("user", start_interview + '\n'.join(next_questions) + end_interview))

        answer = bot.invoke(transcript_prompt, arguments={"doctor_profile": doctor_profile})

        transcript_prompt.append(("assistant", answer["content"]))
        transcript += answer["content"]

    return transcript

# Testing
# make_transcript(get_doctor(0))

In [16]:
def get_transcript_file_name(doctor_profile):
    doctor_name = doctor_profile["doctor_name"].replace("Dr.", "").replace("MD", "").strip()
    return doctor_name.replace(" ", "_") + '_' + doctor_profile["location"].replace(",", "").replace(" ", "_") + ".txt"

# get_transcript_file_name(get_doctor(5))

In [20]:
import shutil

transcript_dir = data_dir / "transcripts"
# generating_range = range(num_physicians)
generating_range = range(5)

if transcript_dir.is_dir() and refresh("transcripts"):
    shutil.rmtree(transcript_dir)

transcript_dir.mkdir(parents=True, exist_ok=True)

for i in tqdm(generating_range, desc="Physician Progress"):
    doctor_profile = dict(get_doctor(i))
    transcript_file = transcript_dir / get_transcript_file_name(doctor_profile)

    if not transcript_file.is_file():
        transcript = make_transcript(doctor_profile)
        transcript_file.write_text(transcript)


Physician Progress:   0%|          | 0/5 [00:00<?, ?it/s]

Interviewing Dr. Emily Chen:   0%|          | 0/6 [00:00<?, ?it/s]

Interviewing Dr. Michael Jones:   0%|          | 0/6 [00:00<?, ?it/s]

Interviewing Dr. Jessica Patel:   0%|          | 0/6 [00:00<?, ?it/s]

Interviewing Dr. Thomas Nguyen:   0%|          | 0/6 [00:00<?, ?it/s]

Interviewing Dr. Sarah Lee:   0%|          | 0/6 [00:00<?, ?it/s]