## Automated AI agent for topic discovery and topic assignment for a set of documents

The workflow is designed to automate the process of discovering topics within a set of documents and assigning these topics to the documents. The workflow uses LLMs to analyze the content of the documents, extract relevant topics, and assign up to three topics to each document.

### Generating a synthetic dataset for testing

A function `generate_synthetic_docs()` is available to generate a synthetic dataset of documents. This dataset can be used for testing the topic discovery and assignment workflow.

In [1]:
from docagent import synthetic as sn
import pandas as pd

pd.set_option("display.max_colwidth", 800)

# Generate synthetic documents for a specific industry and context
investment_bank_employee_survey = sn.generate_synthetic_docs(
    industry="investment banking",
    context="comments provided by employees on their work experience",
    num_docs=100,
)


    Generating 100 synthetic documents for the investment banking industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 100 synthetic documents successfully in 81 seconds.


Viewing a few of the documents in the synthetic dataset:

In [2]:
investment_bank_employee_survey.head()

Unnamed: 0,docID,content
0,1,"Working in investment banking has been challenging but rewarding. The long hours can be difficult to manage, but the exposure to complex financial transactions has accelerated my professional growth significantly."
1,2,"The competitive environment pushes me to perform at my best. While the pressure can be intense, I appreciate the meritocratic culture where hard work is recognized and rewarded."
2,3,"Balancing work-life has been nearly impossible during deal closings. The expectation to be available 24/7 is unsustainable long-term, though the compensation somewhat makes up for it."
3,4,"The learning curve was steep when I started, but the structured training program helped me adapt quickly. Senior colleagues have been surprisingly supportive despite their busy schedules."
4,5,Client interactions are the most fulfilling aspect of my role. Helping companies navigate complex financial decisions and seeing the tangible impact of our work makes the long hours worthwhile.


### Running topic discovery

Topic discovery can be run using the `discover_topics()` function. This function takes a list of documents and returns a list of topics discovered in the documents, along with various other descriptive data about the topics.

In [3]:
from docagent import topic_discovery as td

# Discover between 6 and 10 topics in the synthetic documents
investment_bank_employee_survey_topics = await td.discover_topics(
    corpus = investment_bank_employee_survey,
    min_topics = 6,
    max_topics = 10,
)

Starting topic discovery from 100 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 10 topics in 36 seconds.


We can view the details of the topics discovered in the synthetic dataset:

In [4]:
investment_bank_employee_survey_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,2,Career Development & Learning,"Professional growth opportunities through challenging work, skill development, mentorship, and exposure to complex financial transactions. Includes training programs and knowledge acquisition.","[professional growth, learning curve, skill development, mentorship, training, career development, expertise]","[1, 4, 6, 10, 15, 17, 29, 36, 45, 56, 63, 71, 82, 90]",0.14
1,9,Analytical & Technical Skills,"Development of financial modeling, valuation expertise, and analytical capabilities. Includes the intellectual stimulation from complex problem-solving and quantitative analysis.","[financial modeling, analytical skills, valuation, quantitative analysis, technical expertise]","[6, 17, 21, 29, 36, 56, 85, 88, 93, 97]",0.1
2,5,Team Dynamics & Culture,"Workplace relationships, collaboration challenges, competitive environment, and organizational culture. Includes issues with hierarchy, diversity, and internal competition.","[team dynamics, collaboration, competition, culture, hierarchy, diversity, workplace relationships]","[2, 7, 13, 14, 18, 33, 54, 73, 83]",0.09
3,1,Work-Life Balance,"Concerns about long working hours, burnout, and the difficulty of maintaining personal life while meeting demanding job expectations. Includes issues with 24/7 availability and unpredictable schedules.","[work-life balance, long hours, burnout, 24/7 availability, personal life, demanding schedule, overtime]","[3, 12, 19, 32, 43, 52, 62, 75]",0.08
4,10,Organizational Processes,"Internal procedures, compliance requirements, documentation, and bureaucratic challenges. Includes issues with approval processes and administrative inefficiencies.","[compliance, documentation, processes, bureaucracy, procedures, risk management, approvals]","[26, 42, 46, 63, 67, 76, 94]",0.07
5,3,Compensation & Performance,"Issues related to salary structure, bonuses, performance reviews, and recognition systems. Includes concerns about evaluation metrics and the competitive nature of compensation.","[compensation, bonus, performance review, salary, recognition, evaluation, metrics]","[11, 22, 41, 58, 64, 65]",0.06
6,6,Client Relations & Service,"Managing client expectations, building relationships, and balancing client demands with quality work. Includes challenges with unrealistic deadlines and service delivery.","[client interaction, client expectations, relationship management, service quality, client demands]","[5, 16, 39, 53, 66, 91]",0.06
7,4,Technology & Innovation,"Challenges with outdated systems, slow technology adoption, and the need for digital transformation. Includes concerns about falling behind competitors in technological advancement.","[technology, systems, digital transformation, automation, innovation, modernization, AI]","[9, 38, 51, 60, 72]",0.05
8,8,Ethics & Values,"Conflicts between personal values and business objectives, ethical considerations in deal-making, and concerns about revenue pressure versus doing what's right.","[ethics, values, integrity, ESG, sustainable finance, ethical dilemmas]","[25, 44, 57, 69, 95]",0.05
9,7,Global Exposure & Travel,"International experience, working across different markets and regions, and the impact of travel requirements on personal life. Includes benefits and challenges of global work.","[global exposure, international, travel, cross-border, regional offices, global markets]","[8, 20, 40, 50]",0.04


### Assigning topics to documents

Once a topic DataFrame is available, we can assign topics to documents using the `assign_topics()` function. This function takes a DataFrame of topics and a DataFrame of documents, and returns a DataFrame of documents with assigned topics.   Where possible, primary, secondary and tertiary topics are assigned to each document based on the topics discovered in the previous step.

This allows for human editing of the topics prior to assignment, if needed. The topics are assigned based on the relevance of the topics to the content of each document.

This process runs by sending the documents in consecutive batches to the LLM using a tool to fetch each batch of documents.  The LLM assigns each document in the batch to the the most relevant topics.

In [5]:
from docagent import topic_assignment as ta

# Assign topics to the synthetic documents
synthetic_docs_with_topics = await ta.assign_topics(
    corpus_name = "consulting_engagement_survey",
    corpus = investment_bank_employee_survey,
    topics = investment_bank_employee_survey_topics,
)

Commencing assignment of 100 documents to 10 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 61 to 80...
Document assignment

We can view the assigned topics for the documents in the synthetic dataset:

In [6]:
synthetic_docs_with_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"Working in investment banking has been challenging but rewarding. The long hours can be difficult to manage, but the exposure to complex financial transactions has accelerated my professional growth significantly.",2,Career Development & Learning,1,Work-Life Balance,0,No topic assigned
1,2,"The competitive environment pushes me to perform at my best. While the pressure can be intense, I appreciate the meritocratic culture where hard work is recognized and rewarded.",5,Team Dynamics & Culture,3,Compensation & Performance,0,No topic assigned
2,3,"Balancing work-life has been nearly impossible during deal closings. The expectation to be available 24/7 is unsustainable long-term, though the compensation somewhat makes up for it.",1,Work-Life Balance,3,Compensation & Performance,0,No topic assigned
3,4,"The learning curve was steep when I started, but the structured training program helped me adapt quickly. Senior colleagues have been surprisingly supportive despite their busy schedules.",2,Career Development & Learning,5,Team Dynamics & Culture,0,No topic assigned
4,5,Client interactions are the most fulfilling aspect of my role. Helping companies navigate complex financial decisions and seeing the tangible impact of our work makes the long hours worthwhile.,6,Client Relations & Service,1,Work-Life Balance,0,No topic assigned


### Running the agent to fully automate the workflow

The agent is designed to automate all steps of the process: it takes a corpus of documents, discovers topics, and assigns these topics to the documents.

First let's create a new synthetic dataset of documents to test the agent:


In [7]:
# Generate synthetic documents for a specific industry and context
hospital_patient_survey = sn.generate_synthetic_docs(
    industry="hospital emergency care",
    context="comments provided by patients on their experience in the emergency department",
    num_docs=200,
    min_doc_length=50,
    max_doc_length=200,
)

hospital_patient_survey.head()


    Generating 200 synthetic documents for the hospital emergency care industry with the requested context using model claude-3-7-sonnet-latest. 
    This usually takes about one second per document....
              
Generated 200 synthetic documents successfully in 212 seconds.


Unnamed: 0,docID,content
0,1,"The wait time in the emergency department was too long. I sat there for over 3 hours before being seen, despite being in severe pain. The nurses seemed understaffed and overwhelmed. Once I finally saw the doctor, the care was good, but the initial wait was unbearable."
1,2,"I was impressed with how quickly I was triaged when I arrived with chest pain. The staff took my symptoms seriously and I was seen within minutes. The emergency team worked efficiently, running tests and explaining everything they were doing. The facility was clean and everyone was professional."
2,3,"My experience at the emergency department was mixed. The medical care was excellent, but the communication was poor. No one updated me on test results or next steps. I waited for hours between interactions with staff, not knowing what was happening. Better communication would have made a huge difference."
3,4,"The emergency department was chaotic and disorganized. I had to repeat my symptoms to multiple staff members, and there seemed to be no coordination between them. The doctor who eventually treated me seemed rushed and barely listened to my concerns. I left feeling that my condition wasn't properly addressed."
4,5,"I brought my child in with a high fever and was impressed by how the pediatric emergency team handled everything. They were gentle, thorough, and kept us informed throughout the process. The child-friendly room with colorful walls and toys helped keep my little one calm during a scary situation."


In [8]:
from docagent import topic_agent as ta

# Perform full topic analysis on the hospital patient survey documents, based on 6-8 topics
discovered_topics, assigned_topics = await ta.full_topic_analysis(
    corpus = hospital_patient_survey,
    corpus_name = "hospital_patient_survey",
    min_topics = 6,
    max_topics = 8,
)

Starting topic discovery from 200 documents using model claude-opus-4-20250514...
Topic discovery complete. Discovered 8 topics in 27 seconds.
Commencing assignment of 200 documents to 8 topics. This will be done in chunks of 20 documents at a time...
Assigning topics for docs 1 to 20...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 21 to 40...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained successfully, continuing with topic assignment...
Processed 20 documents with tool use.
Assigning topics for docs 41 to 60...
Document assignment under way using model claude-3-7-sonnet-latest...
Tool use detected.  Obtaining the documents from the corpus...
Documents obtained suc

The agent produces two DataFrames: the first contains the topics discovered in the documents, and the second contains the documents with assigned topics.

In [9]:
discovered_topics

Unnamed: 0,topicID,topicName,topicDescription,topicKeywords,topicExamples,topicPrevalence
0,5,Special Populations Care,"Experiences of patients with specific needs including pediatric, geriatric, mental health, disabled, and culturally diverse patients requiring specialized approaches.","[pediatric, elderly, mental health, disability, cultural, special needs, accommodation]","[5, 11, 14, 21, 28, 40, 46, 60, 68, 82]",0.2
1,3,Quality of Medical Care,"Patient assessments of the actual medical treatment received, including diagnostic accuracy, treatment effectiveness, and clinical competence of healthcare providers.","[treatment, diagnosis, medical care, doctor, misdiagnosed, clinical, expertise]","[2, 6, 16, 20, 29, 33, 54, 78, 104]",0.18
2,2,Staff Communication and Professionalism,"Issues related to how emergency department staff communicate with patients, including clarity of information, bedside manner, and professional behavior.","[communication, staff, professional, explain, information, update, coordination]","[3, 4, 22, 26, 31, 43, 59, 76, 143]",0.15
3,1,Wait Times and Triage,"Patient experiences related to waiting times in the emergency department and the triage process, including concerns about prioritization and delays in receiving care.","[wait time, triage, hours, delay, prioritization, waiting room, queue]","[1, 10, 13, 23, 25, 51, 113, 127]",0.12
4,7,Pain Management and Medication Safety,"Patient experiences with pain control, medication administration, allergy management, and overall pharmaceutical care in the emergency setting.","[pain, medication, allergy, pain management, pharmaceutical, dosing, drug]","[23, 24, 33, 36, 39, 58, 85, 106, 110]",0.12
5,4,Facility Conditions and Resources,"Physical aspects of the emergency department including cleanliness, equipment quality, comfort amenities, and overall facility maintenance.","[facility, clean, equipment, room, bathroom, comfortable, environment]","[7, 12, 15, 27, 35, 47, 65, 71, 99]",0.1
6,6,Administrative and Billing Issues,"Problems related to registration, insurance, billing processes, and administrative coordination including discharge planning and follow-up care.","[billing, insurance, registration, discharge, administrative, paperwork, follow-up]","[8, 30, 38, 50, 52, 57, 91, 108]",0.08
7,8,Technology and Innovation,"Use of modern technology, telemedicine, electronic systems, and innovative approaches to improve emergency care delivery and patient experience.","[technology, telemedicine, digital, innovation, electronic, app, portal]","[27, 37, 44, 48, 69, 77, 92, 115, 128]",0.05


In [10]:
assigned_topics.head()

Unnamed: 0,docID,docContent,primaryTopicID,primaryTopicName,secondaryTopicID,secondaryTopicName,tertiaryTopicID,tertiaryTopicName
0,1,"The wait time in the emergency department was too long. I sat there for over 3 hours before being seen, despite being in severe pain. The nurses seemed understaffed and overwhelmed. Once I finally saw the doctor, the care was good, but the initial wait was unbearable.",1,Wait Times and Triage,3,Quality of Medical Care,0,No topic assigned
1,2,"I was impressed with how quickly I was triaged when I arrived with chest pain. The staff took my symptoms seriously and I was seen within minutes. The emergency team worked efficiently, running tests and explaining everything they were doing. The facility was clean and everyone was professional.",1,Wait Times and Triage,3,Quality of Medical Care,2,Staff Communication and Professionalism
2,3,"My experience at the emergency department was mixed. The medical care was excellent, but the communication was poor. No one updated me on test results or next steps. I waited for hours between interactions with staff, not knowing what was happening. Better communication would have made a huge difference.",2,Staff Communication and Professionalism,3,Quality of Medical Care,0,No topic assigned
3,4,"The emergency department was chaotic and disorganized. I had to repeat my symptoms to multiple staff members, and there seemed to be no coordination between them. The doctor who eventually treated me seemed rushed and barely listened to my concerns. I left feeling that my condition wasn't properly addressed.",2,Staff Communication and Professionalism,3,Quality of Medical Care,0,No topic assigned
4,5,"I brought my child in with a high fever and was impressed by how the pediatric emergency team handled everything. They were gentle, thorough, and kept us informed throughout the process. The child-friendly room with colorful walls and toys helped keep my little one calm during a scary situation.",5,Special Populations Care,2,Staff Communication and Professionalism,4,Facility Conditions and Resources
