In [None]:
"""I will provide you with a document containing question and answer pairs related to pediatric dentistry. Your task is to categorize these QA pairs into around 20 distinct topics. Here are the steps to follow:

<document>
{{DOCUMENT}}
</document>

First, carefully read through the entire document and identify the main topics covered by the question and answer pairs. Aim to come up with approximately 20 topics that effectively capture the key themes without too much overlap between topics. 

Next, go through the document again and categorize each question and answer pair under one and only one of the topics you identified. A given QA pair should not be repeated under multiple topics.

Finally, output the categorized QA pairs with the following format:

<topic>Topic 1 Name</topic>
<qa>
Q: Question 1 text 
A: Answer 1 text
</qa>
<qa>
Q: Question 2 text
A: Answer 2 text 
</qa>

<topic>Topic 2 Name</topic>
<qa>
Q: Question 3 text
A: Answer 3 text
</qa>

And so on for all 20 topics and all QA pairs. Make sure each QA pair is only listed once under its most relevant topic."""

## Load and Test

In [None]:
! pip install anthropic

In [2]:
from dotenv import load_dotenv
import os

# Load the environment variables from the .env file
load_dotenv()

# Get the Claude API key from the environment variables
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

In [4]:
import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key=anthropic_api_key,
)

prompt = "What is the capital of France?"
response = client.completions.create(
    prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
    stop_sequences=[anthropic.HUMAN_PROMPT],
    model="claude-v1",
    max_tokens_to_sample=100
)

print(response.completion)

 The capital of France is Paris.


In [3]:
# read the clean_faq_dataset.txt file and store content in a variable called DOCUMENT
with open('/Users/acrobat/Documents/GitHub/extract_html/clean_faq_dataset.txt', 'r') as file:
    DOCUMENT = file.read()

In [4]:
# Print first 1000 characters of the document
print(len(DOCUMENT))

243454


In [None]:
import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key=anthropic_api_key,
)

# Replace placeholders like {{DOCUMENT}} with real values,
# because the SDK does not support variables.
message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=4000,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "I will provide you with a document containing question and answer pairs related to pediatric dentistry. Your task is to categorize these QA pairs into around 10 distinct topics. Here are the steps to follow:\n\n<document>\n{{DOCUMENT}}\n</document>\n\nFirst, carefully read through the entire document and identify the main topics covered by the question and answer pairs. Aim to come up with approximately 10 topics that effectively capture the key themes without too much overlap between topics. \n\nNext, go through the document again and categorize each question and answer pair under one and only one of the topics you identified. A given QA pair should not be repeated under multiple topics.\n\nFinally, output the categorized QA pairs with the following format:\n\n<topic>Topic 1 Name</topic>\n<qa>\nQ: Question 1 text \nA: Answer 1 text\n</qa>\n<qa>\nQ: Question 2 text\nA: Answer 2 text \n</qa>\n\n<topic>Topic 2 Name</topic>\n<qa>\nQ: Question 3 text\nA: Answer 3 text\n</qa>\n\nAnd so on for all 10 topics and all QA pairs. Make sure each QA pair is only listed once under its most relevant topic."
                }
            ]
        }
    ]
)
print(message.content)

In [6]:
message.content[0].text

'Here is my attempt at categorizing the pediatric dentistry QA pairs into 10 topics:\n\n<topic>Dental Visits and Checkups</topic>\n<qa>\nQ: When should I schedule my child\'s first dental visit?\nA: The American Academy of Pediatric Dentistry (AAPD) recommends that a child go to the dentist by age 1 or within six months after the first tooth erupts. Primary teeth typically begin growing in around 6 months of age.\n</qa>\n<qa>\nQ: How often should a child see the dentist?\nA: Children should visit the dentist every 6 months for regular dental cleanings and checkups. Some dentists may schedule interim visits for every 3 months when the child is very young to build up a comfort level or to treat a developing problem.\n</qa>\n\n<topic>Teething and Tooth Eruption</topic>\n<qa>\nQ: Which teeth will my baby get first?\nA: The two bottom front teeth (lower central incisors) are usually the first to appear, typically around 6 months of age. The two top front teeth (upper central incisors) usual

In [7]:
questions = message.content[0].text

In [84]:
questions = message.content[0].text
print(type(questions))

<class 'str'>


In [8]:
print(questions)

Here is my attempt at categorizing the pediatric dentistry QA pairs into 10 topics:

<topic>Dental Visits and Checkups</topic>
<qa>
Q: When should I schedule my child's first dental visit?
A: The American Academy of Pediatric Dentistry (AAPD) recommends that a child go to the dentist by age 1 or within six months after the first tooth erupts. Primary teeth typically begin growing in around 6 months of age.
</qa>
<qa>
Q: How often should a child see the dentist?
A: Children should visit the dentist every 6 months for regular dental cleanings and checkups. Some dentists may schedule interim visits for every 3 months when the child is very young to build up a comfort level or to treat a developing problem.
</qa>

<topic>Teething and Tooth Eruption</topic>
<qa>
Q: Which teeth will my baby get first?
A: The two bottom front teeth (lower central incisors) are usually the first to appear, typically around 6 months of age. The two top front teeth (upper central incisors) usually come in shortl

In [10]:
import re
import os

def parse_excerpt(excerpt, folder_path):
    # Split the excerpt into topics
    topics = re.split(r'(?=<topic>)', excerpt)
    
    for topic in topics:
        # Extract the topic name
        topic_match = re.search(r'<topic>(.*?)</topic>', topic, re.DOTALL)
        if topic_match:
            topic_name = topic_match.group(1).strip()
            # Replace special characters in the topic name
            topic_name = re.sub(r'[^a-zA-Z0-9\s]', '_', topic_name)
            
            # Create the folder if it doesn't exist
            os.makedirs(folder_path, exist_ok=True)
            
            # Create the file path
            file_path = os.path.join(folder_path, f"{topic_name}.txt")
            
            # Extract the Q&A pairs
            qa_pairs = re.findall(r'<qa>\s*Q:(.*?)\s*A:(.*?)</qa>', topic, re.DOTALL)
            
            # Create a new file for the topic and write the Q&A pairs
            with open(file_path, 'w') as file:
                file.write(f"Topic: {topic_name}\n\n")
                for qa_pair in qa_pairs:
                    question, answer = qa_pair
                    file.write(f"Q: {question.strip()}\nA: {answer.strip()}\n\n")

In [11]:
folder_path = "/Users/acrobat/Documents/GitHub/extract_html/data_cleaned_faq_by_topic"
parse_excerpt(questions, folder_path)

#### Summary - I had great plans to do this using Claude API but max token size output size is set at 4000. I would have to feed the info one by one. I dont want to do that since I wan LLM to consider the entire document. So moved to OpenAI and did it via ChatGPT screen one by one. Bit also did not work. It will only process 16 questions or so at a time. Total FAILURE!!!! Anthropic wont work, OpenAi also wont work. 


I want to group 750 Q&A pairs into 20 groups. Okay different approach. lets start with cheese and go to the mouse. My goal is to add these seprately into Voiceflow and chunk them in by topic. The default size for Voiceflow chunking is set at 1000 tokens. Range is 500 - 1500 tokens. I used tiktopken and calculated that about 1000 tokens is 14 questions and answer pair. I think this is good number - 14 because it is for 1000 tokens. What if I split my ~750 questions into 14 question docs. Actually for ease of calculation 15 question chuncks. 
Then I provide Claude the 20 topics I geenrated. then I can ask for each document place them into a topic for me. for each topic I will have a document under my folder. As Claude analyzes each question it will drop them into relevant document. 

In [None]:
from dotenv import load_dotenv
import os

# Load the environment variables from the .env file
load_dotenv()

# Get the Claude API key from the environment variables
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

In [None]:
# read the clean_faq_dataset.txt file and store content in a variable called DOCUMENT
with open('/Users/acrobat/Documents/GitHub/extract_html/clean_faq_dataset.txt', 'r') as file:
    DOCUMENT = file.read()

- Topic 1: Preventive Care and Check-ups ~	44
- Topic 2: Tooth Decay and Cavities	65
- Topic 3: Dental Procedures and Treatments	51
- Topic 4: Oral Hygiene Practices	62
- Topic 5: Diet and Nutrition's Impact	47
- Topic 6: Infant Dental Care	87
- Topic 7: Early Childhood Oral Health	78
- Topic 8: Orthodontics and Teeth Alignment	29
- Topic 9: Pediatric Dentistry	31
- Topic 10: Dental Emergencies	22
- Topic 11: Dental Anxiety and Comfort	34
- Topic 12: Special Needs Dentistry	18
- Topic 13: Oral Health Education	24
- Topic 14: Cultural and Historical Perspectives	14
- Topic 15: Parental Guidance and Involvement	43
- Topic 16: Dental Health Milestones	21
- Topic 17: Symptoms and Diagnosis	37
- Topic 18: Dental Innovations and Research	15
- Topic 19: Public Health and Dental Policies	12
- Topic 20: Miscellaneous

- Topic 1: Preventive Care and Check-ups
- Topic 2: Tooth Decay and Cavities	
- Topic 3: Dental Procedures and Treatments	
- Topic 4: Oral Hygiene Practices	
- Topic 5: Diet and Nutrition's Impact	
- Topic 6: Infant Dental Care	
- Topic 7: Early Childhood Oral Health	
- Topic 8: Orthodontics and Teeth Alignment	
- Topic 9: Pediatric Dentistry	
- Topic 10: Dental Emergencies	
- Topic 11: Dental Anxiety and Comfort	
- Topic 12: Special Needs Dentistry	
- Topic 13: Oral Health Education	
- Topic 14: Cultural and Historical Perspectives	
- Topic 15: Parental Guidance and Involvement	
- Topic 16: Dental Health Milestones	
- Topic 17: Symptoms and Diagnosis	
- Topic 18: Dental Innovations and Research	
- Topic 19: Public Health and Dental Policies	
- Topic 20: Miscellaneous

In [13]:
# Split the document into 15 questions documements.
def split_qa_document(filepath, output_dir, qa_per_file=15):
    """
    Splits a large document containing Q&A pairs into multiple smaller documents,
    each containing a specified number of Q&A pairs.

    Args:
    - filepath (str): Path to the input file containing the Q&A pairs.
    - output_dir (str): Directory to save the output files.
    - qa_per_file (int): Number of Q&A pairs per output file.
    """
    import os

    # Create the output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    with open(filepath, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Initialize variables to store Q&A pairs and manage file writing
    qa_count = 0
    file_index = 1
    qa_pairs = []

    # Process each line in the input file
    for line in lines:
        if line.strip():  # If line is not empty
            qa_pairs.append(line)
        else:  # Empty line indicates the end of a Q&A pair
            qa_count += 1
            if qa_count == qa_per_file:  # Check if current document has enough Q&A pairs
                # Write the current set of Q&A pairs to a file
                output_filepath = os.path.join(output_dir, f"qa_part_{file_index}.txt")
                with open(output_filepath, 'w', encoding='utf-8') as output_file:
                    output_file.writelines(qa_pairs)
                # Reset for the next file
                qa_pairs = []
                qa_count = 0
                file_index += 1

    # Check if there are any remaining Q&A pairs to write after the last full set
    if qa_pairs:
        output_filepath = os.path.join(output_dir, f"qa_part_{file_index}.txt")
        with open(output_filepath, 'w', encoding='utf-8') as output_file:
            output_file.writelines(qa_pairs)




In [14]:
split_qa_document('/Users/acrobat/Documents/GitHub/extract_html/clean_faq_dataset.txt', '/Users/acrobat/Documents/GitHub/extract_html/data_cleaned_faq_by_topic/raw_split_docs')
#done! in data_cleaned_faq_by_topic/raw_split_docs/qa_part_1.txt

In [17]:
# read the clean_faq_dataset.txt file and store content in a variable called DOCUMENT
with open('data_cleaned_faq_by_topic/raw_split_docs/qa_part_1.txt', 'r') as file:
    DOCUMENT = file.read()

In [18]:
len(DOCUMENT)

4725

In [25]:
print("DOCUMENT content:")
print(DOCUMENT)

DOCUMENT content:
Question: What are common causes of snoring in children?
Answer: Common causes of snoring in children include large tonsils or adenoids, allergies, asthma, a deviated septum, throat infections, and sleep apnea.
Question: When should parents be concerned about their child's snoring?
Answer: Parents should seek medical advice if their child's snoring is very loud, occurs most nights, involves gasping, pausing while sleeping, or if the child sleeps with an extended neck or open mouth.
Question: What is laryngomalacia, and how does it affect newborns?
Answer: Laryngomalacia is a condition where the baby's voice box collapses when they breathe in, causing noisy breathing or stridor. Symptoms include inward chest pulling, difficulty feeding, poor weight gain, periodic breathing stops, and cyanosis. It often resolves by 20 months of age but requires medical evaluation.
Question: How can parents differentiate between normal newborn snoring and laryngomalacia?
Answer: Normal n

In [63]:
import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key=anthropic_api_key,
)

# Read the clean_faq_dataset.txt file and store content in a variable called DOCUMENT
with open('data_cleaned_faq_by_topic/raw_split_docs/qa_part_10.txt', 'r') as file:
    DOCUMENT = file.read()

# Use the DOCUMENT variable directly in the message content
message_content = f"""
You will be provided with a document containing question and answer pairs related to pediatric dentistry. Your task is to categorize these QA pairs into the following 20 distinct topics:

- Topic 1: Preventive Care and Check-ups
- Topic 2: Tooth Decay and Cavities
- Topic 3: Dental Procedures and Treatments
- Topic 4: Oral Hygiene Practices
- Topic 5: Diet and Nutrition's Impact
- Topic 6: Infant Dental Care
- Topic 7: Early Childhood Oral Health
- Topic 8: Orthodontics and Teeth Alignment
- Topic 9: Pediatric Dentistry
- Topic 10: Dental Emergencies
- Topic 11: Dental Anxiety and Comfort
- Topic 12: Special Needs Dentistry
- Topic 13: Oral Health Education
- Topic 14: Cultural and Historical Perspectives
- Topic 15: Parental Guidance and Involvement
- Topic 16: Dental Health Milestones
- Topic 17: Symptoms and Diagnosis
- Topic 18: Dental Innovations and Research
- Topic 19: Public Health and Dental Policies
- Topic 20: Miscellaneous

Here is the document containing the QA pairs:

<document>
{DOCUMENT}
</document>

First, carefully read through the entire document and identify the main topics covered by the question and answer pairs. Aim to place each question and answer pair into the topic that most effectively categorizes it. 

Note that a given QA pair should not be repeated under multiple topics. Each QA pair should only be listed once, under its most relevant topic.

Once you have categorized all the QA pairs, output them using the following format:

<topic>Topic 1 Name</topic>
<qa>
Q: Question 1 text
A: Answer 1 text
</qa>
<qa>
Q: Question 2 text
A: Answer 2 text 
</qa>

<topic>Topic 2 Name</topic>
<qa>
Q: Question 3 text
A: Answer 3 text
</qa>

And so on for all 20 topics and all QA pairs. Make sure each QA pair is only listed once under its most relevant topic. Do not return Topic 1, Topic 2, etc., as the topic names. Instead, provide the actual names of the topics that you have identified.
"""

message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=4000,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": message_content
                }
            ]
        }
    ]
)

print(message.content)

[TextBlock(text="Here is the document with the QA pairs categorized into relevant topics:\n\n<topic>Oral Hygiene Practices</topic>\n<qa>\nQ: How can parents ensure their child develops good oral hygiene habits?\nA: Parents can ensure good oral hygiene by cleaning their baby's mouth after each feeding, supervising and assisting older children with brushing using fluoride toothpaste, and instilling healthy dietary habits.\n</qa>\n<qa>\nQ: When is the right time to start using fluoridated toothpaste for a child, and why?\nA: Parents should start using fluoridated toothpaste when the child is at least 2 years old to avoid fluorosis, which can cause tooth discoloration and damage.\n</qa>\n\n<topic>Tooth Decay and Cavities</topic>\n<qa>\nQ: What should parents do if they suspect their child has Baby Bottle Tooth Decay?\nA: If parents suspect Baby Bottle Tooth Decay, they should immediately consult with a pediatric dentist to assess the severity of the decay and discuss appropriate treatment 

In [64]:
# Split the response into topics and QA pairs
topics = message.content[0].text.split("<topic>")[1:]
print(topics)

["Oral Hygiene Practices</topic>\n<qa>\nQ: How can parents ensure their child develops good oral hygiene habits?\nA: Parents can ensure good oral hygiene by cleaning their baby's mouth after each feeding, supervising and assisting older children with brushing using fluoride toothpaste, and instilling healthy dietary habits.\n</qa>\n<qa>\nQ: When is the right time to start using fluoridated toothpaste for a child, and why?\nA: Parents should start using fluoridated toothpaste when the child is at least 2 years old to avoid fluorosis, which can cause tooth discoloration and damage.\n</qa>\n\n", 'Tooth Decay and Cavities</topic>\n<qa>\nQ: What should parents do if they suspect their child has Baby Bottle Tooth Decay?\nA: If parents suspect Baby Bottle Tooth Decay, they should immediately consult with a pediatric dentist to assess the severity of the decay and discuss appropriate treatment options to prevent further damage.\n</qa>\n\n', 'Preventive Care and Check-ups</topic>\n<qa>\nQ: Why 

In [65]:
import os

# Define the base directory for saving the topic files
base_dir = "/Users/acrobat/Documents/GitHub/extract_html/data_cleaned_faq_by_topic/topics"

# Split the response into topics and QA pairs
topics = message.content[0].text.split("<topic>")[1:]

for topic in topics:
    topic_name, qa_pairs = topic.split("</topic>")
    topic_name = topic_name.strip().replace(" ", "_")
    
    # Create the file path for the topic
    file_path = os.path.join(base_dir, f"{topic_name}.txt")
    
    # Remove the <qa> and </qa> tags from the QA pairs
    qa_pairs_cleaned = qa_pairs.replace("<qa>\n", "").replace("</qa>", "")
    
    # Write the cleaned QA pairs to the topic file
    with open(file_path, "a") as file:
        file.write(qa_pairs_cleaned.strip() + "\n\n")  # Add a newline for spacing between entries

print("QA pairs have been written to topic files.")


QA pairs have been written to topic files.


### Summary: 
- Updates to Claude prompt were required. For example after 7th txt file it started to ptoduce Topic 1, topic 2, etc instead of just returning the topic name so I added; "Do not return Topic 1, Topic 2, etc., as the topic names. Instead, provide the actual names of the topics that you have identified." There were also issues with new lines and spaces that took a few tries to fix. Hence the print outs each step.
- Another issue was reading the DOCUMENT. Claude SDK does not read the variables so the following line was added to the same cell to work: # Read the clean_faq_dataset.txt file and store content in a variable called DOCUMENT
with open('data_cleaned_faq_by_topic/raw_split_docs/qa_part_9.txt', 'r') as file:
    DOCUMENT = file.read()
- ANothe rimprovement to this notebook would be to write a .py file and feed each one of the raw txt files one by one. I did not spend time with this because I wanted to review each block before I moed into their category txt file. 

- I did document 10 - need to continue after 10 with 11. 