# Healthcare Chat bot

- Get the MedQuAD dataset from the GitHub repository: https://github.com/abachaa/MedQuAD

- Extract question-answer pairs: Parse the XML files in the dataset to extract the question-answer pairs. Each pair should contain the question text and its corresponding answer.

- Clean and normalize the text: Remove any special characters, HTML tags, or irrelevant formatting.

- Convert all text to lowercase for consistency. Remove extra whitespace.

- Tokenize the text: Split the questions and answers into individual words or subwords, depending on your model's requirements.
    - Use specialized biomedical tokenizers: BioBERT or other biomedical-specific models are recommended for tokenization of medical text, as they are pre-trained on large biomedical corpora and can better handle domain-specific terminology
    - Handle biomedical abbreviations and terminology: Develop a custom dictionary or use medical NLP tools to expand abbreviations and standardize medical terminology . Pay special attention to ambiguous abbreviations and punctuation commonly used in medical text (e.g. "m.g.", "d/c", "–>")
    - Implement domain-specific preprocessing: Clean and normalize the text by removing special characters, standardizing formats, and handling medical-specific patterns .
    - Use advanced NLP techniques: Employ Named Entity Recognition (NER) to identify and classify medical entities like conditions, treatments, drugs, and symptoms .
    - Utilize Part-of-Speech (POS) tagging to better understand the syntactic structure of clinical text . Consider sentence-level tokenization: Use specialized clinical sentence tokenizers like clinitokenizer, which are designed to handle the unique challenges of medical text .
    - Implement word-level tokenization: Use techniques like WordPiece tokenization, which is employed by BERT-based models and can handle out-of-vocabulary words common in medical text .
    - Handle punctuation carefully: Pay special attention to punctuation, as it can be ambiguous in medical text (e.g., periods in abbreviations vs. sentence endings) .
    - Address spelling and typographical errors: Implement spelling correction mechanisms, as medical texts often contain misspellings and typos .
    - Use machine learning approaches: Consider using supervised or unsupervised machine learning techniques to improve tokenization accuracy, especially for handling domain-specific challenges
    - Evaluate and refine: Test your tokenization approach on a subset of the MedQuAD dataset and refine as needed. Consider using multiple tokenization strategies and comparing their performance on your specific tasks .

- Create training splits: Divide the dataset into training, validation, and test sets (e.g., 80% training, 10% validation, 10% test).
- Format for your model: Prepare the data in the format required by your chosen model architecture (e.g., input-output pairs for sequence-to-sequence models).
- Encode the data: Convert the text to numerical representations (e.g., using word embeddings or subword tokenization). Create attention masks and other model-specific inputs if needed.
- Save the preprocessed data: Store the processed data in a format that's easy to load for training (e.g., JSON, CSV, or a custom binary format).

Remember to handle any dataset-specific annotations or metadata that may be relevant to your chatbot's performance. Additionally, consider using the provided UMLS Concept Unique Identifiers (CUIs) and Semantic Types if they can enhance your model's understanding of medical concepts.



In [2]:
import requests
!pip install -q -U lxml
import pandas as pd
import xml.etree.ElementTree as ET
import os

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
headers = {
    'Authorization':'token '
}
"""
Download xml files and process them
"""

list_with_ans = ['https://api.github.com/repos/komus/MedQuAD/contents/1_CancerGov_QA?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/2_GARD_QA?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/3_GHR_QA?ref=master',
                'https://api.github.com/repos/komus/MedQuAD/contents/4_MPlus_Health_Topics_QA?ref=master',
                'https://api.github.com/repos/komus/MedQuAD/contents/5_NIDDK_QA?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/6_NINDS_QA?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/7_SeniorHealth_QA?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/8_NHLBI_QA_XML?ref=master',
                 'https://api.github.com/repos/komus/MedQuAD/contents/9_CDC_QA?ref=master',

                 ]

def download_process_xml(url):
  #print(url)
  try:
    resp = requests.get(url, headers)
    resp.raise_for_status()

    xml_content = resp.content
    root = ET.fromstring(xml_content)
    return parse_xml_key_pair(root)
  except Exception as e:
    print(f"Error {e}")
    return None

def parse_xml_key_pair(root):
  df = pd.DataFrame(columns=['focus','synonyms', 'semanticgroup', 'question', 'answer'])

  synonyms = [sy.text.strip() for sy in root.findall(".//Synonyms/Synonym")]
  focus = [sy.text.strip() for sy in root.findall(".//Focus")]
  #print(synonyms)
  semanticgroup = [se.text.strip() for se in root.findall(".//UMLS/SemanticGroup")]
  #print(semanticgroup)
  for qapair in root.findall(".//QAPair"):
    question = qapair.find("Question").text.strip() if qapair.find("Question") is not None else ""
    answer = qapair.find("Answer").text.strip() if qapair.find("Answer") is not None else ""

    temp_df = pd.DataFrame({
        'question': question,
        'answer': answer,
        'semanticgroup': ', '.join(semanticgroup),
        'synonyms': [synonyms],
        'focus': ', '.join(focus)
    })
    df = pd.concat([df, temp_df], ignore_index=True)
  return df


def parse_xml_to_dict(root):
  data = {}
  for elem in root.iter():
    if elem.text:
      data[elem.tag] = elem.text.strip()
    else:
      data[elem.tag] = None
  return data

output_path = "output_medplus.jsonl"

"""
Using the url, loop through the content of the repo and get the xml files
"""
def process_github_xml_files(url):
  df = pd.DataFrame()
  resp = requests.get(url, headers=headers)
  contents = resp.json()
  #print(contents)
  if contents:
    for item in contents:
      if item['type']:
        if item['type'] == 'file' and item['name'].endswith('.xml'):
          xml_data = download_process_xml(item['download_url'])
          if xml_data is not None:
          #print(xml_data)
            with open(output_path, "w") as f:
              f.write(xml_data.to_json(orient='records', lines=True, force_ascii=False))
            xml_data.to_csv('output_file1.csv', mode='a', header=not os.path.exists('output_file1.csv'), index=False)


In [5]:
for d in list_with_ans:
  process_github_xml_files(d)

Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' object has no attribute 'strip'
Error 'NoneType' obj