# CHILD Knowledge Management
## Seattle Children Hospital
### July 2025
#### Jerome Massot (jeromemassot@google.com)

## 00- Setup and Import Modules

requirements.txt

- beautifulsoup4
- google-cloud-discoveryengine
- google-storage
- google-genai

In [None]:
%%writefile requirements.txt
html-to-markdown
beautifulsoup4
google-cloud-discoveryengine
google-cloud-storage
google-genai

Overwriting requirements.txt


In [None]:
! pip install -q -r "/content/requirements.txt"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/187.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.3/187.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/804.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m804.0/804.0 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
from google.cloud import discoveryengine_v1beta
from google.cloud import storage
from google import genai

In [2]:
from html_to_markdown import convert_to_markdown
from bs4 import XMLParsedAsHTMLWarning
from bs4 import BeautifulSoup
import requests

In [3]:
from pydantic import BaseModel, Field
from collections import defaultdict
from typing import List

In [4]:
import warnings
import pprint
import tqdm
import json
import uuid
import re

In [5]:
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

## 01- Patient Eduction website

The page education pages can be seen as a collection of PDFs documents and links to internal and external websites. There is almost no informational content in these pages, beyond these links to PDFs and other websites.

The extraction consists in the retrieving of the PDFs documents, and URLs pointing to internal webpages, external webpaces, and videos content.

### 01-01- Pages Extraction

We start by parsing the patient education landing page for extracting all the references to the conditions pages.

An atomic page data contains:
- a title (str)
- an url (str)
- a collection of PDFs (list of uri using the pattern gs://...
- a collection of videos (YouTube)
- a collection of internal links
- a collection of external links
- a collection of data impossible to retrieved

In [None]:
url = "https://www.seattlechildrens.org/patients-families/patient-education/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
pages = defaultdict(lambda: defaultdict)
for link in soup.find_all("a"):
    if link.get("href").startswith("/clinics/"):
        # take care of the website naming unconsistencies
        is_resources = "resources" in link.get("href").split("/")[-2] or "education" in link.get("href").split("/")[-2]
        if is_resources:
            pages[link.text] = {
                "title": link.text,
                "url": "https://www.seattlechildrens.org" + link.get("href"),
                "pdfs": [],
                "videos": [],
                "internal_links": [],
                "external_links": [],
                "failed_links": []
            }

# manually add the vascular access which does not respect the pattern of other pages
pages["Vascular Access"] = {
    "title": "Vascular Access",
    "url": "https://www.seattlechildrens.org/clinics/vascular-access-service/",
    "pdfs": [],
    "videos": [],
    "internal_links": [],
    "external_links": [],
    "failed_links": []
}

The dictionary contains all the urls for the conditions, the keys are the conditions titles.

In [None]:
print(f"Number of pages found: {len(pages)}...")
for condition_name, url in pages.items():
    print(condition_name, url)
    break

Number of pages found: 72...
Airway and Esophageal Resources {'title': 'Airway and Esophageal Resources', 'url': 'https://www.seattlechildrens.org/clinics/airway-esophageal-center/patient-and-family-resources/', 'pdfs': [], 'videos': [], 'internal_links': [], 'external_links': [], 'failed_links': []}


### 01-02- Embedded PDFs, Videos, and Hyperlinks extraction

- A PDF object data contains:
  - a title (str)
  - an url (from sch website)
  - an gs location (str using the gs:// pattern)
  - a language (language used in the PDF)

- A Video data contains:
  - an url (YouTube format)

- An Hyperlink data contains:
  - an url

In [None]:
PDFS_BUCKET_NAME = "sch_patient_education_pdfs"

In [None]:
def extract_pdfs_videos_hyperlinks(
    condition_name: str, pages: dict, bucket_name: str, debug: bool=False
) -> None:
    """
    Extract all the PDF files, videos urls, and hyperlinks from a webpage.
    :param condition_name: name of the condition
    :param pages: dictionary containing pages information
    :param bucket_name: name of the bucket where the PDFs are stored
    :param debug: debug flag
    :return: None
    """

    # retrieve the condition title and page url
    title = pages[condition_name]["title"]
    url = pages[condition_name]["url"]

    # create the bucket if it does not exist
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    if not bucket.exists():
        bucket.create()

    # extracting the knowledge embedded as links
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # pdfs list to be added to the page information
    pdfs = []
    mapping_for_multilingual_pdfs = {}

    # hyperlinks and video lists to the page
    internal_links = []
    external_links = []
    failed_links = []
    videos = []

    for link in tqdm.tqdm(soup.find_all("a")):

        if debug:
            print(f'Working on {link.get("href")}...')

        # enrich the url if needed
        if not link.get("href"):
            continue
        if link.get("href").startswith("http"):
            url = link.get("href")
        else:
            url = "https://www.seattlechildrens.org" + link.get("href")

        # check if the link points to a PDF file
        if url[-3:].lower() == "pdf":

            # get the pdf name
            file_name = url.split("/")[-1].lower()

            # get the pdf language
            language = link.get("lang", "en")

            # to manage the titles of the documents written in other language than English
            if language == 'en':
                mapping_for_multilingual_pdfs[url.split("/")[-1][:-4].lower()] = link.text
                title = link.text
            else:
                current_pdf = url.split("/")[-1].lower()
                for reference_pdf in mapping_for_multilingual_pdfs.keys():
                    if reference_pdf in current_pdf:
                        title = mapping_for_multilingual_pdfs[reference_pdf]
                        break

            # append the current pdf to the list of page's pdfs
            pdfs.append(
                {
                    "title": title,
                    "url": url,
                    "gs_location": f"gs://{bucket_name}/{condition_name.lower()}/{file_name})",
                    "language": language
                }
            )

            # download the pdf in the gs bucket
            if debug:
                print(f"Downloading {url}...")

            try:
                response = requests.get(url, timeout=10)
                if response.status_code != 200:
                    failed_links.append(url)
                else:
                    pdf_content = response.content
                    folder_name = condition_name.lower().replace(" ", "_").encode("ascii", "ignore").decode("ascii")
                    file_name = file_name.replace(" ", "_").encode("ascii", "ignore").decode("ascii")
                    blob = bucket.blob(f"{folder_name}/{file_name}")
                    blob.upload_from_string(pdf_content, content_type="application/pdf")
            except:
                failed_links.append(url)
        else:
            # append the url to the list of external links or videos
            if url.startswith("https://www.youtube.com"):
                  videos.append(url)
            if url.startswith("https://www.seattlechildrens.org"):
                internal_links.append(url)
            else:
                external_links.append(url)

    # update the pages dictionary
    pages[condition_name]["pdfs"] = pdfs
    pages[condition_name]["internal_links"] = internal_links
    pages[condition_name]["external_links"] = external_links
    pages[condition_name]["failed_links"] = failed_links

    # videos list to be added to the page information
    for link in soup.find_all("a", class_="video-link"):
        videos.append(link.get("href"))
    pages[condition_name]["videos"] = videos

    print(f"Extracted {len(pdfs)} PDFs for condition {condition_name}...")
    print(f"Extracted {len(videos)} videos for condition {condition_name}...")
    print(f"Extracted {len(internal_links)} internal links for condition {condition_name}...")
    print(f"Extracted {len(external_links)} external links for condition {condition_name}...")
    print(f"Failed for {len(failed_links)} links for condition {condition_name}...")

In [None]:
extract_pdfs_videos_hyperlinks("Craniofacial Resources", pages, PDFS_BUCKET_NAME, False)

100%|██████████| 224/224 [00:19<00:00, 11.33it/s]

Extracted 32 PDFs for condition Craniofacial Resources...
Extracted 2 videos for condition Craniofacial Resources...
Extracted 135 internal links for condition Craniofacial Resources...
Extracted 56 external links for condition Craniofacial Resources...
Failed for 0 links for condition Craniofacial Resources...





In [None]:
for condition_name in pages.keys():
    print(f"Working on {condition_name}...")
    extract_pdfs_videos_hyperlinks(condition_name, pages, PDFS_BUCKET_NAME)

Working on Airway and Esophageal Resources...


100%|██████████| 169/169 [00:14<00:00, 11.85it/s]


Extracted 51 PDFs for condition Airway and Esophageal Resources...
Extracted 1 videos for condition Airway and Esophageal Resources...
Extracted 100 internal links for condition Airway and Esophageal Resources...
Extracted 18 external links for condition Airway and Esophageal Resources...
Failed for 0 links for condition Airway and Esophageal Resources...
Working on Anesthesiology Resources...


100%|██████████| 161/161 [00:28<00:00,  5.56it/s]


Extracted 28 PDFs for condition Anesthesiology Resources...
Extracted 3 videos for condition Anesthesiology Resources...
Extracted 108 internal links for condition Anesthesiology Resources...
Extracted 25 external links for condition Anesthesiology Resources...
Failed for 0 links for condition Anesthesiology Resources...
Working on Apheresis Resources...


100%|██████████| 123/123 [00:47<00:00,  2.57it/s]


Extracted 17 PDFs for condition Apheresis Resources...
Extracted 1 videos for condition Apheresis Resources...
Extracted 91 internal links for condition Apheresis Resources...
Extracted 15 external links for condition Apheresis Resources...
Failed for 1 links for condition Apheresis Resources...
Working on Audiology Resources...


100%|██████████| 219/219 [01:38<00:00,  2.21it/s]


Extracted 69 PDFs for condition Audiology Resources...
Extracted 7 videos for condition Audiology Resources...
Extracted 100 internal links for condition Audiology Resources...
Extracted 50 external links for condition Audiology Resources...
Failed for 1 links for condition Audiology Resources...
Working on Autism Resources...


100%|██████████| 735/735 [04:53<00:00,  2.50it/s]


Extracted 264 PDFs for condition Autism Resources...
Extracted 54 videos for condition Autism Resources...
Extracted 129 internal links for condition Autism Resources...
Extracted 342 external links for condition Autism Resources...
Failed for 11 links for condition Autism Resources...
Working on Biofeedback Resources...


100%|██████████| 199/199 [01:50<00:00,  1.80it/s]


Extracted 88 PDFs for condition Biofeedback Resources...
Extracted 1 videos for condition Biofeedback Resources...
Extracted 95 internal links for condition Biofeedback Resources...
Extracted 16 external links for condition Biofeedback Resources...
Failed for 0 links for condition Biofeedback Resources...
Working on Brachial Plexus Resources...


100%|██████████| 127/127 [00:02<00:00, 48.87it/s]


Extracted 10 PDFs for condition Brachial Plexus Resources...
Extracted 1 videos for condition Brachial Plexus Resources...
Extracted 99 internal links for condition Brachial Plexus Resources...
Extracted 18 external links for condition Brachial Plexus Resources...
Failed for 0 links for condition Brachial Plexus Resources...
Working on Cancer and Blood Disorders Resources...


100%|██████████| 257/257 [00:35<00:00,  7.31it/s]


Extracted 52 PDFs for condition Cancer and Blood Disorders Resources...
Extracted 8 videos for condition Cancer and Blood Disorders Resources...
Extracted 132 internal links for condition Cancer and Blood Disorders Resources...
Extracted 73 external links for condition Cancer and Blood Disorders Resources...
Failed for 0 links for condition Cancer and Blood Disorders Resources...
Working on Child Life Resources...


100%|██████████| 119/119 [00:02<00:00, 54.17it/s]


Extracted 9 PDFs for condition Child Life Resources...
Extracted 7 videos for condition Child Life Resources...
Extracted 89 internal links for condition Child Life Resources...
Extracted 21 external links for condition Child Life Resources...
Failed for 0 links for condition Child Life Resources...
Working on Child Wellness Resources...


100%|██████████| 144/144 [00:03<00:00, 44.73it/s]


Extracted 14 PDFs for condition Child Wellness Resources...
Extracted 1 videos for condition Child Wellness Resources...
Extracted 102 internal links for condition Child Wellness Resources...
Extracted 28 external links for condition Child Wellness Resources...
Failed for 0 links for condition Child Wellness Resources...
Working on Childhood Communication Resources...


100%|██████████| 212/212 [00:30<00:00,  6.86it/s]


Extracted 16 PDFs for condition Childhood Communication Resources...
Extracted 3 videos for condition Childhood Communication Resources...
Extracted 100 internal links for condition Childhood Communication Resources...
Extracted 96 external links for condition Childhood Communication Resources...
Failed for 0 links for condition Childhood Communication Resources...
Working on Colorectal Resources...


100%|██████████| 194/194 [00:49<00:00,  3.89it/s]


Extracted 46 PDFs for condition Colorectal Resources...
Extracted 2 videos for condition Colorectal Resources...
Extracted 113 internal links for condition Colorectal Resources...
Extracted 35 external links for condition Colorectal Resources...
Failed for 0 links for condition Colorectal Resources...
Working on Cranial Base Resources...


100%|██████████| 162/162 [00:00<00:00, 547083.13it/s]


Extracted 0 PDFs for condition Cranial Base Resources...
Extracted 1 videos for condition Cranial Base Resources...
Extracted 125 internal links for condition Cranial Base Resources...
Extracted 37 external links for condition Cranial Base Resources...
Failed for 0 links for condition Cranial Base Resources...
Working on Craniofacial Resources...


100%|██████████| 224/224 [00:12<00:00, 17.52it/s]


Extracted 32 PDFs for condition Craniofacial Resources...
Extracted 2 videos for condition Craniofacial Resources...
Extracted 135 internal links for condition Craniofacial Resources...
Extracted 56 external links for condition Craniofacial Resources...
Failed for 0 links for condition Craniofacial Resources...
Working on Critical Care Medicine Resources...


100%|██████████| 130/130 [00:00<00:00, 490341.29it/s]


Extracted 0 PDFs for condition Critical Care Medicine Resources...
Extracted 1 videos for condition Critical Care Medicine Resources...
Extracted 114 internal links for condition Critical Care Medicine Resources...
Extracted 16 external links for condition Critical Care Medicine Resources...
Failed for 0 links for condition Critical Care Medicine Resources...
Working on Cystic Fibrosis Resources...


100%|██████████| 202/202 [01:58<00:00,  1.70it/s]


Extracted 72 PDFs for condition Cystic Fibrosis Resources...
Extracted 1 videos for condition Cystic Fibrosis Resources...
Extracted 105 internal links for condition Cystic Fibrosis Resources...
Extracted 25 external links for condition Cystic Fibrosis Resources...
Failed for 0 links for condition Cystic Fibrosis Resources...
Working on Dental Resources...


100%|██████████| 144/144 [00:06<00:00, 23.44it/s]


Extracted 13 PDFs for condition Dental Resources...
Extracted 1 videos for condition Dental Resources...
Extracted 114 internal links for condition Dental Resources...
Extracted 17 external links for condition Dental Resources...
Failed for 0 links for condition Dental Resources...
Working on Dermatology Resources...


100%|██████████| 257/257 [02:25<00:00,  1.76it/s]


Extracted 128 PDFs for condition Dermatology Resources...
Extracted 1 videos for condition Dermatology Resources...
Extracted 95 internal links for condition Dermatology Resources...
Extracted 34 external links for condition Dermatology Resources...
Failed for 0 links for condition Dermatology Resources...
Working on Diabetes Resources...


100%|██████████| 441/441 [04:41<00:00,  1.57it/s]


Extracted 276 PDFs for condition Diabetes Resources...
Extracted 2 videos for condition Diabetes Resources...
Extracted 120 internal links for condition Diabetes Resources...
Extracted 45 external links for condition Diabetes Resources...
Failed for 0 links for condition Diabetes Resources...
Working on Dialysis Resources...


100%|██████████| 249/249 [01:50<00:00,  2.26it/s]


Extracted 104 PDFs for condition Dialysis Resources...
Extracted 9 videos for condition Dialysis Resources...
Extracted 94 internal links for condition Dialysis Resources...
Extracted 51 external links for condition Dialysis Resources...
Failed for 0 links for condition Dialysis Resources...
Working on Differences in Sex Development Resources...


100%|██████████| 225/225 [01:20<00:00,  2.80it/s]


Extracted 63 PDFs for condition Differences in Sex Development Resources...
Extracted 1 videos for condition Differences in Sex Development Resources...
Extracted 113 internal links for condition Differences in Sex Development Resources...
Extracted 49 external links for condition Differences in Sex Development Resources...
Failed for 0 links for condition Differences in Sex Development Resources...
Working on Eating Disorders Recovery Resources...


100%|██████████| 192/192 [02:06<00:00,  1.51it/s]


Extracted 73 PDFs for condition Eating Disorders Recovery Resources...
Extracted 1 videos for condition Eating Disorders Recovery Resources...
Extracted 101 internal links for condition Eating Disorders Recovery Resources...
Extracted 18 external links for condition Eating Disorders Recovery Resources...
Failed for 0 links for condition Eating Disorders Recovery Resources...
Working on Endocrinology Resources...


100%|██████████| 204/204 [01:07<00:00,  3.03it/s]


Extracted 47 PDFs for condition Endocrinology Resources...
Extracted 3 videos for condition Endocrinology Resources...
Extracted 120 internal links for condition Endocrinology Resources...
Extracted 37 external links for condition Endocrinology Resources...
Failed for 1 links for condition Endocrinology Resources...
Working on Fetal Care and Treatment Resources...


100%|██████████| 211/211 [00:47<00:00,  4.48it/s]


Extracted 36 PDFs for condition Fetal Care and Treatment Resources...
Extracted 1 videos for condition Fetal Care and Treatment Resources...
Extracted 143 internal links for condition Fetal Care and Treatment Resources...
Extracted 32 external links for condition Fetal Care and Treatment Resources...
Failed for 2 links for condition Fetal Care and Treatment Resources...
Working on Gastroenterology and Hepatology Resources...


100%|██████████| 302/302 [03:49<00:00,  1.31it/s]


Extracted 146 PDFs for condition Gastroenterology and Hepatology Resources...
Extracted 5 videos for condition Gastroenterology and Hepatology Resources...
Extracted 102 internal links for condition Gastroenterology and Hepatology Resources...
Extracted 46 external links for condition Gastroenterology and Hepatology Resources...
Failed for 3 links for condition Gastroenterology and Hepatology Resources...
Working on General and Thoracic Surgery Resources...


100%|██████████| 166/166 [00:03<00:00, 45.92it/s]


Extracted 8 PDFs for condition General and Thoracic Surgery Resources...
Extracted 5 videos for condition General and Thoracic Surgery Resources...
Extracted 122 internal links for condition General and Thoracic Surgery Resources...
Extracted 36 external links for condition General and Thoracic Surgery Resources...
Failed for 0 links for condition General and Thoracic Surgery Resources...
Working on Genetics Resources...


100%|██████████| 173/173 [00:03<00:00, 47.34it/s]


Extracted 8 PDFs for condition Genetics Resources...
Extracted 1 videos for condition Genetics Resources...
Extracted 103 internal links for condition Genetics Resources...
Extracted 62 external links for condition Genetics Resources...
Failed for 0 links for condition Genetics Resources...
Working on Heart Resources...


100%|██████████| 207/207 [00:27<00:00,  7.64it/s]


Extracted 39 PDFs for condition Heart Resources...
Extracted 2 videos for condition Heart Resources...
Extracted 124 internal links for condition Heart Resources...
Extracted 43 external links for condition Heart Resources...
Failed for 0 links for condition Heart Resources...
Working on Home Care Resources...


100%|██████████| 303/303 [02:57<00:00,  1.71it/s]


Extracted 154 PDFs for condition Home Care Resources...
Extracted 21 videos for condition Home Care Resources...
Extracted 90 internal links for condition Home Care Resources...
Extracted 59 external links for condition Home Care Resources...
Failed for 2 links for condition Home Care Resources...
Working on Immunology Resources...


100%|██████████| 119/119 [00:01<00:00, 80.28it/s]


Extracted 3 PDFs for condition Immunology Resources...
Extracted 1 videos for condition Immunology Resources...
Extracted 90 internal links for condition Immunology Resources...
Extracted 26 external links for condition Immunology Resources...
Failed for 0 links for condition Immunology Resources...
Working on Infectious Diseases and Virology Resources...


100%|██████████| 172/172 [00:54<00:00,  3.18it/s]


Extracted 39 PDFs for condition Infectious Diseases and Virology Resources...
Extracted 1 videos for condition Infectious Diseases and Virology Resources...
Extracted 111 internal links for condition Infectious Diseases and Virology Resources...
Extracted 22 external links for condition Infectious Diseases and Virology Resources...
Failed for 0 links for condition Infectious Diseases and Virology Resources...
Working on Inpatient Rehabilitation Resources...


100%|██████████| 144/144 [00:04<00:00, 29.06it/s]


Extracted 4 PDFs for condition Inpatient Rehabilitation Resources...
Extracted 1 videos for condition Inpatient Rehabilitation Resources...
Extracted 108 internal links for condition Inpatient Rehabilitation Resources...
Extracted 32 external links for condition Inpatient Rehabilitation Resources...
Failed for 0 links for condition Inpatient Rehabilitation Resources...
Working on Interventional Radiology Resources...


100%|██████████| 129/129 [00:14<00:00,  9.12it/s]


Extracted 11 PDFs for condition Interventional Radiology Resources...
Extracted 1 videos for condition Interventional Radiology Resources...
Extracted 101 internal links for condition Interventional Radiology Resources...
Extracted 17 external links for condition Interventional Radiology Resources...
Failed for 0 links for condition Interventional Radiology Resources...
Working on Laboratory Medicine and Pathology Resources...


100%|██████████| 143/143 [00:16<00:00,  8.74it/s]


Extracted 28 PDFs for condition Laboratory Medicine and Pathology Resources...
Extracted 2 videos for condition Laboratory Medicine and Pathology Resources...
Extracted 97 internal links for condition Laboratory Medicine and Pathology Resources...
Extracted 18 external links for condition Laboratory Medicine and Pathology Resources...
Failed for 0 links for condition Laboratory Medicine and Pathology Resources...
Working on Mitochondrial Medicine and Metabolism Resources...


100%|██████████| 142/142 [00:08<00:00, 17.43it/s]


Extracted 14 PDFs for condition Mitochondrial Medicine and Metabolism Resources...
Extracted 1 videos for condition Mitochondrial Medicine and Metabolism Resources...
Extracted 93 internal links for condition Mitochondrial Medicine and Metabolism Resources...
Extracted 35 external links for condition Mitochondrial Medicine and Metabolism Resources...
Failed for 0 links for condition Mitochondrial Medicine and Metabolism Resources...
Working on Neonatology Resources...


100%|██████████| 143/143 [00:00<00:00, 170.48it/s]


Extracted 2 PDFs for condition Neonatology Resources...
Extracted 1 videos for condition Neonatology Resources...
Extracted 115 internal links for condition Neonatology Resources...
Extracted 26 external links for condition Neonatology Resources...
Failed for 0 links for condition Neonatology Resources...
Working on Nephrology Resources...


100%|██████████| 263/263 [01:50<00:00,  2.37it/s]


Extracted 105 PDFs for condition Nephrology Resources...
Extracted 1 videos for condition Nephrology Resources...
Extracted 112 internal links for condition Nephrology Resources...
Extracted 46 external links for condition Nephrology Resources...
Failed for 1 links for condition Nephrology Resources...
Working on Neurodevelopmental Resources...


100%|██████████| 249/249 [01:22<00:00,  3.03it/s]


Extracted 61 PDFs for condition Neurodevelopmental Resources...
Extracted 1 videos for condition Neurodevelopmental Resources...
Extracted 109 internal links for condition Neurodevelopmental Resources...
Extracted 79 external links for condition Neurodevelopmental Resources...
Failed for 1 links for condition Neurodevelopmental Resources...
Working on Neurosciences Resources...


100%|██████████| 379/379 [03:20<00:00,  1.89it/s]


Extracted 172 PDFs for condition Neurosciences Resources...
Extracted 2 videos for condition Neurosciences Resources...
Extracted 137 internal links for condition Neurosciences Resources...
Extracted 68 external links for condition Neurosciences Resources...
Failed for 0 links for condition Neurosciences Resources...
Working on Nutrition Resources...


100%|██████████| 276/276 [02:11<00:00,  2.09it/s]


Extracted 125 PDFs for condition Nutrition Resources...
Extracted 1 videos for condition Nutrition Resources...
Extracted 115 internal links for condition Nutrition Resources...
Extracted 36 external links for condition Nutrition Resources...
Failed for 0 links for condition Nutrition Resources...
Working on Occupational Therapy Resources...


100%|██████████| 230/230 [02:27<00:00,  1.56it/s]


Extracted 111 PDFs for condition Occupational Therapy Resources...
Extracted 1 videos for condition Occupational Therapy Resources...
Extracted 104 internal links for condition Occupational Therapy Resources...
Extracted 15 external links for condition Occupational Therapy Resources...
Failed for 0 links for condition Occupational Therapy Resources...
Working on Odessa Brown Children's Clinic Resources...


100%|██████████| 300/300 [01:32<00:00,  3.24it/s]


Extracted 71 PDFs for condition Odessa Brown Children's Clinic Resources...
Extracted 31 videos for condition Odessa Brown Children's Clinic Resources...
Extracted 147 internal links for condition Odessa Brown Children's Clinic Resources...
Extracted 81 external links for condition Odessa Brown Children's Clinic Resources...
Failed for 0 links for condition Odessa Brown Children's Clinic Resources...
Working on Ophthalmology Resources...


100%|██████████| 133/133 [00:41<00:00,  3.18it/s]


Extracted 26 PDFs for condition Ophthalmology Resources...
Extracted 1 videos for condition Ophthalmology Resources...
Extracted 90 internal links for condition Ophthalmology Resources...
Extracted 17 external links for condition Ophthalmology Resources...
Failed for 0 links for condition Ophthalmology Resources...
Working on Oral and Maxillofacial Surgery Resources...


100%|██████████| 136/136 [00:06<00:00, 21.57it/s]


Extracted 16 PDFs for condition Oral and Maxillofacial Surgery Resources...
Extracted 1 videos for condition Oral and Maxillofacial Surgery Resources...
Extracted 102 internal links for condition Oral and Maxillofacial Surgery Resources...
Extracted 18 external links for condition Oral and Maxillofacial Surgery Resources...
Failed for 0 links for condition Oral and Maxillofacial Surgery Resources...
Working on Orthopedics and Sports Medicine Resources...


100%|██████████| 336/336 [02:15<00:00,  2.47it/s]


Extracted 126 PDFs for condition Orthopedics and Sports Medicine Resources...
Extracted 1 videos for condition Orthopedics and Sports Medicine Resources...
Extracted 160 internal links for condition Orthopedics and Sports Medicine Resources...
Extracted 50 external links for condition Orthopedics and Sports Medicine Resources...
Failed for 0 links for condition Orthopedics and Sports Medicine Resources...
Working on Orthotics and Prosthetics Resources...


100%|██████████| 169/169 [01:51<00:00,  1.52it/s]


Extracted 49 PDFs for condition Orthotics and Prosthetics Resources...
Extracted 1 videos for condition Orthotics and Prosthetics Resources...
Extracted 104 internal links for condition Orthotics and Prosthetics Resources...
Extracted 16 external links for condition Orthotics and Prosthetics Resources...
Failed for 0 links for condition Orthotics and Prosthetics Resources...
Working on Otolaryngology Resources...


100%|██████████| 260/260 [02:51<00:00,  1.52it/s]


Extracted 128 PDFs for condition Otolaryngology Resources...
Extracted 1 videos for condition Otolaryngology Resources...
Extracted 107 internal links for condition Otolaryngology Resources...
Extracted 25 external links for condition Otolaryngology Resources...
Failed for 0 links for condition Otolaryngology Resources...
Working on Pain Medicine Resources...


100%|██████████| 175/175 [00:24<00:00,  7.23it/s]


Extracted 50 PDFs for condition Pain Medicine Resources...
Extracted 3 videos for condition Pain Medicine Resources...
Extracted 100 internal links for condition Pain Medicine Resources...
Extracted 25 external links for condition Pain Medicine Resources...
Failed for 0 links for condition Pain Medicine Resources...
Working on Palliative Care Resources...


100%|██████████| 123/123 [00:00<00:00, 478570.86it/s]


Extracted 0 PDFs for condition Palliative Care Resources...
Extracted 2 videos for condition Palliative Care Resources...
Extracted 95 internal links for condition Palliative Care Resources...
Extracted 28 external links for condition Palliative Care Resources...
Failed for 0 links for condition Palliative Care Resources...
Working on Pediatric Feeding Resources...


100%|██████████| 138/138 [00:56<00:00,  2.43it/s]


Extracted 22 PDFs for condition Pediatric Feeding Resources...
Extracted 6 videos for condition Pediatric Feeding Resources...
Extracted 96 internal links for condition Pediatric Feeding Resources...
Extracted 20 external links for condition Pediatric Feeding Resources...
Failed for 0 links for condition Pediatric Feeding Resources...
Working on Pediatric Hypertension Resources...


100%|██████████| 106/106 [00:03<00:00, 28.23it/s]


Extracted 6 PDFs for condition Pediatric Hypertension Resources...
Extracted 1 videos for condition Pediatric Hypertension Resources...
Extracted 84 internal links for condition Pediatric Hypertension Resources...
Extracted 16 external links for condition Pediatric Hypertension Resources...
Failed for 0 links for condition Pediatric Hypertension Resources...
Working on Pediatric Intensive Care Unit (PICU) Resources...


100%|██████████| 158/158 [00:14<00:00, 10.91it/s]


Extracted 11 PDFs for condition Pediatric Intensive Care Unit (PICU) Resources...
Extracted 1 videos for condition Pediatric Intensive Care Unit (PICU) Resources...
Extracted 132 internal links for condition Pediatric Intensive Care Unit (PICU) Resources...
Extracted 15 external links for condition Pediatric Intensive Care Unit (PICU) Resources...
Failed for 0 links for condition Pediatric Intensive Care Unit (PICU) Resources...
Working on Pediatric and Adolescent Gynecology Resources...


100%|██████████| 150/150 [00:17<00:00,  8.67it/s]


Extracted 19 PDFs for condition Pediatric and Adolescent Gynecology Resources...
Extracted 1 videos for condition Pediatric and Adolescent Gynecology Resources...
Extracted 109 internal links for condition Pediatric and Adolescent Gynecology Resources...
Extracted 22 external links for condition Pediatric and Adolescent Gynecology Resources...
Failed for 0 links for condition Pediatric and Adolescent Gynecology Resources...
Working on Physical Therapy Resources...


100%|██████████| 230/230 [02:01<00:00,  1.89it/s]


Extracted 111 PDFs for condition Physical Therapy Resources...
Extracted 1 videos for condition Physical Therapy Resources...
Extracted 104 internal links for condition Physical Therapy Resources...
Extracted 15 external links for condition Physical Therapy Resources...
Failed for 1 links for condition Physical Therapy Resources...
Working on Plastic Surgery Resources...


100%|██████████| 144/144 [00:43<00:00,  3.31it/s]


Extracted 31 PDFs for condition Plastic Surgery Resources...
Extracted 1 videos for condition Plastic Surgery Resources...
Extracted 92 internal links for condition Plastic Surgery Resources...
Extracted 21 external links for condition Plastic Surgery Resources...
Failed for 0 links for condition Plastic Surgery Resources...
Working on Psychiatry and Behavioral Medicine Resources...


100%|██████████| 420/420 [04:28<00:00,  1.56it/s]


Extracted 188 PDFs for condition Psychiatry and Behavioral Medicine Resources...
Extracted 10 videos for condition Psychiatry and Behavioral Medicine Resources...
Extracted 153 internal links for condition Psychiatry and Behavioral Medicine Resources...
Extracted 78 external links for condition Psychiatry and Behavioral Medicine Resources...
Failed for 0 links for condition Psychiatry and Behavioral Medicine Resources...
Working on Pulmonary Resources...


100%|██████████| 346/346 [04:19<00:00,  1.33it/s]


Extracted 221 PDFs for condition Pulmonary Resources...
Extracted 3 videos for condition Pulmonary Resources...
Extracted 100 internal links for condition Pulmonary Resources...
Extracted 24 external links for condition Pulmonary Resources...
Failed for 9 links for condition Pulmonary Resources...
Working on Radiology Resources...


100%|██████████| 167/167 [00:24<00:00,  6.83it/s]


Extracted 53 PDFs for condition Radiology Resources...
Extracted 5 videos for condition Radiology Resources...
Extracted 93 internal links for condition Radiology Resources...
Extracted 21 external links for condition Radiology Resources...
Failed for 0 links for condition Radiology Resources...
Working on Reconstructive Pelvic Medicine Resources...


100%|██████████| 194/194 [01:16<00:00,  2.55it/s]


Extracted 46 PDFs for condition Reconstructive Pelvic Medicine Resources...
Extracted 2 videos for condition Reconstructive Pelvic Medicine Resources...
Extracted 113 internal links for condition Reconstructive Pelvic Medicine Resources...
Extracted 35 external links for condition Reconstructive Pelvic Medicine Resources...
Failed for 0 links for condition Reconstructive Pelvic Medicine Resources...
Working on Rehabilitation Medicine Resources...


100%|██████████| 166/166 [00:25<00:00,  6.53it/s]


Extracted 12 PDFs for condition Rehabilitation Medicine Resources...
Extracted 1 videos for condition Rehabilitation Medicine Resources...
Extracted 125 internal links for condition Rehabilitation Medicine Resources...
Extracted 29 external links for condition Rehabilitation Medicine Resources...
Failed for 0 links for condition Rehabilitation Medicine Resources...
Working on Rehabilitation Psychology Resources...


100%|██████████| 164/164 [00:04<00:00, 35.92it/s]


Extracted 9 PDFs for condition Rehabilitation Psychology Resources...
Extracted 1 videos for condition Rehabilitation Psychology Resources...
Extracted 113 internal links for condition Rehabilitation Psychology Resources...
Extracted 42 external links for condition Rehabilitation Psychology Resources...
Failed for 0 links for condition Rehabilitation Psychology Resources...
Working on Reproductive and Sexual Health Resources...


100%|██████████| 126/126 [00:03<00:00, 35.88it/s]


Extracted 10 PDFs for condition Reproductive and Sexual Health Resources...
Extracted 1 videos for condition Reproductive and Sexual Health Resources...
Extracted 98 internal links for condition Reproductive and Sexual Health Resources...
Extracted 18 external links for condition Reproductive and Sexual Health Resources...
Failed for 0 links for condition Reproductive and Sexual Health Resources...
Working on Rheumatology Resources...


100%|██████████| 167/167 [00:09<00:00, 18.13it/s]


Extracted 21 PDFs for condition Rheumatology Resources...
Extracted 1 videos for condition Rheumatology Resources...
Extracted 105 internal links for condition Rheumatology Resources...
Extracted 41 external links for condition Rheumatology Resources...
Failed for 0 links for condition Rheumatology Resources...
Working on Seattle Children’s Resources...


100%|██████████| 149/149 [00:53<00:00,  2.80it/s]


Extracted 44 PDFs for condition Seattle Children’s Resources...
Extracted 1 videos for condition Seattle Children’s Resources...
Extracted 89 internal links for condition Seattle Children’s Resources...
Extracted 16 external links for condition Seattle Children’s Resources...
Failed for 0 links for condition Seattle Children’s Resources...
Working on Sleep Medicine Resources...


100%|██████████| 213/213 [02:06<00:00,  1.68it/s]


Extracted 85 PDFs for condition Sleep Medicine Resources...
Extracted 1 videos for condition Sleep Medicine Resources...
Extracted 103 internal links for condition Sleep Medicine Resources...
Extracted 25 external links for condition Sleep Medicine Resources...
Failed for 0 links for condition Sleep Medicine Resources...
Working on Social Work Resources...


100%|██████████| 125/125 [00:01<00:00, 66.79it/s]


Extracted 8 PDFs for condition Social Work Resources...
Extracted 1 videos for condition Social Work Resources...
Extracted 95 internal links for condition Social Work Resources...
Extracted 22 external links for condition Social Work Resources...
Failed for 0 links for condition Social Work Resources...
Working on Speech and Language Resources...


100%|██████████| 216/216 [00:49<00:00,  4.40it/s]


Extracted 67 PDFs for condition Speech and Language Resources...
Extracted 4 videos for condition Speech and Language Resources...
Extracted 113 internal links for condition Speech and Language Resources...
Extracted 36 external links for condition Speech and Language Resources...
Failed for 0 links for condition Speech and Language Resources...
Working on Sports Physical Therapy Resources...


100%|██████████| 188/188 [01:09<00:00,  2.70it/s]


Extracted 40 PDFs for condition Sports Physical Therapy Resources...
Extracted 1 videos for condition Sports Physical Therapy Resources...
Extracted 125 internal links for condition Sports Physical Therapy Resources...
Extracted 23 external links for condition Sports Physical Therapy Resources...
Failed for 0 links for condition Sports Physical Therapy Resources...
Working on Transplant Resources...


100%|██████████| 150/150 [00:08<00:00, 17.80it/s]


Extracted 10 PDFs for condition Transplant Resources...
Extracted 2 videos for condition Transplant Resources...
Extracted 99 internal links for condition Transplant Resources...
Extracted 41 external links for condition Transplant Resources...
Failed for 0 links for condition Transplant Resources...
Working on Urology Resources...


100%|██████████| 386/386 [05:00<00:00,  1.29it/s]


Extracted 238 PDFs for condition Urology Resources...
Extracted 2 videos for condition Urology Resources...
Extracted 110 internal links for condition Urology Resources...
Extracted 38 external links for condition Urology Resources...
Failed for 0 links for condition Urology Resources...
Working on Vascular Anomalies Resources...


100%|██████████| 168/168 [00:25<00:00,  6.72it/s]


Extracted 24 PDFs for condition Vascular Anomalies Resources...
Extracted 1 videos for condition Vascular Anomalies Resources...
Extracted 114 internal links for condition Vascular Anomalies Resources...
Extracted 30 external links for condition Vascular Anomalies Resources...
Failed for 0 links for condition Vascular Anomalies Resources...
Working on Vascular Access...


100%|██████████| 240/240 [02:25<00:00,  1.65it/s]

Extracted 117 PDFs for condition Vascular Access...
Extracted 17 videos for condition Vascular Access...
Extracted 89 internal links for condition Vascular Access...
Extracted 33 external links for condition Vascular Access...
Failed for 0 links for condition Vascular Access...





In [None]:
print(f"{len(pages)} pages have been parsed from the Patient Education website ...")

72 pages have been parsed from the Patient Education website ...


In [None]:
for artifact_type in ['pdfs', 'videos', 'internal_links', 'external_links']:
    count = sum([len(condition[artifact_type]) for condition in pages.values()])
    print(f"{count} {artifact_type} have been extracted from the Patient Education website ...")

4397 pdfs have been extracted from the Patient Education website ...
267 videos have been extracted from the Patient Education website ...
7829 internal_links have been extracted from the Patient Education website ...
2741 external_links have been extracted from the Patient Education website ...


Ok so now let's save the pages dictionary as a JSON file.

In [None]:
with open("pages.json", "w") as f:
    json.dump(pages, f)

We can also duplicate it in the GSC bucket used for the PDF extractions.

In [None]:
# upload the pages.json to GSC bucket
!gsutil cp patient_education_pages.json gs://{PDFS_BUCKET_NAME}/patient_education_pages.json

Copying file://pages.json [Content-Type=application/json]...
/ [0 files][    0.0 B/  1.7 MiB]                                                / [1 files][  1.7 MiB/  1.7 MiB]                                                
Operation completed over 1 objects/1.7 MiB.                                      


## 02- Conditions website

The conditions webpages contain a lot of information available as text content. The number of embedded PDF documnents is low.

- The knowledge extraction is focus on the text content. The text content is organized in chunks created from the pages hierarchy.

- The external and internal hyperlinks are also referenced for each page.

### 02-01- Pages Extraction

This section of the notebook extracts the list of conditions pages from the index page available in the Seattle Children website.

In [None]:
def extract_conditions_collection(url: str) -> dict:
    """
    Extract the collection of conditions pages
    :param url: url of the index page
    :return: a dictionary with the conditions names and urls
    """

    # prefix for the links
    prefix = "https://www.seattlechildrens.org"

    # get the source of the page
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes
    soup = BeautifulSoup(response.content, 'html.parser')

    # list the links corresponding to the conditions pages
    conditions_pages = {
        a.text: prefix+a.get('href') for a in soup.find_all('a', href=True) if a.get('href').startswith("/conditions")
    }

    return conditions_pages

In [None]:
conditions_pages = extract_conditions_collection("https://www.seattlechildrens.org/conditions/a-z/")

In [None]:
print(f"There are {len(conditions_pages)} Condition pages in the collection...")

There are 392 Condition pages in the collection...


### 02-02- Content Extraction and Chunking

The schema for the page object is slightly different as the one used for the Patient Education pages, because there are more information to be extracted from the Conditions pages and also to include the references to the chunks.

In [None]:
def chunk_markdown(markdown_content: str, links: dict) -> list:
    """
    Chunk the markdown content
    :param markdown_content: content in markdown format
    :param links: dictionary containing the links extracted from content
    :return: list of chunks
    """

    # list of chunks
    chunks = []

    # chunking is based on the mardown layout
    lines = markdown_content.split("\n")

    # current chunk
    current_id = -1
    def default_chunk():
        return {
            "id": current_id + 1,
            "unique_id": str(uuid.uuid4()),
            "content": "",
            "parent_id": None,
            "url_links": [],
            "is_root": False,
            "kind": "chunk"
        }

    # init the different artifacts
    current_chunk = default_chunk()
    hierarchy_ids = dict()

    # fill the chunks line by line
    for i, line in enumerate(lines):
        if len(line)>0:

            # Markdown layour drives the chunks content
            hashtag_count = line[:10].count("#")

            if hashtag_count == 0:
                current_chunk['content'] += "\n" + line

                # gather the found links
                for link in links:
                    if link in line:
                        current_chunk['url_links'].append(links[link])
            else:
                current_chunk['url_links'] = list(set(current_chunk['url_links']))
                chunks.append(current_chunk)
                current_chunk = default_chunk()
                current_chunk['content'] += line + "\n"
                hierarchy_ids[hashtag_count] = current_chunk['id']

                if hashtag_count == 1:
                    current_chunk['is_root'] = True
                else:
                    if hashtag_count == 2:
                        current_chunk['parent_id'] = 0
                    elif hashtag_count > 2:
                        if hashtag_count-1 in hierarchy_ids:
                            current_chunk['parent_id'] = hierarchy_ids[hashtag_count-1]
                        else:
                            current_chunk['parent_id'] = hierarchy_ids[min(hierarchy_ids.keys())]

                # gather the found links
                for link in links:
                    if link in line:
                        current_chunk['url_links'].append(links[link])

                # incrementing the id counter for next chunk
                current_id = current_chunk['id']
    chunks.append(current_chunk)
    return chunks[1:]

In [None]:
def extract_from_page(url: str) -> dict:
    """
    Extract knowledge from Conditions Page.
    :param url: url of the condition page
    :return: content and metadata as dictionary
    """

    # prefix for the links
    prefix = "https://www.seattlechildrens.org"

    # get the source of the page
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes
    soup = BeautifulSoup(response.content, 'html.parser')

    # extract the title of the page
    title = soup.find('div', class_="mod page-title")
    if title:
        title = title.find('h1').text
        title_content = f"# {title}"
    else:
        title = "Title Not Found"

    # treate the accordeons
    div_elements = soup.find_all('div', class_='accordion-header heading4 js-accordion-header')
    for div_tag in div_elements:
        new_h3_tag = soup.new_tag('h3')
        new_h3_tag.extend(div_tag.contents)
        div_tag.replace_with(new_h3_tag)
    div_elements = soup.find_all('div', class_='accordion-more js-accordion-more')
    for div_tag in div_elements:
        div_tag.unwrap()
    uv_elements = soup.find_all('ul', class_='accordion accordion--classic js-accordion')
    for uv_element in uv_elements:
        uv_element.unwrap()
    li_elements = soup.find_all('li', class_='accordion-item js-accordion-item')
    for li_element in li_elements:
        li_element.unwrap()

    # extract the main content of the page
    main_content = soup.find('div', class_="main-content-body")

    if main_content:

        markdown_content = convert_to_markdown(
            main_content,
            heading_style='atx',
            escape_asterisks=True
        )

        link_regex = re.compile(r'(!?\[[(^)\]]*\])\((.*?)\)')
        link_matches = link_regex.findall(markdown_content)
        if link_matches:
            for link_match in link_matches:
                target = link_match[1]
                if target.startswith("http"):
                    prefixed_target = target
                else:
                    prefixed_target = prefix + target
                pattern = r"\(" + target.replace("/", "\\/") + "\\)"
                markdown_content = re.sub(pattern, f"({prefixed_target})", markdown_content)
    else:
        markdown_content = "Main content not found"

    # copyrights
    copyright = soup.find('p', class_="ho-psmall")
    if copyright:
        copyright = copyright.text
    else:
        copyright = "Copyright not found"

    # extract links
    def clean_link(link_value: str):
        link_value = link_value.replace("\r\n", "").strip()
        return link_value

    # Links parsing for page and chunks
    links = []
    internal_links = []
    external_links = []
    if main_content:
        internal_links = [prefix+a.get('href') for a in main_content.find_all('a', href=True) if not a.get('href').startswith("http")]
        external_links = [a.get('href') for a in main_content.find_all('a', href=True) if a.get('href').startswith("http")]
        links = {
            clean_link(a.text): prefix+a.get('href') for a in main_content.find_all('a', href=True)
        }

    # create the page content dictionary
    page_content = {
        "unique_id": str(uuid.uuid4()),
        "title": title,
        "content": title_content + markdown_content,
        "copyright": copyright,
        "chunks": chunk_markdown(title_content + markdown_content, links),
        "kind": "page",
        "external_links": external_links,
        "internal_links": internal_links
    }

    return page_content

In [None]:
test = extract_from_page(conditions_pages['22q11.2-Related Disorders'])

In [None]:
pprint.pprint(test['chunks'])

[{'content': '# 22q11.2-Related Disorders## What are 22q11\\.2\\-related '
             'disorders?\n'
             '\n'
             '22q11\\.2\\-related disorders are caused by differences in part '
             'of chromosome 22, called the q11\\.2 region. Chromosomes contain '
             'genes, which tell our cells how to work and what proteins to '
             'make. There are 23 pairs of chromosomes in each cell of the '
             'body.\n'
             '22q11\\.2\\-related disorders happen in at least 1 in 1,000 '
             'newborns.\n'
             'The symptoms differ widely, even among members of the same '
             'family. There may be small differences in how your child’s '
             'eyelids, nose and ears look.\n'
             'These conditions are linked to many health issues. They can '
             'affect your child’s growth, feeding, breathing, speaking, '
             'hearing, learning and mental health. But most children with '
             '22q

In [None]:
with open("one_condition.jsonl", "w") as fp:
    json.dump(test, fp)

Let's extract the content of all the conditions pages available in the collection.

In [None]:
pages_content = []
pages_w_issue = []
for condition, page_url in tqdm.tqdm(conditions_pages.items()):
    if condition != "Conditions" and condition != "\r\n        All Symptoms\r\n    ":
        try:
            pages_content.append(extract_from_page(page_url))
        except:
            pages_w_issue.append(condition)

100%|██████████| 392/392 [11:15<00:00,  1.72s/it]


In [None]:
print(f"{len(pages_content)} Condition pages extracted...")

389 Condition pages extracted...


In [None]:
pages_w_issue

['Mental Health Problems']

Let's save the page content dict as a JSON file.

In [None]:
jsonl_content = ""
for page_content in pages_content:
    if "unique_id" not in page_content.keys():
        print(page_content)
    jsonl_content += json.dumps(page_content) + "\n"


with open("pages_content.jsonl", "w") as fp:
    fp.write(jsonl_content)

In [None]:
# upload the pages.json to GSC bucket
! gsutil cp pages_content.jsonl gs://sch_conditions_pages/pages_content.jsonl

Copying file://pages_content.jsonl [Content-Type=application/octet-stream]...
/ [1 files][ 10.3 MiB/ 10.3 MiB]                                                
Operation completed over 1 objects/10.3 MiB.                                     


Let's summarize the amount of information extracted.

In [None]:
for page_content in pages_content:
    print(page_content)
    break

{'unique_id': 'bde8c4d6-693a-46f9-8cc1-b204eaf34d07', 'title': '22q11.2-Related Disorders', 'content': '# 22q11.2-Related Disorders## What are 22q11\\.2\\-related disorders?\n\n22q11\\.2\\-related disorders are caused by differences in part of chromosome 22, called the q11\\.2 region. Chromosomes contain genes, which tell our cells how to work and what proteins to make. There are 23 pairs of chromosomes in each cell of the body.\n\n22q11\\.2\\-related disorders happen in at least 1 in 1,000 newborns.\n\nThe symptoms differ widely, even among members of the same family. There may be small differences in how your child’s eyelids, nose and ears look.\n\nThese conditions are linked to many health issues. They can affect your child’s growth, feeding, breathing, speaking, hearing, learning and mental health. But most children with 22q11\\.2\\-related disorders only have problems in some of these areas.\n\n### What causes 22q11\\.2\\-related disorders?\n\nThese disorders happen because of cha

In [None]:
print(f"{len(pages_content)} Condition pages analyzed...")
for artifact_type in ['chunks', 'internal_links', 'external_links']:
    count = sum([len(page_content[artifact_type]) for page_content in pages_content])
    print(f"{count} {artifact_type} have been extracted from the Conditions website ...")

389 Condition pages analyzed...
8281 chunks have been extracted from the Conditions website ...
7207 internal_links have been extracted from the Conditions website ...
390 external_links have been extracted from the Conditions website ...


## 03- Pathways website

The pathways website contains only PDF documents which are extracted using the following code.

In [10]:
url = "https://www.seattlechildrens.org/healthcare-professionals/community-providers/pathways/"

In [11]:
PDFS_BUCKET_NAME = "sch_pathways_pdfs"

In [21]:
def extract_pdfs(url: str, bucket_name: str, debug: bool=False) -> None:
    """
    Extract all the PDF files
    :param url: url of the root webpace
    :param bucket_name: name of the bucket where the PDFs are stored
    :param debug: debug flag
    :return: None
    """

    # create the bucket if it does not exist
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    if not bucket.exists():
        bucket.create()

    # extracting the knowledge embedded as links
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # placeholder for the failed PDF urls
    failed_links = []

    # extracted PDFs counter
    success_count = 0

    for link in tqdm.tqdm(soup.find_all("a")):

        # enrich the url if needed
        if not link.get("href"):
            continue

        url = "https://www.seattlechildrens.org" + link.get("href")

        # check if the link points to a PDF file
        if url[-3:].lower() == "pdf":

            if debug:
              print(f'Working on {link.get("href")}...')
              print(f'URL: {url}')

            try:
                response = requests.get(url, timeout=10)
                if debug:
                    print(f'Status code: {response.status_code}')
                    return
                if response.status_code != 200:
                    failed_links.append(url)
                else:
                    pdf_content = response.content
                    file_name = url.split("/")[-1].lower()
                    file_name = file_name.replace(" ", "_").encode("ascii", "ignore").decode("ascii")
                    blob = bucket.blob(file_name)
                    blob.upload_from_string(pdf_content, content_type="application/pdf")
                    success_count += 1
            except:
                failed_links.append(url)

    print(f"Extracted {success_count} PDFs from pathways ...")
    print(f"Failed to extract {len(failed_links)} PDFs from pathways ...")

In [22]:
extract_pdfs(url, PDFS_BUCKET_NAME)

100%|██████████| 201/201 [02:08<00:00,  1.57it/s]

Extracted 87 PDFs from pathways ...
Failed to extract 0 PDFs from pathways ...





## 04- Emergency and Urgent Care

The Emergency or Urgent Care page contains information about conditions and the most appropriate service to contact for each of them.

The main source of knowledge is embedded in a table that needs to be parsed consistently.

In [23]:
url = "https://www.seattlechildrens.org/clinics/urgent-care-clinic/emergency-or-urgent-care"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

### 04-01- Table Extraction

In [24]:
# prefix for the links
prefix = "https://www.seattlechildrens.org"

In [62]:
def extract_table(url: str)-> dict:
    """
    Extract table content from url
    :param url: url of the table
    :return: dictionary containing the table content
    """

    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    conditions_services = {}

    # table extraction
    table_content = soup.find("table")
    tbody = table_content.find("tbody")

    for tr in tbody.find_all("tr"):
        tds = list(tr.find_all("td"))

        # condition parsing
        condition_td = tds[0]
        if condition_td.find("a"):
            # removing the information for the em anchor
            if condition_td.find("em"):
                condition_td.find("em").decompose()
                condition_url = None
            else:
                condition_url = prefix + condition_td.find("a").get("href")
            condition_name = condition_td.text.strip()
            # removing all non ascii characters from condition_name
            condition_name = condition_name.encode("ascii", "ignore").decode("ascii")
        else:
            condition_name = condition_td.text.strip()
            condition_url = None
        condition = {
            "name": condition_name,
            "url": condition_url
        }

        # service parsing
        service_td = tds[1]
        if service_td.find("a"):
            service_name = service_td.find("a").text.strip()
            service_url = prefix + service_td.find("a").get("href")
        else:
            service_name = condition_td.text.strip()
            service_url = None
        service = {
            "name": service_name,
            "url": service_url
        }

        conditions_services[condition_name] = {
            "condition": condition,
            "service": service
        }

    print(f"{len(conditions_services)} rows have been extracted ...")
    return conditions_services

In [63]:
conditions_services = extract_table(url)

39 rows have been extracted ...


In [68]:
with open("emergency_or_urgent_care.json", "w") as fp:
    json.dump(conditions_services, fp)

In [69]:
! gsutil cp emergency_or_urgent_care.json gs://sch_emergency_or_urgent_care_pdfs/emergency_or_urgent_care.json

Copying file://emergency_or_urgent_care.json [Content-Type=application/json]...
/ [0 files][    0.0 B/  8.5 KiB]                                                / [1 files][  8.5 KiB/  8.5 KiB]                                                
Operation completed over 1 objects/8.5 KiB.                                      


#### 04-02- PDFs extraction

In [65]:
PDFS_BUCKET_NAME = "sch_emergency_or_urgent_care_pdfs"

In [66]:
extract_pdfs(url, PDFS_BUCKET_NAME)

100%|██████████| 227/227 [00:13<00:00, 17.19it/s]

Extracted 16 PDFs from pathways ...
Failed to extract 0 PDFs from pathways ...





#### 04-03- Extraction of the text content

In [None]:
# TO DO: Keep only the content in the  <div class="main-content-body">
# TO DO: decompose <ul class="block-featured-buttons" data-block-type="FeaturedButtonBlockData" data-block-name="Buttons" data-block-id="85609">
# TO DO: manage the accordion li id="" class="accordion-item js-accordion-item">