# Document Intelligence Processor Walkthrough
This notebook walks through the Document Intelligence processor component that is included in this repository. It is useful for converting the raw API response from Document Intelligence into something more useful, and which can be easily converted into the format required for sending to an LLM endpoint.

Some features include:
* Automatic extraction of rich content in a PDF, including tables, figures and more.
* Output of text content in a PDF, images for each page and figure within the PDF, and pandas dataframes for tables within the PDF.
* Automatic correction of image rotation when extracting page and figure images (if not corrected, this can completely destroy LLM extraction accuracy)
* Custom definition of the content outputs, allowing for completely dynamic formatting of all content in a file.
* Chunking of content into smaller parts (e.g. into chunks of X pages) which can then be processed as part of a Map Reduce pattern.
* Automatic conversion of the content to the OpenAI message format, ready for processing with an LLM.

In [19]:
import json
import os
from dataclasses import dataclass

from IPython.display import Markdown as md
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.formrecognizer import DocumentAnalysisClient, AnalysisFeature
from azure.ai.formrecognizer import AnalyzeResult as FormRecognizerAnalyzeResult
from azure.ai.documentintelligence.models import (
    AnalyzeResult,
    AnalyzeDocumentRequest,
    DocumentAnalysisFeature,
)

# ignore cryptography version warnings
import warnings
warnings.filterwarnings(action='ignore', module='.*cryptography.*')

# Append src module to system path to import from src module
import sys
sys.path.append(os.path.abspath("../function_app"))

from src.components.customizedDocumentFigureProcessor import MarkdownImageTagDocumentFigureProcessor

from src.components.doc_intelligence import (
    DefaultDocumentPageProcessor, DefaultDocumentKeyValuePairProcessor,
    DefaultDocumentTableProcessor, DefaultDocumentFigureProcessor,
    DefaultDocumentParagraphProcessor, DefaultDocumentLineProcessor,
    DefaultDocumentWordProcessor, DefaultSelectionMarkFormatter,
    DefaultDocumentSectionProcessor, DocumentIntelligenceProcessor, 
    PageDocumentListSplitter, convert_processed_di_docs_to_openai_message,
    convert_processed_di_docs_to_markdown,
)
from src.helpers.data_loading import load_pymupdf_pdf, extract_pdf_page_images

# Load environment variables
load_dotenv(override=True)

# Auto-reload modules
%load_ext autoreload
%autoreload 2

# Display all outputs of a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

False

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Set Constants and Define the Configuration

In [None]:
# Select whether to use the Preview Document Intelligence version (v4.0) or the GA version (v3.3). Each version has some slight differences
# - V4.0 is in preview and is only available in a handful of regions. It has some additional features, particularly the ability to process figures within an image.
# - V3.3 is Generally Available and available in most Azure regions. It does not have support for extracting/processing figures from within an image.
USE_DOC_INTEL_PREVIEW_VERSION = True

# Select the model type. 
# More info here: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/model-overview
DOC_INTEL_MODEL_ID = "prebuilt-layout" # E.g. "prebuilt-read", "prebuilt-layout", or "prebuilt-document"

# Possible Document Intelligence features
# v4.0 (Preview): ['ocrHighResolution', 'languages', 'barcodes', 'formulas', 'keyValuePairs', 'styleFont', 'queryFields']
# v3.3 (GA):      ['ocrHighResolution', 'languages', 'barcodes', 'formulas', 'styleFont']

DOC_INTEL_FEATURES = ['ocrHighResolution', 'languages', 'styleFont']

### Load environment variables and setup the Document Intelligence client

In [None]:
# Load environment variables from Function App local settings file
with open("../function_app/local.settings.json", "rb") as f:
    local_settings = json.load(f)
    os.environ.update(local_settings["Values"])

DOC_INTEL_ENDPOINT = os.getenv("DOC_INTEL_ENDPOINT")
DOC_INTEL_API_KEY = os.getenv("DOC_INTEL_API_KEY")

# Construct the Document Intelligence clients
if USE_DOC_INTEL_PREVIEW_VERSION:
    # Doc Intelligence v4.0 (preview) - only available in selected regions
    di_client = DocumentIntelligenceClient(
        endpoint=DOC_INTEL_ENDPOINT, 
        credential=AzureKeyCredential(DOC_INTEL_API_KEY),
        api_version="2024-07-31-preview",
    )
    enabled_features = [DocumentAnalysisFeature(feature) for feature in DOC_INTEL_FEATURES]
else:
    # Doc Intelligence v3.3 (GA) - Available globally
    di_client = DocumentAnalysisClient(
        endpoint=DOC_INTEL_ENDPOINT, 
        credential=AzureKeyCredential(DOC_INTEL_API_KEY),
        api_version="2023-07-31",
    )
    enabled_features = [AnalysisFeature(feature) for feature in DOC_INTEL_FEATURES]
print("Selected Document Intelligence Features:", [feature.value for feature in enabled_features])

In [None]:
# Define helper objects/functions
from typing import Union,Optional
import base64


@dataclass
class SamplePdfFileInfo:
    name: str
    description: str
    url_source: Optional[str] = None
    file_path: Optional[str] = None

def convert_pdf_to_base64(pdf_path: str):
    # Read the PDF file in binary mode, encode it to base64, and decode to string
    with open(pdf_path, "rb") as file:
        base64_encoded_pdf = base64.b64encode(file.read()).decode()
    return base64_encoded_pdf

def get_analyze_document_result(
    sample_pdf_file_info: SamplePdfFileInfo,
    di_client: Union[DocumentIntelligenceClient, DocumentAnalysisClient],
    model_id: str = "prebuilt-layout",
    **kwargs
) -> Union[AnalyzeResult, FormRecognizerAnalyzeResult]:
    """
    Gets the AnalyzeResult for a sample PDF file using a Document Intelligence
    client.
    """
    if isinstance(di_client, DocumentIntelligenceClient):
        if sample_pdf_file_info.url_source:
            analyze_request = AnalyzeDocumentRequest(url_source=sample_pdf_file_info.url_source)
        elif sample_pdf_file_info.file_path:
            analyze_request = AnalyzeDocumentRequest(bytes_source=convert_pdf_to_base64(sample_pdf_file_info.file_path))
        else:
            raise ValueError("No valid source provided.")
    
        poller = di_client.begin_analyze_document(
            model_id=model_id,
            analyze_request=analyze_request,
            **kwargs
        )
        new_result = poller.result()
    else:
        if sample_pdf_file_info.url_source:
            poller = di_client.begin_analyze_document_from_url(
                model_id=model_id,
                document_url=sample_pdf_file_info.url_source,
                **kwargs
            )
        elif sample_pdf_file_info.file_path:
            with open(sample_pdf_file_info.file_path, "rb") as document:
                poller = di_client.begin_analyze_document(
                    model_id=model_id,
                    document=document,
                    **kwargs
                )
        new_result = poller.result()
    return new_result

# Setup a list of PDFs for testing
We will use a set of PDFs for showcasing how the processor works. These examples include different elements such as tables, inline figures, document structures/heirarchies, and lengths/pages. You can add your own files here for testing.

In [None]:
# # Get raw file links
# doc_intelligence_test_files = {
#     "ikea": SamplePdfFileInfo(
#         name="IKEA Installation Manual",
#         description=(
#             "An instruction manual for installing an Ikea kitchen. "
#             "This document contains 12 pages, consisting of a series of diagrams and written instructions. "
#             "This shows how we can extract individual figures and text from the document."
#         ),
#         url_source="https://www.ikea.com/au/en/files/pdf/c7/ef/c7ef4878/kitchen-installation-guide_fy22.pdf",
#     ),
#     "rotated_image_pdf": SamplePdfFileInfo(
#         name="Rotated Image PDF",
#         description=(
#             "A single-page PDF containing an embedded image. "
#             "This document Shows how we can automatically extracted images from a PDF, "
#             "correcting the rotation of the page and figures images automatically."
#         ),
#         url_source="https://github.com/pymupdf/PyMuPDF/blob/main/tests/resources/test_delete_image.pdf?raw=true",
#     ),
#     "editorial_page": SamplePdfFileInfo(
#         name="Editorial Page (Mixed content)",
#         description=(
#             "A single-page editorial article with a mix of text and figures on the page. "
#             "Another example  extracting both text and figures from the page."
#         ),
#         url_source="https://github.com/pymupdf/PyMuPDF/blob/main/tests/resources/001003ED.pdf?raw=true",
#     ),
#     "multicolumn_pdf": SamplePdfFileInfo(
#         name="Multicolumn PDF with table",
#         description="A 3-page test PDF containing multi-column text and a table to be extracted.",
#         url_source="https://github.com/py-pdf/sample-files/blob/main/026-latex-multicolumn/multicolumn.pdf?raw=true",
#     ),
#     "rotated_proof_of_delivery_pdf": SamplePdfFileInfo(
#         name="Rotated Delivery Receipt",
#         description="A 3-page test PDF containing multi-column text and a table to be extracted.",
#         url_source="https://github.com/Azure/multimodal-ai-llm-processing-accelerator/blob/main/demo_app/demo_files/Rotated%20Proof%20of%20Delivery%20Receipt.jpg?raw=true",
#     ),
# }

doc_intelligence_test_files = {
    "oral_cancer_range1-6": SamplePdfFileInfo(
        name="adobe convert oral_cancer_text_5th_table&image",
        description=(
            "An oral cancer description. "
            "This document contains 1 pages, consisting of a series of images and tables. "
            "This shows how we can extract individual figures,text and tables from the document."
        ),
        file_path="/home/azureuser/multimodal-ai-llm-processing-accelerator/multimodel_pdf/oral_cancer_range1-6.pdf",
    )
}

# Print file links
print("Sample PDF files to analyze:")
for name, sample_pdf_file_info in doc_intelligence_test_files.items():
    print(f"- '{name}': {sample_pdf_file_info.description}")

## Define Document Intelligence Processor configuration
Now we will define the configuration for the processing of the result for Doc Intelligence.

Document Intelligence returns many different element objects which correspond to different types of content in the document. Some examples are pages, sections, paragraphs, lines, words, figures, tables and more. Each of these elements have a corresponding processor which is designed to handle that type of element. These processors may simply format the output (e.g. for paragraphs and lines), but others may perform data processing to load and convert the raw API response into something more usable.

Here are some key things to look at:

##### 1. Processed output dataclass type (Haystack `Document`s)
Because the resulting information comes in different types (e.g. text, images, or dataframes), the processors use Haystack's Document dataclass as the standardised data structure for storing the content. These can then be post-processed into other formats such as OpenAI message dictionaries.

##### 2. Formatting text content
When outputting text, the format of that text can be customized using the `*_text_formats` parameters of each element processor. The format can be defined such that different elements are returned in a specific format (such as italicing text or adding prefixes or suffixes). The format can feature placeholders for different parts of the content, with the list of possible placeholders shown in the constructor docstring for each processor class. 
* If the processor processes an element that does not contain any information for the placeholder, that format will be skipped. 
* For example, if the text formats for a figure processor was ["*Figure Caption:* {caption}", "*Figure Footnotes:* {footnotes}", "*Figure Content:*\n{content}"] but the figure did not have any captions or footnotes, only the final text_format string would be returned in the result. 
* Final result: "*Figure Content:\n<text_extracted_from_inside_the_image>"

##### 3. Outputting and rotating images
The Page and Figure processors can automatically output page and figure images into the output, maintaining the correct order of the content. These processors can also automatically adjust the rotation of the image using the `angle` of the page as detected by Document Intelligence. This can ensure that all output images are rotated correctly prior to storing and using those images (this makes a big difference in how accurately LLM's can analyze those images).

##### 4. Splitting output `Document`s into chunks
When processing large documents, it can be useful to split the original document into separate chunks prior to processing. While the processors are not meant to completely replace other chunking tools (e.g. those in `Langchain` and other libraries), most chunkers are only designed to work with text content and will break if they are given multimodal data. To help avoid this, a `Splitter` can be used to split processed results. The default option is the `PageDocumentListSplitter`, which can split the list of full outputs into chunks based on the page number of the original content. This is useful in cases where you want to take a large Document (e.g. 50+ pages) and process those the document in chunks, then combine the chunk results into a final document-level result. This is a common pattern in document processing (known as the Map Reduce pattern).

##### 5. Merging output `Document`s 
By default, each individual element within a document will result in 1 or more outputs. This means that a single page of paragraphs, lines or words would result in many different output `Document`s. Converting these to LLM messages would then result in hundreds of messages for each document. To prevent this, we can merge adjacent text elements together to reduce the number of text objects prior to converting them into LLM messages. This is accomplished with the 
* Example result:
* Original documents: [text, text, image, text, text, text, table (markdown), text, image]
* Merged documents: [text, image, text, image]

In [28]:
### Document Intelligence Processor config

# Define processors for individual components
selection_mark_formatter = DefaultSelectionMarkFormatter(
    selected_replacement="[X]", 
    unselected_replacement="[ ]"
)
section_processor = DefaultDocumentSectionProcessor(
    text_format=None,
)
page_processor = DefaultDocumentPageProcessor(
    page_start_text_formats=["\n*Page {page_number} content:*"],
    page_end_text_formats=None,
    page_img_order="after",
    page_img_text_intro="*Page {page_number} Image:*",
    img_export_dpi=100,
    adjust_rotation = True,
    rotated_fill_color = (255, 255, 255),
)
table_processor = DefaultDocumentTableProcessor(
    before_table_text_formats=["**Table {table_number} Info**\n", "*Table Caption:* {caption}", "*Table Footnotes:* {footnotes}", "*Table Content:*"],
    after_table_text_formats=None,
)

# add content to text format
figure_processor = MarkdownImageTagDocumentFigureProcessor(
    before_figure_text_formats=["**Figure {figure_number} Info**\n", "*Figure Caption:* {caption}", "*Figure Footnotes:* {footnotes}", "*Figure Content:*\n{content}"],
    output_figure_img=True,
    figure_img_text_format="*Figure Content:*\n{content}",
    after_figure_text_formats=None,
)
key_value_pair_processor = DefaultDocumentKeyValuePairProcessor(
    text_format = "*Key Value Pair*: {key_content}: {value_content}",
)
paragraph_processor = DefaultDocumentParagraphProcessor(
    general_text_format = "{content}",
    page_header_format = None,
    page_footer_format = None,
    title_format = "\n{heading_hashes} **{content}**",
    section_heading_format = "\n{heading_hashes} **{content}**",
    footnote_format = "*Footnote:* {content}",
    formula_format = "*Formula:* {content}",
    page_number_format = None,
)
line_processor = DefaultDocumentLineProcessor()
word_processor = DefaultDocumentWordProcessor()

# Now construct the DocumentIntelligenceProcessor class which uses each of these sub-processors
doc_intel_result_processor = DocumentIntelligenceProcessor(
    page_processor = page_processor,
    section_processor = section_processor,
    table_processor = table_processor,
    figure_processor = figure_processor,
    paragraph_processor = paragraph_processor,
    line_processor = line_processor,
    word_processor = word_processor,
    selection_mark_formatter = selection_mark_formatter
)

# Now construct the a splitter class which can separate the outputs into different chunks
chunk_splitter = PageDocumentListSplitter(pages_per_chunk=1)

# Process Documents
We will now process our documents, inspecting the result and converting it into various formats.

In [51]:
# Set the list of sample PDFs to process
# TEST_PDF_NAMES = ["rotated_proof_of_delivery_pdf", "ikea"] # Select specific test PDFs
TEST_PDF_NAMES = list(doc_intelligence_test_files.keys()) # Use all test PDFs

# Set the max number of elements to process before stopping (this prevents the notebook from getting too long)
BREAK_AFTER_ELEMENT_IDX = 10000

for test_pdf_name in TEST_PDF_NAMES:
    print(f"==================== PROCESSING TEST PDF: '{test_pdf_name}' =========================\n")
    # Load the PDF with PyMuPDF and convert the pages to images. We need to do this to get the images for the pages and figures.
    pdf_url = doc_intelligence_test_files[test_pdf_name].url_source
    file_path = doc_intelligence_test_files[test_pdf_name].file_path
    if pdf_url is not None:
        pdf = load_pymupdf_pdf(pdf_path=None, pdf_url=pdf_url)
    elif file_path is not None:
        pdf = load_pymupdf_pdf(pdf_path=file_path, pdf_url=None)
    
    doc_page_imgs = extract_pdf_page_images(pdf, img_dpi=100, starting_idx=1)

    # Get Doc Intelligence result
    di_result = get_analyze_document_result(
        sample_pdf_file_info=doc_intelligence_test_files[test_pdf_name],
        di_client=di_client,
        model_id=DOC_INTEL_MODEL_ID,
        features=enabled_features,
    )

    # Process the API response with the processor
    processed_content_docs = doc_intel_result_processor.process_analyze_result(
        di_result,
        doc_page_imgs=doc_page_imgs, 
        on_error="ignore", 
        break_after_element_idx=BREAK_AFTER_ELEMENT_IDX
    )

    # Split the results into chunks
    #page_chunked_content_docs = chunk_splitter.split_document_list(processed_content_docs)

    # By default, each element outputs a separate data class. Converting these to LLM messages would
    # result in hundreds or thousands of messages for each PDF. We can merge adjacent text elements together
    # to reduce their quantity prior to converting them into LLM messages.
    #merged_subchunk_content_docs = doc_intel_result_processor.merge_adjacent_text_content_docs(page_chunked_content_docs, default_text_merge_separator="\n\n")

    # print("Output info:")
    # print("\n- Number of pages in the file:", len(doc_page_imgs))
    # print("- Number of content Documents:", len(processed_content_docs))
    # print("- Number of content Documents after merging adjacent text Documents:", sum([len(l) for l in merged_subchunk_content_docs]))
    # print("- Number of content chunks after splitting the Documents by page:", len(page_chunked_content_docs))

    # Convert content to OpenAI messages
    #all_content_openai_message = convert_processed_di_docs_to_openai_message(processed_content_docs, role="user")
    #first_chunk_openai_message = convert_processed_di_docs_to_openai_message(merged_subchunk_content_docs[0], role="user")
    
    # Optionally print the OpenAI messages
    # print("\nAll content OpenAI messages:")
    # all_content_openai_message
    # print("First chunk OpenAI messages:")
    # first_chunk_openai_message

    # Print content in the notebook
    #print("\nPrinting the content in markdown format")
    filtered_docs = [doc for doc in processed_content_docs if  doc.meta["element_type"] != "DocumentPage"]
    print(convert_processed_di_docs_to_markdown(filtered_docs, default_text_merge_separator="\n"))
    #md(convert_processed_di_docs_to_markdown(filtered_docs, default_text_merge_separator="\n"))
    # for chunk_num, chunk_docs in enumerate(merged_subchunk_content_docs, start=1):
    #     #print(f"*** Chunk {chunk_num} Content ***")
    #     md(convert_processed_di_docs_to_markdown(chunk_docs, default_text_merge_separator="\n"))


# **5**




## **Staging Head and Neck Cancers**

William M. Lydiatt, Snehal G. Patel, John A. Ridge, Brian O'Sullivan, and Jatin P. Shah



### **INTRODUCTION AND OVERVIEW OF KEY CONCEPTS**

Cancers of the head and neck may arise from any of the mucosal surfaces of the upper aerodigestive tract. The American Joint Committee on Cancer (AJCC) Cancer Staging Manual, 8th Edition (8th Edition) introduces a num- ber of significant changes. These include a separate staging algorithm for human papilloma virus-associated cancer, restructuring of the head- and neck- specific cutaneous malignancy chapter, division of the pharynx staging system into three components, changes to the tumor (T) categories, addition of depth of invasion as a T characteristic in oral cancer, and the addition of extranodal tumor extension to the node (N) category.
Maintaining a balance between hazard discrimination, hazard consistency, desirable spread in outcomes, predic- tion of cure, and the highest possible compli

In [32]:
markdownStrResult = convert_processed_di_docs_to_markdown(filtered_docs, default_text_merge_separator="\n")

# Send the document contents to Azure OpenAI
With the document now processed, we can easily convert the output into messages that are ready for processing with Azure OpenAI.

In [31]:
from openai import AzureOpenAI

AOAI_LLM_DEPLOYMENT = os.getenv("AOAI_LLM_DEPLOYMENT")
AOAI_ENDPOINT = os.getenv("AOAI_ENDPOINT")
AOAI_API_KEY = os.getenv("AOAI_API_KEY")

aoai_client = AzureOpenAI(
    azure_endpoint=AOAI_ENDPOINT,
    azure_deployment=AOAI_LLM_DEPLOYMENT,
    api_key=AOAI_API_KEY,
    api_version="2024-06-01",
    timeout=30,
    max_retries=0,
)

In [None]:
messages = [
    {
        "role": "system",
        "content": '''
        Please summarize the content of the following file into 2000 words or less in markdwon format but Do not ouput start with ```markdown  ``` in the result. just ouput the content in markdown format.
        If there are any images wich are useful, please include them in the markdown format. 
        if the image's context like this: ![Figure 1](test1.png),then add host url https://stqiahai4753747325064580.blob.core.windows.net/images4rag before the image name, and addd service principle after the image name.
        like this: ![Figure 1](https://stqiahai4753747325064580.blob.core.windows.net/images4rag/test1.png?sp=racwdli&st=2024-11-22T05:20:41Z&se=2025-05-10T13:20:41Z&spr=https&sv=2022-11-02&sr=c&sig=5zMT2ynXTiKZX2hHhtzPGdMhMca1lwZsyOc%2F9dUoIxQ%3D)
        ''',
    },
    {
        "role": "user",
        "content": markdownStrResult,
    }
]
response = aoai_client.chat.completions.create(
    messages=messages,
    max_tokens=2000,
    model=AOAI_LLM_DEPLOYMENT
)
md(response.choices[0].message.content)

# **Staging Head and Neck Cancers**

**William M. Lydiatt, Snehal G. Patel, John A. Ridge, Brian O'Sullivan, and Jatin P. Shah**

### **Introduction and Overview of Key Concepts**

Cancers of the head and neck arise from various mucosal surfaces of the upper aerodigestive tract. Significant changes in the 8th Edition of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual include new staging algorithms for human papilloma virus (HPV)-associated cancer, adjustments to head and neck cutaneous malignancies, division of the pharynx into distinct staging compartments, changes to tumor (T) categories, addition of depth of invasion in oral cancer, and inclusion of extranodal tumor extension in the node (N) category.

Maintaining a balance between hazard discrimination and prediction of cure while ensuring compliance was challenging. The move toward personalized medicine will demand individualized predictions of risk and outcomes, potentially replacing traditional cancer staging groupings. Simplified systems will aid compliance but may compromise predictive accuracy.

**Key Changes to Head and Neck Cancer Staging**

Major changes reflect shifts in head and neck oncology. Highlights include:
- A restructured head and neck-specific cutaneous malignancy chapter.
- Division of the Pharynx chapter into three regions.
- Inclusion of a new chapter on HPV-associated oropharyngeal cancers due to their increasing incidence.

**New and Restructured Chapters**

The new structure recognizes the distinct biology of nasopharynx, merges HPV-negative oropharynx and hypopharynx, and introduces a chapter for HPV-associated oropharyngeal cancers (OPC). HPV-associated cancers are distinct due to their different biology and risk factors, necessitating separate staging.

**Rules for Classification**

Two systems were developed: pathological TNM (pTNM) for surgically treated cases and clinical TNM (cTNM) for pretreatment assessment. Clinical staging data should be collected for all patients to provide a uniform standard for treatment planning and prognostic prediction.

**Definition of Primary Tumor (T): Changes**

T categories are mostly unchanged, except for skin, nasopharynx, and oral cavity. The TO category is eliminated for most sites. Changes include:
- For skin cancer, depth of invasion and perineural invasion upstage lesions to T3.
- In nasopharynx, muscle involvement is redefined for precise staging.
- For oral cavity, depth of invasion (DOI) modifies T categories to better distinguish between tumor types and behaviors.

**Regional Lymph Node (N) Category: Introduction of Extranodal Extension**

ENE is now included in the N category due to its prognostic significance, except for HR-HPV tumors. Clear criteria based on clinical and pathological examination are required to determine ENE status.

**Principles of Staging**

Staging systems are primarily based on clinical assessment before treatment. Techniques like CT, MR imaging, PET, and ultrasonography can enhance accuracy. Clinical examination alone may suffice if imaging is unavailable, especially in low-resource settings.

**Regional Lymph Nodes**

Cervical lymph nodes are critical for prognosis in head and neck cancer. They are subdivided into levels for simple description (Fig. 5.1, Tables 5.1 and 5.2).

**Collection of Key Patient and Tumor Factors**

Efforts to collect prognostic data include factors like comorbidity, performance status, lifestyle factors, and nutrition. Performance status and smoking history should be documented for future staging revisions.

## **Tables and Figures**

### **Table 5.1 Anatomical structures defining the boundaries of the neck levels and sublevels**

| Boundary Level | Superior | Inferior | Anterior (medial) | Posterior (lateral) |
|----------------|----------|----------|-------------------|--------------------|
| IA             | Symphysis of the mandible | Body of the hyo

In [44]:
print(response.choices[0].message.content)

# Staging Head and Neck Cancers

**Authors**: William M. Lydiatt, Snehal G. Patel, John A. Ridge, Brian O'Sullivan, and Jatin P. Shah

## Introduction and Overview of Key Concepts

Head and neck cancers originate from the mucosal surfaces of the upper aerodigestive tract. The 8th Edition of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual introduces significant changes such as separate staging for HPV-associated cancer, changes in tumor categories, and the addition of depth of invasion as a factor in oral cancer staging. The aim is to balance hazard discrimination, consistency, and the prediction of cure, which is critical as personalized medicine advances. Cancer staging is crucial worldwide, and the effort harmonized with the Union for International Cancer Control (UICC) staging systems.

## Key Changes to Head and Neck Cancer Staging

### New and Restructured Chapters

The cutaneous malignancy chapter specific to the head and neck has been restructured. The pharyn