# Document Intelligence with Markdown

In this notebook we will experiment with Document Intelligence and its Markdown output. We will try reading documents and experiment whether Document Intelligence can manage advanced actions, such as reading multi-page tables. 


### Notes:
- Getting 404 when calling document_intelligence_client.begin_analyze_document
  - Help in finding solution - [ResourceNotFoundError (404) with DocumentIntelligenceClient · Issue #33312 · Azure/azure-sdk-for-python · GitHub](https://github.com/Azure/azure-sdk-for-python/issues/33312)
  - **Solution:** Make sure that resource is in supported region - [Document layout analysis - Document Intelligence (formerly Form Recognizer) - Azure AI services | Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0)

- Not all pages are analyzed (sample_data/2_ai_facts.pdf)
  - For some reason only first two pages are analyzed. This is not an issue with sample_data/2_ai_facts.docx
  - Same problem is observed in Document Intelligence Studio
  - **Solution:** Change Document Intelligence resource pricing tier from F0 -> S0


### Python Imports


In [1]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append('..\\code')


import os
from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, Markdown, HTML
from PIL import Image
from doc_utils import *


def show_img(img_path, width = None):
    if width is not None:
        display(HTML(f'<img src="{img_path}" width={width}>'))
    else:
        display(Image.open(img_path))


[nltk_data] Downloading package punkt_tab to
[nltk_data]     c:\Users\navinparmar\source\repos\multimodal-rag-code-
[nltk_data]     execution\.conda\lib\site-
[nltk_data]     packages\llama_index\core\_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


### Make sure we have the OpenAI Models information

We will need the GPT-4-Turbo and GPT-4-Vision models for this notebook.

When running the below cell, the values should reflect the OpenAI reource you have created in the `.env` file.

In [2]:
model_info = {
        'AZURE_OPENAI_RESOURCE': os.environ.get('AZURE_OPENAI_RESOURCE'),
        'AZURE_OPENAI_KEY': os.environ.get('AZURE_OPENAI_KEY'),
        'AZURE_OPENAI_MODEL_VISION': os.environ.get('AZURE_OPENAI_MODEL_VISION'),
        'AZURE_OPENAI_MODEL': os.environ.get('AZURE_OPENAI_MODEL'),
}

### Experimenting with Document Intelligence

First make sure to install the right version of Document Intelligence

In [3]:
## This version corresponds to API Version 2024-02-29-preview
## Visit: https://learn.microsoft.com/en-us/python/api/overview/azure/ai-documentintelligence-readme?view=azure-python-preview

%pip install azure-ai-documentintelligence>=1.0.0b2

Note: you may need to restart the kernel to use updated packages.


Defining a helper function

In [4]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, ContentFormat

DI_ENDPOINT = os.environ["DI_ENDPOINT"]
DI_KEY = os.environ["DI_KEY"]

def analyze_document(path):
    document_intelligence_client = DocumentIntelligenceClient(endpoint=DI_ENDPOINT, credential=AzureKeyCredential(DI_KEY))

    with open(path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, output_content_format=ContentFormat.MARKDOWN, content_type="application/octet-stream"
        )
    result: AnalyzeResult = poller.result()
    
    return result



In [5]:
import fitz  # PyMuPDF

# Create a directory to store the outputs
work_dir = "sample_data/pdf_outputs"
os.makedirs(work_dir, exist_ok=True)

# Load a sample PDF document

def read_pdf(pdf_doc):
    doc = fitz.open(pdf_doc)
    print(f"PDF File {os.path.basename(pdf_doc)} has {len(doc)} pages.")
    return doc

def nb_extract_pages_as_png_files(doc):
    png_files = []
    for page in doc:
        page_num = page.number
        img_path = f"{work_dir}/page_{page_num}.png"
        page_pix = page.get_pixmap(dpi=300)
        page_pix.save(img_path)
        print(f"Page {page_num} saved as {img_path}")
        png_files.append(img_path)
    
    return png_files


pdf_doc = "sample_data/1_London_Brochure.pdf"
doc = read_pdf(pdf_doc)
png_files = nb_extract_pages_as_png_files(doc)  


PDF File 1_London_Brochure.pdf has 2 pages.
Page 0 saved as sample_data/pdf_outputs/page_0.png
Page 1 saved as sample_data/pdf_outputs/page_1.png


#### Reading in Sample Documents

In [6]:
path = r"sample_data/1_London_Brochure.docx"
london_docx_result = analyze_document(path)
Markdown(london_docx_result['content'])

Margie’s Travel Presents…


# London

London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.

London's ancient core, the City of London, largely retains its 1.12-square- mile medieval boundaries.

Since at least the 19th century, London has also referred to the metropolis around this core, historically split between Middlesex, Essex, Surrey, Kent, and Hertfordshire, which today largely makes up Greater London, governed by the Mayor of London and the London Assembly.


## Mostly popular for:

Leisure, Outdoors, Historical, Arts & Culture


## Best time to visit:

Jun-Aug

Averag Precipitation: 1.9 in Average Temperature: 56-67°F

London Hotels

Margie’s Travel offers the following accommodation options in London:


## The Buckingham Hotel

Comfortable hotel close to major sights like Buckingham Palace, Regent’s Park, and Trafalgar Square.


## The City Hotel

Luxury rooms in the city, within walking distance of Tower Bridge and the Tower of London..


## The Kensington Hotel

Budget accommodation near Earl’s Court.

To book your trip to London, visit www.margiestravel.com


<table>
<tr>
<th>Category</th>
<th>Information</th>
</tr>
<tr>
<td>Country</td>
<td>United Kingdom</td>
</tr>
<tr>
<td>Capital Of</td>
<td>England</td>
</tr>
<tr>
<td>Currency</td>
<td>Pound Sterling (GBP)</td>
</tr>
<tr>
<td>Population (2021 census)</td>
<td>Approximately 8.8 million</td>
</tr>
<tr>
<td>Famous For</td>
<td>Historical landmarks, museums, cultural diversity</td>
</tr>
</table>


In [7]:
path = r"sample_data/1_London_Brochure.pdf"
london_pdf_result = analyze_document(path)
Markdown(london_pdf_result['content'])

<!-- PageHeader="Margie's Travel Presents ..." -->


# London


<figure>

London is the capital and
most populous city of
England and the United
Kingdom. Standing on the
River Thames in the south
east of the island of Great
Britain, London has been
a major settlement for two
millennia. It was founded
by the Romans, who
named it Londinium.
London's ancient core, the
City of London, largely
retains its 1.12-square-
Tower Bridge
mile medieval boundaries.
Since at least the 19th century, London has also referred to the metropolis around this core,
historically split between Middlesex, Essex, Surrey, Kent, and Hertfordshire, which today largely
makes up Greater London, governed by the Mayor of London and the London Assembly.

</figure>


<figure>

Fountains in Trafalgar Square

</figure>


Mostly popular for:
Leisure, Outdoors, Historical, Arts
& Culture

Best time to visit:
Jun-Aug
Averag Precipitation: 1.9 in
Average Temperature: 56-67ºF


## London Hotels

Margie's Travel offers the following accommodation options in London:

The Buckingham Hotel

Comfortable hotel close to major sights like Buckingham Palace, Regent's Park, and Trafalgar
Square.

The City Hotel

Luxury rooms in the city, within walking distance of Tower Bridge and the Tower of London ..

The Kensington Hotel
Budget accommodation near Earl's Court.

To book your trip to London, visit www.margiestravel.com

<!-- PageBreak -->


<table>
<tr>
<th>Category</th>
<th>Information</th>
</tr>
<tr>
<td>Country</td>
<td>United Kingdom</td>
</tr>
<tr>
<td>Capital Of</td>
<td>England</td>
</tr>
<tr>
<td>Currency</td>
<td>Pound Sterling (GBP)</td>
</tr>
<tr>
<td>Population (2021 census)</td>
<td>Approximately 8.8 million</td>
</tr>
<tr>
<td>Famous For</td>
<td>Historical landmarks, museums, cultural diversity</td>
</tr>
</table>


### Figures

For the same document, Figures are detected for the PDF version, and not for the Docx version

In [8]:
london_pdf_result.keys()

dict_keys(['apiVersion', 'modelId', 'stringIndexType', 'content', 'pages', 'tables', 'paragraphs', 'contentFormat', 'sections', 'figures'])

In [9]:
london_docx_result.keys()

dict_keys(['apiVersion', 'modelId', 'stringIndexType', 'content', 'pages', 'tables', 'paragraphs', 'contentFormat', 'sections'])

In [10]:
london_pdf_result['figures']

[{'id': '1.1', 'boundingRegions': [{'pageNumber': 1, 'polygon': [2.7432, 2.5977, 7.1195, 2.5981, 7.1197, 5.0761, 2.7433, 5.0756]}], 'spans': [{'offset': 64, 'length': 676}], 'elements': ['/paragraphs/2']},
 {'id': '1.2', 'boundingRegions': [{'pageNumber': 1, 'polygon': [0.9692, 5.8349, 5.0643, 5.8345, 5.0639, 8.1202, 0.9687, 8.1205]}], 'spans': [{'offset': 743, 'length': 50}], 'elements': ['/paragraphs/3']}]

In [11]:
from PIL import Image

def get_dpi(image):
    try:
        dpi = image.info['dpi']
        print("DPI", dpi)
    except KeyError:
        dpi = (300, 300)
    return dpi

def polygon_to_bbox(polygon):
    xs = polygon[::2]  # Extract all x coordinates
    ys = polygon[1::2]  # Extract all y coordinates
    left = min(xs)
    top = min(ys)
    right = max(xs)
    bottom = max(ys)
    return (left, top, right, bottom)


def inches_to_pixels(inches, dpi):
    dpi_x, dpi_y = dpi
    return [int(inches[i] * dpi_x if i % 2 == 0 else inches[i] * dpi_y) for i in range(len(inches))]

polygon = [2.7496, 2.5964, 7.1241, 2.597, 7.1249, 5.0656, 2.7505, 5.0645]


def extract_figure(image_path, polygon):
    bbox_in_inches = polygon_to_bbox(polygon)

    # Load the image
    image = Image.open(image_path)

    filename = os.path.splitext(os.path.basename(image_path))[0].strip()
    extension = os.path.splitext(os.path.basename(image_path))[1].strip()
    crop_name = os.path.join(os.path.dirname(image_path), f"{filename}_{generate_uuid_from_string(str(polygon))}{extension}")
    print(f"Cropped image will be saved under name: {crop_name}")

    # Get DPI from the image
    dpi = get_dpi(image)

    # Convert the bounding box to pixels using the image's DPI
    bbox_in_pixels = inches_to_pixels(bbox_in_inches, dpi)
    print("bbox_in_pixels", bbox_in_pixels)

    # Crop the image
    cropped_image = image.crop(bbox_in_pixels)
    cropped_image.save(crop_name)  
    show_img(crop_name, 400)

for bounding in london_pdf_result['figures']:
    polygon = bounding['boundingRegions'][0]['polygon']
    extract_figure(png_files[0], polygon)

Cropped image will be saved under name: sample_data/pdf_outputs\page_0_583cc04d-aace-3a7c-fa43-c76a6d151339.png
DPI (299.9994, 299.9994)
bbox_in_pixels [822, 779, 2135, 1522]


Cropped image will be saved under name: sample_data/pdf_outputs\page_0_342e7ecd-f4c7-f17a-874e-41eca46b7dfa.png
DPI (299.9994, 299.9994)
bbox_in_pixels [290, 1750, 1519, 2436]


### Document Intelligence with MS Word documents (docx)

Another sample document ..

In [12]:
path = r"sample_data/2_ai_facts.docx"
docx_result = analyze_document(path)
Markdown(docx_result['content'])

# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating developments and applications that stretch the boundaries of what we once thought possible. One of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to learn and mimic human creativity. These applications have not only provided new tools for artists and creators but have also sparked debates about the nature of creativity and whether it is a uniquely human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated world champions in games like Go, which is known for its intricate strategy and depth. These achievements are not just about winning games; they represent significant advancements in AI's problem-solving and strategic planning capabilities. The algorithms developed for these purposes have broader applications, such as solving complex logistical problems and improving decision-making processes in business and science.


## AI Facts


<table>
<tr>
<th>Fact Category</th>
<th>Detail</th>
</tr>
<tr>
<td>Definition</td>
<td>AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans.</td>
</tr>
<tr>
<td>First AI Concept</td>
<td>The concept of artificial intelligence was first theorized by Alan Turing, a British mathematician, in 1950 through the Turing Test.</td>
</tr>
<tr>
<td>First AI Program</td>
<td>The first AI program was developed by Allen Newell and Herbert A. Simon in 1955, which was called the Logic Theorist.</td>
</tr>
<tr>
<td>Machine Learning</td>
<td>Machine learning is a subset of AI that involves the development of algorithms allowing computers to learn from and make decisions based on data.</td>
</tr>
<tr>
<td>Deep Learning</td>
<td>Deep learning is a subset of machine learning involving neural networks with many layers, enabling advanced pattern recognition.</td>
</tr>
<tr>
<td>AI in Healthcare</td>
<td>AI applications in healthcare include diagnostics, personalized medicine, and robot-assisted surgery.</td>
</tr>
<tr>
<td>AI in Autonomous Vehicles</td>
<td>AI enables autonomous vehicles to perceive their environment and make decisions, reducing the need for human intervention.</td>
</tr>
<tr>
<td>Natural Language Processing (NLP)</td>
<td>NLP is a field of AI focused on the interaction between computers and humans through natural language.</td>
</tr>
<tr>
<td>Ethics in AI</td>
<td>Ethical concerns in AI include privacy, bias, accountability, and the potential for job displacement.</td>
</tr>
<tr>
<td>AI in Finance</td>
<td>AI is used in finance for algorithmic trading, fraud detection, and customer service through chatbots.</td>
</tr>
<tr>
<td>Robotics</td>
<td>Robotics is a field related to AI, focusing on the design and manufacturing of robots that can perform tasks autonomously.</td>
</tr>
<tr>
<td>AI in Gaming</td>
<td>AI enhances gaming experiences by enabling more complex and realistic NPC behaviors and game environments.</td>
</tr>
<tr>
<td>AI in Education</td>
<td>AI can personalize learning experiences through adaptive learning systems and automate administrative tasks.</td>
</tr>
<tr>
<td>AI in Entertainment</td>
<td>AI is used in entertainment for content recommendation algorithms, music composition, and visual effects.</td>
</tr>
<tr>
<td>Quantum Computing and AI</td>
<td>Quantum computing has the potential to significantly increase the processing power available for AI tasks, leading to breakthroughs in complex problem-solving.</td>
</tr>
<tr>
<td>Bias in AI</td>
<td>AI systems can exhibit bias if they are trained on biased data sets, leading to unfair outcomes.</td>
</tr>
<tr>
<td>AI and Job Creation</td>
<td>While AI can automate tasks, it also creates new job opportunities in AI development, supervision, and maintenance.</td>
</tr>
<tr>
<td>AI in Agriculture</td>
<td>AI applications in agriculture include crop monitoring, predictive analysis for yield optimization, and automated machinery.</td>
</tr>
<tr>
<td>AI and Climate Change</td>
<td>AI can analyze large datasets to model climate change impacts and optimize energy consumption patterns.</td>
</tr>
<tr>
<td>AI Art and Creativity</td>
<td>AI is used to generate art, music, and literature, challenging traditional notions of creativity.</td>
</tr>
</table>


## Some More Fun Facts

AI has also made unexpected discoveries and contributions to various scientific fields. For instance, AI algorithms have identified new patterns in data that were previously overlooked by human researchers, leading to new scientific discoveries. In the realm of astronomy, AI has helped identify new exoplanets and galaxies by analyzing vast amounts of data from telescopes and space missions. Similarly, in biology, AI has accelerated the discovery of new molecules and drugs by predicting their structures and interactions more efficiently than traditional methods.

The relationship between AI and animals presents another intriguing aspect. Researchers have used AI to understand animal behavior and communication better. For example, AI has been employed to decode the complex language of dolphins and to track and predict the migration patterns of birds and other wildlife. These studies not only deepen our understanding of the natural world but also demonstrate AI's potential to bridge the communication gap between humans and other species. As AI continues to evolve, its applications expand into new and unexpected territories, promising to revolutionize our understanding of the world around us and beyond.


In [13]:
print(docx_result['content'])

# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating developments and applications that stretch the boundaries of what we once thought possible. One of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to learn and mimic human creativity. These applications have not only provided new tools for artists and creators but have also sparked debates about the nature of creativity and whether it is a uniquely human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated world champions in games like Go, which is known for its intricate strategy and depth. These achievements are not just about winning games; they represent signif

### Document Intelligence with PDF Documents

Notice the broken table over 3 pages, which gives 3 tables

In [14]:
path = r"sample_data/2_ai_facts.pdf"
pdf_result = analyze_document(path)
Markdown(pdf_result['content'])

# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating
developments and applications that stretch the boundaries of what we once thought possible. One
of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used
to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to
learn and mimic human creativity. These applications have not only provided new tools for artists
and creators but have also sparked debates about the nature of creativity and whether it is a uniquely
human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to
be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated
world champions in games like Go, which is known for its intricate strategy and depth. These
achievements are not just about winning games; they represent significant advancements in AI's
problem-solving and strategic planning capabilities. The algorithms developed for these purposes
have broader applications, such as solving complex logistical problems and improving decision-
making processes in business and science.


## AI Facts


<table>
<tr>
<td>Fact Category</td>
<td>Detail</td>
</tr>
<tr>
<td>Definition</td>
<td>AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans.</td>
</tr>
<tr>
<td>First AI Concept</td>
<td>The concept of artificial intelligence was first theorized by Alan Turing, a British mathematician, in 1950 through the Turing Test.</td>
</tr>
<tr>
<td>First AI Program</td>
<td>The first AI program was developed by Allen Newell and Herbert A. Simon in 1955, which was called the Logic Theorist.</td>
</tr>
<tr>
<td>Machine Learning</td>
<td>Machine learning is a subset of AI that involves the development of algorithms allowing computers to learn from and make decisions based on data.</td>
</tr>
</table>


<!-- PageBreak -->


<table>
<tr>
<td>Deep Learning</td>
<td>Deep learning is a subset of machine learning involving neural networks with many layers, enabling advanced pattern recognition.</td>
</tr>
<tr>
<td>AI in Healthcare</td>
<td>AI applications in healthcare include diagnostics, personalized medicine, and robot-assisted surgery.</td>
</tr>
<tr>
<td>AI in Autonomous Vehicles</td>
<td>AI enables autonomous vehicles to perceive their environment and make decisions, reducing the need for human intervention.</td>
</tr>
<tr>
<td>Natural Language Processing (NLP)</td>
<td>NLP is a field of AI focused on the interaction between computers and humans through natural language.</td>
</tr>
<tr>
<td>Ethics in AI</td>
<td>Ethical concerns in AI include privacy, bias, accountability, and the potential for job displacement.</td>
</tr>
<tr>
<td>AI in Finance</td>
<td>AI is used in finance for algorithmic trading, fraud detection, and customer service through chatbots.</td>
</tr>
<tr>
<td>Robotics</td>
<td>Robotics is a field related to AI, focusing on the design and manufacturing of robots that can perform tasks autonomously.</td>
</tr>
<tr>
<td>AI in Gaming</td>
<td>AI enhances gaming experiences by enabling more complex and realistic NPC behaviors and game environments.</td>
</tr>
<tr>
<td>AI in Education</td>
<td>AI can personalize learning experiences through adaptive learning systems and automate administrative tasks.</td>
</tr>
<tr>
<td>AI in Entertainment</td>
<td>AI is used in entertainment for content recommendation algorithms, music composition, and visual effects.</td>
</tr>
<tr>
<td>Quantum Computing and AI</td>
<td>Quantum computing has the potential to significantly increase the processing power available for AI tasks, leading to breakthroughs in complex problem-solving.</td>
</tr>
<tr>
<td>Bias in AI</td>
<td>AI systems can exhibit bias if they are trained on biased data sets, leading to unfair outcomes.</td>
</tr>
</table>


<!-- PageBreak -->


<table>
<tr>
<td>AI and Job Creation</td>
<td>While AI can automate tasks, it also creates new job opportunities in AI development, supervision, and maintenance.</td>
</tr>
<tr>
<td>AI in Agriculture</td>
<td>AI applications in agriculture include crop monitoring, predictive analysis for yield optimization, and automated machinery.</td>
</tr>
<tr>
<td>AI and Climate Change</td>
<td>AI can analyze large datasets to model climate change impacts and optimize energy consumption patterns.</td>
</tr>
<tr>
<td>AI Art and Creativity</td>
<td>AI is used to generate art, music, and literature, challenging traditional notions of creativity.</td>
</tr>
</table>


## Some More Fun Facts

AI has also made unexpected discoveries and contributions to various scientific fields. For
instance, AI algorithms have identified new patterns in data that were previously overlooked by
human researchers, leading to new scientific discoveries. In the realm of astronomy, AI has helped
identify new exoplanets and galaxies by analyzing vast amounts of data from telescopes and space
missions. Similarly, in biology, AI has accelerated the discovery of new molecules and drugs by
predicting their structures and interactions more efficiently than traditional methods.

The relationship between AI and animals presents another intriguing aspect. Researchers have
used AI to understand animal behavior and communication better. For example, AI has been
employed to decode the complex language of dolphins and to track and predict the migration
patterns of birds and other wildlife. These studies not only deepen our understanding of the natural
world but also demonstrate AI's potential to bridge the communication gap between humans and
other species. As AI continues to evolve, its applications expand into new and unexpected
territories, promising to revolutionize our understanding of the world around us and beyond.


It seems that for the PDF, Document Intelligence did not correctly identify this as a single table, but rather as three separate tables.

In [14]:
print(pdf_result['content'])

# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating
developments and applications that stretch the boundaries of what we once thought possible. One
of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used
to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to
learn and mimic human creativity. These applications have not only provided new tools for artists
and creators but have also sparked debates about the nature of creativity and whether it is a uniquely
human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to
be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated
world champions in games like Go, which is known for its intricate strategy and depth. These
achievements are not just about winning games; they represent signif

### Table Merging

Let's try to merge the tables using OpenAI GPT-4

In [15]:
#### Let's try something "hacky"

if len(pdf_result['content'].split('|\n')) > 0: ## This is a hacky way to check if there is a Markdown table in the text
    print('yes')


yes


#### First Step

First try to establish the table boundaries

In [16]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE


prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If there are, you **MUST** then output the first row and the last row of each table. We define a Markdown table by having consecutive rows with no space between, with the each row separated from the next by '|\n'. Do not separate Markdown table semantically. 


JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format:

{{
    "total_number_of_markdown_tables_located:": "the total number of Markdown tables located",,
    "table_first_and_last_rows_array":
    [
        {{
            "first_row": "The first row in Markdown table format. Make sure that the last row of the first table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "last_row": "The last row of the table in Markdown table format. Make sure that the first row of the second table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
        }}
    ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)




Calling OpenAI APIs with 2 messages - Model: gpt-4t - Endpoint: https://admin-m78bg318-swedencentral.openai.azure.com/openai/
[0m
Messages: [{'role': 'system', 'content': 'You are a helpful assistant, who helps the user with their query. You are designed to output JSON.'}, {'role': 'user', 'content': '\nYou are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.\n\nYou are given the following Markdown text:\n## START OF MARKDOWN TEXT\n# AI Facts\n\n\n## Fun Facts\n\nArtificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating\ndevelopments and applications that stretch the boundaries of what we once thought possible. One\nof the most engaging aspects of AI is its application in creative fields. For instance, AI has been used\nto compose music, generate unique pieces of art, and even write poetry

#### Second Step

Second, try to use the tables boundaries to determine whether the tables are split because of the pages, or OCR extraction, and if they are, try to merge

In [17]:

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

The below are the edges that define the start and end of each Markdown table. Use the first and last row mentioned in the "table_first_and_last_rows_array" as the definition of a Markdown table:
## START OF TABLE EDGES
{table_edges}
## END OF TABLE EDGES

In the markdown text above, detect whether there are Markdown tables. If there are, you **MUST** follow the Chain of Thought below:
    1. Locate every Markdown table in the text using the provided Table Edges, and write down the number of those tables in your scratchpad.
    2. Work on every two consecutive Markdown tables in a pairwise manner. For every pair of consecutive Markdown tables, you **MUST** check if they have any text between them. If the two tables are immediately consecutive, but there is text between them and this text is a footnote or a page number, you can safely ignore it, and mark the two tables as consecutive in your scratchpad. If, however, there is valid text between the two tables, then you **MUST** mark the two tables as not consecutive in your scratchpad.
    3. Write down the number of checks you have to perform. You will have to perform checks for consecutive tables pair-wise.

 After completing the above steps, you **MUST** then do the following for every two immediately consecutive Markdown table:
    A. If the separated Markdown tables are immediately consecutive, then check the headers and the data type of the columns of each table to estimate if the two tables originated from the same table but got split into two because of the page break, or because of the OCR extraction process.
    B. If you find that these two Markdown tables likely belong to the same original table, then output **EXACTLY** the full last row of the first table, and the first full row of the first table in Markdown table format. Also output the exact string of characters that exist between the two tables in the JSON output.
    C. Repeat the above steps for **ALL** immediately consecutive Markdown tables located in your scratchpad, to match the number of checks in your scratchpad.

**SUPER IMPORTANT**: We define a Markdown table by having consecutive rows with no space between, with the each row separated from the next by '|\n'. Do not separate Markdown table semantically. 


JSON OUTPUT FORMAT:
Remember that you **MUST** process every two immediately consecutive or consecutive Markdown tables. You **MUST** output a JSON object with the following key-value pairs and following format:

{{
    "total_number_of_markdown_tables_located:": "the total number of Markdown tables located",
    "num_of_tables_to_be_merged": "the number of tables to be merged",
    "tables_to_be_merged":
    [
        {{
            "first_table_last_row": "The last row of the first table in Markdown table format. Make sure that the last row of the first table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "second_table_first_row": "The first row of the second table in Markdown table format. Make sure that the first row of the second table is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
            "dividing_string": "The exact string of characters that exist between the two tables. Make sure that the string is outputted verbatim word-for-word, including any leading or trailing spaces, and any leading or trailing pipes.",
        }}
    ]
}}

"""

p = prompt.format(text = pdf_result['content'], table_edges=table_edges)
output = ask_LLM_with_JSON(p, model_info=model_info)
print(output)





Calling OpenAI APIs with 2 messages - Model: gpt-4t - Endpoint: https://admin-m78bg318-swedencentral.openai.azure.com/openai/
[0m
Messages: [{'role': 'system', 'content': 'You are a helpful assistant, who helps the user with their query. You are designed to output JSON.'}, {'role': 'user', 'content': '\nYou are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.\n\nYou are given the following Markdown text:\n## START OF MARKDOWN TEXT\n# AI Facts\n\n\n## Fun Facts\n\nArtificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating\ndevelopments and applications that stretch the boundaries of what we once thought possible. One\nof the most engaging aspects of AI is its application in creative fields. For instance, AI has been used\nto compose music, generate unique pieces of art, and even write poetry

In [18]:
import json
import copy


## Let's try to merge the tables
findings = json.loads(output)
pdf_content = copy.deepcopy(pdf_result['content'])


for table in findings['tables_to_be_merged']:
    first = table['first_table_last_row']
    second = table['second_table_first_row']

    print("Row in first table found: ", first in pdf_result['content'])
    print("Row in last table found: ", second in pdf_result['content'])

    for table in findings['tables_to_be_merged']:
        first_table_last_row = table['first_table_last_row']
        second_table_first_row = table['second_table_first_row']

        first_part = pdf_content.split(first_table_last_row)[0]
        second_part = pdf_content.split(second_table_first_row)[1]

        pdf_content = first_part + first_table_last_row + '\n' + second_table_first_row + second_part


Markdown(pdf_content)



Row in first table found:  True
Row in last table found:  True
Row in first table found:  True
Row in last table found:  True


# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating
developments and applications that stretch the boundaries of what we once thought possible. One
of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used
to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to
learn and mimic human creativity. These applications have not only provided new tools for artists
and creators but have also sparked debates about the nature of creativity and whether it is a uniquely
human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to
be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated
world champions in games like Go, which is known for its intricate strategy and depth. These
achievements are not just about winning games; they represent significant advancements in AI's
problem-solving and strategic planning capabilities. The algorithms developed for these purposes
have broader applications, such as solving complex logistical problems and improving decision-
making processes in business and science.


## AI Facts


<table>
<tr>
<td>Fact Category</td>
<td>Detail</td>
</tr>
<tr>
<td>Definition</td>
<td>AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans.</td>
</tr>
<tr>
<td>First AI Concept</td>
<td>The concept of artificial intelligence was first theorized by Alan Turing, a British mathematician, in 1950 through the Turing Test.</td>
</tr>
<tr>
<td>First AI Program</td>
<td>The first AI program was developed by Allen Newell and Herbert A. Simon in 1955, which was called the Logic Theorist.</td>
</tr>
<tr>
<td>Machine Learning</td>
<td>Machine learning is a subset of AI that involves the development of algorithms allowing computers to learn from and make decisions based on data.</td>
</tr>
<tr>
<td>Deep Learning</td>
<td>Deep learning is a subset of machine learning involving neural networks with many layers, enabling advanced pattern recognition.</td>
</tr>
<tr>
<td>AI in Healthcare</td>
<td>AI applications in healthcare include diagnostics, personalized medicine, and robot-assisted surgery.</td>
</tr>
<tr>
<td>AI in Autonomous Vehicles</td>
<td>AI enables autonomous vehicles to perceive their environment and make decisions, reducing the need for human intervention.</td>
</tr>
<tr>
<td>Natural Language Processing (NLP)</td>
<td>NLP is a field of AI focused on the interaction between computers and humans through natural language.</td>
</tr>
<tr>
<td>Ethics in AI</td>
<td>Ethical concerns in AI include privacy, bias, accountability, and the potential for job displacement.</td>
</tr>
<tr>
<td>AI in Finance</td>
<td>AI is used in finance for algorithmic trading, fraud detection, and customer service through chatbots.</td>
</tr>
<tr>
<td>Robotics</td>
<td>Robotics is a field related to AI, focusing on the design and manufacturing of robots that can perform tasks autonomously.</td>
</tr>
<tr>
<td>AI in Gaming</td>
<td>AI enhances gaming experiences by enabling more complex and realistic NPC behaviors and game environments.</td>
</tr>
<tr>
<td>AI in Education</td>
<td>AI can personalize learning experiences through adaptive learning systems and automate administrative tasks.</td>
</tr>
<tr>
<td>AI in Entertainment</td>
<td>AI is used in entertainment for content recommendation algorithms, music composition, and visual effects.</td>
</tr>
<tr>
<td>Quantum Computing and AI</td>
<td>Quantum computing has the potential to significantly increase the processing power available for AI tasks, leading to breakthroughs in complex problem-solving.</td>
</tr>
<tr>
<td>Bias in AI</td>
<td>AI systems can exhibit bias if they are trained on biased data sets, leading to unfair outcomes.</td>
</tr>
<tr>
<td>AI and Job Creation</td>
<td>While AI can automate tasks, it also creates new job opportunities in AI development, supervision, and maintenance.</td>
</tr>
<tr>
<td>AI in Agriculture</td>
<td>AI applications in agriculture include crop monitoring, predictive analysis for yield optimization, and automated machinery.</td>
</tr>
<tr>
<td>AI and Climate Change</td>
<td>AI can analyze large datasets to model climate change impacts and optimize energy consumption patterns.</td>
</tr>
<tr>
<td>AI Art and Creativity</td>
<td>AI is used to generate art, music, and literature, challenging traditional notions of creativity.</td>
</tr>
</table>


## Some More Fun Facts

AI has also made unexpected discoveries and contributions to various scientific fields. For
instance, AI algorithms have identified new patterns in data that were previously overlooked by
human researchers, leading to new scientific discoveries. In the realm of astronomy, AI has helped
identify new exoplanets and galaxies by analyzing vast amounts of data from telescopes and space
missions. Similarly, in biology, AI has accelerated the discovery of new molecules and drugs by
predicting their structures and interactions more efficiently than traditional methods.

The relationship between AI and animals presents another intriguing aspect. Researchers have
used AI to understand animal behavior and communication better. For example, AI has been
employed to decode the complex language of dolphins and to track and predict the migration
patterns of birds and other wildlife. These studies not only deepen our understanding of the natural
world but also demonstrate AI's potential to bridge the communication gap between humans and
other species. As AI continues to evolve, its applications expand into new and unexpected
territories, promising to revolutionize our understanding of the world around us and beyond.


In [19]:
print(pdf_content)

# AI Facts


## Fun Facts

Artificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating
developments and applications that stretch the boundaries of what we once thought possible. One
of the most engaging aspects of AI is its application in creative fields. For instance, AI has been used
to compose music, generate unique pieces of art, and even write poetry, demonstrating its ability to
learn and mimic human creativity. These applications have not only provided new tools for artists
and creators but have also sparked debates about the nature of creativity and whether it is a uniquely
human trait.

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to
be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated
world champions in games like Go, which is known for its intricate strategy and depth. These
achievements are not just about winning games; they represent signif

### Using Regex to detect table boundaries

Maybe an easier way is to use Regex to detect table boundaries, and then post-process to the correct format using GPT-4 if the table is not big. 

In [20]:
## Regex will extract all 3 tables as one contiguous table

tables = extract_markdown_table(pdf_result['content'])

print(len(tables))
Markdown(tables[0][0])

0


IndexError: list index out of range

This time we artificially separate the same tables with some text. RegEx would detect 2 tables instead of one.

In [21]:
cc = """
# AI Facts

|||
| - | - |
| Fact Category | Detail |
| Definition | AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. |
| First AI Concept | The concept of artificial intelligence was first theorized by Alan Turing, a British mathematician, in 1950 through the Turing Test. |
| First AI Program | The first AI program was developed by Allen Newell and Herbert A. Simon in 1955, which was called the Logic Theorist. |
| Machine Learning | Machine learning is a subset of AI that involves the development of algorithms allowing computers to learn from and make decisions based on data. |

Another fun fact about AI is its ability to play and excel in complex games, a domain once believed to be the exclusive realm of human intelligence. AI systems like DeepMind's AlphaGo have defeated world champions in games like Go, which is known for its intricate strategy and depth. These achievements are not just about winning games; they represent significant advancements in AI's problem-solving and strategic planning capabilities. The algorithms developed for these purposes have broader applications, such as solving complex logistical problems and improving decision- making processes in business and science.

|||
| - | - |
| Deep Learning | Deep learning is a subset of machine learning involving neural networks with many layers, enabling advanced pattern recognition. |
| AI in Healthcare | AI applications in healthcare include diagnostics, personalized medicine, and robot-assisted surgery. |
| AI in Autonomous Vehicles | AI enables autonomous vehicles to perceive their environment and make decisions, reducing the need for human intervention. |
| Natural Language Processing (NLP) | NLP is a field of AI focused on the interaction between computers and humans through natural language. |
| Ethics in AI | Ethical concerns in AI include privacy, bias, accountability, and the potential for job displacement. |
| AI in Finance | AI is used in finance for algorithmic trading, fraud detection, and customer service through chatbots. |
| Robotics | Robotics is a field related to AI, focusing on the design and manufacturing of robots that can perform tasks autonomously. |
| AI in Gaming | AI enhances gaming experiences by enabling more complex and realistic NPC behaviors and game environments. |
| AI in Education | AI can personalize learning experiences through adaptive learning systems and automate administrative tasks. |
| AI in Entertainment | AI is used in entertainment for content recommendation algorithms, music composition, and visual effects. |
| Quantum Computing and AI | Quantum computing has the potential to significantly increase the processing power available for AI tasks, leading to breakthroughs in complex problem-solving. |
| Bias in AI | AI systems can exhibit bias if they are trained on biased data sets, leading to unfair outcomes. |

"""


tables = extract_markdown_table(cc)

print(len(tables))
Markdown(tables[0][0])


2


|||
| - | - |
| Fact Category | Detail |
| Definition | AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. |
| First AI Concept | The concept of artificial intelligence was first theorized by Alan Turing, a British mathematician, in 1950 through the Turing Test. |
| First AI Program | The first AI program was developed by Allen Newell and Herbert A. Simon in 1955, which was called the Logic Theorist. |
| Machine Learning | Machine learning is a subset of AI that involves the development of algorithms allowing computers to learn from and make decisions based on data. |



In [22]:
#### Change prompt and provide an example

In [23]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If table is spanning across pages, you **MUST** combine table chunks from consecutive pages and create single table **BEFORE** output to JSON.

## EXAMPLE
Here is example of table that spans across pages:

|||
| - | - |
| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |

|||
| - | - |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## COMBINED TABLE
Here result of combining table chunks from consecutive pages. First row is always the header row.

| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format. Do **NOT* provide any other comments or unnecessary information in the JSON output:

{{
    "Column1 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    "Column2 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    ...,
    "ColumnN Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)


Calling OpenAI APIs with 2 messages - Model: gpt-4t - Endpoint: https://admin-m78bg318-swedencentral.openai.azure.com/openai/
[0m
Messages: [{'role': 'system', 'content': 'You are a helpful assistant, who helps the user with their query. You are designed to output JSON.'}, {'role': 'user', 'content': '\nYou are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.\n\nYou are given the following Markdown text:\n## START OF MARKDOWN TEXT\n# AI Facts\n\n\n## Fun Facts\n\nArtificial Intelligence (AI) has evolved dramatically from its inception, leading to fascinating\ndevelopments and applications that stretch the boundaries of what we once thought possible. One\nof the most engaging aspects of AI is its application in creative fields. For instance, AI has been used\nto compose music, generate unique pieces of art, and even write poetry

In [24]:
import pandas as pd
import json

# Parse the JSON string
data = json.loads(table_edges)

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
display(df)


Unnamed: 0,Fact Category,Detail
0,Definition,"AI, or Artificial Intelligence, refers to the ..."
1,First AI Concept,The concept of artificial intelligence was fir...
2,First AI Program,The first AI program was developed by Allen Ne...
3,Machine Learning,Machine learning is a subset of AI that involv...
4,Deep Learning,Deep learning is a subset of machine learning ...
5,AI in Healthcare,AI applications in healthcare include diagnost...
6,AI in Autonomous Vehicles,AI enables autonomous vehicles to perceive the...
7,Natural Language Processing (NLP),NLP is a field of AI focused on the interactio...
8,Ethics in AI,"Ethical concerns in AI include privacy, bias, ..."
9,AI in Finance,"AI is used in finance for algorithmic trading,..."


#### Lets try with different PDF file

In [25]:
path = r"sample_data/1_London_Brochure.pdf"
pdf_result = analyze_document(path)
Markdown(pdf_result['content'])

<!-- PageHeader="Margie's Travel Presents ..." -->


# London


<figure>

London is the capital and
most populous city of
England and the United
Kingdom. Standing on the
River Thames in the south
east of the island of Great
Britain, London has been
a major settlement for two
millennia. It was founded
by the Romans, who
named it Londinium.
London's ancient core, the
City of London, largely
retains its 1.12-square-
Tower Bridge
mile medieval boundaries.
Since at least the 19th century, London has also referred to the metropolis around this core,
historically split between Middlesex, Essex, Surrey, Kent, and Hertfordshire, which today largely
makes up Greater London, governed by the Mayor of London and the London Assembly.

</figure>


<figure>

Fountains in Trafalgar Square

</figure>


Mostly popular for:
Leisure, Outdoors, Historical, Arts
& Culture

Best time to visit:
Jun-Aug
Averag Precipitation: 1.9 in
Average Temperature: 56-67ºF


## London Hotels

Margie's Travel offers the following accommodation options in London:

The Buckingham Hotel

Comfortable hotel close to major sights like Buckingham Palace, Regent's Park, and Trafalgar
Square.

The City Hotel

Luxury rooms in the city, within walking distance of Tower Bridge and the Tower of London ..

The Kensington Hotel
Budget accommodation near Earl's Court.

To book your trip to London, visit www.margiestravel.com

<!-- PageBreak -->


<table>
<tr>
<th>Category</th>
<th>Information</th>
</tr>
<tr>
<td>Country</td>
<td>United Kingdom</td>
</tr>
<tr>
<td>Capital Of</td>
<td>England</td>
</tr>
<tr>
<td>Currency</td>
<td>Pound Sterling (GBP)</td>
</tr>
<tr>
<td>Population (2021 census)</td>
<td>Approximately 8.8 million</td>
</tr>
<tr>
<td>Famous For</td>
<td>Historical landmarks, museums, cultural diversity</td>
</tr>
</table>


In [26]:
## Experiment with prompting, to see if broken tables can be "merged" based on LLM's output
## This is work in progress .. NOT YET COMPLETE

prompt = """
You are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.

You are given the following Markdown text:
## START OF MARKDOWN TEXT
{text}
## END OF MARKDOWN TEXT

In the markdown text above, detect whether there are Markdown tables. If table is spanning across pages, you **MUST** combine table chunks from consecutive pages and create single table **BEFORE** output to JSON.

## EXAMPLE
Here is example of table that spans across pages:

|||
| - | - |
| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |

|||
| - | - |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## COMBINED TABLE
Here result of combining table chunks from consecutive pages. First row is always the header row.

| Column1 | Column2 | ColumnN |
| Row1 Column1 value | Row1 Column2 value | Row1 ColumnN value |
| Row2 Column1 value | Row2 Column2 value | Row2 ColumnN value |
| RowX Column1 value | RowX Column2 value | RowX ColumnN value |
| RowY Column1 value | RowY Column2 value | RowY ColumnN value |

## JSON OUTPUT FORMAT:
You **MUST** output a JSON object with the following key-value pairs and following format. Do **NOT* provide any other comments or unnecessary information in the JSON output:

{{
    "Column1 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    "Column2 Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ],
    ...,
    "ColumnN Header":
        [
                "Row1 value",
                "Row2 value",
                ...,
                "RowN value"
            }}
        ]
}}
"""

p = prompt.format(text = pdf_result['content'])
table_edges = ask_LLM_with_JSON(p, model_info=model_info)
print(table_edges)


Calling OpenAI APIs with 2 messages - Model: gpt-4t - Endpoint: https://admin-m78bg318-swedencentral.openai.azure.com/openai/
[0m
Messages: [{'role': 'system', 'content': 'You are a helpful assistant, who helps the user with their query. You are designed to output JSON.'}, {'role': 'user', 'content': '\nYou are a Markdown expert whose objective is to check whether Markdown tables have been split into two or more tables because of page breaks or because of the OCR extraction process. You are designed to output JSON.\n\nYou are given the following Markdown text:\n## START OF MARKDOWN TEXT\n<!-- PageHeader="Margie\'s Travel Presents ..." -->\n\n\n# London\n\n\n<figure>\n\nLondon is the capital and\nmost populous city of\nEngland and the United\nKingdom. Standing on the\nRiver Thames in the south\neast of the island of Great\nBritain, London has been\na major settlement for two\nmillennia. It was founded\nby the Romans, who\nnamed it Londinium.\nLondon\'s ancient core, the\nCity of Londo

In [27]:
import pandas as pd
import json

# Parse the JSON string
data = json.loads(table_edges)

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
display(df)


Unnamed: 0,Category,Information
0,Country,United Kingdom
1,Capital Of,England
2,Currency,Pound Sterling (GBP)
3,Population (2021 census),Approximately 8.8 million
4,Famous For,"Historical landmarks, museums, cultural diversity"
