Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geearl/6323 large tables #429

Merged
merged 134 commits into from
Jan 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
17476f0
Test sku and autoscale changes in automation
ryonsteele Dec 6, 2023
9b04c39
Function Autoscale testing
ryonsteele Dec 7, 2023
b904511
Function Autoscale testing
ryonsteele Dec 7, 2023
7b76303
Function Autoscale testing
ryonsteele Dec 7, 2023
e638504
Adjust enrichment autoscale and remove ui cap
ryonsteele Dec 8, 2023
ca6eaac
Change function host concurrency and enrichment scale metrics
ryonsteele Dec 11, 2023
2556b08
Add sku and autoscale markdown
ryonsteele Dec 12, 2023
e20a586
process flow for v1.0
asbanger Dec 13, 2023
bd9e0f3
Update docs and resolve enrichment concurrency issue
ryonsteele Dec 13, 2023
227b537
Merge branch 'main' into ryonsteele/function-autoscale
ryonsteele Dec 13, 2023
bacd8e1
Update docs and resolve enrichment concurrency issue
ryonsteele Dec 13, 2023
af49ad2
Merge branch 'ryonsteele/function-autoscale' of https://github.com/mi…
ryonsteele Dec 13, 2023
69fb566
Revert "Merge branch 'vNext-Dev' into main"
dayland-ms Dec 14, 2023
2613e58
Merge branch 'main' into ryonsteele/function-autoscale
ryonsteele Dec 14, 2023
aa054d3
Merge pull request #398 from microsoft/asbanger/6309-update-architecture
dayland Dec 14, 2023
589c779
Update docs
ryonsteele Dec 14, 2023
49d129f
Update docs
ryonsteele Dec 14, 2023
e1b5468
Update docs
ryonsteele Dec 14, 2023
7fc09da
Merge branch 'main' into ryonsteele/function-autoscale
asbanger Dec 15, 2023
bd326b3
assign SP ID for Ci/CD shared
dayland-ms Dec 15, 2023
a9739a5
whitespace change to trigger new build
dayland-ms Dec 15, 2023
42cb0df
Merge pull request #403 from microsoft/dayland/ci-cd-pipeline-fixes
ryonsteele Dec 15, 2023
59adee8
Test increased timeout
ryonsteele Dec 15, 2023
90ae382
Merge branch 'ryonsteele/function-autoscale' of https://github.com/mi…
ryonsteele Dec 15, 2023
c826cdc
Merge branch 'main' into ryonsteele/function-autoscale
ryonsteele Dec 15, 2023
393e351
Merge pull request #390 from microsoft/ryonsteele/function-autoscale
ryonsteele Dec 19, 2023
fcba8a8
Add check for existing CUA deployment object and remove before new de…
ryonsteele Dec 21, 2023
6a4a3b6
Merge pull request #415 from microsoft/ryonsteele/6322-cleanup-cua-de…
lon-tierney Dec 22, 2023
fa0c6d0
first round of changes
georearl Dec 22, 2023
9332789
th fix
georearl Dec 23, 2023
c76f067
Update costestimator.md
mausolfj Dec 29, 2023
5f11a68
Merge pull request #418 from microsoft/mausolfj/6312-cost-estimator-md
mausolfj Dec 29, 2023
d2cd0ac
hasattr
georearl Jan 3, 2024
7b7396d
prompt engineering to fix citation with no answer questions and follo…
ArpitaisAn0maly Jan 3, 2024
18ac0ff
Merge branch 'main' into aparmar/prompt-engineering-tuning
dayland Jan 4, 2024
62b94ca
whitespace to trigger new build
dayland Jan 4, 2024
39917f4
Merge pull request #421 from microsoft/aparmar/prompt-engineering-tuning
ryonsteele Jan 4, 2024
fcc6675
Update user_experience.md with optional data sources
mausolfj Jan 4, 2024
d4545e6
Update user_experience.md
mausolfj Jan 4, 2024
f673790
Update deployment.md
asbanger Jan 4, 2024
4f9772d
Implement runtime keyvault secrets for app services
ryonsteele Jan 4, 2024
52ed9e3
Merge branch 'main' into mausolfj/6301-user-experience
ryonsteele Jan 4, 2024
eee5ea6
Fix issue with duplicate storage secrets
ryonsteele Jan 4, 2024
47e20ef
Documenation on considerations for production adoption
ryonsteele Jan 4, 2024
d852785
Resolve npm dependencie vulerabilities
ryonsteele Jan 4, 2024
e1dfbbd
Merge pull request #427 from microsoft/ryonsteele/6359-resolve-npm-vuln
dayland Jan 4, 2024
c06bb5b
Merge branch 'main' into asbanger/6311-Link-broken
dayland Jan 4, 2024
0feae75
Merge pull request #423 from microsoft/asbanger/6311-Link-broken
dayland Jan 4, 2024
05bb0f9
Merge branch 'main' into mausolfj/6301-user-experience
dayland Jan 4, 2024
961f6a1
Merge pull request #422 from microsoft/mausolfj/6301-user-experience
dayland Jan 4, 2024
f37ff9b
Merge branch 'main' into ryonsteele/6348-runtime-keyvault-secrets
dayland Jan 4, 2024
c9b9bec
refinement
georearl Jan 5, 2024
cec9b66
Merge branch 'main' into ryonsteele/6349-production-considerations
dayland Jan 5, 2024
34299ea
Merge pull request #424 from microsoft/ryonsteele/6348-runtime-keyvau…
dayland Jan 5, 2024
9182758
Add to TOC and put page summary into a deployment.md subsection
ryonsteele Jan 5, 2024
dd97be4
Merge branch 'ryonsteele/6349-production-considerations' of https://g…
ryonsteele Jan 5, 2024
370632c
Merge branch 'main' into ryonsteele/6349-production-considerations
dayland Jan 5, 2024
0b97a8e
Add considerations for document sizes and custom branding
ryonsteele Jan 5, 2024
2c67371
Merge branch 'ryonsteele/6349-production-considerations' of https://g…
ryonsteele Jan 5, 2024
c170553
fixes = thead
georearl Jan 5, 2024
73a9d5e
Merge branch 'vNext-Dev' of https://github.com/microsoft/PubSec-Info-…
georearl Jan 5, 2024
21bbe03
revised table max token
georearl Jan 5, 2024
b8bd02f
Update verbiage on document intelligence
ryonsteele Jan 8, 2024
a54dd28
PR comment resolution
ryonsteele Jan 8, 2024
357b5e7
Document Intelligence version bump hotfix
ryonsteele Jan 8, 2024
ce64874
Merge pull request #433 from microsoft/ryonsteele/6362-unicode-fr-iss…
dayland Jan 8, 2024
264342e
Updates to docs based on community feedback
dayland-ms Jan 8, 2024
e7c9bee
Update configure_local_dev_environment.md
ArpitaisAn0maly Jan 8, 2024
bd8435b
Merge pull request #434 from microsoft/dayland/doc-updates-for-1.0-re…
dayland Jan 8, 2024
7552124
Merge branch 'main' into aparmar/deploy-link
dayland Jan 8, 2024
4aedcfe
Update example list with new questions
dayland-ms Jan 9, 2024
7bfd0a5
Resolve issue with aoai key reference when not using existing deployment
ryonsteele Jan 9, 2024
9174139
Merge pull request #436 from microsoft/dayland/6372-update-default-Qs…
dayland Jan 9, 2024
c5a23d0
Merge branch 'main' into aparmar/deploy-link
dayland Jan 9, 2024
45a02ea
Merge pull request #435 from microsoft/aparmar/deploy-link
dayland Jan 9, 2024
d9b0d90
Merge branch 'main' into ryonsteele/6373-aoai-kv-hf
dayland Jan 9, 2024
1127cd6
Merge branch 'main' into ryonsteele/6349-production-considerations
dayland Jan 9, 2024
ce90232
Merge pull request #438 from microsoft/ryonsteele/6373-aoai-kv-hf
ryonsteele Jan 9, 2024
651c975
Merge branch 'main' into ryonsteele/6349-production-considerations
ryonsteele Jan 9, 2024
d8af46d
Merge pull request #426 from microsoft/ryonsteele/6349-production-con…
dayland Jan 9, 2024
82d993e
Updating for video content
lon-tierney Jan 9, 2024
2827cd1
Correct typo
ryonsteele Jan 9, 2024
294f5c1
Add mslearn primer link in the app scalersection
ryonsteele Jan 9, 2024
333a974
Updates to include YouTube video, and change to "guide" terminology.
lon-tierney Jan 9, 2024
c0463e6
1.0 high level architecture
asbanger Jan 10, 2024
53cd0bf
Update README.md
asbanger Jan 10, 2024
029c17c
update: feature documentation
asbanger Jan 10, 2024
5e66ca6
update: feature documentation
asbanger Jan 10, 2024
adaa4d4
update:github codespace documentation
asbanger Jan 10, 2024
ce6a929
update:github codespace documentation
asbanger Jan 10, 2024
5c512b6
Merge pull request #440 from microsoft/ryonsteele/6349-production-con…
ryonsteele Jan 10, 2024
2c49a88
Update process_flow.drawio.png and process_flow.png
dayland-ms Jan 10, 2024
ff0eff4
Merge branch 'main' into asbanger/architecture-updated
asbanger Jan 10, 2024
89a087c
Merge pull request #444 from microsoft/asbanger/architecture-updated
dayland Jan 10, 2024
e4a7286
Merge branch 'main' into asbanger/6387-documentation-fix-bad-typo
ryonsteele Jan 10, 2024
8a03832
Add Architecture Document in /docs
dayland-ms Jan 10, 2024
0c62ed4
update:github codespace documentation
asbanger Jan 10, 2024
d45fea3
Resolve issue with chunks created statuslog
ryonsteele Jan 10, 2024
bb58c9c
Merge branch 'main' into ryonsteele/6389-fix-chunks-logger-hf
dayland Jan 10, 2024
94b89c7
Removing "s" from short link
lon-tierney Jan 10, 2024
5323c14
Merge branch 'main' into ltierney/deploy-video
lon-tierney Jan 10, 2024
a0cad4d
Merge pull request #447 from microsoft/ryonsteele/6389-fix-chunks-log…
ryonsteele Jan 10, 2024
f7db662
Merge branch 'main' into dayland/4817-add-arch-doc
dayland Jan 10, 2024
c9ee668
Merge pull request #445 from microsoft/asbanger/6387-documentation-fi…
dayland Jan 10, 2024
0913645
Merge branch 'main' into ltierney/deploy-video
dayland Jan 10, 2024
20fc3d0
Merge pull request #446 from microsoft/dayland/4817-add-arch-doc
dayland Jan 10, 2024
724e885
Merge pull request #441 from microsoft/ltierney/deploy-video
dayland Jan 10, 2024
de5b6cd
Update costestimator.md
asbanger Jan 11, 2024
9a5f582
Update costestimator.md
asbanger Jan 11, 2024
ae12f7b
Update costestimator.md
asbanger Jan 11, 2024
a9eb035
Update deployment troubleshooting link and improve UX analysis panel …
dayland-ms Jan 11, 2024
1d35714
Merge branch 'main' into asbanger/6390-azure-estimation-1.0-release
asbanger Jan 11, 2024
8b39f53
Update deployment.md with more detailed instructions
dayland-ms Jan 11, 2024
e1fbbe5
Merge pull request #449 from microsoft/asbanger/6390-azure-estimation…
dayland Jan 11, 2024
0079b51
Merge pull request #450 from microsoft/dayland/6394-add-missing-suppo…
dayland Jan 11, 2024
e4fc7e0
Updating hard link to redirect link for YouTube
lon-tierney Jan 11, 2024
48f68a2
Fixed typo and broken image links
KronemeyerJoshua Jan 11, 2024
0b49909
Merge pull request #451 from microsoft/ltierney/shortURLs
dayland Jan 11, 2024
84f0c0f
Merge branch 'main' into patch-1
dayland Jan 11, 2024
4baa702
Merge pull request #452 from KronemeyerJoshua/patch-1
dayland Jan 11, 2024
300b0a5
Update links in user_experience.md
dayland Jan 11, 2024
7138e09
Merge pull request #453 from microsoft/dayland/6397-fix-sample-data-l…
dayland Jan 11, 2024
dfa907f
Update bug_report.md template with additional instructions and details
dayland-ms Jan 12, 2024
e8e627f
Merge pull request #454 from microsoft/geearl/function-flow-doc
dayland-ms Jan 12, 2024
70da66c
Merge pull request #456 from microsoft/dayland/6394-update-function-f…
dayland Jan 12, 2024
27df390
Merge pull request #455 from microsoft/geearl/6255-issue-template
dayland Jan 12, 2024
c3dca96
Update costestimator.md with consistent release version information
mausolfj Jan 17, 2024
3b3acf3
Merge branch 'main' of https://github.com/microsoft/PubSec-Info-Assis…
georearl Jan 17, 2024
0f8b348
Merge branch 'vNext-Dev' of https://github.com/microsoft/PubSec-Info-…
georearl Jan 17, 2024
4fdc07d
Merge branch 'vNext-Dev' of https://github.com/microsoft/PubSec-Info-…
georearl Jan 17, 2024
c6792d7
Merge branch 'geearl/6323-large-tables' of https://github.com/microso…
georearl Jan 17, 2024
fcea368
Revert "Merge branch 'geearl/6323-large-tables' of https://github.com…
dayland-ms Jan 18, 2024
74c8473
Merge branch 'vNext-Dev' of https://github.com/microsoft/PubSec-Info-…
georearl Jan 18, 2024
6934088
Merge branch 'vNext-Dev' into geearl/6323-large-tables
ryonsteele Jan 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion functions/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ azure-ai-vision == 0.15.1b1
unstructured[csv,doc,docx,email,html,md,msg,ppt,pptx,text,xlsx,xml] == 0.10.27
pyoo == 1.4
azure-search-documents == 11.4.0b11

beautifulsoup4 == 4.12.2
264 changes: 218 additions & 46 deletions functions/shared_code/utilities.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import nltk
# Try to download using nltk.download
nltk.download('punkt')
from bs4 import BeautifulSoup

punkt_dir = os.path.join(nltk.data.path[0], 'tokenizers/punkt')

Expand Down Expand Up @@ -62,6 +63,7 @@ class MediaType:
MEDIA = "media"

class Utilities:

""" Class to hold utility functions """
def __init__(self,
azure_blob_storage_account,
Expand Down Expand Up @@ -105,30 +107,97 @@ def get_blob_and_sas(self, blob_path):
""" Function to retrieve the uri and sas token for a given blob in azure storage"""
return self.utilities_helper.get_blob_and_sas(blob_path)

# def table_to_html(self, table):
# """ Function to take an output FR table json structure and convert to HTML """
# header_processing_complete = False
# table_html = "<table>"
# rows = [sorted([cell for cell in table["cells"] if cell["rowIndex"] == i],
# key=lambda cell: cell["columnIndex"]) for i in range(table["rowCount"])]
# for row_cells in rows:
# is_row_a_header = False
# row_html = "<tr>"
# for cell in row_cells:
# tag = "td"
# #if hasattr(cell, 'kind'):
# if 'kind' in cell:
# if (cell["kind"] == "columnHeader" or cell["kind"] == "rowHeader"):
# tag = "th"
# if (cell["kind"] == "columnHeader"):
# is_row_a_header = True
# else:
# # we have encountered a cell that isn't tagged as a header,
# # so assume we have now rerached regular table cells
# header_processing_complete = True
# cell_spans = ""
# #if hasattr(cell, 'columnSpan'):
# if 'columnSpan' in cell:
# if cell["columnSpan"] > 1:
# cell_spans += f" colSpan={cell['columnSpan']}"
# #if hasattr(cell, 'rowSpan'):
# if 'rowSpan' in cell:
# if cell["rowSpan"] > 1:
# cell_spans += f" rowSpan={cell['rowSpan']}"
# row_html += f"<{tag}{cell_spans}>{html.escape(cell['content'])}</{tag}>"
# row_html += "</tr>"

# if is_row_a_header and header_processing_complete == False:
# row_html = "<thead>" + row_html + "</thead>"
# table_html += row_html
# table_html += "</table>"
# return table_html






def table_to_html(self, table):
""" Function to take an output FR table json structure and convert to HTML """
table_html = "<table>"
rows = [sorted([cell for cell in table["cells"] if cell["rowIndex"] == i],
key=lambda cell: cell["columnIndex"]) for i in range(table["rowCount"])]
for row_cells in rows:
table_html += "<tr>"
thead_open_added = False
thead_closed_added = False

for i, row_cells in enumerate(rows):
is_row_a_header = False
row_html = "<tr>"
for cell in row_cells:
tag = "td"
if hasattr(cell, 'kind'):
if 'kind' in cell:
if (cell["kind"] == "columnHeader" or cell["kind"] == "rowHeader"):
tag = "th"
if (cell["kind"] == "columnHeader"):
is_row_a_header = True
cell_spans = ""
if hasattr(cell, 'columnSpan'):
if 'columnSpan' in cell:
if cell["columnSpan"] > 1:
cell_spans += f" colSpan={cell['columnSpan']}"
if hasattr(cell, 'rowSpan'):
if 'rowSpan' in cell:
if cell["rowSpan"] > 1:
cell_spans += f" rowSpan={cell['rowSpan']}"
table_html += f"<{tag}{cell_spans}>{html.escape(cell['content'])}</{tag}>"
table_html +="</tr>"
row_html += f"<{tag}{cell_spans}>{html.escape(cell['content'])}</{tag}>"
row_html += "</tr>"

# add the opening thead if this is the first row and the first header row encountered
if is_row_a_header and i == 0 and not thead_open_added:
row_html = "<thead>" + row_html
thead_open_added = True

# add the closing thead if we have added an opening thead and if this is not a header row
if not is_row_a_header and thead_open_added and not thead_closed_added:
row_html = "</thead>" + row_html
thead_closed_added = True

table_html += row_html
table_html += "</table>"
return table_html






def build_document_map_pdf(self, myblob_name, myblob_uri, result, azure_blob_log_storage_container, enable_dev_code):
""" Function to build a json structure representing the paragraphs in a document,
including metadata such as section heading, title, page number, etc.
Expand Down Expand Up @@ -313,7 +382,58 @@ def build_chunk_filepath (self, file_directory, file_name, file_extension, file_
folder_set = file_directory + file_name + file_extension + "/"
output_filename = file_name + f'-{file_number}' + '.json'
return f'{folder_set}{output_filename}'


previous_table_header = ""

def chunk_table_with_headers(self, prefix_text, table_html, standard_chunk_target_size,
previous_paragraph_element_is_a_table):
soup = BeautifulSoup(table_html, 'html.parser')
thead = str(soup.find('thead'))

# check if this table is a continuation of a table on a previous page.
# If yes then apply the header row from the previous table
if previous_paragraph_element_is_a_table:
if thead != "":
# update thead to include the main table header
thead = thead.replace("<thead>", "<thead>"+self.previous_table_header)

else:
# just use the previoud thead
thead = "<thead>"+self.previous_table_header+"</thead>"

def add_current_table_chunk(chunk):
# Close the table tag for the current chunk and add it to the chunks list
if chunk.strip() and not chunk.endswith("<table>"):
chunk = '<table>' + chunk + '</table>'
chunks.append(chunk)
# Start a new chunk with header if it exists

# Initialize chunks list
chunks = []
current_chunk = prefix_text
# set the target size of the first chunk
chunk_target_size = standard_chunk_target_size - self.token_count(prefix_text)
rows = soup.find_all('tr')
# Filter out rows that are part of thead block
filtered_rows = [row for row in rows if row.parent.name != "thead"]

for i, row in enumerate(filtered_rows):
row_html = str(row)

# If adding this row to the current chunk exceeds the target size, start a new chunk
if self.token_count(current_chunk + row_html) > chunk_target_size:
add_current_table_chunk(current_chunk)
current_chunk = thead
chunk_target_size = standard_chunk_target_size

# Add the current row to the chunk
current_chunk += row_html

# Add the final chunk if there's any content left
add_current_table_chunk(current_chunk)

return chunks

def build_chunks(self, document_map, myblob_name, myblob_uri, chunk_target_size):
""" Function to build chunk outputs based on the document map """

Expand All @@ -339,47 +459,83 @@ def build_chunks(self, document_map, myblob_name, myblob_uri, chunk_target_size)
#if the collected tokens in the current in-memory chunk + the next paragraph
# will be larger than the allowed chunk size prepare to write out the total chunk
if (chunk_size + paragraph_size >= chunk_target_size) or section_name != previous_section_name or title_name != previous_title_name or subtitle_name != previous_subtitle_name:

# If the current paragraph just by itself is larger than CHUNK_TARGET_SIZE,
# then we need to split this up and treat each slice as a new in-memory chunk
# that fall under the max size and ensure the first chunk,
# which will be added to the current
if paragraph_size >= chunk_target_size:
# start by keeping the existing in-memory chunk in front of the large paragraph
# and begin to process it on sentence boundaries to break it down into
# sub-chunks that are below the CHUNK_TARGET_SIZE
sentences = sent_tokenize(chunk_text + paragraph_text)
chunks = []
chunk = ""
for sentence in sentences:
temp_chunk = chunk + " " + sentence if chunk else sentence
if self.token_count(temp_chunk) <= chunk_target_size:
chunk = temp_chunk
else:

# We will process tables and regular text differently as text can be split by sentence boundaries,
# but tables fail as the code sees a table as a single sentence.
# We need a speciality way of splitting a table that is greater than
# our target chunk size
if paragraph_element["type"] == "table":
# table processing & splitting
table_chunks = self.chunk_table_with_headers(chunk_text,
paragraph_text,
chunk_target_size,
previous_paragraph_element_is_a_table)

for i, table_chunk in enumerate(table_chunks):

# write out each table chunk, apart from the last, as this will be less than or
# equal to CHUNK_TARGET_SIZE the last chunk will be processed like
# a regular paragraph
if i < len(table_chunks) - 1:
self.write_chunk(myblob_name, myblob_uri,
f"{file_number}.{i}",
self.token_count(table_chunk),
table_chunk, page_list,
previous_section_name, previous_title_name, previous_subtitle_name,
MediaType.TEXT)
chunk_count += 1
else:
# Reset the paragraph token count to just the tokens left in the last
# chunk and leave the remaining text from the large paragraph to be
# combined with the next in the outer loop
paragraph_size = self.token_count(table_chunk)
paragraph_text = table_chunk
chunk_text = ''
file_number += 1

else:
# text processing & splitting
# start by keeping the existing in-memory chunk in front of the large paragraph
# and begin to process it on sentence boundaries to break it down into
# sub-chunks that are below the CHUNK_TARGET_SIZE
sentences = sent_tokenize(chunk_text + paragraph_text)
chunks = []
chunk = ""
for sentence in sentences:
temp_chunk = chunk + " " + sentence if chunk else sentence
if self.token_count(temp_chunk) <= chunk_target_size:
chunk = temp_chunk
else:
chunks.append(chunk)
chunk = sentence
if chunk:
chunks.append(chunk)
chunk = sentence
if chunk:
chunks.append(chunk)

# Now write out each chunk, apart from the last, as this will be less than or
# equal to CHUNK_TARGET_SIZE the last chunk will be processed like
# a regular paragraph
for i, chunk_text_p in enumerate(chunks):
if i < len(chunks) - 1:
# Process all but the last chunk in this large para
self.write_chunk(myblob_name, myblob_uri,
f"{file_number}.{i}",
self.token_count(chunk_text_p),
chunk_text_p, page_list,
previous_section_name, previous_title_name, previous_subtitle_name,
MediaType.TEXT)
chunk_count += 1
else:
# Reset the paragraph token count to just the tokens left in the last
# chunk and leave the remaining text from the large paragraph to be
# combined with the next in the outer loop
paragraph_size = self.token_count(chunk_text_p)
paragraph_text = chunk_text_p
chunk_text = ''

# Now write out each chunk, apart from the last, as this will be less than or
# equal to CHUNK_TARGET_SIZE the last chunk will be processed like
# a regular paragraph
for i, chunk_text_p in enumerate(chunks):
if i < len(chunks) - 1:
# Process all but the last chunk in this large para
self.write_chunk(myblob_name, myblob_uri,
f"{file_number}.{i}",
self.token_count(chunk_text_p),
chunk_text_p, page_list,
previous_section_name, previous_title_name, previous_subtitle_name,
MediaType.TEXT)
chunk_count += 1
else:
# Reset the paragraph token count to just the tokens left in the last
# chunk and leave the remaining text from the large paragraph to be
# combined with the next in the outer loop
paragraph_size = self.token_count(chunk_text_p)
paragraph_text = chunk_text_p
chunk_text = ''
file_number += 1
else:
# if this para is not large by itself but will put us over the max token count
# or it is a new section, then write out the chunk text we have to this point
Expand All @@ -404,7 +560,23 @@ def build_chunks(self, document_map, myblob_name, myblob_uri, chunk_target_size)
# add paragraph to the chunk
chunk_size = chunk_size + paragraph_size
chunk_text = chunk_text + "\n" + paragraph_text


# store the type of paragraph and content, to be used if a table crosses page boundaries and
# we need to apply the column headings to subsequent pages

if paragraph_element["type"] == "table":
previous_paragraph_element_is_a_table = True
if self.previous_table_header == "":
# Stash the current tables heading to apply to subsequent page tables if they are missing column headings,
# but only for the first page of the multi-page table
soup = BeautifulSoup(paragraph_text, 'html.parser')
# Extract the thead and strip the thead wrapper to just leave the header cells
self.previous_table_header = str(soup.find('thead'))
self.previous_table_header = self.previous_table_header.replace("<thead>", "").replace("</thead>", "")
else:
previous_paragraph_element_is_a_table = False
self.previous_table_header = ""

# If this is the last paragraph then write the chunk
if index == len(document_map['structure'])-1:
self.write_chunk(myblob_name, myblob_uri, file_number, chunk_size,
Expand Down
Loading