First install libraries. Starting with langchain due to amount of documentation. Using PyPDF for loading PDFs. TikToken is for counting tokens. MatPlotLib is to graph experiments with chunking size.

*todo*: move to requirements.txt

*todo*: consider adding code from here to plot [chunking](https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/lib/common.py)

In [33]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


Load PDFs as based on example [here](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents).

In [38]:
import os
from langchain_community.document_loaders import PyPDFLoader

path_to_sample_docs = ".\\pdfs\\"

file_names = os.listdir(path_to_sample_docs)

# todo: apply this to all files in directory
test_doc = path_to_sample_docs + file_names[0]

loader = PyPDFLoader(test_doc)
pages = loader.load()

print(len(pages))

9


Now that we have the PDF(s) loaded we can run some analysis of token count

In [18]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base') # is this correct? what to do here
def tiktoken_len(text):
    tokens = tokenizer.encode(
    text,
    disallowed_special=()
)
    return len(tokens)
tiktoken.encoding_for_model('gpt-3.5-turbo') # todo: parameterize this

# create the length function
token_counts = []
for page in pages:
    token_counts.append(tiktoken_len(page.page_content))
min_token_count = min(token_counts)
avg_token_count = int(sum(token_counts) / len(token_counts))
max_token_count = max(token_counts)

# print token counts
print(f"Min: {min_token_count}")
print(f"Avg: {avg_token_count}")
print(f"Max: {max_token_count}")

Min: 891
Avg: 1303
Max: 2093


Smallest page has 891 tokens, largest page has 2093, average page is at 1303 tokens. For GPT-3.5 turbo which is the encoding chosen above the max token count for any given interaction is [4097](https://platform.openai.com/docs/guides/text-generation/managing-tokens)

1/2 of 4097 = ~2,048

Meaning our largest page would take up half of our tokens immediately. Common practice would be to allow at least half token count for response.  We should also evalutate based on how large we think our prompts will be. For now we'll be generous and assume that 1/4 of our prompt will be towards our prompt engineering (e.g. system prompt, few shot, etc...)

1/4 of 4097 = ~1,024

Which also consequently gives us our first outlined maximum chunk size. Notably this is smaller than our average page size but larger than the minimum page size.

In this [example](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#text-split-skill-example) though, they show using page split mode with max page length of 2k tokens and using an overlaop of 0! More investigation needed to determine how that compares to our use case and if we should eschew the [typically recommended overlap size of 10%](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations) Here is an interesting [tool](https://gustavo-espindola.medium.com/chunk-division-and-overlap-understanding-the-process-ade7eae1b2bd) for visualization to check out.


It is also good to consider the following [advice](https://dev.to/peterabel/what-chunk-size-and-chunk-overlap-should-you-use-4338):

> However, in general, it is a good idea to use a small chunk size for tasks that require a fine-grained view of the text and a larger chunk size for tasks that require a more holistic view of the text.

For document scanner, we're first performing some classification (holistic) and then entity extraction (fine-grain). It may behoove us to 2 different chunking methods to accomplish the different tasks. What will be important is (a) determining a way of easily measuring sucess and (b) experimenting with chunk sizes for both tasks across a variety of inputs. It would be ideal to put such data into a graph to visualize where we find efficiencies with different chunking approaches for both. Further, we may find that there is different chunking strategies for success based on the results of the classification. That is to say, if we classify a document as being Type Y, then it benefits from a Z chunking paradigm to pull its entities.

Interesting [article](https://vectify.ai/blog/LargeDocumentSummarization) on these kinds of calculations specific to summarization - would summarization be helpful for classifying through OpenAI as opposed to Doc Intelligence?

*QUESTION*: Is the encoding for model different based on 16k etc vs 32k vs base amount for tokens?



In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# split documents into text and embeddings

chunk_size=1024
chunk_overlap=chunk_size*0.1

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=chunk_size, 
   chunk_overlap=chunk_overlap,
   length_function=len,
   is_separator_regex=False
)

chunks = text_splitter.split_documents(pages)

print("# of chunks: ", len(chunks))
print("First chunk: ", chunks[0].page_content)
print("Last chunk: ", chunks[-1].page_content)

# of chunks:  54
First chunk:  © 2014 Blombery and Scully. This work is published by Dove Medical Press Limited, and licensed under Creative Commons Attribution – Non Commercial (unported,  v3.0)  
License. The full terms of the License are available at http://creativecommons.org/licenses/by-nc/3.0 /. Non-commercial uses of the work are permitted without any further 
permission from Dove Medical Press Limited, provided the work is properly attributed. Permissions beyond the scope of the License are administered by Dove Medical Press Limited. Information on 
how to request permission may be found at: http://www.dovepress.com/permissions.phpJournal of Blood Medicine 2014:5 15–23Journal of Blood Medicine Dove press
submit your manuscript | www.dovepress.co m
Dove press  15Reviewopen access to scientific and medical research
Open Access Full T ext Article
http: //dx.doi.org/ 10.2147/JBM.S4645 8Management of thrombotic thrombocytopenic 
purpura: current perspectives
Piers Blombery
Marie Scu

Let's just try some of the tasks we have:
- identify internal study number
- identify external study number
- classify based on document type/subtype & archive type/subtype
- detect handwritten signature (doc intell?)
    - pricing vs open ai?
- langage detection e.g. english vs japanese (lagnuage service)
    - what is pricing?
- stakeholder names
- sentiment analysis of intro & conclusion




takeda author always have internal but not have an external (might be some one offs)
but an external study report will typically have an external number and often have an internal study number
other types
might need to detect label e.g. study director

## Semantic Chunking

Per discussion with Joe Karasha, attempting to use Document Intelligence libaries to leverage semantic chunking. Other notable things to try include large chunks vs small chunks and more vs less overlap and markdown vs plain text chunkers and lastly the model we are using for domain knowledge e.g. gpt3.5 turbo vs gpt4. For MVP, we'll just try Semantic Chunking but we should follow up with additional tasks to setup testing and metrics to determine if we change our chunking strategy here what kind of improvements we see.

### Extract Layout

Pulled the following from this example in their [documentation](https://github.com/mirespace/python-azure/blob/bc98fd7949ba6c2d6bc1bd396317e98c50c09d77/sdk/formrecognizer/azure-ai-formrecognizer/README.md). This could be useful for some more naive entity extraction potentially or helping to verify what chunks belong to what sections of the document (which would be good for our tasks around intro/conclusion sentiment analysis).

In [51]:
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient

load_dotenv()

endpoint = os.environ.get("DOCUMENTINTELLIGENCE_ENDPOINT")
key = os.getenv("DOCUMENTINTELLIGENCE_API_KEY")

print("Using endpoint: ", endpoint)
print("Using key with length: ", len(key) * "*")

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

with open(test_doc, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document("prebuilt-layout", analyze_request=f, content_type="application/octet-stream")

result = poller.result()

for page in result.pages:
    print(f"----Analyzing layout from page #{page.page_number}----")
    print(f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}")

    if(page.lines):
        for line_idx, line in enumerate(page.lines):
            print(f"...Line # {line_idx} has content '{line.content}' within polygon '{line.polygon}'")

    if(page.words):
        for word in page.words:
            print(f"...Word '{word.content}' has a confidence of {word.confidence}")

    if(page.selection_marks):
        for selection_mark in page.selection_marks:
            print(f"...Selection mark is '{selection_mark.state}' within polygon '{selection_mark.polygon}' and has a confidence of {selection_mark.confidence}")
if(result.tables):
    for table_idx, table in enumerate(result.tables):
        print(f"Table # {table_idx} has {table.row_count} rows and {table.column_count} columns")

        for region in table.bounding_regions:
            print(f"Table # {table_idx} location on page: {region.page_number} is {region.bounding_box}")
        for cell in table.cells:
            print(f"...Cell[{cell.row_index}][{cell.column_index}] has content '{cell.content}'")

Using endpoint:  https://codewith-azure-ai-services.cognitiveservices.azure.com/
Using key with length:  ********************************
----Analyzing layout from page #1----
Page has width: 8.5 and height: 11, measured with unit: inch
...Line # 0 has content 'Journal of Blood Medicine' within polygon '[0.9407, 0.6016, 3.0944, 0.592, 3.0992, 0.7782, 0.9407, 0.7973]'
...Line # 1 has content 'Dovepress' within polygon '[6.8621, 0.6016, 7.7551, 0.6063, 7.7551, 0.7925, 6.8621, 0.7878]'
...Line # 2 has content 'open access to scientific and medical research' within polygon '[5.8688, 0.8355, 7.7646, 0.8307, 7.7646, 0.931, 5.8688, 0.9405]'
...Line # 3 has content 'a' within polygon '[0.9694, 1.1029, 1.0649, 1.0981, 1.0649, 1.27, 0.9694, 1.2652]'
...Line # 4 has content 'Open Access Full Text Article' within polygon '[1.1126, 1.1411, 2.1346, 1.1315, 2.1346, 1.2365, 1.1126, 1.2461]'
...Line # 5 has content 'REVIEW' within polygon '[7.1104, 1.1458, 7.7598, 1.1458, 7.7598, 1.27, 7.1104, 1.2747]'

## MarkdownHeaderTextSplitter

Looks like we can just use semantic chunking immediately on loading the file without needing to run layout analysis first as a separate call? Perhaps it is baked into this call? Does our document need to be markdown off the bat or does it convert it to markdown using the layout analysis? Based on this [example](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augmented-generation?view=doc-intel-4.0.0#use-case).

In [54]:
# Using SDK targeting 2024-02-29-preview or 2023-10-31-preview, make sure your resource is in one of these regions: East US, West US2, West Europe
# pip install azure-ai-documentintelligence==1.0.0b1
# pip install langchain langchain-community azure-ai-documentintelligence


from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
 
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
ai_doc_intel_loader = AzureAIDocumentIntelligenceLoader(file_path=test_doc, api_key = key, api_endpoint = endpoint, api_model="prebuilt-layout")
docs = ai_doc_intel_loader.load()
 
# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 
docs_string = docs[0].page_content
splits = text_splitter.split_text(docs_string)
splits

[Document(page_content='<!-- PageHeader="Journal of Blood Medicine" -->  \n<!-- PageHeader="Dovepress open access to scientific and medical research" -->  \na Open Access Full Text Article  \nREVIEW  \nManagement of thrombotic thrombocytopeni purpura: current perspectives\n===  \nThis article was published in the following Dove Press journal: Journal of Blood Medicine 5 February 2014 Number of times this article has been viewed  \nPiers Blombery Marie Scully  \nDepartment of Haematology, University College London Hospital, London, UK  \nAbstract: Thrombotic thrombocytopeniarpura (TTP) is a rare, life-threatening thrombotic microangiopathy which causes significant morbidity and mortality unless promptly recognized and treated. The underlying pathogenesis of TTP is a severe deficiency in ADAMTS13 activity, a metalloprotease that cleaves ultralarge von Willebrand factor multimers. This deficiency is either autoantibody mediated (acquired TTP) or due to deleterious mutations in the gene en

## Different Chunkers

We have a couple of different ways we can choose our chunking strategy it seems. More "naive" or traditional chunking methods with 
- ```RecursiveCharacterTextSplitter```: Recursively chunks based on separators attempting to reach the desired chunk size for all chunks while preserving natural separators. Viewing this as the 'default'. Documentation [here](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html?highlight=recursive%20text%20splitter#langchain.text_splitter.RecursiveCharacterTextSplitter).
- ```NLTKTextSplitter```: potentially too basic? Documentation [here](https://www.nltk.org/)
- ```SpaCyTextSplitter```: "sophisticated sentence segmentation feature that can efficiently divide the text into separate sentences, enabling better context preservation in the resulting chunks" as described [here](https://www.pinecone.io/learn/chunking-strategies/#:~:text=sophisticated%20sentence%20segmentation%20feature%20that%20can%20efficiently%20divide%20the%20text%20into%20separate%20sentences%2C%20enabling%20better%20context%20preservation%20in%20the%20resulting%20chunks). This might be a decent choice for our entity exctraction problems?

More specialized chunking
- ```MarkdownTextSplitter```: LangChain makes available a markdown text splitter which we can use after using either Document Intelligence LayoutExtraction or [PDF Miner](https://pdfminersix.readthedocs.io/en/latest/) to convert our PDF into markdown. This likely has an advantage at least with document classification problems which would benefit from more semantic context / holistic view of the document.
- ```LatexTextSplitter```: "LaTeX is the de facto standard for the communication and publication of scientific documents" (according to [LaTex](https://www.latex-project.org/)). Perhaps due to the nature of our documents this might prove superior?

## How to Evaluate

So I like the process that [this blog post](https://medium.com/@zilliz_learn/experimenting-with-different-chunking-strategies-via-langchain-694a4bd9f7a5) uses a bit where you parameterize the whole thing and try it out with different sizes. This is good for evaluating the 'naive' chunking where size is the primary dictator. We should also compare the semantic aware chunks. It is also worth comparing chunk sizes generated by those to see if they might be too large for our prompting. Another consideration is that our 'naive' strategies might be impacted by loaders so that could be a paramter... I like how [this blog post](https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-3858c4a0997a) plotted the distribution of chunk lengths based on different parameters for just one splitter, it also has a funny analogy with a pineapple at the end. 


## Language detection. 

Just using detect_langs library since we're in python, why pay for another service unless we find unsatisfactory results. Only testing first chunk for performance - seems unlikely that the first chunk would return English and then a bunch of others would be different languages. Easy enough to refactor to check all chunks though and then return all results as tags if that was considered desirable.

Library documentation can be found [here](https://pypi.org/project/langdetect/).

In [22]:
from langdetect import detect_langs
first_chunk_lang = detect_langs(pages[0].page_content)
print(first_chunk_lang)

[en:0.9999963880660385]


Resources I need to evaluate/integrate.
- https://learn.microsoft.com/en-us/semantic-kernel/agents/plugins/out-of-the-box-plugins?tabs=Csharp
- https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf
- https://langchain-doc.readthedocs.io/en/latest/modules/indexes/examples/textsplitter.html
- https://www.pinecone.io/learn/chunking-strategies/
- https://medium.com/@ankit941208/generating-summaries-for-large-documents-with-llama2-using-hugging-face-and-langchain-f7de567339d2
- https://vectify.ai/blog/LargeDocumentSummarization
- https://pypi.org/project/langdetect/
- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167
- https://documentintelligence.ai.azure.com/studio
- https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/langchain-data-chunking-example.ipynb
- https://learn.microsoft.com/en-us/training/modules/use-prebuilt-form-recognizer-models/