## Document Analysis using LLMs with Python

by mauzum shamil

Document analysis refers to extracting, interpreting, and understanding the information contained within a document. Traditionally, this involved manual review or simple keyword-based techniques, but with the rise of Large Language Models (LLMs) like GPT and BERT, LLMs are now preferred for document analysis because they can comprehend context, generate summaries, answer questions, and identify key insights efficiently.

### Extract Text From the pdf

The first step in document analysis is extracting the content from a PDF file. We can use libraries like pdfplumber to open and read the text from each page of the PDF and save it into a .txt file for further analysis. You can install pdfplumber on your Python environment using the command: pip install pdfplumber. Here’s how to extract text from the PDF:

In [None]:
 pip install pdfplumber


In [2]:
import pdfplumber

pdf_path = r"C:\Users\mauzu\OneDrive\Desktop\Document Analysis using LLms with Python\google_terms_of_service_en_in.pdf"

output_text_file = "extracted_text.txt"

with pdfplumber.open(pdf_path) as pdf:
    extracted_text = ""
    for page in pdf.pages:
        extracted_text += page.extract_text()

with open(output_text_file, "w") as text_file:
    text_file.write(extracted_text)

print(f"Text extracted and saved to {output_text_file}")

Text extracted and saved to extracted_text.txt


### Preview the Extracted Text 

After extracting the text, it’s essential to preview the content to ensure everything is correctly captured.

In [3]:
# reading pdf content 

with open(r"C:\Users\mauzu\OneDrive\Desktop\Document Analysis using LLms with Python\extracted_text.txt", "r") as file:
    document_text = file.read()

# preview the document content 

print(document_text[:500])

GOOGLE TERMS OF SERVICE
Effective May 22, 2024 | Archived versions
What’s covered in these terms
We know it’s tempting to skip these Terms of
Service, but it’s important to establish what you
can expect from us as you use Google services,
and what we expect from you.
These Terms of Service re ect the way Google’s business works, the laws that apply to
our company, and certain things we’ve always believed to be true. As a result, these Terms
of Service help de ne Google’s relationship with you as


### Summarize the Document

To get a high-level overview of the document, you can use a pre-trained summarization model like t5-small. This allows you to condense large pieces of text into shorter summaries, which helps you to grasp the most important information.

In [None]:
pip install transformers -U


In [5]:
from transformers import pipeline

In [6]:
# load the summarization pipeline
summarizer = pipeline("summarization", model="t5-small")

# summarize the document text

summary = summarizer(document_text[:1000], max_length=150, min_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text'])

Device set to use cpu


Summary: these Terms of Service reect the way Google’s business works, the laws that apply to our company, and certain things we’ve always believed to be true . these terms include: what you can expect from us, which describes how we provide and develop our services What we expect from you, which establishes certain rules for using our services Content in Google services .


The pipeline(“summarization”, model= “t5-small”) sets up the summarization model using T5-small, a pre-trained transformer model designed for text summarization. The document_text[:1000] specifies the portion of the text to summarize (the first 1000 characters), while max_length = 150 and min_length = 30 control the maximum and minimum length of the summary in tokens. The do_sample = False parameter ensures deterministic output, meaning the model will not randomly sample from possible summaries but will give the same result every time.

### Split the Document into Sentences and Passages

For more detailed analysis, like question generation, it’s important to split the document into smaller chunks. This step tokenizes the document into sentences and combines them into manageable passages for subsequent steps. 

In [7]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# split text into sentences
sentences = sent_tokenize(document_text)

# combine sentences into passages
passages = []
current_passage = ""
for sentence in sentences:
    if len(current_passage.split()) + len(sentence.split()) < 200:  # adjust the word limit as needed
        current_passage += " " + sentence
    else:
        passages.append(current_passage.strip())
        current_passage = sentence
if current_passage:
    passages.append(current_passage.strip())

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mauzu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In this part of the code, we are using the NLTK library to split the extracted document text into individual sentences using the sent_tokenize() function. Then, we combine these sentences into manageable passages by setting a word limit of 200 words for each passage. This helps ensure that each passage is of a suitable length for further processing by language models, which often have token limits. If the current passage exceeds the word limit, it is appended to the passages list, and the process continues until all sentences are grouped into passages.

### Generate Questions from the Passages Using LLMs

The next step is to generate questions based on the document’s content. This helps in understanding key information points and can be used to check the comprehension of the document.

In [None]:
# pip install tiktoken
#pip install sentencepiece

In [11]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline

# Load the T5 tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("valhalla/t5-base-qg-hl")
model = T5ForConditionalGeneration.from_pretrained("valhalla/t5-base-qg-hl")

# Initialize the pipeline
qg_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

# Function to generate questions using the pipeline
def generate_questions_pipeline(passage, min_questions=3):
    input_text = f"generate questions for the following text: {passage} Provide relevant questions that help understand the text."
    results = qg_pipeline(input_text, max_length=512, num_beams=10, num_return_sequences=10)
    questions = set()
    
    for result in results:
        questions.update(result['generated_text'].split('<sep>'))
    
    # Ensure we have at least 3 questions
    questions = [q.strip() for q in questions if q.strip() and q.strip().endswith('?')]
    
    # If fewer than 3 questions, try to regenerate from smaller parts of the passage
    if len(questions) < min_questions:
        passage_sentences = passage.split('. ')
        for i in range(0, len(passage_sentences), 2):
            if len(questions) >= min_questions:
                break
            additional_input = ' '.join(passage_sentences[i:i+2])
            additional_results = qg_pipeline(f"generate questions for the following text: {additional_input} Provide relevant questions that help understand the text.", max_length=512, num_beams=10, num_return_sequences=10)
            for additional_result in additional_results:
                questions.update(additional_result['generated_text'].split('<sep>'))
            questions = [q.strip() for q in questions if q.strip() and q.strip().endswith('?')]
    
    return questions[:min_questions]  # Return only the top questions

# Generate questions from passages
passages = ["Here are the terms of service that govern how Google operates...", "The second passage includes detailed descriptions of user responsibilities..."]
for idx, passage in enumerate(passages):
    questions = generate_questions_pipeline(passage)
    print(f"Passage {idx+1}:\n{passage}\n")
    print("Generated Questions:")
    for q in questions:
        print(f"- {q}")
    print(f"\n{'-'*50}\n")



You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Device set to use cpu


Passage 1:
Here are the terms of service that govern how Google operates...

Generated Questions:
- What are the terms of service that govern how Google operates?
- What are the terms and conditions that govern how Google operates?
- What are the terms of service for Google?

--------------------------------------------------

Passage 2:
The second passage includes detailed descriptions of user responsibilities...

Generated Questions:
- The second passage includes detailed descriptions of user responsibilities... Provide relevant questions that help understand the text?
- The second passage includes detailed descriptions of user responsibilities... Provide relevant questions to help understand the text?
- The second passage of the book includes detailed descriptions of user responsibilities... Provide relevant questions that help understand the text?

--------------------------------------------------



In this part of the code, we are using a question generation model (T5-based model valhalla/t5-base-qg-hl) from the Hugging Face transformers library to automatically generate questions from text passages. The function generate_questions_pipeline() takes a text passage as input and produces a list of questions. We generate at least three questions for each passage, and if not, we split the passage into smaller parts and generate additional questions. This approach guarantees comprehensive question generation for each passage, and we print the questions along with the corresponding passage for review.

### Answer the Generated Questions Using a QA Model

After generating the questions, we can use a pre-trained question-answering (QA) model to find the answers within the text. The deepset/roberta-base-squad2 model extracts answers based on the context of the passage.

In [12]:
from transformers import pipeline

# Load the QA pipeline with additional parameters for detailed answers
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2", 
                       max_answer_len=512, num_beams=5)

# Function to track and answer only unique questions
def answer_unique_questions(passages, qa_pipeline):
    answered_questions = set()  # to store unique questions

    for idx, passage in enumerate(passages):
        questions = generate_questions_pipeline(passage)

        for question in questions:
            if question not in answered_questions:  # check if the question has already been answered
                answer = qa_pipeline({'question': question, 'context': passage})
                print(f"Q: {question}")
                print(f"A: {answer['answer']}\n")
                answered_questions.add(question)  # add the question to the set to avoid repetition
        print(f"{'='*50}\n")

# Example passages
passages = [
    "Here are the terms of service that govern how Google operates...",
    "The second passage includes detailed descriptions of user responsibilities..."
]

answer_unique_questions(passages, qa_pipeline)


Device set to use cpu


Q: What are the terms of service that govern how Google operates?
A: Here

Q: What are the terms and conditions that govern how Google operates?
A: terms of service

Q: What are the terms of service for Google?
A: terms of service that govern how Google operates...


Q: The second passage includes detailed descriptions of user responsibilities... Provide relevant questions that help understand the text?
A: ...

Q: The second passage includes detailed descriptions of user responsibilities... Provide relevant questions to help understand the text?
A: ...

Q: The second passage of the book includes detailed descriptions of user responsibilities... Provide relevant questions that help understand the text?
A: ...




In this part of the code, we used a question-answering (QA) pipeline with the deepset/roberta-base-squad2 model to answer questions generated from the document passages. The function answer_unique_questions() tracks unique questions in a set to ensure it answers each question only once. As the code processes each passage, it checks whether it has already answered a question; if not, it generates an answer based on the passage’s context. This avoids answering duplicate questions and ensures efficient processing of all relevant queries.

## Summary

So, this is how we can analyze documents using LLMs step-by-step. LLMs excel at understanding natural language, which makes them ideal for handling complex documents and extracting meaningful insights with high accuracy and minimal human intervention. I hope you liked this article on document analysis using LLMs with Python. Feel free to ask valuable questions in the comments section below.

Here are my profiles for reference:

GitHub:
https://github.com/mauzumshamil

LinkedIn:
http://linkedin.com/in/mauzum-shamil

Portfolio Link: 
https://linktr.ee/mauzum_shamil