In [24]:
import os
import math
import PyPDF2
from transformers import T5ForConditionalGeneration, T5Tokenizer
from concurrent.futures import ProcessPoolExecutor

### File Ingestion

In [25]:
import PyPDF2

# Function to extract text from PDF using PyPDF2
def extract_text_from_pdf(file_path):
    text = ''
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

# Example usage
pdf_file_path = "D:\Programming\Projects\WASSERSTOFF\Data\pdf1.pdf"  # Replace with your PDF file path
document_text = extract_text_from_pdf(pdf_file_path)

# Check the extracted text
print(document_text[:1000])  # Print first 1000 characters


1950 
Patntalal 
Jankid-:u 
v, 
Mohanlal and 
Another, 
Pata1tjali 
S11stri J. 
!950 
Deo, 21. 1008 SUPREME COURT REPORTS [1950] 
of section 14, it seems to me, they would be bringing 
themselves under the bar of section 18 (2). The 
respondents cannot therefore claim that the loss of the 
goods was explosion damage within the meaning of the 
Ordinance so as to bring the case within section 14 and 
at the same time contend that the loss was not "due 
to or did not in any way arise ont of the explosion" in 
order to a void the bar under section 18. Both sec­
tion 14 and section 18 have in view the physical cause 
for the loss or damage to property for which compen­
sation is claimed and not the cause of action in rela­
tion to the person against whom relief is sought. The 
respondents cannot, in my opinion, be allowed to take 
up inconsistent positions in order to bring themselves 
within the one and to get out of the other. 
I would therefore allow the appeal and dismiss the 
counter-c

In [26]:
# document_text="""
# In a world increasingly defined by technology, the relationship between humans and machines has become a focal point of discussion across various domains. The advent of artificial intelligence (AI) has reshaped industries, impacting everything from healthcare to finance, education to entertainment. With the integration of AI in daily life, there emerges a profound question: what does it mean to be human in an age where machines can replicate cognitive functions? This question is particularly relevant in the realm of healthcare, where AI systems are not only assisting doctors in diagnosing diseases but also in predicting patient outcomes based on vast datasets. As a result, the medical community is witnessing a shift in how care is delivered; physicians are now seen as guides who interpret AI-generated insights rather than sole decision-makers. This paradigm shift has its proponents and critics. Advocates argue that AI enhances human capabilities, allowing for more accurate diagnoses and personalized treatment plans. They cite examples of AI systems that analyze medical images with a precision that rivals expert radiologists, thus improving early detection rates of conditions like cancer. On the other hand, critics voice concerns regarding the ethical implications of relying too heavily on machines, fearing that this could undermine the human touch that is integral to patient care. They point out the risk of data bias, as AI systems trained on historical data may perpetuate existing inequalities in healthcare delivery. Moreover, there are apprehensions about patient privacy, as the use of sensitive health data raises questions about who owns this information and how it is used. As healthcare continues to evolve in this direction, the dialogue surrounding the intersection of technology and human experience becomes even more critical. Moving beyond healthcare, the influence of AI permeates other sectors such as finance, where algorithms determine credit scores and assess loan applications. The automation of these processes promises efficiency and speed, yet it also poses significant challenges. For instance, the opaque nature of algorithmic decision-making can lead to a lack of accountability, making it difficult for individuals to understand why they were denied a loan or insurance coverage. In response, policymakers are grappling with the need to regulate AI to ensure fairness and transparency. The challenge lies in striking a balance between innovation and safeguarding the rights of individuals, a debate that reflects broader societal concerns about the pace of technological advancement. In the realm of education, AI offers tools for personalized learning experiences, adapting curricula to meet the unique needs of each student. This potential for customization is promising, particularly for learners who may struggle in traditional educational settings. However, the reliance on technology raises questions about equity, as not all students have equal access to the necessary tools and resources. The digital divide becomes apparent, highlighting disparities that exist within and between communities. As educators explore the integration of AI in classrooms, they must also consider how to ensure that all students benefit from these innovations. In the creative industries, AI is reshaping the landscape of art, music, and literature. Tools powered by AI are now capable of generating original works, blurring the lines between human creativity and machine-generated content. This evolution prompts discussions about authorship and originality. If a machine creates a piece of art, who is the true artist? Furthermore, the rise of deepfakes and AI-generated media has raised ethical concerns about misinformation and the manipulation of public perception. As technology continues to advance, society faces the dual challenge of embracing innovation while mitigating its risks. The overarching theme across these sectors is the need for a thoughtful approach to AI integration, one that considers the ethical, social, and economic implications of these powerful technologies. The dialogue surrounding AI is not merely academic; it has real-world consequences that affect individuals, communities, and global dynamics. As we stand on the precipice of a future increasingly influenced by artificial intelligence, it is crucial that we engage in ongoing discussions about what it means to live in harmony with machines. The future of work, society, and human interaction hinges on our ability to navigate these complexities with care, foresight, and a commitment to preserving the values that define our humanity. Ultimately, the integration of AI into everyday life challenges us to redefine our roles as individuals and as a society, prompting us to ask profound questions about identity, ethics, and the nature of progress. In this ever-evolving landscape, we must prioritize education, policy development, and ethical considerations to ensure that technology serves as a tool for enhancing human potential rather than a force that diminishes it.
# """

# reference_summary = """
# The relationship between humans and machines is increasingly defined by the rise of artificial intelligence (AI), impacting various sectors, including healthcare, finance, education, and the creative industries. In healthcare, AI assists in diagnosing diseases and predicting outcomes, prompting discussions about the balance between technology and the human touch in patient care. While proponents argue that AI enhances capabilities, critics raise concerns about data bias, patient privacy, and the potential loss of human empathy. In finance, AI streamlines processes but introduces challenges related to transparency and accountability. The education sector benefits from personalized learning tools, though disparities in access highlight existing inequalities. In creative fields, AI-generated content raises questions about authorship and originality, particularly concerning misinformation. The overarching theme is the necessity for a thoughtful approach to AI integration, balancing innovation with ethical and social implications. As society navigates these complexities, ongoing dialogue is essential to ensure technology enhances human potential while preserving core values.
# """

### Splitting into chunks

In [27]:
def split_text_by_characters(text, chunk_size_chars=1000):
    """
    Splits text into smaller chunks based on a character limit.
    :param text: The entire document text
    :param chunk_size_chars: The number of characters per chunk
    :return: List of text chunks
    """
    chunks = [text[i:i + chunk_size_chars] for i in range(0, len(text), chunk_size_chars)]
    return chunks

# Split the document into smaller character-based chunks
char_chunks = split_text_by_characters(document_text, chunk_size_chars=1000)

print(f"Number of character-based chunks: {len(char_chunks)}")


Number of character-based chunks: 25


### Token Validation

In [28]:
from transformers import T5Tokenizer

# Load the T5-small tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def ensure_safe_token_length(chunks, tokenizer, max_chunk_length=512):
    validated_chunks = []
    
    for chunk in chunks:
        tokens = tokenizer.encode(chunk)
        if len(tokens) > max_chunk_length:
            # If a chunk exceeds the token limit, split it further
            token_chunks = [tokens[i:i + max_chunk_length] for i in range(0, len(tokens), max_chunk_length)]
            # Convert token chunks back to text and append to validated_chunks
            for token_chunk in token_chunks:
                validated_chunks.append(tokenizer.decode(token_chunk, skip_special_tokens=True))
        else:
            validated_chunks.append(chunk)
    
    return validated_chunks

# Ensure all chunks fit within the token length constraint
validated_text_chunks = ensure_safe_token_length(char_chunks, tokenizer, max_chunk_length=512)

print(f"Number of validated token-safe chunks: {len(validated_text_chunks)}")


Number of validated token-safe chunks: 25


### Summerization

In [29]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")

def summarize_chunk(chunk):
    """
    Summarizes a single chunk of text.
    :param chunk: A chunk of text to summarize
    :return: The summary of the chunk
    """
    inputs = tokenizer.encode("summarize: " + chunk, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Summarize each validated chunk
summarized_chunks = [summarize_chunk(chunk) for chunk in validated_text_chunks]

# Print summarized chunks (optional)
for idx, summary in enumerate(summarized_chunks):
    print(f"Summary for Chunk {idx + 1}:\n{summary}\n")


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Summary for Chunk 1:
This is an appeal from a decision of the High Court which dismissed a claim for compensation brought under section 14 of the Explosives Ordinance, 1950, on the ground that the loss of the goods was explosion damage within the meaning of the Ordinance so as to bring the case within section 18.

Summary for Chunk 2:
The High Court of India has the jurisdiction to hear an appeal against a decision of the Appellate Tribunal on facts recorded by the Tribuna.l-Accepting arauments of CO'ltnsel as proved fa.cts and basing decision on them, impropriety of Business expenditnre Payments to avoid disclosure of misfeasance of director.

Summary for Chunk 3:
The High Court of Calcutta is not an advisory court, as it is not under the advisory jurisdiction of the Income-tax Tribunal, as it is under the jurisdiction of the Supreme Court, as it is under the advisory jurisdiction of the High Court, as it is under the jurisdiction of the Supreme Court, as it is under the advisory juri

In [30]:
# Concatenate all the summarized chunks to form the final summary
final_summary = " ".join(summarized_chunks)

# Print the final summary

print(final_summary)


This is an appeal from a decision of the High Court which dismissed a claim for compensation brought under section 14 of the Explosives Ordinance, 1950, on the ground that the loss of the goods was explosion damage within the meaning of the Ordinance so as to bring the case within section 18. The High Court of India has the jurisdiction to hear an appeal against a decision of the Appellate Tribunal on facts recorded by the Tribuna.l-Accepting arauments of CO'ltnsel as proved fa.cts and basing decision on them, impropriety of Business expenditnre Payments to avoid disclosure of misfeasance of director. The High Court of Calcutta is not an advisory court, as it is not under the advisory jurisdiction of the Income-tax Tribunal, as it is under the jurisdiction of the Supreme Court, as it is under the advisory jurisdiction of the High Court, as it is under the jurisdiction of the Supreme Court, as it is under the advisory jurisdiction of the High Court, as it is under the advisory jurisdict

In [34]:
len(final_summary)

6623

### Metrics

In [31]:

from rouge_score import rouge_scorer

# Example summaries
generated_summary = final_summary
reference_summary = reference_summary
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)

print(scores)


{'rouge1': Score(precision=0.03916666666666667, recall=0.3032258064516129, fmeasure=0.06937269372693727), 'rouge2': Score(precision=0.004170141784820684, recall=0.032467532467532464, fmeasure=0.007390983000739098), 'rougeL': Score(precision=0.030833333333333334, recall=0.23870967741935484, fmeasure=0.054612546125461264)}


In [32]:
# Loop through each ROUGE score and print its values
for key, score in scores.items():
    print(f"{key}:")
    print(f"  Precision: {score.precision:.4f}")
    print(f"  Recall: {score.recall:.4f}")
    print(f"  F-measure: {score.fmeasure:.4f}\n")


rouge1:
  Precision: 0.0392
  Recall: 0.3032
  F-measure: 0.0694

rouge2:
  Precision: 0.0042
  Recall: 0.0325
  F-measure: 0.0074

rougeL:
  Precision: 0.0308
  Recall: 0.2387
  F-measure: 0.0546



In [33]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Sample reference text
reference_text = reference_summary

# Sample generated text from your model
generated_text = final_summary

# Tokenize the reference and generated texts
reference = [nltk.word_tokenize(reference_text.lower())]  # List of reference sentences
generated = nltk.word_tokenize(generated_text.lower())

# Use a smoothing function
smoothie = SmoothingFunction()

# Calculate BLEU score with smoothing for all n-gram orders
bleu_score = sentence_bleu(reference, generated, smoothing_function=smoothie.method1)
print(f"BLEU score with smoothing: {bleu_score:.4f}")

# Calculate BLEU score for unigrams, bigrams, trigrams, and 4-grams with smoothing
bleu_score_unigrams = sentence_bleu(reference, generated, weights=(1, 0, 0, 0), smoothing_function=smoothie.method1)
bleu_score_bigrams = sentence_bleu(reference, generated, weights=(0, 1, 0, 0), smoothing_function=smoothie.method1)
bleu_score_trigrams = sentence_bleu(reference, generated, weights=(0, 0, 1, 0), smoothing_function=smoothie.method1)
bleu_score_4grams = sentence_bleu(reference, generated, weights=(0, 0, 0, 1), smoothing_function=smoothie.method1)

print(f"BLEU score (unigrams with smoothing): {bleu_score_unigrams:.4f}")
print(f"BLEU score (bigrams with smoothing): {bleu_score_bigrams:.4f}")
print(f"BLEU score (trigrams with smoothing): {bleu_score_trigrams:.4f}")
print(f"BLEU score (4-grams with smoothing): {bleu_score_4grams:.4f}")


BLEU score with smoothing: 0.0013
BLEU score (unigrams with smoothing): 0.0536
BLEU score (bigrams with smoothing): 0.0079
BLEU score (trigrams with smoothing): 0.0001
BLEU score (4-grams with smoothing): 0.0001
