In [1]:
import os
import math
import PyPDF2
from transformers import T5ForConditionalGeneration, T5Tokenizer
from concurrent.futures import ProcessPoolExecutor

### File Ingestion

In [None]:
# import PyPDF2

# # Function to extract text from PDF using PyPDF2
# def extract_text_from_pdf(file_path):
#     text = ''
#     with open(file_path, 'rb') as file:
#         pdf_reader = PyPDF2.PdfReader(file)
#         for page_num in range(len(pdf_reader.pages)):
#             page = pdf_reader.pages[page_num]
#             text += page.extract_text()
#     return text

# # Example usage
# pdf_file_path = "D:\Programming\Projects\WASSERSTOFF\Data\pdf1.pdf"  # Replace with your PDF file path
# document_text = extract_text_from_pdf(pdf_file_path)

# # Check the extracted text
# print(document_text[:1000])  # Print first 1000 characters


In [3]:
document_text="""
In a world increasingly defined by technology, the relationship between humans and machines has become a focal point of discussion across various domains. The advent of artificial intelligence (AI) has reshaped industries, impacting everything from healthcare to finance, education to entertainment. With the integration of AI in daily life, there emerges a profound question: what does it mean to be human in an age where machines can replicate cognitive functions? This question is particularly relevant in the realm of healthcare, where AI systems are not only assisting doctors in diagnosing diseases but also in predicting patient outcomes based on vast datasets. As a result, the medical community is witnessing a shift in how care is delivered; physicians are now seen as guides who interpret AI-generated insights rather than sole decision-makers. This paradigm shift has its proponents and critics. Advocates argue that AI enhances human capabilities, allowing for more accurate diagnoses and personalized treatment plans. They cite examples of AI systems that analyze medical images with a precision that rivals expert radiologists, thus improving early detection rates of conditions like cancer. On the other hand, critics voice concerns regarding the ethical implications of relying too heavily on machines, fearing that this could undermine the human touch that is integral to patient care. They point out the risk of data bias, as AI systems trained on historical data may perpetuate existing inequalities in healthcare delivery. Moreover, there are apprehensions about patient privacy, as the use of sensitive health data raises questions about who owns this information and how it is used. As healthcare continues to evolve in this direction, the dialogue surrounding the intersection of technology and human experience becomes even more critical. Moving beyond healthcare, the influence of AI permeates other sectors such as finance, where algorithms determine credit scores and assess loan applications. The automation of these processes promises efficiency and speed, yet it also poses significant challenges. For instance, the opaque nature of algorithmic decision-making can lead to a lack of accountability, making it difficult for individuals to understand why they were denied a loan or insurance coverage. In response, policymakers are grappling with the need to regulate AI to ensure fairness and transparency. The challenge lies in striking a balance between innovation and safeguarding the rights of individuals, a debate that reflects broader societal concerns about the pace of technological advancement. In the realm of education, AI offers tools for personalized learning experiences, adapting curricula to meet the unique needs of each student. This potential for customization is promising, particularly for learners who may struggle in traditional educational settings. However, the reliance on technology raises questions about equity, as not all students have equal access to the necessary tools and resources. The digital divide becomes apparent, highlighting disparities that exist within and between communities. As educators explore the integration of AI in classrooms, they must also consider how to ensure that all students benefit from these innovations. In the creative industries, AI is reshaping the landscape of art, music, and literature. Tools powered by AI are now capable of generating original works, blurring the lines between human creativity and machine-generated content. This evolution prompts discussions about authorship and originality. If a machine creates a piece of art, who is the true artist? Furthermore, the rise of deepfakes and AI-generated media has raised ethical concerns about misinformation and the manipulation of public perception. As technology continues to advance, society faces the dual challenge of embracing innovation while mitigating its risks. The overarching theme across these sectors is the need for a thoughtful approach to AI integration, one that considers the ethical, social, and economic implications of these powerful technologies. The dialogue surrounding AI is not merely academic; it has real-world consequences that affect individuals, communities, and global dynamics. As we stand on the precipice of a future increasingly influenced by artificial intelligence, it is crucial that we engage in ongoing discussions about what it means to live in harmony with machines. The future of work, society, and human interaction hinges on our ability to navigate these complexities with care, foresight, and a commitment to preserving the values that define our humanity. Ultimately, the integration of AI into everyday life challenges us to redefine our roles as individuals and as a society, prompting us to ask profound questions about identity, ethics, and the nature of progress. In this ever-evolving landscape, we must prioritize education, policy development, and ethical considerations to ensure that technology serves as a tool for enhancing human potential rather than a force that diminishes it.
"""

reference_summary = """
The relationship between humans and machines is increasingly defined by the rise of artificial intelligence (AI), impacting various sectors, including healthcare, finance, education, and the creative industries. In healthcare, AI assists in diagnosing diseases and predicting outcomes, prompting discussions about the balance between technology and the human touch in patient care. While proponents argue that AI enhances capabilities, critics raise concerns about data bias, patient privacy, and the potential loss of human empathy. In finance, AI streamlines processes but introduces challenges related to transparency and accountability. The education sector benefits from personalized learning tools, though disparities in access highlight existing inequalities. In creative fields, AI-generated content raises questions about authorship and originality, particularly concerning misinformation. The overarching theme is the necessity for a thoughtful approach to AI integration, balancing innovation with ethical and social implications. As society navigates these complexities, ongoing dialogue is essential to ensure technology enhances human potential while preserving core values.
 """

### Splitting into chunks

In [4]:
def split_text_by_characters(text, chunk_size_chars=1000):
    """
    Splits text into smaller chunks based on a character limit.
    :param text: The entire document text
    :param chunk_size_chars: The number of characters per chunk
    :return: List of text chunks
    """
    chunks = [text[i:i + chunk_size_chars] for i in range(0, len(text), chunk_size_chars)]
    return chunks

# Split the document into smaller character-based chunks
char_chunks = split_text_by_characters(document_text, chunk_size_chars=1000)

print(f"Number of character-based chunks: {len(char_chunks)}")


Number of character-based chunks: 6


### Token Validation

In [5]:
from transformers import T5Tokenizer

# Load the T5-small tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def ensure_safe_token_length(chunks, tokenizer, max_chunk_length=512):
    validated_chunks = []
    
    for chunk in chunks:
        tokens = tokenizer.encode(chunk)
        if len(tokens) > max_chunk_length:
            # If a chunk exceeds the token limit, split it further
            token_chunks = [tokens[i:i + max_chunk_length] for i in range(0, len(tokens), max_chunk_length)]
            # Convert token chunks back to text and append to validated_chunks
            for token_chunk in token_chunks:
                validated_chunks.append(tokenizer.decode(token_chunk, skip_special_tokens=True))
        else:
            validated_chunks.append(chunk)
    
    return validated_chunks

# Ensure all chunks fit within the token length constraint
validated_text_chunks = ensure_safe_token_length(char_chunks, tokenizer, max_chunk_length=512)

print(f"Number of validated token-safe chunks: {len(validated_text_chunks)}")


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Number of validated token-safe chunks: 6


### Summerization

In [6]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the T5-small model for summarization
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def summarize_chunk(chunk):
    """
    Summarizes a single chunk of text.
    :param chunk: A chunk of text to summarize
    :return: The summary of the chunk
    """
    inputs = tokenizer.encode("summarize: " + chunk, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Summarize each validated chunk
summarized_chunks = [summarize_chunk(chunk) for chunk in validated_text_chunks]

# Print summarized chunks (optional)
for idx, summary in enumerate(summarized_chunks):
    print(f"Summary for Chunk {idx + 1}:\n{summary}\n")


Summary for Chunk 1:
the advent of artificial intelligence (AI) has reshaped industries. there emerges a profound question: what does it mean to be human in an age where machines can replicate cognitive functions? this paradigm shift has its proponents and critics.

Summary for Chunk 2:
critics voice concerns about the ethical implications of relying too heavily on machines. critics fear that this could undermine the human touch that is integral to patient care. apprehensions about patient privacy raise questions about who owns this information and how it is used.

Summary for Chunk 3:
policymakers are grappling with the need to regulate AI to ensure fairness and transparency. the challenge lies in striking a balance between innovation and safeguarding the rights of individuals. the challenge lies in striking a balance between innovation and safeguarding the rights of individuals.

Summary for Chunk 4:
tools powered by AI are now capable of generating original works. this blurring the 

In [7]:
# Concatenate all the summarized chunks to form the final summary
final_summary = " ".join(summarized_chunks)

# Print the final summary

print(final_summary)


the advent of artificial intelligence (AI) has reshaped industries. there emerges a profound question: what does it mean to be human in an age where machines can replicate cognitive functions? this paradigm shift has its proponents and critics. critics voice concerns about the ethical implications of relying too heavily on machines. critics fear that this could undermine the human touch that is integral to patient care. apprehensions about patient privacy raise questions about who owns this information and how it is used. policymakers are grappling with the need to regulate AI to ensure fairness and transparency. the challenge lies in striking a balance between innovation and safeguarding the rights of individuals. the challenge lies in striking a balance between innovation and safeguarding the rights of individuals. tools powered by AI are now capable of generating original works. this blurring the lines between human creativity and machine-generated content. the rise of deepfakes and

In [8]:
len(final_summary)

1608

### Metrics

In [9]:

from rouge_score import rouge_scorer

# Example summaries
generated_summary = final_summary
reference_summary = reference_summary
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)

print(scores)


{'rouge1': Score(precision=0.3877551020408163, recall=0.6129032258064516, fmeasure=0.4749999999999999), 'rouge2': Score(precision=0.09836065573770492, recall=0.15584415584415584, fmeasure=0.1206030150753769), 'rougeL': Score(precision=0.1836734693877551, recall=0.2903225806451613, fmeasure=0.225)}


In [10]:
# Loop through each ROUGE score and print its values
for key, score in scores.items():
    print(f"{key}:")
    print(f"  Precision: {score.precision:.4f}")
    print(f"  Recall: {score.recall:.4f}")
    print(f"  F-measure: {score.fmeasure:.4f}\n")


rouge1:
  Precision: 0.3878
  Recall: 0.6129
  F-measure: 0.4750

rouge2:
  Precision: 0.0984
  Recall: 0.1558
  F-measure: 0.1206

rougeL:
  Precision: 0.1837
  Recall: 0.2903
  F-measure: 0.2250



In [11]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Sample reference text
reference_text = reference_summary

# Sample generated text from your model
generated_text = final_summary

# Tokenize the reference and generated texts
reference = [nltk.word_tokenize(reference_text.lower())]  # List of reference sentences
generated = nltk.word_tokenize(generated_text.lower())

# Use a smoothing function
smoothie = SmoothingFunction()

# Calculate BLEU score with smoothing for all n-gram orders
bleu_score = sentence_bleu(reference, generated, smoothing_function=smoothie.method1)
print(f"BLEU score with smoothing: {bleu_score:.4f}")

# Calculate BLEU score for unigrams, bigrams, trigrams, and 4-grams with smoothing
bleu_score_unigrams = sentence_bleu(reference, generated, weights=(1, 0, 0, 0), smoothing_function=smoothie.method1)
bleu_score_bigrams = sentence_bleu(reference, generated, weights=(0, 1, 0, 0), smoothing_function=smoothie.method1)
bleu_score_trigrams = sentence_bleu(reference, generated, weights=(0, 0, 1, 0), smoothing_function=smoothie.method1)
bleu_score_4grams = sentence_bleu(reference, generated, weights=(0, 0, 0, 1), smoothing_function=smoothie.method1)

print(f"BLEU score (unigrams with smoothing): {bleu_score_unigrams:.4f}")
print(f"BLEU score (bigrams with smoothing): {bleu_score_bigrams:.4f}")
print(f"BLEU score (trigrams with smoothing): {bleu_score_trigrams:.4f}")
print(f"BLEU score (4-grams with smoothing): {bleu_score_4grams:.4f}")


BLEU score with smoothing: 0.0564
BLEU score (unigrams with smoothing): 0.3619
BLEU score (bigrams with smoothing): 0.0936
BLEU score (trigrams with smoothing): 0.0263
BLEU score (4-grams with smoothing): 0.0113
