LLM-Powered Blog Summarizer: Exploring Cutting-Edge Techniquesm
In today's world of too much online information, it's important to quickly understand long blog posts. That's where Language Model (LLM) technologies come in. They're changing how we read and shorten big amounts of text. This project, called "LLM-Powered Blog Summarizer: Exploring Cutting-Edge Techniques," looks into these advanced models.

Start by picking out useful information from long blog posts. Using top LLMs
like BART, T5, and Pegasus, summarize blogs accurately and efficiently.

By useing different LLMs, each with its own way of working, to make short summaries of long blogs. By using BART, T5, and Pegasus models, find out what works best and compare their results.

Also check how similar the summaries are to a reference using cosine similarity. This helps us see if the summarization techniques are working well.



In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Extracting Blog Content
Read the content of a blog.Acess the source and open the file, read its contents, and store them in a variable to work with them later.

In [11]:
# --- import re,string ----
import re
import string

file_path = './blog.txt'
#--- Read in blog.txt file ----
with open(file_path,'r', encoding='utf-8') as file:
  blog_content = file.read()

#--- Inspect Blog txt ---
blog_content

"The Rise of Citizen Science: How Everyday People Are Contributing to Research\nCitizen science is the involvement of the general public in scientific research. It's a collaborative effort where anyone, regardless of their scientific background, can participate in collecting and analyzing data. This approach has gained significant momentum in recent years, driven by several factors:\n\nAdvancement in technology: Online platforms and mobile apps have made it easier than ever for individuals to contribute to research projects. These tools allow for data collection, analysis, and collaboration on a global scale.\nGrowing public interest in science: There's an increasing public desire to understand and contribute to scientific advancements. Citizen science projects provide accessible entry points for people to engage with science in a meaningful way.\nNeed for broader data collection: Many scientific fields require large datasets to address complex challenges. Citizen science projects can 

##Cleaning Blog Text
Clean up the text to make it easier to work with. Convert the text to lower case and remove any special characters




In [12]:
#convert blog to lower case
text=blog_content.lower()
# cleaned_text
def remove_special_characters(text):
    # Define the pattern to match special characters
    pattern = r'[^a-zA-Z0-9\s]'

    # Use the sub() function to replace matches with an empty string
    cleaned_text = re.sub(pattern, '', text)

    # Remove extra spaces by replacing consecutive spaces with a single space
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)

    return cleaned_text.strip()
cleaned_text = remove_special_characters(text )

#--- Inspect cleaned_text ---
cleaned_text

'the rise of citizen science how everyday people are contributing to research citizen science is the involvement of the general public in scientific research its a collaborative effort where anyone regardless of their scientific background can participate in collecting and analyzing data this approach has gained significant momentum in recent years driven by several factors advancement in technology online platforms and mobile apps have made it easier than ever for individuals to contribute to research projects these tools allow for data collection analysis and collaboration on a global scale growing public interest in science theres an increasing public desire to understand and contribute to scientific advancements citizen science projects provide accessible entry points for people to engage with science in a meaningful way need for broader data collection many scientific fields require large datasets to address complex challenges citizen science projects can help gather vast amounts 

##Summarizing Blog Content
Ceate a summary of the blogby using BART to help us with this task. First, load the BART tokenizer and model then tokenize the cleaned text and feed it into the model to generate a summary. The summary will capture the main points of the blog in a concise form. Once generated, save the summary to a file for future reference.

In [13]:
pip install transformers



In [None]:
# --- import BartTokenizer, TFBartForConditionalGeneration from transformers ---
from transformers import TFBartForConditionalGeneration, BartTokenizer

# Load the tokenizer and model
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

# Tokenize the input text
inputs = tokenizer.encode(
    cleaned_text,
    return_tensors="tf",
    max_length=1024,
    truncation=True
)
# Generate the summary
summary_ids = model.generate(
    inputs,
    max_length=300,
    num_beams=4,
    early_stopping=True
)
# Decode and print the generated summary
summary_bart = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#Export the Generated text ("summary_bart.txt")
# output_filepath = "summary_bart.txt"
output_filepath="./sumarry_bart.txt"
# with open(output_filepath, "w", encoding="utf-8") as output_file:
with open(output_filepath,"w", encoding="utf-8") as output_file:
    output_file.write(summary_bart)

# output_file.write(summary_bart)
summary_bart


##Alternative Blog Summarization with T5 Model
Summarizing the blog content using T5. Similar to before, load the T5 tokenizer and model. Then tokenize the cleaned text and feed it into the model to generate a summary. The summary will capture the main points of the blog in a concise form. Once generated, save the summary to a file for future reference.



In [15]:
pip install SentencePiece



In [None]:
# --- import T5Tokenizer, TFT5ForConditionalGeneration from transformers ---

from transformers import T5Tokenizer, TFT5ForConditionalGeneration

# Load the tokenizer and model

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')

# Tokenize the input text

inputs = tokenizer.encode(
    cleaned_text,
    return_tensors="tf",
    max_length=1024,
    truncation=True
)

# Generate the summary
summary_ids = model.generate(
    inputs,
    max_length=150,
    num_beams=4,
    early_stopping=True
)

# Decode and print the generated summary
summary_t5 = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#Export the Generated text ("summary_t5.txt")
# output_filepath = "summary_t5.txt"
output_filepath="summary_t5.txt"
# with open(output_filepath, "w", encoding="utf-8") as output_file:

with open(output_filepath,"w") as output_file:

    output_file.write(summary_t5)

#     output_file.write(summary_text)
summary_t5

## Employing Pegasus Model for Blog Summarization
Another method Pegasus, load the Pegasus tokenizer and model Then tokenize the cleaned text and feed it into the model to generate a summary. The summary will give us a condensed version of the blog's key points. Once generated, we'll save the summary to a file for future reference.

In [None]:
# --- import PegasusTokenizer, TFPegasusForConditionalGeneration from transformers ---
from transformers import TFPegasusForConditionalGeneration, PegasusTokenizer
# Load the tokenizer and model
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')
model = TFPegasusForConditionalGeneration.from_pretrained('google/pegasus-large')
# Tokenize the input text
input=tokenizer.encode(
    cleaned_text,
    return_tensors="tf",
    max_length=1024,
    truncation=True,
)

# Generate the summary

summary_ids=model.generate(
    inputs,
    max_length=150,
    num_beams=4,
    length_penalty=2.0,
    early_stopping=True
)

# Decode and print the generated summary
summary_pegasus = tokenizer.decode(summary_ids[0],skip_special_tokens=True)

#Export the Generated text ("summary_pegasus.txt")
# output_filepath = "summary_pegasus.txt"
output_filepath="summary_pegasus.txt"

# with open(output_filepath, "w", encoding="utf-8") as output_file:

with open(output_filepath, "w") as output_file:
  output_file.write(summary_pegasus)

#     output_file.write(summary_pegasus)
summary_pegasus

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

Some layers of TFPegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
