# High-Level Overview of Semantic Search and Summarization Pipeline

This document provides an abstract overview of the key steps and functionalities implemented in the provided Python code, which combines semantic search and summarization to process and analyze news articles.

## 1. **Dataset Loading and Preprocessing**
   - The pipeline utilizes the **CNN/DailyMail dataset** to work with news articles.
   - The dataset is filtered to include a manageable subset of articles (e.g., 50 rows).
   - A **SentenceTransformer model** (`all-mpnet-base-v2`) is employed to preprocess and generate embeddings for the text data.

## 2. **Vectorization and Embedding**
   - Articles are tokenized and converted into high-dimensional vector representations using the SentenceTransformer model.
   - These embeddings facilitate semantic similarity computations for downstream tasks.

## 3. **Semantic Search with ChromaDB**
   - A **ChromaDB client** is initialized to store and query vectorized data.
   - Articles and their embeddings are stored in a collection named `news_collection`.
   - A semantic search function allows querying the database with a given topic and retrieves the most relevant articles based on vector similarity.

## 4. **LLM-Powered Summarization**
   - A language model (LLM) is integrated using **Cohere** to generate concise summaries.
   - A custom prompt template is defined, emphasizing accuracy, brevity, and adherence to the content.
   - The summarization pipeline adheres to a one-shot learning approach by providing examples of input articles and expected summary outputs.

## 5. **End-to-End Query and Summarization**
   - Topics are used as input queries for semantic search.
   - The retrieved articles are summarized using the LLM with the custom prompt.
   - Human-written summaries from the dataset are included for comparison with the generated summaries.

## 6. **Output and Evaluation**
   - The pipeline produces three outputs for each topic:
     - The full article retrieved through semantic search.
     - The human-written summary from the dataset.
     - The machine-generated summary from the LLM.
   - Results are printed for each topic to allow for qualitative evaluation.

## 7. **Key Features**
   - **Semantic Search:** Powered by vector embeddings, enabling accurate retrieval of articles relevant to a query.
   - **LLM Integration:** Leveraging Cohere to generate precise, human-like summaries.
   - **Scalability:** Modular design for extending dataset size, embedding models, or summarization techniques.
   - **Custom Prompting:** Carefully crafted prompts ensure alignment with the original content, minimizing hallucination.

## 8. **Use Case Example**
   Topics like:
   - *Prince Harry's Tribute to Princess Diana on the 10th Anniversary of Her Passing.*
   - *Zoe's Ark Accused of Child Trafficking in Chad Amid Adoption Controversy.*
   are processed through semantic search and summarization, delivering concise and content-accurate summaries for user queries.


This pipeline provides a robust and modular foundation for applications in content summarization, semantic search, and natural language understanding.


In [None]:
!pip install chromadb
!pip install langchain
!pip install langchain-community
!pip install transformers datasets scikit-learn
!pip install cohere

In [2]:
import getpass
import os
os.environ["COHERE_API_KEY"] = "bxKGG9J7IT8cfPdpSsPosSwBaxDA2gbi1oEUufPD"

In [3]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from chromadb import Client
from chromadb.config import Settings
from chromadb.utils import embedding_functions
from langchain.llms import Cohere
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

In [None]:
# Step 1: Load the CNN/DailyMail dataset
from datasets import load_dataset
dataset_name = "cnn_dailymail"  
version = "3.0.0"  
data = load_dataset(dataset_name, version)


# Use rows until 50 for the vectorized dataset
full_data = data['train'].select(range(50))

# Preprocessing and Embedding with SentenceTransformer
model_name = "all-mpnet-base-v2"
embedding_model = SentenceTransformer(model_name)

In [5]:
# Step 2: Vectorize the news articles
def preprocess_and_embed(data):
    texts = [item['article'] for item in data]
    embeddings = embedding_model.encode(texts, convert_to_tensor=True)
    return texts, embeddings

texts, embeddings = preprocess_and_embed(full_data)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
# Step 3: Initialize ChromaDB for semantic search
client = Client(Settings())


# Check if the collection already exists
existing_collections = [col for col in client.list_collections()]

if "news_collection" in existing_collections:
    client.delete_collection(name="news_collection")
collection = client.create_collection(name="news_collection")


# Add data to ChromaDB
for idx, (text, embedding) in enumerate(zip(texts, embeddings)):
    collection.add(
        ids=[str(idx)],  
        documents=[text],
        metadatas=[{"id": str(idx)}],  
        embeddings=[embedding.tolist()],
    )

In [7]:
# Step 4: Define Semantic Search Retriever
def semantic_search(query, top_k=2):
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    return results

In [8]:
# Step 5: Model and Prompt 

llm = Cohere()

# one-shot learning example
examples = """

Write a concise summary (5 short sentences, max 10 words each):
- Adhere strictly to the information in the article.
- Avoid hallucinations or additions beyond the content.
- Focus on key points only effectively and accurately.

Example 1:
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a 
reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists 
the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter 
and the Order of the Phoenix" To the disappointment of gossip columnists around the 
world, the young actor says he has no plans to fritter his cash away on fast cars, drink, 
and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, 
suddenly buy themselves a massive sports car collection or something similar," he told an 
Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. 
The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." 
At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror 
film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. 
Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no 
comment on his plans. "I'll definitely have some sort of party," he said in an interview. 
"Hopefully none of you will be reading about it." Radcliffe's earnings from the first five Potter 
films have been held in a trust fund which he has not been able to touch. Despite his growing fame 
and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking 
to say 'kid star goes off the rails,'" he told reporters last month. "But I try very hard not to 
go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry 
Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will 
reprise the role in the last two films. Watch I-Reporter give her review of Potter's latest » . 
There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," 
about author Rudyard Kipling and his son, due for release later this year. He will also appear in 
"December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he 
made his stage debut playing a tortured teenager in Peter Shaffer's "Equus." Meanwhile, he is braced 
for even closer media scrutiny now that he's legally an adult: "I just think I'm going to be more 
sort of fair game," he told Reuters. E-mail to a friend. Copyright 2007 Reuters. All rights reserved.
This material may not be published, broadcast, rewritten, or redistributed.

Summary: Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

Task: Generate a summary for the following article.
"""

prompt_template = PromptTemplate(
    input_variables=["article"],
    template=f"{examples}\nArticle: {{article}}\nSummary:"
)

chain = LLMChain(llm=llm, prompt=prompt_template)


  llm = Cohere()
  chain = LLMChain(llm=llm, prompt=prompt_template)


In [9]:
# Step 6: End-to-End Query and Summarization
def generate_summary(topics):
    summaries = []
    articles = []
    human_summaries = []
    for idx, topic in enumerate(topics):
        # Perform semantic search
        search_results = semantic_search(topic, top_k=1)

        try:
            article = search_results['documents'][0][0]
            metadata_id = search_results['metadatas'][0][0]['id']  
            human_summary = full_data[int(metadata_id)]['highlights']  
        except Exception as e:
            continue  
        

        generated_summary = chain.invoke({"article": article})
        articles.append(article)
        human_summaries.append(human_summary)
        summaries.append(generated_summary)

    return articles, human_summaries, summaries

In [10]:
# Example Usage
topics = [
    "Prince Harry's Tribute to Princess Diana on the 10th Anniversary of Her Passing",
    "Zoe's Ark Accused of Child Trafficking in Chad Amid Adoption Controversy"
]


# Generate summaries
articles, human_summaries, generated_summaries = generate_summary(topics)


# Print results
print("\nSample Outputs:")
for idx, (article, human_summary, generated_summary, topic) in enumerate(zip(articles, human_summaries, generated_summaries, topics), start=1):
    print(f"\n{'='*60}")
    print(f"Topic {idx}: {topic}")
    print(f"{'='*60}")
    print(f"\nArticle {idx} (Full Content):\n{'-'*20}\n{article}\n")
    print(f"Human-Written Summary {idx}:\n{'-'*20}\n{human_summary}\n")
    print(f"Generated Summary {idx}:\n{'-'*20}")
    
    if isinstance(generated_summary, dict):
        summary_text = generated_summary.get("text", "")
    else:
        summary_text = generated_summary

    for sentence in summary_text.split('. '):
        print(f"- {sentence.strip()}\n")
    print(f"{'='*60}\n")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Sample Outputs:

Topic 1: Prince Harry's Tribute to Princess Diana on the 10th Anniversary of Her Passing

Article 1 (Full Content):
--------------------
LONDON, England (CNN) -- Prince Harry led tributes to Diana, Princess of Wales on the 10th anniversary of her death, describing her as "the best mother in the world" in a speech at a memorial service. Here is his speech in full: . William and I can separate life into two parts. There were those years when we were blessed with the physical presence beside us of both our mother and father. Princes Harry and William greet guests at a thanksgiving service in memory of their mother. And then there are the 10 years since our mother's death. When she was alive, we completely took for granted her unrivaled love of life, laughter, fun and folly. She was our guardian, friend and protector. She never once allowed her unfaltering love for us to go unspoken or undemonstrated. She will always be remembered for her amazing public work. But behind t