<a href="https://www.kaggle.com/code/nirmit27/gen-ai-intensive-course-capstone-2025q1-project?scriptVersionId=233377375" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# NewsGenius AI

## Overview
Welcome to **NewsGenius AI**, an advanced Capstone Project for the **5-day Gen AI Intensive Course with Google** (March 31 - April 4, 2025). This project summarizes the latest news or YouTube videos on any topic by leveraging generative AI to provide concise, accurate digests of complex information. 📰

### Problem
The information overload in today's digital landscape presents significant challenges for individuals trying to stay informed. Users face difficulties in efficiently processing the overwhelming volume of news articles and video content available across multiple platforms. Traditional consumption methods are time-consuming and often result in missing key insights or context. **NewsGenius AI** addresses this challenge by providing an automated solution that extracts and synthesizes essential information from diverse media sources, enabling users to quickly grasp the core content without sacrificing comprehension.

### Solution
This tool integrates multiple Gen AI capabilities to:
- Efficiently summarize text content from news articles while preserving key information and context.
- Extract and condense important insights from YouTube videos into a well-formatted summary.
- Build a Q&A chatbot leveraging RAG to provide informed answers based on a video summary.

### Gen AI Capabilities
1. **Few-shot Prompting & Structured Output** : Summarization of a news article excerpt with few-shot prompting.
2. **Video Understanding** : Summarization of a YouTube video using the Gemini Pro model.
3. **Retrieval augmented generation (RAG)** : Building a Q&A chatbot leveraging RAG to provide informed answers based on a video summary.

> *This notebook has been submitted for the Gen AI Intensive Course Capstone 2025Q1 Project, open April 4 - April 20, 2025.*

## Project setup

### Installing the required dependencies 
... and initializing the environment.

In [None]:
!pip uninstall -qy jupyterlab jupyterlab-lsp

!pip install -qU 'google-genai==1.7.0'
!pip install pytubefix -qU
!pip install chromadb -qU

In [None]:
import os
import re
import requests
from time import sleep
from kaggle_secrets import UserSecretsClient

from pprint import pprint
from IPython.display import Markdown, HTML, display

from google import genai
from google.genai import types
from google.api_core import retry

from pytubefix import YouTube
from pytubefix.cli import on_progress

import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings

### Fetching the user secrets

In [None]:
RAPIDAPI_HOST = UserSecretsClient().get_secret("RAPIDAPI_HOST")
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
RAPIDAPI_API_KEY = UserSecretsClient().get_secret("RAPIDAPI_API_KEY")

### Hitting the news API
Fetching the latest news articles about the share market from a third-party API.

In [None]:
article_ids = []
headers = {
	"x-rapidapi-key": RAPIDAPI_API_KEY,
	"x-rapidapi-host": RAPIDAPI_HOST
}

url_ids = f"https://{RAPIDAPI_HOST}/news/v2/list"
querystring_ids = {"size":"10",
               "category":"market-news::all",
               "number":"1",
              }

response = requests.get(url_ids, headers=headers, params=querystring_ids)
response_dict = response.json()

for data in response_dict['data']:
    article_ids.append(data['id'])

article_ids

In [None]:
url_details = f"https://{RAPIDAPI_HOST}/articles/get-details"
querystring_details = {"id":article_ids[-5]}

response = requests.get(url_details, headers=headers, params=querystring_details)
news_article = response.json()['data']['attributes']['content']

In [None]:
def remove_html(text):
    clean_text = re.sub(r'<.*?>', '', text)

    clean_text = clean_text.replace('&amp;', '&')
    clean_text = clean_text.replace('&lt;', '<')
    clean_text = clean_text.replace('&gt;', '>')
    clean_text = clean_text.replace('&quot;', '"')
    clean_text = clean_text.replace('&nbsp;', ' ')
    
    return clean_text.strip()

news_article = remove_html(news_article)
news_article

### Setting up the client for Gemini API

In [None]:
client = genai.Client(api_key=GOOGLE_API_KEY)

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

model_config_1 = types.GenerateContentConfig(
    temperature=0.1,
    top_p=1,
    max_output_tokens=300,
)

model_name = 'gemini-2.0-flash'

## Step 1: Few-shot prompting and & Structured output
The following code snippet demonstrates summarizing a news article excerpt using the **Gemini Pro** model with few-shot prompting.

In [None]:
def summarize_article(article_text):
    examples = [
        {
            "input": """
            The company announced today a new partnership with TechCorp to develop innovative AI solutions for the healthcare industry. This collaboration will leverage the company's expertise in medical imaging and TechCorp's advanced AI algorithms. The joint effort aims to improve diagnostic accuracy and patient outcomes. Clinical trials are expected to begin in the third quarter of this year.
            """,
            "output": """## Article Summary

**Key Points:**
- New partnership between the company and TechCorp
- Focus on AI solutions for healthcare diagnostics
- Combines medical imaging expertise with advanced AI algorithms
- Clinical trials scheduled for Q3 this year

**Impact:** Potential improvements in diagnostic accuracy and patient outcomes"""
        },
        {
            "input": """
            Despite a challenging economic climate, the retailer reported a 5% increase in online sales for the past quarter. Brick-and-mortar store traffic saw a slight decline of 2%, but overall revenue remained stable due to strong e-commerce performance. The company plans to further invest in its online platform and enhance the digital customer experience.
            """,
            "output": """## Article Summary

**Key Points:**
- 5% increase in online sales this quarter
- 2% decline in physical store traffic
- Overall revenue remains stable
- Future investment planned for online platform

**Outlook:** Company shifting focus to enhance digital customer experience amid changing retail patterns"""
        }
    ]
    
    prompt_parts = [
        "Below are examples of news articles and their structured summaries. Use these examples to create a similar well-formatted summary for the provided article:\n\n",
    ]
    
    for example in examples:
        prompt_parts.append(f"ARTICLE:\n{example['input']}\n\nSUMMARY:\n{example['output']}\n\n---\n\n")
    
    prompt_parts.append(f"ARTICLE:\n{article_text}\n\nSUMMARY:")
    prompt = "".join(prompt_parts)
    
    try:
        response_pt1 = client.models.generate_content(
            model=model_name,
            config=model_config_1,
            contents=prompt
        )
        return response_pt1.text
    except Exception as e:
        return f"""## Error Generating Article Summary"""

In [None]:
sample_article = """Donald Trump is facing accusations of market manipulation after posting on social media that it was a “great time to buy” just hours before he made a dramatic U-turn on his trade war that led to big rises in stock markets around the world.Shortly after US markets opened on Wednesday morning, Trump wrote on his social media platform Truth Social: “THIS IS A GREAT TIME TO BUY!!! DJT”.Less than four hours later, he shocked investors by announcing a 90-day pause on additional trade tariffs on most countries except China, sending share indexes soaring.In America the S&P 500 blue chip index closed up by more than 9%, while the technology-focused Nasdaq index shut more than 12% up. Stocks continued to rise in Asia and Europe on Thursday, with Japan’s Nikkei 225 index up by 9%, and London’s FTSE 100 index rising by as much as 4% in early trading.Trump does not usually sign off his post with his initials. Those letters happen to be the same as the ticker for Trump Media & Technology Group, the business that controls Truth Social, whose stock shot up by 22% on Wednesday.The timing of the US president’s posts and subsequent huge share jumps has sparked accusations of market manipulation. The Democratic senator Adam Schiff has called for an investigation, saying: “These constant gyrations in policy provide dangerous opportunities for insider trading.“Who in the administration knew about Trump’s latest tariff flip-flop ahead of time? Did anyone buy or sell stocks, and profit at the public’s expense? I’m writing to the White House – the public has a right to know.”The Democratic senator Chris Murphy also wrote on X that an “insider trading scandal is brewing … Trump’s 9:30am tweet makes it clear he was eager for his people to make money off the private info only he knew. So who knew ahead of time and how much money did they make?”The New York Democratic representative Alexandria Ocasio-Cortez called for all members of Congress to disclose any stocks they had bought in the past 24 hours. “I’ve been hearing some interesting chatter on the floor,” she wrote on X. “Disclosure deadline is May 15th. We’re about to learn a few things. It’s time to ban insider trading in Congress.”When asked by US reporters on Wednesday evening when exactly he arrived at his decision to pause the tariffs on most countries for 90 days, Trump said: “For a period of a time. I would say this morning. Over the last few days, I’ve been thinking about it.”However White House officials have argued the shift was part of the strategy all along, with his press secretary, Karoline Leavitt, arguing it was his “art of the deal” at work.Several investors have used volatility in the stock market in recent weeks as a buying opportunity. The US representative for Georgia, Republican and Trump ally Marjorie Taylor Greene, disclosed that she had made several purchases on 3 and 4 April – days when there were sharp market falls after Trump first detailed his “liberation day” tariffs on 2 April – including shares in Amazon.com and Apple. Shares in the technology companies rose by 12% and 15% respectively on Wednesday.While Trump has paused many of the new country-specific tariffs, he has maintained pressure on China, the second biggest economy in the world. He increased the tariff on Chinese imports to 125% from the 104% level that started on Wednesday. Beijing could respond again after hitting US imports with 84% tariffs that began on Thursday."""

Markdown(summarize_article(news_article))

## Step 2: Video understanding
The following code snippets demonstrate the summarization of a YouTube video using the **Gemini Pro** model.

### Downloading the video
Downloading a sample video from YouTube in the `.mp4` format

In [None]:
url = r"https://www.youtube.com/watch?v=c8NEJAsha-s"
 
yt = YouTube(url, on_progress_callback = on_progress)
print(yt.title)
 
ys = yt.streams.get_highest_resolution()
ys.download()

In [None]:
video_filepath = None

for dirname, _, filenames in os.walk(os.getcwd()):
    for filename in filenames:
        if filename.endswith('.mp4'):
            video_filepath = os.path.join(dirname, filename)

video_filepath

### Video summarization
Prompting the **Gemini Pro** model to summarize the video.

In [None]:
print("Uploading to the File API...")
video_file = client.files.upload(file=video_filepath)
print("Upload complete!")

In [None]:
while video_file.state.name == "PROCESSING":
    print('Waiting for video to be processed...')
    sleep(10)
    video_file = client.files.get(name=video_file.name)

if video_file.state.name == "FAILED":
  raise ValueError(video_file.state.name)

print(f'Video processing complete: ' + video_file.uri)

In [None]:
prompt = """I need you to create a comprehensive summary of this YouTube video transcript. Please analyze the content and provide the output STRICTLY in the format mentioned below:

OUTPUT FORMAT:
## Title: [Extract or infer the title]
- Main Topic: [Identify the central topic/theme of discussion]
- Duration: [If mentioned in transcript]
- Creator/Channel: [If mentioned in transcript]

## Key Points
- Identify 3-5 main points or arguments presented
- List them as bullet points with brief explanations

## Summary
Write a concise 2-3 paragraph summary capturing the essence of the video discussion

## Notable Quotes
Include 1-2 significant quotes from the transcript (if any stand out)

<Do not include any feedback messages like 'Okay, here is the summary...' in the final output.>
"""

model_config_2 = types.GenerateContentConfig(
    temperature=1.0,
    top_p=0.95,
)

response_pt2 = client.models.generate_content(
    model=model_name,
    config=model_config_2,
    contents=[prompt, video_file]
)

Markdown(response_pt2.text)

## Step 3: Retrieval augmented generation (RAG)
This step details building a Q&A chatbot leveraging RAG to provide informed answers based on a video summary.

### Generating the documents
Splitting the generated video summary into section-wise documents.

In [None]:
def split_into_sections(text):
    main_title_match = re.search(r'^#\s+(.*?)$', text, re.MULTILINE)
    main_title = main_title_match.group(1) if main_title_match else "Untitled Document"
    
    sections = re.split(r'(?=^##\s+.*?$)', text, flags=re.MULTILINE)
    
    documents = []
    metadatas = []
    
    for section in sections:
        if not section.strip():
            continue
            
        section_title_match = re.search(r'^##\s+(.*?)$', section, re.MULTILINE)
        if not section_title_match:
            continue
            
        section_title = section_title_match.group(1)
        cleaned_section = clean_text(section)
        documents.append(cleaned_section)
        
        metadata = {
            "title": main_title,
            "section": section_title,
            "inferred_topics": infer_topics(section)
        }
        
        source_match = re.search(r'\*\*Source:\*\*\s+(.*?)(?:\n|$)', section, re.MULTILINE)
        if source_match:
            metadata["source"] = source_match.group(1)
        
        metadatas.append(metadata)
    
    return documents, metadatas


def clean_text(text):
    text = re.sub(r'#', '', text)
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    
    return text


def infer_topics(text):
    topics = []
    bullet_points = re.findall(r'- (.*?)(?:\n|$)', text)
    
    for point in bullet_points:
        clean_point = re.sub(r'\*\*(.*?)\*\*', r'\1', point)
        entities = re.findall(r'[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*', clean_point)
        topics.extend(entities)
    
    return list(set(topics))

In [None]:
documents, metadatas = split_into_sections(response_pt2.text)

for document in documents:
    print(document, end='\n\n')

### Creating the embeddings
Creating a database for the retrieval of embeddings of the generated video summary.

In [None]:
genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

class GeminiEmbeddingFunction(EmbeddingFunction):
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

In [None]:
DB_NAME = "videosummarydb"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

In [None]:
print(f"DB document count : {db.count()}")

### Retrieval
Finding the relevant documents from the Chroma database.

In [None]:
embed_fn.document_mode = False

query = "How did the stock market respond to President Trump’s pause on parts of his tariff plan?"

result = db.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]

Markdown(all_passages[0])

### Augmented generation
The following code snippet outlines answering of the question using the **Gemini Pro** model using the retrieved embeddings from the Chroma database.

In [None]:
query_oneline = query.replace("\n", " ")

prompt = f"""You are a helpful and informative chatbot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
"""

for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

print(prompt)

In [None]:
answer = client.models.generate_content(
    model=model_name,
    contents=prompt)

Markdown(answer.text)

### Working demo
A demonstration that simulates a fully-operational chatbot.

In [None]:
def generate_response(query, passages):
    query_oneline = query.replace("\n", " ")
    prompt = f"""You are a helpful and informative chatbot that answers questions using text from the reference passage included below. Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. However, you are talking to a non-technical audience, so be sure to break down complicated concepts and strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.
QUESTION: {query_oneline}"""
    
    for passage in passages:
        passage_oneline = passage.replace("\n", " ")
        prompt += f"PASSAGE: {passage_oneline}\n"
    
    answer = client.models.generate_content(model=model_name, contents=prompt)
    return answer.text

def chat_loop():
    print("Chatbot Ready 🤖 - Type 'exit' to quit")
    
    while True:
        user_query = input("\nYou: ")
        if user_query.lower() == 'exit':
            print("\nGoodbye! 👋")
            break
            
        result = db.query(query_texts=[user_query], n_results=1)
        all_passages = result["documents"][0]
        
        response = generate_response(user_query, all_passages)
        print(f"\nNewsGenius: {response}")

In [None]:
# chat_loop()

## Conclusion

### How the use of Gen AI fits the use case in question
- **Few-shot Prompting & Structured Output**: Guides models to generate structured data from video info for easier processing.
- **Video understanding**: Leverages a video file API to upload, process, and generate structured summaries, extracting key information on topics, duration, creator, and notable quotes from the video's content.
- **Retrieval augmented generation (RAG)**: Enhances chatbot responses with external video summaries for more accurate answers.

### Limitations
- **Video Processing API Dependence:** This process relies heavily on the functionality and limitations of the external video file API for uploading, processing, and generating summaries. The quality and availability of the API's features directly impacts the chatbot's performance.
- **Structured Output Constraint:** The response quality depends on the model's ability to adhere strictly to the specified output format. Any deviation from the defined format can impact downstream processing and the usability of the extracted information. (This assumes the model doesn't always perfectly follow the output structure.)

### Future Scope
- Improving the robustness and accuracy of the structured summary generation by experimenting with different models and prompt engineering techniques. Additionally, exploring more advanced caching strategies and integrating the system with other video platforms would enhance its scalability and versatility.
- Integrating this chatbot with various communication platforms (e.g., Slack, Discord) and incorporating it into larger workflows for video content management and knowledge sharing. Furthermore, exploring techniques for automatically identifying and summarizing trending videos would enable proactive knowledge discovery and dissemination.

<!-- Check out my [blogpost](#) for more details on this project! -->