<a href="https://www.kaggle.com/code/oluwaseyisalisu/fake-news-forensic?scriptVersionId=234130510" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Fake News Forensic – Detecting & Summarizing Misinformation with GenAI**

This notebook demonstrates how to detect and summarize misinformation (so-called “fake news”) by combining:

1. **Embeddings** to transform text into vector representations.
2. A **Vector Store** (in this example, using FAISS) to enable similarity search.
3. **Retrieval-Augmented Generation (RAG)** techniques to ground outputs from a Large Language Model (LLM) in factual data.

We will walk through each step—from data loading and vector database creation to prompting a language model for fact-checking responses. Feel free to **experiment** by changing prompts, using different LLMs, or customizing the data!

**Tip: Before running first cell, enable GPU T4x2 for Faster Embedding & Index Building**
settings>>accelerator>>gpu t4x2


# **1. Introduction**

Misinformation spreads rapidly online, and manual fact-checking cannot always keep pace. In this project, we show how **Generative AI** methods can help:

1. **Find** relevant fact-check statements from a curated dataset (using embeddings + a vector database).
2. **Generate** short, evidence-based explanations (using a language model like OpenAI’s GPT).
3. **Structure** responses in a user-friendly format, such as JSON or bullet-point summaries.

Our example uses the **ISOT Fake News Dataset** (which labels articles as true or false) and an OpenAI model to produce short explanations referencing the retrieved evidence.


# **2. Libraries & Setup**

We first need to install and import the libraries that will power our pipeline:

- **sentence-transformers**: For creating text embeddings.
- **Langchain** & **ChromaDB**: Common libraries for building LLM applications (though we focus on FAISS here).
- **faiss-gpu**: A vector store for similarity search.
- **pandas**, **numpy**: For data manipulation.
- **torch**: Underlying framework (used by sentence-transformers).
- **openai**: To interact with OpenAI’s GPT models.

Installing might require a restart of the environment once done. Let’s go ahead and set things up.

In [1]:
!pip install -q --no-cache-dir sentence-transformers
!pip install -q --no-cache-dir langchain 
!pip install -q --no-cache-dir faiss-gpu

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m207.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:


import pandas as pd
import numpy as np
import torch

# For embeddings (SentenceTransformers)
from sentence_transformers import SentenceTransformer

# For vector database, example: FAISS or Chroma
import faiss  # or use Chroma or another



We import the necessary libraries for data manipulation (**pandas**, **numpy**), generating embeddings (**SentenceTransformers**), creating a vector index (**FAISS**), and working with **PyTorch** (which powers the transformer model).


**Obtain an API Key**

To use OpenAI’s API (e.g., GPT-3.5 or GPT-4), you will need an API key from [https://platform.openai.com/](https://platform.openai.com/). Note that **paid plans** or billing information might be required depending on how many requests you make and the rate limits you exceed. If you are just testing small requests, the free trial credit might suffice, but for consistent or larger-scale usage, you’ll need a paid subscription.

In a Kaggle notebook or a local Jupyter environment, you can store the key in an environment variable or a secrets manager. Below, we demonstrate retrieving it from Kaggle’s `UserSecretsClient`.


In [3]:
# For language model calls (OpenAI, Hugging Face, or local)
# Example with OpenAI:

import os
import openai
from kaggle_secrets import UserSecretsClient

OPENAI_API_KEY = UserSecretsClient().get_secret("OPENAI_API_KEY")
openai.api_key = OPENAI_API_KEY



# ...any additional imports...


# **3. Data Loading & Preprocessing**

In this section, we’ll load our **ISOT Fake News Dataset**, which consists of two CSV files:
- `Fake.csv` for misinformation articles
- `True.csv` for real news articles

We then combine them into a single DataFrame, adding a column `'label'` indicating whether the text is **true** or **false**. You can swap in **your own** data here to create a customized fact-checking pipeline.


In [4]:
# 1) Read the CSVs from the ISOT Fake News Dataset
df_fake = pd.read_csv('/kaggle/input/isot-fake-news-dataset/Fake.csv')
df_true = pd.read_csv('/kaggle/input/isot-fake-news-dataset/True.csv')

# 2) Assign labels
df_fake['label'] = 'false'
df_true['label'] = 'true'

# 3) Concatenate into a single DataFrame
df = pd.concat([df_fake, df_true], ignore_index=True)

# 4) Display 3 rows from each label to get a quick look
df_subset = df.groupby('label').head(3)
df_subset



Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",False
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",False
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",False
23481,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
23482,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
23483,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True


You can see a snippet of the data above. Each row contains text (the article body or headline) and the `label` indicating **true** or **false**.


# **4. Building & Populating the Vector Store**

In this stage, we will:

1. **Generate embeddings** for each article/claim using a Sentence Transformers model.
2. **Store** those embeddings in a FAISS index.

A vector store allows us to quickly find articles similar to any new query. This is critical for **retrieval-augmented generation** because we can feed the top-matching articles to the language model to ground its responses in factual data.

---

**Tip: Enable GPU for Faster Embedding & Index Building**

- In the **Kaggle** Notebook, go to **Settings** on the right side of the screen and switch the Hardware Accelerator to **GPU**.  
- If `torch.cuda.is_available()` returns `True`, you can specify `device='cuda'` when creating your `SentenceTransformer` model:




In [5]:
# Choose a sentence-transformers model (lightweight example)
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')

# Create a list of texts to embed (e.g., headlines or claims)
texts = df['text'].tolist()
labels = df['label'].tolist()

# Generate embeddings
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print("Embeddings shape:", embeddings.shape)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1404 [00:00<?, ?it/s]

Embeddings shape: (44898, 384)


Here, we use the **all-MiniLM-L6-v2** model, which produces 384-dimensional vector embeddings. If you prefer a different model (e.g., `all-mpnet-base-v2` or a multilingual model), simply replace the string in `SentenceTransformer(...)`.


**4.2 Creating a FAISS Index**

Next, we create a **FAISS** index to store these embeddings. FAISS supports efficient similarity search, letting us retrieve the top-k most similar items for any new query.

You could use other vector databases (e.g., **Chroma**, **Milvus**, **Pinecone**, etc.) if you prefer.


In [6]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"FAISS index size: {index.ntotal}")

FAISS index size: 44898


The index is now created and populated. We can quickly query it with any embedding vector to find similar texts from our dataset.


# **5. Retrieval Augmented Generation (RAG)**

With the vector store in place, we can **retrieve** the top relevant articles for a new claim/headline. Then, we’ll feed both the user’s query and the retrieved evidence into an LLM to generate a short, fact-based explanation.

This approach helps **reduce hallucinations** by giving the model actual reference text for each claim. Let’s define some helper functions next.


**5.1 Example: Single Query with RAG**

Below, we:
1. Create a function to **retrieve** the top-k similar entries using the FAISS index.
2. Create a function to **generate** a short explanation by referencing those retrieved entries in the LLM prompt.
3. Test the workflow with an example query (e.g., `"Vaccines cause autism."`).

You can swap in **any query** you’d like here. Also, if you want to use GPT-4 or a different model, update the `model` parameter in the `openai.chat.completions.create(...)` call.


In [7]:

import openai

def retrieve_similar_texts(query, k=3):
    """
    Given a query (string), return top k similar text entries
    from the FAISS index.
    """
    query_emb = embedding_model.encode([query])
    distances, indices = index.search(query_emb, k)

    results = []
    for idx in indices[0]:
        item_text = df.loc[idx, 'text']
        item_label = df.loc[idx, 'label']
        results.append((item_text, item_label))
    return results

def generate_explanation(query, retrievals):
    """
    Use the retrieved text & label to produce a short explanation
    via ChatCompletion in openai>=1.0.0.
    """

    references_str = "\n".join([f"Claim: {rt[0]} -- Label: {rt[1]}" for rt in retrievals])

    messages = [
        {
            "role": "system",
            "content": "You are a fact-checking assistant."
        },
        {
            "role": "user",
            "content": f"""
The user says: '{query}'

We found these fact-check references:
{references_str}

Based on these, is the user's claim likely true or false?
Provide a concise explanation (2-3 sentences) referencing the evidence above.
            """
        }
    ]

    # For openai>=1.0.0, we use the 'chat.completions.create' endpoint
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  # or gpt-4, etc.
        messages=messages,
        max_tokens=150,
        temperature=0
    )

    # Access the content by attribute, not by subscripting
    return response.choices[0].message.content.strip()

# Test call
query_test = "Vaccines cause autism."
retrieved = retrieve_similar_texts(query_test, k=3)
llm_explanation = generate_explanation(query_test, retrieved)

print("----Retrieved Fact Checks----")
for i, (txt, lbl) in enumerate(retrieved):
    print(f"{i+1}) [Label: {lbl}] {txt[:80]}...")

print("\n----LLM Explanation----")
print(llm_explanation)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

----Retrieved Fact Checks----
1) [Label: false] Because idiots haven t harmed America enough yet.An anti-vaxxer propaganda film ...
2) [Label: false] A 12-year-old boy has just gone viral for absolutely shredding every anti-vaxxer...
3) [Label: false] Donald Trump says he surrounds himself with  the best people  but obviously, tha...

----LLM Explanation----
The claim that vaccines cause autism is false. Multiple sources, including scientific studies and medical professionals, have debunked the link between vaccines and autism. Promoting this misinformation has led to a dangerous anti-vaccination movement that has resulted in preventable disease outbreaks and poses a threat to public health.


The prompt to the LLM includes both the user’s claim and the top matching fact-check entries. Feel free to **tweak** the instructions to get a more structured output, longer explanation, or bullet-point summary. This is a key part of **prompt engineering**.


# **6. Demonstration of Additional Capabilities**

Below is an example of **few-shot prompting** to ensure that the LLM outputs data in a consistent format (e.g., JSON). This can be helpful if you want to parse the output programmatically later on.


**Few-Shot Prompting Example**

We might want the explanation in a more structured JSON format or a “bullet-point” style. Let’s do a few-shot approach to ensure consistent formatting.

In [8]:
import openai

demo_prompt = """
Below are examples of how to respond to fact-check queries in JSON:

Example 1:
{
  "label": "false",
  "explanation": "This claim is contradicted by evidence in X and Y..."
}

Now, follow that format exactly:

User Claim: "5G towers spread COVID-19."
Evidence: "Claim: 5G networks cause coronavirus. Label: false"

{
   "label": 
   "explanation":
}
"""

# In openai>=1.0.0, you call openai.chat.completions.create(...)
response_json = openai.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": demo_prompt
        }
    ],
    max_tokens=100,
    temperature=0
)

# Access the content via object attributes, not dict keys
print(response_json.choices[0].message.content)



{
  "label": "false",
  "explanation": "This claim that 5G towers spread COVID-19 is false. There is no scientific evidence to support the idea that 5G networks cause coronavirus."
}


Here, we gave a small “few-shot” style prompt (one or two examples) so the model consistently responds in JSON format, fulfilling the Structured Output capability.

# **7. Live News Detection with Gemini**

Our current model only works on static data. But what if we want to fact-check **breaking news** or **recent headlines**?

Let’s use:
- **NewsAPI** to pull recent articles (e.g., about AI)
- **Gemini** to analyze the truthfulness of these headlines

This allows us to fact-check claims in **real time**.



In [9]:
# Install required libraries (skip if already installed)
!pip install --upgrade pip -q
!pip install newsapi-python -q
!pip install google-generativeai -q






[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
# Setup API keys and libraries
import os
import google.generativeai as genai
from newsapi import NewsApiClient
from kaggle_secrets import UserSecretsClient

# Get your Gemini and NewsAPI keys (from Kaggle Secrets)
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
NEWS_API_KEY = UserSecretsClient().get_secret("NEWS_API_KEY")

genai.configure(api_key=GOOGLE_API_KEY)
newsapi = NewsApiClient(api_key=NEWS_API_KEY)


In [11]:
# List available models (must be authenticated first)
import google.generativeai as genai
genai.configure(api_key=GOOGLE_API_KEY)

for model in genai.list_models():
    print(model.name)


models/chat-bison-001
models/text-bison-001
models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-1.5-flash-8b-exp-0827
models/gemini-1.5-flash-8b-exp-0924
models/gemini-2.5-pro-exp-03-25
models/gemini-2.5-pro-preview-03-25
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
models/gemini-2.0-flash-thinking-exp-01

In [12]:
# 🗞️ Function to fetch live news headlines
def fetch_latest_news(topic="technology", language="en", page_size=5):
    articles = newsapi.get_everything(q=topic, language=language, sort_by="publishedAt", page_size=page_size)
    return [article['title'] + ". " + article.get('description', '') for article in articles['articles']]


In [13]:
# 🤖 Use Gemini to fact-check a headline (without grounding)
def gemini_fact_check(claim_text):
    """
    Uses Gemini to determine whether a claim is true, false, or uncertain.
    """
    model = genai.GenerativeModel('models/gemini-1.5-pro')  # Stable and widely supported
    
    prompt = f"""You are a helpful AI that checks if the following recent news claim is likely accurate, partially true, or false.

Claim: "{claim_text}"

Without guessing, briefly explain your reasoning in 2–3 sentences. If you're unsure, say so clearly."""
    response = model.generate_content(prompt)
    return response.text


In [14]:
# 🧪 Example: Fact-checking 3 live AI news headlines
live_headlines = fetch_latest_news(topic="AI", page_size=3)

for i, headline in enumerate(live_headlines):
    print(f"\n📌 Headline {i+1}: {headline}")
    result = gemini_fact_check(headline)
    print(f"🤖 Gemini Response:\n{result}")



📌 Headline 1: Canada is lagging in innovation, and that’s a problem for funding the programs we care about. If Canadians want to fund education, health care and climate adaptation, Canada must grow its economy. And to do that, it needs smarter innovation policy.
🤖 Gemini Response:
Partially true.  While Canada has been criticized for lagging innovation in some sectors compared to other developed nations (impacting potential economic growth), it's unclear if this is the *sole* or primary impediment to funding social programs.  Connecting innovation directly to the funding of these specific programs simplifies a complex relationship between economic performance, government budgets, and policy priorities.


📌 Headline 2: DNA testing in Mason helps patients find mental health medications faster. A Mason-based company's genetic testing helps patients find effective mental health medications faster, potentially eliminating months of trial and error.
🤖 Gemini Response:
Partially true. While 

> ✅ **Tip:** The more specific the topic you search (e.g., "ChatGPT security", "AI legislation"), the more focused your fact-checking pipeline becomes.


# **8. Enhancing Accuracy: Google Search Grounding**

Gemini is powerful, but it can still hallucinate or guess if it's not grounded in real-world context.
To boost accuracy, we’ll combine it with **Google Programmable Search** to pull top search snippets.

---

## Setup: Get Your Google Search API Key & Search Engine ID (CX)

Follow these steps to enable live search:

### Step 1: Create a Programmable Search Engine
1. Go to [programmablesearchengine.google.com](https://programmablesearchengine.google.com/)
2. Click "Create a search engine"
3. Under "Sites to search," enter:
   `www.google.com`
4. Name it anything (e.g., "NewsVerifier")
5. Click Create

### Step 2: Get Your Search Engine ID (CX)
- Go to [My Search Engines](https://programmablesearchengine.google.com/cse/all)
- Click your engine and copy the Search engine ID (your CX)

### Step 3: Get Your Google API Key
1. Go to the [Google Cloud Console](https://console.cloud.google.com/)
2. Create or select a project
3. Go to APIs & Services > Credentials
4. Click "+ Create Credentials > API key" and copy the result

### Step 4: Enable the Custom Search API
1. In the Cloud Console, go to Library
2. Search for "Custom Search API"
3. Click it → Click "Enable"

---

### You will need two values:
- `GOOGLE_SEARCH_API_KEY` → from Google Cloud Console
- `GOOGLE_SEARCH_ENGINE_ID` (CX) → from the Programmable Search dashboard

Store them securely using `UserSecretsClient()` in Kaggle or environment variables locally.


In [15]:
# Google Programmable Search setup
import requests

# Get your Google Search credentials (from Kaggle Secrets)
GOOGLE_SEARCH_API_KEY = UserSecretsClient().get_secret("GOOGLE_SEARCH_API_KEY")
GOOGLE_SEARCH_ENGINE_ID = UserSecretsClient().get_secret("GOOGLE_SEARCH_ENGINE_ID")

# Search Google and return top search result snippets (with title)
def google_search_snippets(query, num_results=5, site_filter=None):
    """
    Uses Google Custom Search API to retrieve top search result snippets.
    
    Args:
        query (str): The search term or claim.
        num_results (int): Number of search results to return.
        site_filter (str, optional): Domain to restrict search (e.g., "snopes.com").
        
    Returns:
        List[str]: Formatted list of title + snippet strings.
    """
    search_query = f"site:{site_filter} {query}" if site_filter else query
    url = "https://www.googleapis.com/customsearch/v1"

    params = {
        "key": GOOGLE_SEARCH_API_KEY,
        "cx": GOOGLE_SEARCH_ENGINE_ID,
        "q": search_query,
        "num": num_results
    }

    response = requests.get(url, params=params)
    items = response.json().get("items", [])

    return [f"{item['title']}: {item['snippet']}" for item in items]


**What This Cell Does**  
- Imports the `requests` library to handle HTTP requests.  
- Retrieves **Google Search credentials** (`GOOGLE_SEARCH_API_KEY` and `GOOGLE_SEARCH_ENGINE_ID`) via `UserSecretsClient`.
- Defines a function `google_search_snippets(query, num_results=5, site_filter=None)`:
  - Constructs a query URL for the Google Custom Search API.
  - Executes the search by sending an HTTP GET request.
  - Extracts each result’s `title` and `snippet` into a list of formatted strings.

**Suggestions / Tips**  
- If you want to limit your search to a specific domain, pass a domain as `site_filter` (e.g., `"snopes.com"`). We will do this next.  
- Adjust `num_results` to retrieve more or fewer search results as needed.  
- Be mindful of API rate limits from Google.

In [16]:
def grounded_fact_check_with_snopes_and_gemini(claim_text):
    """
    Uses Google search on snopes.com, politifact.com, and the general web for evidence,
    then sends structured evidence to Gemini for grounded fact-checking.
    """

    # Step 1: Retrieve search evidence
    snopes_evidence = google_search_snippets(claim_text, site_filter="snopes.com", num_results=5)
    politifact_evidence = google_search_snippets(claim_text, site_filter="politifact.com", num_results=5)
    general_evidence = google_search_snippets(claim_text, num_results=5)

    # Step 2: Combine all search snippets
    all_evidence = snopes_evidence + politifact_evidence + general_evidence

    # Format the snippets as bullet points
    context = "\n".join([f"- {snippet}" for snippet in all_evidence])

    # Step 3: Prompt Gemini with reasoning steps
    prompt = f"""
You are a fact-checking assistant analyzing the following claim.

Claim: "{claim_text}"

Below is real-world evidence gathered from Snopes, PolitiFact, and general search results:
{context}

---
Your task:

1. Summarize the key points from the evidence.
2. Analyze whether the evidence supports or contradicts the claim.
3. Give a final verdict using ONE of the following labels:

    - True
    - False
    - Misleading
    - Lacks Evidence

Then explain your reasoning in 2–3 sentences and state your confidence level (High, Medium, or Low).
    """

    # Step 4: Generate Gemini response
    model = genai.GenerativeModel('models/gemini-1.5-pro')
    response = model.generate_content(prompt)
    
    return response.text.strip()



**What This Cell Does**  
- Implements `grounded_fact_check_with_snopes_and_gemini`.
  1. Performs targeted searches on **snopes.com**, **politifact.com**, and the general web using the function from Cell 36.
  2. Combines the returned snippets into a single list, then formats them as bullet points.
  3. Constructs a prompt directing Gemini to:
     - Summarize key points from the evidence.
     - Decide if the claim is `True`, `False`, `Misleading`, or `Lacks Evidence`.
     - Provide a brief explanation with a stated confidence level.
  4. Calls **Gemini** (`genai.GenerativeModel('models/gemini-1.5-pro')`) to generate the AI’s verdict.

**Suggestions / Tips**  
- You can extend or customize the prompt for a different style of output, such as bullet points or JSON.  
- Consider adding more sources (e.g., fact-checking websites) to improve reliability.  
- If certain sources are more trustworthy for your domain, prioritize them or filter out irrelevant domains.

In [17]:
topics = ["AI", "Politics", "Healthcare", "Climate", "Finance", "Education", "Technology", "Environment"]
live_headlines = []

# Try fetching 1 headline per topic until you get 5 total
for topic in topics:
    if len(live_headlines) >= 5:
        break
    try:
        headlines = fetch_latest_news(topic=topic, page_size=1)
        if headlines:
            live_headlines.extend(headlines)
    except Exception as e:
        print(f"Error fetching topic '{topic}': {e}")




**What This Cell Does**  
- Creates a list of **topics** (`"AI"`, `"Politics"`, `"Healthcare"`, etc.).  
- Fetches one headline per topic using `fetch_latest_news`, aiming to collect **5 total** headlines.  
- Catches any exceptions that occur (for instance, API errors or empty responses).

**Suggestions / Tips**  
- Adjust the topics to match your specific interests (e.g., `"Sports"`, `"Economy"`, or `"Entertainment"`).  
- Increase `page_size` if you want more headlines per topic.  
- If you’re working on a specialized domain, replace `fetch_latest_news` with your own custom data source.

In [18]:
for i, headline in enumerate(live_headlines):
    print(f"{i+1}. {headline}\n")


1. Canada is lagging in innovation, and that’s a problem for funding the programs we care about. If Canadians want to fund education, health care and climate adaptation, Canada must grow its economy. And to do that, it needs smarter innovation policy.

2. Why the USS John F. Kennedy Comes to Port Later than Expected. The second USS John F. Kennedy was scheduled for delivery in 2022 but fell behind due to delays caused by the pandemic; now, the vessel is ninety-five percent complete and has a contract delivery date of July 2025.
The post Why the USS John F. Kennedy Comes t…

3. Trump Wants to End Head Start While Boosting Military Spending to Record $1 Trillion. Critics on Monday decried the Trump administration's consideration of a budget proposal that would completely eliminate funding for the early childhood education program Head Start—which serves over 800,000 low-income U.S. families—while increasing military s…

4. Canada is lagging in innovation, and that’s a problem for funding

Simply **displays** each headline from the `live_headlines` list with an index. 

Looking at what you've generated, what do you think would be the verdict?

In [19]:
for i, headline in enumerate(live_headlines):
    print("=" * 100)
    print(f"\nHeadline {i+1}:\n{headline}\n")

    verdict = grounded_fact_check_with_snopes_and_gemini(headline)
    print(f"GEMINI Verdict:\n{verdict}")
    print("=" * 100)




Headline 1:
Canada is lagging in innovation, and that’s a problem for funding the programs we care about. If Canadians want to fund education, health care and climate adaptation, Canada must grow its economy. And to do that, it needs smarter innovation policy.

GEMINI Verdict:
1. **Summary of Key Points:** The provided evidence repeats the claim itself across several sources and mentions the importance of economic growth for funding social programs and climate adaptation. However,  it lacks specific data or analysis supporting the assertion that Canada is lagging in innovation.  There's tangential mention of Canada's Green Building Strategy, but it doesn't directly address innovation performance.

2. **Analysis:** The evidence primarily reiterates the claim without providing substantiating information.  While the logic of the claim—that economic growth fueled by innovation can fund important programs—is reasonable, there's no factual backing provided for the core assertion about Canad

Prints the AI’s **verdict** and reasoning, effectively completing a “live” fact-check based on up-to-date search evidence.

**Suggestions / Tips**  
- Enhance trust by storing the raw evidence or displaying it alongside the verdict.  
- If Gemini’s verdict is too general, refine the prompt instructions in `grounded_fact_check_with_snopes_and_gemini`.  
- For repeated queries or large batches of headlines, be aware of API usage and potential costs/limits.

# **9. Results & Observations**

This section highlights the **strengths** and **limitations** of the pipeline.

### What Works Well
- **Accurate Retrieval**: The vector store consistently returns relevant fact-checks.
- **LLM Grounding**: RAG plus grounding boosts factual accuracy.
- **Prompt Flexibility**: Can adapt format for different use cases (JSON, bullet points, etc.).

### Limitations
- **Static Dataset**: The base system doesn’t update with new claims unless extended.
- **LLM Bias**: Explanations rely on prompt quality and search evidence.
- **Language Restriction**: Default setup works only for English.

### Suggestions
- Swap in your own dataset for domain-specific tasks.
- Use different LLMs (Claude, GPT-4, LLaMA).
- Deploy as a web app or Slack bot.



# **10. Conclusion & Future Work**

In this notebook, we built a pipeline for detecting and summarizing misinformation using:

- **Sentence Transformers** for embeddings.
- **FAISS** for fast similarity search.
- **Gemini** for generating reasoned explanations.
- **NewsAPI & Google Search** to enable live and grounded fact-checking.

### Possible Next Steps
- Add **multilingual support**.
- Improve UI/UX for non-technical users.
- Deploy via web dashboard or chatbot.
- Add **search filters** (for example, news within 24 hours).

Feel free to customize and expand this project!


# **11. References & Acknowledgments**

1. **ISOT Fake News Dataset**
2. [Snopes](https://www.snopes.com/)
3. [PolitiFact](https://www.politifact.com/)
4. [SentenceTransformers Documentation](https://www.sbert.net/)
5. [FAISS GitHub](https://github.com/facebookresearch/faiss)
6. [OpenAI API Docs](https://platform.openai.com/docs)
7. [NewsAPI](https://newsapi.org/)
8. [Google Programmable Search](https://programmablesearchengine.google.com/)

Thanks for exploring this with me! 
