# **Fake News Forensic – Detecting & Summarizing Misinformation with GenAI**

# **1. Introduction**
In this project, we address the growing problem of misinformation by leveraging generative AI techniques. Our system retrieves fact-check references from a curated database and uses large language models to generate short, grounded explanations of whether a new claim or headline is likely true or false.

We demonstrate three generative AI capabilities:

1. Embeddings + Vector Store (to store and retrieve similar fact-check statements)

2. Retrieval Augmented Generation (RAG) (to ground AI outputs with actual fact-check snippets)

3. Few-Shot Prompting (to control output style and ensure consistent structure)

# **2. Libraries & Setup**

In [None]:
!pip install -q sentence-transformers langchain chromadb faiss-gpu

In [None]:
!pip freeze | grep widgets
!pip freeze | grep jupyter


In [None]:


import pandas as pd
import numpy as np
import torch

# For embeddings (SentenceTransformers)
from sentence_transformers import SentenceTransformer

# For vector database, example: FAISS or Chroma
import faiss  # or use Chroma or another



We import the necessary libraries for data manipulation (pandas), creating embeddings (SentenceTransformers), building a vector index (FAISS), and interacting with a language model (OpenAI).

**Obtain an API Key**

If you’re calling OpenAI’s API, you need an API key from [platform.openai.com](http://).

Store it as openai.api_key = "YOUR_KEY_HERE" in the Notebook.

Alternatively, set it as an environment variable:

In [None]:
# For language model calls (OpenAI, Hugging Face, or local)
# Example with OpenAI:

import os
import openai
from kaggle_secrets import UserSecretsClient

OPENAI_API_KEY = UserSecretsClient().get_secret("OPENAI_API_KEY")
openai.api_key = OPENAI_API_KEY



# ...any additional imports...


# **3. Data Loading & Preprocessing**

In [None]:
# 1) Read the CSVs from the ISOT Fake News Dataset
df_fake = pd.read_csv('/kaggle/input/isot-fake-news-dataset/Fake.csv')
df_true = pd.read_csv('/kaggle/input/isot-fake-news-dataset/True.csv')

# 2) Assign labels
df_fake['label'] = 'false'
df_true['label'] = 'true'

# 3) Concatenate into a single DataFrame
df = pd.concat([df_fake, df_true], ignore_index=True)

# 4) Display 3 rows from each label to get a quick look
df_subset = df.groupby('label').head(3)
df_subset



Here’s a quick look at our dataset, which contains a 'text' or 'title' column for the claim/headline and a 'label' column (e.g., “true,” “false”). We’ve already cleaned and prepared these samples so they’re ready for the embedding step.

# **4. Building & Populating the Vector Store**
**4.1 Generating Embeddings**

In [None]:
# Choose a sentence-transformers model (lightweight example)
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')

# Create a list of texts to embed (e.g., headlines or claims)
texts = df['text'].tolist()
labels = df['label'].tolist()

# Generate embeddings
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print("Embeddings shape:", embeddings.shape)


We used a pre-trained model (all-MiniLM-L6-v2) to convert each text snippet into a 384-dimensional embedding vector.

**4.2 Creating a FAISS Index**

In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"FAISS index size: {index.ntotal}")

We’re using a simple FAISS index for vector similarity search. The data is now stored in a format that allows quick retrieval of top-k similar items.

# **5. Retrieval Augmented Generation (RAG)**
Given a new claim or headline, we will:

1. Convert it to an embedding with the same model.

2. Search in our FAISS index for the most similar text(s).

3. Retrieve relevant fact-check info (label, snippet).

4. Send the combined prompt to a language model (e.g., OpenAI) to generate a short, grounded explanation.

**5.1 Example: Single Query with RAG**

In [None]:

import openai

def retrieve_similar_texts(query, k=3):
    """
    Given a query (string), return top k similar text entries
    from the FAISS index.
    """
    query_emb = embedding_model.encode([query])
    distances, indices = index.search(query_emb, k)

    results = []
    for idx in indices[0]:
        item_text = df.loc[idx, 'text']
        item_label = df.loc[idx, 'label']
        results.append((item_text, item_label))
    return results

def generate_explanation(query, retrievals):
    """
    Use the retrieved text & label to produce a short explanation
    via ChatCompletion in openai>=1.0.0.
    """

    references_str = "\n".join([f"Claim: {rt[0]} -- Label: {rt[1]}" for rt in retrievals])

    messages = [
        {
            "role": "system",
            "content": "You are a fact-checking assistant."
        },
        {
            "role": "user",
            "content": f"""
The user says: '{query}'

We found these fact-check references:
{references_str}

Based on these, is the user's claim likely true or false?
Provide a concise explanation (2-3 sentences) referencing the evidence above.
            """
        }
    ]

    # For openai>=1.0.0, we use the 'chat.completions.create' endpoint
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  # or gpt-4, etc.
        messages=messages,
        max_tokens=150,
        temperature=0
    )

    # Access the content by attribute, not by subscripting
    return response.choices[0].message.content.strip()

# Test call
query_test = "Vaccines cause autism."
retrieved = retrieve_similar_texts(query_test, k=3)
llm_explanation = generate_explanation(query_test, retrieved)

print("----Retrieved Fact Checks----")
for i, (txt, lbl) in enumerate(retrieved):
    print(f"{i+1}) [Label: {lbl}] {txt[:80]}...")

print("\n----LLM Explanation----")
print(llm_explanation)


Observe how we incorporate both the user’s query and the top matching fact-check references into a single prompt. This ensures our LLM’s response is grounded in real data, reducing hallucinations.

# **6. Demonstration of Additional Capabilities**

**Few-Shot Prompting Example**

We might want the explanation in a more structured JSON format or a “bullet-point” style. Let’s do a few-shot approach to ensure consistent formatting.

In [None]:
import openai

demo_prompt = """
Below are examples of how to respond to fact-check queries in JSON:

Example 1:
{
  "label": "false",
  "explanation": "This claim is contradicted by evidence in X and Y..."
}

Now, follow that format exactly:

User Claim: "5G towers spread COVID-19."
Evidence: "Claim: 5G networks cause coronavirus. Label: false"

{
   "label": 
   "explanation":
}
"""

# In openai>=1.0.0, you call openai.chat.completions.create(...)
response_json = openai.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": demo_prompt
        }
    ],
    max_tokens=100,
    temperature=0
)

# Access the content via object attributes, not dict keys
print(response_json.choices[0].message.content)



Here, we gave a small “few-shot” style prompt (one or two examples) so the model consistently responds in JSON format, fulfilling the Structured Output capability.

# **7. Results & Discussion**
**Observations:**

1. The system effectively retrieves relevant fact-checks for common misinformation claims.

2. LLM explanations are grounded in the references we supply.

**Limitations:**

* If the claim is novel or not in our database, the system might produce uncertain or less accurate results.

* We rely on the original data’s correctness; if the fact-check source is biased or incomplete, our output inherits that limitation.

# **8. Conclusion & Future Work**
We demonstrated a pipeline that harnesses embeddings, vector stores, and retrieval augmented generation to tackle misinformation. By bridging knowledge from a fact-check database with an LLM’s language capabilities, we can produce concise, evidence-based judgments. In future iterations, we might:

* Expand to multilingual content.

* Integrate real-time social media feeds.

* Deploy as a web app or browser extension for on-the-fly checks.

# **Appendix: References & Acknowledgments**

1. Snopes – Fact-check references
2. PolitiFact
3. SentenceTransformers docs
4. FAISS docs