# **Fake News Forensic – Detecting & Summarizing Misinformation with GenAI**

This notebook demonstrates how to detect and summarize misinformation (so-called “fake news”) by combining:

1. **Embeddings** to transform text into vector representations.
2. A **Vector Store** (in this example, using FAISS) to enable similarity search.
3. **Retrieval-Augmented Generation (RAG)** techniques to ground outputs from a Large Language Model (LLM) in factual data.

We will walk through each step—from data loading and vector database creation to prompting a language model for fact-checking responses. Feel free to **experiment** by changing prompts, using different LLMs, or customizing the data!



# **1. Introduction**

Misinformation spreads rapidly online, and manual fact-checking cannot always keep pace. In this project, we show how **Generative AI** methods can help:

1. **Find** relevant fact-check statements from a curated dataset (using embeddings + a vector database).
2. **Generate** short, evidence-based explanations (using a language model like OpenAI’s GPT).
3. **Structure** responses in a user-friendly format, such as JSON or bullet-point summaries.

Our example uses the **ISOT Fake News Dataset** (which labels articles as true or false) and an OpenAI model to produce short explanations referencing the retrieved evidence.


# **2. Libraries & Setup**

We first need to install and import the libraries that will power our pipeline:

- **sentence-transformers**: For creating text embeddings.
- **langchain** & **chromadb**: Common libraries for building LLM applications (though we focus on FAISS here).
- **faiss-gpu**: A vector store for similarity search.
- **pandas**, **numpy**: For data manipulation.
- **torch**: Underlying framework (used by sentence-transformers).
- **openai**: To interact with OpenAI’s GPT models.

Installing might require a restart of the environment once done. Let’s go ahead and set things up.


In [None]:
!pip install -q sentence-transformers langchain chromadb faiss-gpu

In [None]:


import pandas as pd
import numpy as np
import torch

# For embeddings (SentenceTransformers)
from sentence_transformers import SentenceTransformer

# For vector database, example: FAISS or Chroma
import faiss  # or use Chroma or another



We import the necessary libraries for data manipulation (**pandas**, **numpy**), generating embeddings (**SentenceTransformers**), creating a vector index (**FAISS**), and working with **PyTorch** (which powers the transformer model).


**Obtain an API Key**

To use OpenAI’s API (e.g., GPT-3.5 or GPT-4), you will need an API key from [https://platform.openai.com/](https://platform.openai.com/). Note that **paid plans** or billing information might be required depending on how many requests you make and the rate limits you exceed. If you are just testing small requests, the free trial credit might suffice, but for consistent or larger-scale usage, you’ll need a paid subscription.

In a Kaggle notebook or a local Jupyter environment, you can store the key in an environment variable or a secrets manager. Below, we demonstrate retrieving it from Kaggle’s `UserSecretsClient`.


In [None]:
# For language model calls (OpenAI, Hugging Face, or local)
# Example with OpenAI:

import os
import openai
from kaggle_secrets import UserSecretsClient

OPENAI_API_KEY = UserSecretsClient().get_secret("OPENAI_API_KEY")
openai.api_key = OPENAI_API_KEY



# ...any additional imports...


# **3. Data Loading & Preprocessing**

In this section, we’ll load our **ISOT Fake News Dataset**, which consists of two CSV files:
- `Fake.csv` for misinformation articles
- `True.csv` for real news articles

We then combine them into a single DataFrame, adding a column `'label'` indicating whether the text is **true** or **false**. You can swap in **your own** data here to create a customized fact-checking pipeline.


In [None]:
# 1) Read the CSVs from the ISOT Fake News Dataset
df_fake = pd.read_csv('/kaggle/input/isot-fake-news-dataset/Fake.csv')
df_true = pd.read_csv('/kaggle/input/isot-fake-news-dataset/True.csv')

# 2) Assign labels
df_fake['label'] = 'false'
df_true['label'] = 'true'

# 3) Concatenate into a single DataFrame
df = pd.concat([df_fake, df_true], ignore_index=True)

# 4) Display 3 rows from each label to get a quick look
df_subset = df.groupby('label').head(3)
df_subset



You can see a snippet of the data above. Each row contains text (the article body or headline) and the `label` indicating **true** or **false**.


# **4. Building & Populating the Vector Store**

In this stage, we will:
1. **Generate embeddings** for each article/claim using a Sentence Transformers model.
2. **Store** those embeddings in a FAISS index.

A vector store allows us to quickly find articles similar to any new query. This is critical for retrieval-augmented generation: we can feed the top-matching articles to the language model to ground its responses in factual data.


In [None]:
# Choose a sentence-transformers model (lightweight example)
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')

# Create a list of texts to embed (e.g., headlines or claims)
texts = df['text'].tolist()
labels = df['label'].tolist()

# Generate embeddings
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print("Embeddings shape:", embeddings.shape)


Here, we use the **all-MiniLM-L6-v2** model, which produces 384-dimensional vector embeddings. If you prefer a different model (e.g., `all-mpnet-base-v2` or a multilingual model), simply replace the string in `SentenceTransformer(...)`.


**4.2 Creating a FAISS Index**

Next, we create a **FAISS** index to store these embeddings. FAISS supports efficient similarity search, letting us retrieve the top-k most similar items for any new query.

You could use other vector databases (e.g., **Chroma**, **Milvus**, **Pinecone**, etc.) if you prefer.


In [None]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"FAISS index size: {index.ntotal}")

The index is now created and populated. We can quickly query it with any embedding vector to find similar texts from our dataset.


# **5. Retrieval Augmented Generation (RAG)**

With the vector store in place, we can **retrieve** the top relevant articles for a new claim/headline. Then, we’ll feed both the user’s query and the retrieved evidence into an LLM to generate a short, fact-based explanation.

This approach helps **reduce hallucinations** by giving the model actual reference text for each claim. Let’s define some helper functions next.


**5.1 Example: Single Query with RAG**

Below, we:
1. Create a function to **retrieve** the top-k similar entries using the FAISS index.
2. Create a function to **generate** a short explanation by referencing those retrieved entries in the LLM prompt.
3. Test the workflow with an example query (e.g., `"Vaccines cause autism."`).

You can swap in **any query** you’d like here. Also, if you want to use GPT-4 or a different model, update the `model` parameter in the `openai.chat.completions.create(...)` call.


In [None]:

import openai

def retrieve_similar_texts(query, k=3):
    """
    Given a query (string), return top k similar text entries
    from the FAISS index.
    """
    query_emb = embedding_model.encode([query])
    distances, indices = index.search(query_emb, k)

    results = []
    for idx in indices[0]:
        item_text = df.loc[idx, 'text']
        item_label = df.loc[idx, 'label']
        results.append((item_text, item_label))
    return results

def generate_explanation(query, retrievals):
    """
    Use the retrieved text & label to produce a short explanation
    via ChatCompletion in openai>=1.0.0.
    """

    references_str = "\n".join([f"Claim: {rt[0]} -- Label: {rt[1]}" for rt in retrievals])

    messages = [
        {
            "role": "system",
            "content": "You are a fact-checking assistant."
        },
        {
            "role": "user",
            "content": f"""
The user says: '{query}'

We found these fact-check references:
{references_str}

Based on these, is the user's claim likely true or false?
Provide a concise explanation (2-3 sentences) referencing the evidence above.
            """
        }
    ]

    # For openai>=1.0.0, we use the 'chat.completions.create' endpoint
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  # or gpt-4, etc.
        messages=messages,
        max_tokens=150,
        temperature=0
    )

    # Access the content by attribute, not by subscripting
    return response.choices[0].message.content.strip()

# Test call
query_test = "Vaccines cause autism."
retrieved = retrieve_similar_texts(query_test, k=3)
llm_explanation = generate_explanation(query_test, retrieved)

print("----Retrieved Fact Checks----")
for i, (txt, lbl) in enumerate(retrieved):
    print(f"{i+1}) [Label: {lbl}] {txt[:80]}...")

print("\n----LLM Explanation----")
print(llm_explanation)


The prompt to the LLM includes both the user’s claim and the top matching fact-check entries. Feel free to **tweak** the instructions to get a more structured output, longer explanation, or bullet-point summary. This is a key part of **prompt engineering**.


# **6. Demonstration of Additional Capabilities**

Below is an example of **few-shot prompting** to ensure that the LLM outputs data in a consistent format (e.g., JSON). This can be helpful if you want to parse the output programmatically later on.


**Few-Shot Prompting Example**

We might want the explanation in a more structured JSON format or a “bullet-point” style. Let’s do a few-shot approach to ensure consistent formatting.

In [None]:
import openai

demo_prompt = """
Below are examples of how to respond to fact-check queries in JSON:

Example 1:
{
  "label": "false",
  "explanation": "This claim is contradicted by evidence in X and Y..."
}

Now, follow that format exactly:

User Claim: "5G towers spread COVID-19."
Evidence: "Claim: 5G networks cause coronavirus. Label: false"

{
   "label": 
   "explanation":
}
"""

# In openai>=1.0.0, you call openai.chat.completions.create(...)
response_json = openai.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs JSON."
        },
        {
            "role": "user",
            "content": demo_prompt
        }
    ],
    max_tokens=100,
    temperature=0
)

# Access the content via object attributes, not dict keys
print(response_json.choices[0].message.content)



Here, we gave a small “few-shot” style prompt (one or two examples) so the model consistently responds in JSON format, fulfilling the Structured Output capability.

# **7. Results & Discussion**

This section highlights the strengths and areas for improvement of the current system. By combining embeddings, a vector store, and an LLM, we demonstrate a practical pipeline for misinformation detection.

---

### ✅ **Observations**

1. **Effective Retrieval**  
   The system consistently retrieves fact-checks that are semantically similar to the user's claim. This shows the embedding model and FAISS index are working well for capturing conceptual similarity.

2. **Grounded LLM Explanations**  
   Because we explicitly provide supporting evidence (via top-k retrieved texts), the LLM's responses are more reliable and less prone to hallucination. The generated explanations often reflect the core sentiment and stance of the reference material.

3. **Flexible Prompt Engineering**  
   The prompt design (e.g., structured JSON, bullet points) can be customized to suit various output formats, improving downstream usability (e.g., in web apps, APIs, or analytics pipelines).

---

### ⚠️ **Limitations**

1. **Coverage Gaps in the Dataset**  
   If a user provides a **novel claim** that isn’t closely aligned with the training data, the retrieval process may return irrelevant or weak matches. This limits the model’s ability to produce an accurate explanation.

2. **Dependency on Original Labels**  
   The system assumes the original labels (true/false) in the dataset are accurate. If those fact-checks are themselves outdated, biased, or incorrect, the LLM will inherit and reinforce those flaws.

3. **Lack of Temporal Awareness**  
   The system does not account for when a claim was made or verified. In reality, facts evolve (e.g., medical advice, political events), so a static dataset can lead to stale or misleading conclusions.

4. **Limited to English (Default Setup)**  
   The model and dataset used are English-only. Multilingual support would be required to scale this to global misinformation detection.

---

### 💡 **Suggestions for Users**

- Try using **different queries** to test the system’s generalization.
- Swap in other **LLMs** (like `gpt-4`, `Claude`, or local models via Hugging Face).
- Adjust the **prompt** to change tone, style, format (e.g., tweet-length responses, long explanations, academic citations).
- Plug in your **own dataset** to adapt the pipeline for domains like healthcare, education, or financial scams.

This is not a complete solution, but it’s a solid foundation that highlights how AI can assist in fact-checking workflows.


# **8. Conclusion & Future Work**

In this notebook, we demonstrated how to build a **fact-checking pipeline** using:

- **Sentence Transformers** for embeddings.
- **FAISS** for vector similarity search.
- **OpenAI’s** language model to generate concise, grounded explanations.

### Possible Next Steps
1. **Multilingual Support**: Incorporate models and data for different languages.
2. **Real-Time Data**: Pull live content from social media or news APIs for on-the-fly fact checks.
3. **Improved Prompt Engineering**: Use few-shot prompts or chain-of-thought to yield more rigorous explanations.
4. **Deploy as an App**: Build a simple web or command-line interface to let users query claims interactively.

Finally, **have fun experimenting**! Change the model, the vector database, or the prompt. This is just one example pipeline—the underlying approach can be adapted to many tasks beyond fake news detection.


# **Appendix: References & Acknowledgments**

1. **ISOT Fake News Dataset**
2. [Snopes](https://www.snopes.com/) – Fact-check references
3. [PolitiFact](https://www.politifact.com/)
4. [SentenceTransformers Documentation](https://www.sbert.net/)
5. [FAISS Documentation](https://github.com/facebookresearch/faiss)
6. [OpenAI API Documentation](https://platform.openai.com/docs/introduction)

If you’d like to switch to other vector stores (e.g., Pinecone, Chroma) or other LLM providers (e.g., Hugging Face’s Transformers, Cohere), feel free to adapt the code to suit your requirements.
