# Fake News Detection  
## Why 99% Accuracy is a Dangerous Lie  
### >>> The Real Number is ~60%

**Mesfin Kebede**  
Data Science Career Track – Capstone Three  
November 19, 2025  

GitHub: https://github.com/mesfin-k/capstone-three

## The Problem Everyone Ignores

Most papers report **98–99.9% accuracy**  

But they all make the **same critical mistake**:

→ Train & test on the **same single dataset**

→ Model learns **dataset-specific shortcuts** (style, vocabulary, length), not actual deception

**Real world** = infinite sources, styles, topics

**This project asks:**  
> **What happens when we force models to face multiple domains at once — just like in production?**

## Datasets Used

| Dataset   | Articles  | Real   | Fake   | Key Traits                              |
|-----------|-----------|--------|--------|-----------------------------------------|
| ISOT      | 44,898   | 21,417 | 23,481 | 2016-2017 politics, Reuters + unreliable, very formal |
| WELFake   | 72,134   | 37,106 | 35,028 | Multi-topic, noisy, mixed sources       |
| **Merged**| **117,032** | 58,523 | 58,509 | **Source column removed → true real-world test** |

## This Is Why Models Hit 99% on Single Datasets

They exploit **shortcut features**:

- ISOT → Reuters writing style = real  
- WELFake → different length, noise, topics = fake

Even **RoBERTa-Large** gets:

- **99.98%** on ISOT alone  
- **99.82%** on WELFake alone

→ It's basically learning **"which dataset is this?"** not "is this fake?"

In [6]:
import pandas as pd
from IPython.display import display, HTML

# Model results
results = pd.DataFrame([
    ["Logistic Regression", "ISOT",     "98.18%"],
    ["SVM",                 "ISOT",     "98.97%"],
    ["DistilBERT",          "ISOT",     "99.97%"],
    ["RoBERTa-Large",       "ISOT",     "99.98%"],
    ["Logistic Regression", "WELFake",  "93.97%"],
    ["SVM",                 "WELFake",  "95.19%"],
    ["DistilBERT",          "WELFake",  "99.10%"],
    ["RoBERTa-Large",       "WELFake",  "99.82%"],
    ["ALL MODELS",          "MERGED",   "≈60%"],
], columns=["Model", "Dataset", "Accuracy"])

# Title banner
display(HTML("""
    <div style="text-align:center; margin: 20px 0;">
        <h1 style="color:#c0392b; font-size:80px; margin-bottom:0;">99% → 60%</h1>
        <h3 style="color:#2c3e50; font-size:30px; margin-top:5px;">Performance Drop When Tested on Combined Dataset</h3>
    </div>
"""))

# Styled table
styled = results.style.set_properties(**{
    'font-size': '18pt',
    'text-align': 'center',
    'border': '1px solid black',
}).set_table_styles([{
    'selector': 'th',
    'props': [('font-size', '20pt'),
              ('background-color', '#34495e'),
              ('color', 'white'),
              ('text-align', 'center')]
}])

display(styled)


Unnamed: 0,Model,Dataset,Accuracy
0,Logistic Regression,ISOT,98.18%
1,SVM,ISOT,98.97%
2,DistilBERT,ISOT,99.97%
3,RoBERTa-Large,ISOT,99.98%
4,Logistic Regression,WELFake,93.97%
5,SVM,WELFake,95.19%
6,DistilBERT,WELFake,99.10%
7,RoBERTa-Large,WELFake,99.82%
8,ALL MODELS,MERGED,≈60%


## Proof #1: Text Length Distributions

![](images/text_length_distribution.png)

## Proof #2: Word Clouds (Fake vs Real)

![](images/fake_wordcloud.png)

![](images/real_wordcloud.png)

## Proof #3: Top Bigrams

**Fake news** loves:  
`donald trump` · `hillary clinton` · `white house` · `fake news` · `breaking`

**Real news** loves:  
`new york` · `united states` · `prime minister` · `last year` · `according to`

→ A simple bag-of-words model gets >90% using just these

## Full DSM Process Completed (All Rubric Boxes Checked)

1. Data wrangling → merged, cleaned, removed source column  
2. EDA → length, word clouds, n-grams, statistical tests  
3. Feature engineering → TF-IDF, numerical features, standardization  
4. Modeling → 5 models (LogReg, SVM, LSTM, DistilBERT, RoBERTa-Large)  
5. Evaluation → separate vs merged → comparison table filled  
6. Final model selected & applied to merged data  
7. 3+ visualizations created  
8. PDF report + model metrics file + clean GitHub repo

**All steps documented and submitted**

## The Real State of Fake News Detection

**99% on single dataset = misleading/overfitting**

**60% on merged data = honest real-world performance**

**Conclusion:**  
Fake news detection is **nowhere near solved**  
Current models will fail in production  

This is the clearest evidence yet that we need multi-domain training

## 3 Concrete Recommendations for Clients

1. **Never deploy a model trained on one dataset**  
   → Require validation on ≥2 unseen datasets

2. **Use domain-adversarial training or style-transfer augmentation**  
   → Forces model to ignore writing style

3. **Monitor drift monthly & retrain with new sources**

# Thank You!

**Fake News Detection: The 99% Myth & the 60% Reality**

Mesfin Kebede 
GitHub: https://github.com/mesfin-k/capstone-three  

Questions?