# Fake News Detector — Final Project

**Group 4 — Kai Bleuel, Mustafa Sivgin, César Diaz Murga**  
Scientific Programming WIN (2025-FS)


## 1️. Introduction

In this project, we developed a "Fake News Detector & Article Analyzer" as part of the Scientific Programming WIN course at ZHAW.

The goal of our project is to automatically analyze live news articles for:

- Clickbait characteristics
- Sentiment (positive/negative/neutral)
- Word count
- Statistical relationships

**Research Question:**  
Can we identify patterns between clickbait headlines, sentiment and article length in real-world news?

Our tool provides an interactive platform to explore the relationship between clickbait headlines, article sentiment, and text length in real-world news. It leverages modern AI techniques (Transformer-based language models), statistical analysis, and user-friendly visualizations to support data-driven conclusions.



## 2️. Materials & Methods


### Data Source

We use **NewsAPI** to collect real-world, live news articles based on a user-provided keyword.

Example API call:

https://newsapi.org/v2/everything?q=iphone&pageSize=20&page=1&sortBy=publishedAt&language=en&apiKey=xxx


### Data Preparation

✅ Cleaning and preparing article text:  

- Removed empty entries  
- Filtered articles by keyword occurrence in title or text  
- Converted to pandas DataFrame  

✅ Keyword highlighting implemented via:

The highlight function iterates over the text and applies color formatting (using Colorama in terminal, HTML in Streamlit) to visually emphasize the search keyword. This improves user experience and readability.

```python
def highlight_keyword(text, keyword):
    ...

✅ Word count calculated via:

Word count serves as a simple proxy for article length and complexity. It is computed as the number of whitespace-separated tokens in the article body.
```python
word_count = len(row.get('text', '').split())

### Analysis Algorithms

✅ Clickbait Detection:

- Rule-based matching against a curated list of common clickbait phrases  
- Example phrases: "shocking", "you won’t believe", "secret", etc.

//Example: Clickbait Detection Code
```python
CLICKBAIT_WORDS = ["shocking", "secret", "you won’t believe", ...]
                def detect_clickbait_in_title(title):
                    title_lower = title.lower()
                    found_words = [word for word in CLICKBAIT_WORDS if word in title_lower]
                    if found_words:
                        return "Clickbait"
                    else:
                        return "Not Clickbait"
```

✅ Sentiment Analysis:

- Using Hugging Face Transformers model:  
`distilbert-base-uncased-finetuned-sst-2-english`

//Example: Sentiment Analysis Code
```python
                from transformers import pipeline

                classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

                def analyze_article_with_local_llm(text):
                    result = classifier(text[:512])[0]
                    return {"label": result["label"], "score": result["score"]}
```

✅ Statistical Analysis:

- Pearson correlation: Sentiment Score ↔ Word Count  
- Spearman correlation: Sentiment Score ↔ Word Count  
- Chi-Square test: Clickbait vs. Sentiment

//Example: Statistical Analysis Code (Pearson Correlation)
```python
                from scipy.stats import pearsonr

                sentiment_scores = df_results["Sentiment Score"]
                word_counts = df_results["Word Count"]

                pearson_corr, pearson_p = pearsonr(sentiment_scores, word_counts)
                print(f"Pearson r = {pearson_corr:.2f}, p = {pearson_p:.4f}")
```


### Tools used

- Python 3.13
- Visual Studio Code
- Jupyter Notebook
- Streamlit (Web App)
- pandas
- matplotlib
- transformers
- scipy
- colorama
- NewsAPI Web API


### Database / SQL Integration

To fulfill the database requirement of the project, we added an integration with **SQLite**, a lightweight SQL database.

Our program automatically:

✅ Saves the full analysis DataFrame (`df_results`) to a local SQLite database:  
`data/fake_news_analysis.db`

✅ Stores the results in an SQL table called **"articles"**

✅ Executes an example SQL query directly from Python to demonstrate integration:

```sql
SELECT Clickbait, COUNT(*) as count
FROM articles
GROUP BY Clickbait;


## 3️. Results & Discussion


### ➤ Pie Chart: Clickbait + Sentiment

**Distribution of Clickbait + Sentiment combinations in analyzed articles:**  

![Pie Chart](images/pie_chart.png)

### ➤ Example Terminal Output (main.py)

**Example output of our terminal-based analysis:**  

![Terminal Output](images/terminal_output.png)


### ➤ Statistical Analysis Output

![Statistical Analysis](images/stat_analysis.png)


### ➤ Example Streamlit App

**Screenshot of the interactive Streamlit app:**  

![Streamlit App](images/streamlit_app.png)


### ➤ Interpretation of Results

Our findings suggest that while some sentiment patterns correlate with article length, clickbait features are less predictive of sentiment. This might be due to the variability of journalistic writing styles across different news outlets.

Interestingly, many articles flagged as "Not Clickbait" still contained highly emotional or biased language — indicating that clickbait headlines and article tone are not always aligned.

The flexible, keyword-based search allows users to explore many different topics (sports, politics, tech, etc.) and compare how media tone shifts across domains.


Based on our runs:

- Some correlation between **sentiment and article length** was observed (depending on topic).
- No strong correlation between **clickbait and sentiment** → surprising result!
- The combination of Clickbait + Sentiment is well visualized in the Pie Chart.
- The app allows flexible exploration across topics (e.g. politics, sports, tech).


## 4️. Conclusions

✅ Our project successfully implements an automated Fake News Detector:

- Real-world data collection using NewsAPI
- Cleaning and preparation using pandas
- Clickbait detection with rule-based approach
- Sentiment analysis with LLM model (Hugging Face)
- Visualization of results (Pie Chart)
- Statistical analysis with valid p-values
- Interactive web app with Streamlit

**Future Improvements:**

- More sophisticated clickbait detection (ML-based)  
- Support for multilingual analysis  
- More advanced statistical analysis  
- Larger dataset across multiple pages  


## 5️. Appendix — Point Mapping ✅


### Minimum points (8):

| Requirement                          | Implemented |
|--------------------------------------|-------------|
| (1) Collection of real-world data    | ✅ NewsAPI |
| (2) Data preparation (regex, string) | ✅ Keyword highlighting + cleaning |
| (3) Python structures & pandas       | ✅ Lists, dicts, DataFrames |
| (4) Loops, conditionals              | ✅ main.py loops & filtering |
| (5) Procedural programming           | ✅ Functions (main.py + llm_helper.py) |
| (6) Visualization (table/chart)      | ✅ Pie Chart |
| (7) Statistical analysis (p-value)   | ✅ Pearson, Spearman, Chi2 |
| (8) Code + data available on Moodle  | ✅ Will upload |


### Bonus points:

| Bonus Feature                        | Implemented |
|--------------------------------------|-------------|
| (1) Creativity                       | ✅ Clickbait + Sentiment + Interactive Search |
| (2) Web API                          | ✅ NewsAPI |
| (3) Database / SQL                   | ✅ Database/SQL |
| (4) LLM usage                        | ✅ Sentiment analysis model |
| (5) Streamlit web app                | ✅ Implemented |
| (6) GitHub repo                      | ✅ Planned (optional) |


Thank you for reading our project report!**
