# Introduction to the Final Project: Fake News Detection

## Introduction
In today's digital world, fake news has become a significant challenge. With the rapid spread of information on social media and online platforms, distinguishing between true and false information is becoming increasingly difficult.  
This project aims to tackle this problem using **Machine Learning** and **Natural Language Processing (NLP)** techniques.

---

## Why Does Fake News Exist?
There are several reasons for the creation of fake news:

- **Manipulation and disinformation**: Created to influence public opinion (e.g., during elections or wars).  
- **Clickbait**: Sensational false headlines attract more views, leading to higher ad revenue.  
- **Errors**: Sometimes false news results from mistakes or lack of verification rather than intentional deception.  

---

## How to Recognize Fake News?
Some key strategies include:

- **Check the source**: Is it reputable and verified?  
- **Cross-verify**: Reliable news will appear in multiple trusted outlets.  
- **Examine language**: Overly emotional, sensational, or grammatically incorrect content may be suspicious.  
- **Look at the publication date**: Old or out-of-context news can be misleading.  

---

## The Final Project – What Will We Do?
Our goal is to build a **machine learning model** that classifies news articles into two categories:

- ✅ **True information**  
- ❌ **Fake news**

### Dataset Overview
The dataset contains the following columns:

- `title` – Article title  
- `text` – Article content  
- `subject` – Article category/topic  
- `isfake` – Target label (0 = true, 1 = fake)  
- `date` - Date when article published
- `title_content` – Combined title and text  
- `processed` – Preprocessed text (lowercased, cleaned, stopwords removed, lemmatized)  

The `processed` column was used to generate Word2Vec embeddings for machine learning models.

### Project Steps:
1. **Data preprocessing**: Clean, normalize, and prepare the text data.  
2. **Text vectorization**: Convert text into numbers using techniques like **Word2Vec**.  
3. **Model building**: Train classification models (e.g., Logistic Regression, SVM, Random Forest).  
4. **Evaluation**: Measure performance to determine the most effective algorithm.  

---

By the end, we will have a **working tool to detect fake news** based on the text content of articles.

---




### Data cleaning

In [None]:
# -------------------------------
# Import necessary libraries
# -------------------------------
import pandas as pd
import numpy as np
import re  
from ydata_profiling import ProfileReport
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report


# === Working with files ===

# -----------------------------------------
# Removing al ; at the end of title line
# -----------------------------------------

input_file = "news_dataset.csv"
output_file = "news_dataset1.csv"

with open(input_file, "r", encoding="utf-8") as f: 
    linije = f.readlines()

linije = [re.sub(r";+\s*$", "", linija) + "\n" for linija in linije] 


with open(output_file, "w", encoding="utf-8") as f: 
    f.writelines(linije)

print(f"The file with the removed all semicolons from the end of the line is saved as: {output_file}")



The file with the removed all semicolons from the end of the line is saved as: news_dataset1.csv


In [None]:
# ------------------------------------
# Replacing all , with ; in header
# ------------------------------------

input_file = "news_dataset1.csv"
output_file = "news_dataset_semicolon.csv"

with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
    
    header = fin.readline().strip() 
    
    header_semicolon = header.replace(',', ';')
    
    fout.write(header_semicolon + '\n')
    
    
    for line in fin:
        fout.write(line)

print(f"The file with the replaced separator is saved as: {output_file}")


The file with the replaced separator is saved as: news_dataset_semicolon.csv


In [None]:
# -------------------------------------------------------
# Removing al " at the begining and at the end of line
# -------------------------------------------------------

input_file = "news_dataset_semicolon.csv"
output_file = "news_dataset_no_quotes.csv"

with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
    for line in fin:                     
        line = line.strip('\n')         
        line = line.lstrip('"').rstrip('"')  
        fout.write(line + '\n')         

print(f"The file with the removed all \" at the begining and at the end of line is saved as: {output_file}")



The file with the removed all " at the begining and at the end of line is saved as: news_dataset_no_quotes.csv


In [None]:
# --------------------------------------------------------------------------------------------------------
# Replacing all double ""with ; and removes everything except letters, numbers, spaces and ; in lines
# --------------------------------------------------------------------------------------------------------

input_file = "news_dataset_no_quotes.csv"
output_file = "news_dataset_clean.csv"

with open(input_file, "r", encoding="utf-8") as fin, open(output_file, "w", encoding="utf-8") as fout:
    for line in fin:
        line = line.replace('""', ';')                     
        line = re.sub(r'[^a-zA-Z0-9 ;\n]', '', line)        
        fout.write(line)                                    

print(f"The cleaned file is saved as: {output_file}")

The cleaned file is saved as: news_dataset_clean.csv


In [228]:
# --------------------------------------------------
# Loading a clean file without problematic lines
# --------------------------------------------------

df = pd.read_csv(
    "news_dataset_clean.csv",
    sep=';',
    on_bad_lines='skip',
    engine='python' 
)

# -----------------------------------------
# Basic information about the dataset
# -----------------------------------------

print(df.info())
print(df.columns.tolist()) 

# ----------------------------
# Data set display
# ----------------------------
print("═" * 50)
print("FIRST 10 ROWS OF DATASET:")
print("═" * 50)
df.head(10)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10797 entries, 0 to 10796
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   title    10797 non-null  object 
 1   text     10794 non-null  object 
 2   subject  10792 non-null  object 
 3   date     10344 non-null  object 
 4   isfake   10341 non-null  float64
dtypes: float64(1), object(4)
memory usage: 421.9+ KB
None
['title', 'text', 'subject', 'date', 'isfake']
══════════════════════════════════════════════════
FIRST 10 ROWS OF DATASET:
══════════════════════════════════════════════════


Unnamed: 0,title,text,subject,date,isfake
0,WATCH Hypocrite Mike Pence Calls Democratic O...,This is unbelievably outrageousRepublicans are...,News,February 4 2017,1.0
1,Ammon and Ryan Bundy Found Not Guilty in Orego...,21st Century Wire Yesterday Judge Anna Brown h...,USNews,October 29 2016,1.0
2,WATCH HILARIOUS Video Proves CNN Doesnt Even B...,Watch these hilarious examples of CNN having r...,leftnews,Apr 3 2017,1.0
3,TRUMP CHIEF OF STAFF Goes At It With Liberal H...,Jan 29 2017,1,,
4,PRICELESS What Nancy Pelosi Just Said About Tr...,Nancy Pelosi is obviously geographically chall...,politics,Nov 10 2017,1.0
5,LOL Democrat Congressman Says Best Way To Figh...,MSNBC host asks Congressman Ted Leiu a Democra...,leftnews,Mar 18 2017,1.0
6,US Senate Democratic leader dismisses Republic...,WASHINGTON Reuters US Senate Democratic leade...,politicsNews,June 23 2016,0.0
7,UK must produce credible border plan to unlock...,DUBLIN Reuters Irish Prime Minister Leo Varad...,worldnews,December 1 2017,0.0
8,HILLARY CLINTON MEETS BLACK LIVES MATTER Says ...,Clinton pandered to Black Lives Matter while i...,politics,Oct 25 2016,1.0
9,SCOTT BAIO FILES POLICE REPORT Physically Atta...,Scott Baio became a teen idol starring as Chac...,leftnews,Dec 16 2016,1.0


In [None]:
# ----------------------------------
# How many rows were skipped:
# ----------------------------------

with open("news_dataset_clean.csv", 'r') as f:  
    total_lines = sum(1 for _ in f)
                                    
print("Total lines in file:", total_lines)
print("Rows loaded into DataFrame:", len(df))

Total lines in file: 12006
Rows loaded into DataFrame: 10797


In [230]:
# ---------------------------
# Check for Nan values
# ---------------------------

print(df.isna().sum())

title        0
text         3
subject      5
date       453
isfake     456
dtype: int64


In [231]:
# ---------------------------
# Check for duplicates
# ---------------------------
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 20


In [None]:
# -------------------------------------
# Drop rows with any missing values
# -------------------------------------
df_no_null = df.dropna() 
                         

# ------------------------
# Drop duplicate rows
# ------------------------
df_no_duplicates = df_no_null.drop_duplicates()

# --------------------------------
# Drop unecessary column - Date
# --------------------------------

df_no_duplicates = df_no_duplicates.drop(columns=['date'])

# ------------------
# Reset index
# ------------------
df_no_duplicates = df_no_duplicates.reset_index(drop=True)

# ------------
# Inspect
# ------------
print(df_no_duplicates.shape)
print(df_no_duplicates.head())

# --------------------
# Save cleaned CSV
# --------------------
output_file = "news_dataset_clean2.csv"
df_no_duplicates.to_csv(output_file, index=False, sep=';')
print(f"The cleaned file is saved as: {output_file}")

# --------------------------------------------
# Save the same cleaned DataFrame as JSON
# --------------------------------------------
json_file = "news_dataset_preprocessed.json"
df_no_duplicates.to_json(json_file, orient="records", lines=True)
print(f"The cleaned file is also saved as: {json_file}")



(10321, 4)
                                               title  \
0   WATCH Hypocrite Mike Pence Calls Democratic O...   
1  Ammon and Ryan Bundy Found Not Guilty in Orego...   
2  WATCH HILARIOUS Video Proves CNN Doesnt Even B...   
3  PRICELESS What Nancy Pelosi Just Said About Tr...   
4  LOL Democrat Congressman Says Best Way To Figh...   

                                                text   subject  isfake  
0  This is unbelievably outrageousRepublicans are...      News     1.0  
1  21st Century Wire Yesterday Judge Anna Brown h...    USNews     1.0  
2  Watch these hilarious examples of CNN having r...  leftnews     1.0  
3  Nancy Pelosi is obviously geographically chall...  politics     1.0  
4  MSNBC host asks Congressman Ted Leiu a Democra...  leftnews     1.0  
The cleaned file is saved as: news_dataset_clean2.csv
The cleaned file is also saved as: news_dataset_preprocessed.json


In [None]:
# -----------------------------------------------------------------------
# Project assignment- How many articles are there that are political?
# -----------------------------------------------------------------------
print(df_no_duplicates['subject'].unique())
print(df_no_duplicates['subject'].nunique())

['News' 'USNews' 'leftnews' 'politics' 'politicsNews' 'worldnews'
 'Government News' 'Middleeast']
8


In [None]:
# -----------------------------------------------------------------------
# Project assignment- How many articles are there that are political?
# -----------------------------------------------------------------------
df_no_duplicates['subject'].value_counts() 

subject
politicsNews       2920
worldnews          2691
News               1931
politics           1270
leftnews            882
Government News     303
USNews              165
Middleeast          159
Name: count, dtype: int64

In [None]:
politics = df_no_duplicates[df_no_duplicates['subject'].isin(['politicsNews','politics'])].shape[0]
print(f"Number of politics news: {politics}")                                                                                       

Number of politics news: 4190


In [None]:
# -------------------------
# Data Profiling Report
# -------------------------

! pip install ydata-profiling

report = ProfileReport(df_no_duplicates, title="Data Profiling Report", explorative=True) 
report.to_notebook_iframe() 




Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 4/4 [00:01<00:00,  2.44it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Data preprocessing

In [None]:
# -------------------------------
# Import necessary libraries
# -------------------------------
import nltk                        
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# ------------------------------------
# Download necessary NLTK resources
# ------------------------------------
nltk.download('stopwords')   
nltk.download('punkt')      
nltk.download('wordnet')     

# -----------------------------------------
# Initialize stopwords and lemmatizer
# -----------------------------------------
stop_words = set(stopwords.words('english')) 
lemmatizer = WordNetLemmatizer()

# ------------------------------------------------
# Create a new column combining title and text
# ------------------------------------------------
df_no_duplicates['title_content'] = df_no_duplicates['title'].fillna('') + ' ' + ['text'].fillna('')

# ------------------------------------
#  Define preprocessing function
# ------------------------------------ 

def preprocess_text(text):
    # Lowercase all letters
    text = text.lower()
    # Remove numbers
    text = re.sub(r'\d+', '', text) 
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove redundant spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Tokenize text
    words = nltk.word_tokenize(text) 
    # Remove stopwords and lemmatize
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words] 
    # Join back into a single string
    return ' '.join(words) 

# ----------------------------------------------
# Apply preprocessing to the combined column
# ----------------------------------------------
df_no_duplicates['processed'] = df_no_duplicates['title_content'].apply(preprocess_text) 

# ----------------------
# Inspect results
# ----------------------
print(df_no_duplicates[['title_content', 'processed']].head())

json_file = "news_dataset_preprocessed.json"
df_no_duplicates.to_json(json_file, orient="records", lines=True)
print(f"Saved preprocessed JSON with 'processed' column: {json_file}")




[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/milica.antic011/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/milica.antic011/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/milica.antic011/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                       title_content  \
0   WATCH Hypocrite Mike Pence Calls Democratic O...   
1  Ammon and Ryan Bundy Found Not Guilty in Orego...   
2  WATCH HILARIOUS Video Proves CNN Doesnt Even B...   
3  PRICELESS What Nancy Pelosi Just Said About Tr...   
4  LOL Democrat Congressman Says Best Way To Figh...   

                                           processed  
0  watch hypocrite mike penny call democratic obs...  
1  ammon ryan bundy found guilty oregon federal c...  
2  watch hilarious video prof cnn doesnt even bot...  
3  priceless nancy pelosi said trump backfired bi...  
4  lol democrat congressman say best way fight fa...  
Saved preprocessed JSON with 'processed' column: news_dataset_preprocessed.json


In [258]:
df_no_duplicates

Unnamed: 0,title,text,subject,isfake,title_content,processed
0,WATCH Hypocrite Mike Pence Calls Democratic O...,This is unbelievably outrageousRepublicans are...,News,1.0,WATCH Hypocrite Mike Pence Calls Democratic O...,watch hypocrite mike penny call democratic obs...
1,Ammon and Ryan Bundy Found Not Guilty in Orego...,21st Century Wire Yesterday Judge Anna Brown h...,USNews,1.0,Ammon and Ryan Bundy Found Not Guilty in Orego...,ammon ryan bundy found guilty oregon federal c...
2,WATCH HILARIOUS Video Proves CNN Doesnt Even B...,Watch these hilarious examples of CNN having r...,leftnews,1.0,WATCH HILARIOUS Video Proves CNN Doesnt Even B...,watch hilarious video prof cnn doesnt even bot...
3,PRICELESS What Nancy Pelosi Just Said About Tr...,Nancy Pelosi is obviously geographically chall...,politics,1.0,PRICELESS What Nancy Pelosi Just Said About Tr...,priceless nancy pelosi said trump backfired bi...
4,LOL Democrat Congressman Says Best Way To Figh...,MSNBC host asks Congressman Ted Leiu a Democra...,leftnews,1.0,LOL Democrat Congressman Says Best Way To Figh...,lol democrat congressman say best way fight fa...
...,...,...,...,...,...,...
10316,VIDEO DEAF Team USA Athlete SEXUALLY ASSAULTED...,The inaction by police officers who should hav...,leftnews,1.0,VIDEO DEAF Team USA Athlete SEXUALLY ASSAULTED...,video deaf team usa athlete sexually assaulted...
10317,100 FED UP WITH HILLARY 2016 WEVE GOT THE AWES...,Yada yada yada Hillary Clinton announced her ...,politics,1.0,100 FED UP WITH HILLARY 2016 WEVE GOT THE AWES...,fed hillary weve got awesome answer reason why...
10318,BUILD THE WALL How Terrorists Have Been Coming...,OUR GOOD FRIENDS AT TEXAS BORDER VOLUNTEERS ar...,politics,1.0,BUILD THE WALL How Terrorists Have Been Coming...,build wall terrorist coming across border year...
10319,Russian Twitter accounts promoted Brexit ahead...,LONDON Reuters Russian Twitter accounts poste...,worldnews,0.0,Russian Twitter accounts promoted Brexit ahead...,russian twitter account promoted brexit ahead ...


### Model building

In [None]:

# ------------------------------
# Load preprocessed JSON
# ------------------------------

json_file = "news_dataset_preprocessed.json"
df = pd.read_json(json_file, orient="records", lines=True)

# ------------------------------
# Check columns
# ------------------------------
print(df.columns.tolist())

# -------------------------------------
# Make sure 'processed' column exists
# -------------------------------------
if 'processed' not in df.columns:
    raise ValueError("Column 'processed' not found! Run preprocessing first.")

# -------------------------------------
# 3Prepare sentences for Word2Vec
# -------------------------------------
sentences = [text.split() for text in df['processed']] 

# -------------------------------
# Train Word2Vec model
# -------------------------------
w2v_model = Word2Vec(
    sentences,         # list of tokenized sentences
    vector_size=100,   # each word → 100-dimensional vector
    window=5,          # context window size - 5 words left and right
    min_count=2,       # ignore words that appear fewer than 2 times
    workers=4,         # number of CPU cores
    epochs=20          # number of passes through the entire dataset
)

# ----------------------------------------
# Function to create document vectors         
# ----------------------------------------
def document_vector(doc):
    doc = [word for word in doc if word in w2v_model.wv.key_to_index]
    if len(doc) == 0:
        return np.zeros(w2v_model.vector_size) 
    return np.mean(w2v_model.wv[doc], axis=0) 

# ------------------------------
# Apply to each article
# ------------------------------
df['vector'] = df['processed'].apply(lambda x: document_vector(x.split())) 

# -------------------------------
# Prepare X and y for ML
# -------------------------------
X = np.vstack(df['vector'].values) 
y = df['isfake']  

# -------------------------------
# Train/test split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------------------------------
# Define models
# -------------------------------
models = {                                                   
    'Logistic Regression': LogisticRegression(max_iter=1000), 
    'Support Vector Machine': SVC()
}

# -------------------------------
# Train, predict, and evaluate
# -------------------------------
for name, model in models.items(): 
    print(f"\n--- {name} ---") 
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

   


['title', 'text', 'subject', 'isfake', 'title_content', 'processed']

--- Logistic Regression ---
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1120
           1       0.97      0.96      0.96       945

    accuracy                           0.97      2065
   macro avg       0.97      0.97      0.97      2065
weighted avg       0.97      0.97      0.97      2065


--- Random Forest ---
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1120
           1       0.96      0.94      0.95       945

    accuracy                           0.95      2065
   macro avg       0.95      0.95      0.95      2065
weighted avg       0.95      0.95      0.95      2065


--- Support Vector Machine ---
              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1120
           1       0.97      0.97      0.97       945

    accuracy                 

# Fake News Detection - Final Project


## 1. Model Performance

### Logistic Regression
- **Accuracy:** 97%  
- **F1-score:** 0.97 (true news), 0.96 (fake news)

### Random Forest
- **Accuracy:** 95%  
- **F1-score:** 0.96 (true news), 0.95 (fake news)

### Support Vector Machine (SVM)
- **Accuracy:** 97%  
- **F1-score:** 0.97 for both classes

---

## 2. Observations
- Logistic Regression and SVM slightly outperform Random Forest.  
- All models demonstrate high effectiveness in classifying fake vs true news.  
- Preprocessing and Word2Vec embeddings provide strong features for the models.

---

## 3. Conclusion
The pipeline—text preprocessing, Word2Vec embeddings, and ML classifiers—proves highly effective for fake news detection.  
Logistic Regression and SVM are recommended for deployment due to their slightly higher accuracy and balanced performance.
