<a href="https://colab.research.google.com/github/laibaabbas/NLP/blob/main/Fake_new_Classification(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fake News Detection on Social Media using Deep Learning

## Classwork NLP Project (Google Colab Notebook)

---

## 1. Introduction

### Project Title

**Fake News Detection on Social Media using Deep Learning**

### Domain

Social Media • Media Literacy • Security

### Problem Statement

Fake news on social media platforms can mislead users, influence public opinion, and pose serious social and security risks. The objective of this project is to build **deep learning–based Natural Language Processing (NLP) models** that can automatically classify news articles as:

* **Real News (0)**
* **Fake News (1)**

This is a **binary text classification problem**.

---

## 2. Dataset Description

### Dataset Name

**WELFake Dataset**

### Dataset Characteristics

* Total samples: **72,134 news articles**
* Balanced dataset
* Textual data only
* Two main text fields:

  * `title`
  * `text` (news content)
* Target label:

  * `0` → Real News
  * `1` → Fake News

---

## 3. Learning Objectives

By completing this notebook, students will learn:

* Text preprocessing for NLP tasks
* Tokenization and sequence padding
* Word embeddings for deep learning
* Building multiple deep learning models for text
* Evaluating and comparing NLP models
* Performing predictions on unseen text data

---

## 4. Environment Setup

### Step 1: Import Required Libraries



In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
import string

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, LSTM, GRU, Bidirectional, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dropout
from tensorflow.keras.callbacks import EarlyStopping






---

## 5. Load the Dataset

### Step 2: Upload Dataset in Colab



In [None]:
# from google.colab import files
# uploaded = files.upload()

In [None]:
import kagglehub
path = kagglehub.dataset_download("saurabhshahane/fake-news-classification")

### Step 3: Read the Dataset

In [None]:
import os

os.listdir(path)

In [None]:
file_path = os.path.join(path, "WELFake_Dataset.csv")
df = pd.read_csv(file_path)

In [None]:
# df = pd.read_csv('WELFake_Dataset.csv')
df.head()

---

## 6. Data Understanding


In [None]:
print('Dataset Shape:', df.shape)
df.info()

### Check Class Distribution


In [None]:
df['label'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

---

## 7. Text Preprocessing

### Step 4: Combine Title and Text

In [None]:
df['content'] = df['title'] + ' ' + df['text']
df = df[['content', 'label']]


### Step 5: Text Cleaning Function


In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\n', '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    return text

df['content'] = df['content'].fillna('')  # Fill NaN values with empty strings
df['content'] = df['content'].apply(clean_text)

---

## 8. Train-Test Split


In [None]:
X = df['content']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

---

## 9. Tokenization and Padding


In [None]:
vocab_size = 50000
max_length = 300
oov_token = '<OOV>'


tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding='post')

---

## 10. Model 1: Embedding + LSTM

In [None]:
model_lstm = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    LSTM(128),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_lstm.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model_lstm.summary()

In [None]:
history_lstm = model_lstm.fit(
    X_train_pad, y_train,
    validation_split=0.2,
    epochs=5,
    batch_size=64,
    callbacks=[EarlyStopping(patience=2, restore_best_weights=True)]
)

Epoch 1/5
[1m722/722[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m553s[0m 760ms/step - accuracy: 0.7335 - loss: 0.4921 - val_accuracy: 0.8025 - val_loss: 0.3622
Epoch 2/5
[1m722/722[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m570s[0m 770ms/step - accuracy: 0.7838 - loss: 0.3853 - val_accuracy: 0.8966 - val_loss: 0.2692
Epoch 3/5
[1m722/722[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m554s[0m 766ms/step - accuracy: 0.9242 - loss: 0.2140 - val_accuracy: 0.7251 - val_loss: 0.4253
Epoch 4/5
[1m722/722[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m548s[0m 759ms/step - accuracy: 0.8661 - loss: 0.2823 - val_accuracy: 0.9321 - val_loss: 0.1779
Epoch 5/5
[1m406/722[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m3:46[0m 717ms/step - accuracy: 0.9675 - loss: 0.1043


---

## 11. Model 2: Embedding + BiLSTM


In [None]:
model_bilstm = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    Bidirectional(LSTM(128)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_bilstm.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model_bilstm.fit(
    X_train_pad, y_train,
    validation_split=0.2,
    epochs=5,
    batch_size=64
)


---

## 12. Model 3: CNN for Text Classification


In [None]:
model_cnn = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    Conv1D(128, 5, activation='relu'),
    MaxPooling1D(pool_size=2),
    GlobalMaxPooling1D(),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_cnn.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model_cnn.fit(
    X_train_pad, y_train,
    validation_split=0.2,
    epochs=5,
    batch_size=64
)


---

## 13. Model 4: GRU


In [None]:
model_gru = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    GRU(128),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model_gru.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model_gru.fit(
    X_train_pad, y_train,
    validation_split=0.2,
    epochs=5,
    batch_size=64
)


---

## 14. Model Evaluation

In [None]:
def evaluate_model(model, X_test, y_test):
    y_pred = (model.predict(X_test) > 0.5).astype(int)
    print(classification_report(y_test, y_pred))
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
    plt.show()

print('LSTM Evaluation')
evaluate_model(model_lstm, X_test_pad, y_test)



---

## 15. Prediction on New News Article


In [None]:
sample_text = "Breaking: Scientists confirm water found on Mars"

sample_seq = tokenizer.texts_to_sequences([sample_text])
sample_pad = pad_sequences(sample_seq, maxlen=max_length, padding='post')

prediction = model_bilstm.predict(sample_pad)

if prediction > 0.5:
    print('Prediction: Fake News')
else:
    print('Prediction: Real News')


---

## 16. Conclusion

In this classwork project, multiple deep learning models were implemented for fake news detection. Students observed how different architectures (LSTM, BiLSTM, CNN, GRU) handle textual data and compared their performance. This notebook demonstrates a complete NLP pipeline suitable for real-world social media security applications.

---

## End of Notebook
