# 🧠 Fake News Detection using NLP & Machine Learning

Welcome dear students 👋!  
In this fun mini-project, you will build a **Fake News Detector** using **NLP and Machine Learning**.

We'll learn step-by-step how text is processed, understood, and classified by AI systems 💡.

Let's begin our happy learning journey 🚀

## 📘 Step 1: Linguistic Foundation

Before coding, let's recall the **four linguistic levels** of NLP:
- **Syntax** 🧩 — sentence structure (grammar rules)
- **Semantics** 💬 — meaning of words/sentences
- **Pragmatics** 🧠 — meaning based on context
- **Discourse** 📖 — how sentences connect to form a meaningful paragraph

👉 We'll explore them using **POS Tagging** and **Chunking** soon.

In [1]:
# Example sentence for linguistic analysis
example_sentence = "The intelligent system learns new patterns efficiently."

# TODO: Tokenize the sentence and perform POS tagging using nltk
# HINT: use nltk.word_tokenize() and nltk.pos_tag()

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# --- Write your code below ---
# tokens = ...
# pos_tags = ...
# print(pos_tags)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 🧹 Step 2: Data Preprocessing with NLTK

We'll clean text data to remove unnecessary symbols, numbers, and stopwords.

### Concepts Covered:
- Lowercasing
- Removing numbers & punctuation
- Stopword removal
- Tokenization
- Stemming & Lemmatization
- Normalization

Let's define our preprocessing function 🧼

In [2]:
import re, string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def preprocess(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(stemmer.stem(t)) for t in tokens]
    return ' '.join(tokens)

# Test preprocessing
sample_text = "Artificial Intelligence in 2025 will revolutionize industries!"
print(preprocess(sample_text))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


artifici intellig revolution industri


## 📊 Step 3: Load Dataset

We'll use a small Fake News dataset. Each row contains a news article and a label (real/fake).

In [8]:

import pandas as pd
from datasets import load_dataset

# Load the dataset from Hugging Face
ds = load_dataset("ErfanMoosaviMonazzah/fake-news-detection-dataset-English")

# Convert to Pandas DataFrame
df = pd.DataFrame(ds['train'])  # or ds['test'] depending on partition

# Assume the columns include e.g. 'text' and 'label'
data = df[['text', 'label']].dropna().sample(1000, random_state=42)

data.head()



README.md:   0%|          | 0.00/487 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train.tsv:   0%|          | 0.00/78.4M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


validation.tsv:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test.tsv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/datasets/ErfanMoosaviMonazzah/fake-news-detection-dataset-English/resolve/2f7e828658d33a0e8aca4b2f9f4ccb26e27ac32f/test.tsv: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


test.tsv:  32%|###2      | 10.5M/32.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8267 [00:00<?, ? examples/s]

Unnamed: 0,text,label
2308,The U.S. Senate on Friday backed a plan to nam...,1
22404,"What a difference a year makes. A year ago, Pr...",1
23397,The US Supreme Court is set to decide the firs...,0
25058,U.S. President Donald Trump is expected to mak...,1
2664,U.S. Supreme Court Justice Antonin Scalia’s ca...,1


In [10]:
# Apply preprocessing to all text
data['clean_text'] = data['text'].apply(preprocess)
data.head()

Unnamed: 0,text,label,clean_text
2308,The U.S. Senate on Friday backed a plan to nam...,1,u senat friday back plan name plaza front chin...
22404,"What a difference a year makes. A year ago, Pr...",1,differ year make year ago prime minist theresa...
23397,The US Supreme Court is set to decide the firs...,0,u suprem court set decid first major abort cas...
25058,U.S. President Donald Trump is expected to mak...,1,u presid donald trump expect make announc week...
2664,U.S. Supreme Court Justice Antonin Scalia’s ca...,1,u suprem court justic antonin scalia ’ caus de...


## 🧩 Step 4: POS Tagging, NER, and Chunking

Let's analyze sentence structure using NLTK’s tools.

👉 Complete missing code lines below.

In [None]:
sample = data['clean_text'].iloc[0]
tokens = word_tokenize(sample)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags[:10])  # show first few tagged words

# Define a chunk grammar (e.g., noun phrase)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(pos_tags)

# TODO: Visualize or print tree
# HINT: result.draw()  # (optional if you run locally)

## 🌐 Step 5: Explore WordNet
WordNet helps understand word meanings and relations.
Try exploring synonyms, definitions, and examples!

In [None]:
from nltk.corpus import wordnet

word = 'intelligence'
syns = wordnet.synsets(word)

print(f"Definition: {syns[0].definition()}")
print(f"Examples: {syns[0].examples()}")

## 🧮 Step 6: Feature Extraction (BoW + TF-IDF)
Let's convert text into numerical features that ML models can understand!

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Create BoW and TF-IDF features
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['clean_text'])
y = data['label']

print(f"Feature matrix shape: {X.shape}")

## 🧭 Step 7: Document Similarity
Let's calculate how similar two documents are using cosine similarity 🧮.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(X[0:2])
print(similarity)

## 🤖 Step 8: Classification Models (SVM, Decision Tree, Random Forest)
Now, let's train and test our models!

We'll compare how each performs using accuracy and other metrics.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
models = {
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier()
}

for name, model in models.items():
    print(f"\nTraining {name} model...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Results for {name}:")
    print(classification_report(y_test, y_pred))

## ⚖️ Step 9: Handling Imbalanced Dataset
Sometimes data isn't balanced — let's oversample minority classes using **RandomOverSampler**.

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_res, y_res = ros.fit_resample(X, y)
print('Before:', X.shape, 'After:', X_res.shape)

## 📈 Step 10: Evaluation Metrics & Confusion Matrix
Let's visualize model performance!

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(models['Random Forest'], X_test, y_test)

## 🔮 Step 11: Future Directions in NLP
NLP is evolving rapidly! ✨
- **Transformers** like BERT, GPT, and T5 dominate current research.
- **Few-shot & zero-shot learning** are reducing labeled data needs.
- **Ethical AI** ensures fairness and transparency.

Keep exploring, learning, and experimenting — the future is yours 🌍!

🎉 **Congratulations!** You've completed the mini-project successfully! 💪

You now understand the NLP pipeline — from syntax to semantics, and from preprocessing to machine learning. 👏