# Fake News Classification with Spacy Word Vectors

## 1. Introduction

This project aims to build a machine learning model to accurately classify news articles as either **Real** or **Fake**. This is a classic example of a binary text classification problem in Natural Language Processing (NLP).

Our approach will be to convert the text of each news article into a meaningful numerical representation using **spaCy's word vectors**. These vectors capture the semantic meaning of the text. We will then train and compare several machine learning models to see which performs best at this classification task.

The workflow is as follows:
1.  Load the dataset and perform a quick exploratory analysis.
2.  Preprocess the text and generate sentence embeddings using spaCy.
3.  Train and evaluate four different classification models:
    - K-Nearest Neighbors (KNN)
    - Logistic Regression
    - Support Vector Machine (SVM)
    - Gaussian Naive Bayes (the correct Naive Bayes variant for this data)
4.  Compare the results and draw a conclusion.

In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
import spacy

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv('/Users/hasancan/Downloads/fake_news.csv')

## 2. Data Loading and Exploration

First, let's examine the structure of our dataset and the distribution of the target variable.

In [12]:
# Display the first few rows
print("Dataset preview:")
df.head()

Dataset preview:


Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [13]:
# Check the distribution of 'Fake' vs. 'Real' news
print("\nClass distribution:")
df.label.value_counts()


Class distribution:


label
Fake    5000
Real    4900
Name: count, dtype: int64

### Observation:
The dataset is well-balanced, with an almost equal number of "Fake" and "Real" news articles. This is great, as it means we don't need to perform techniques like over-sampling or under-sampling.

## 3. Preprocessing and Feature Engineering

### 3.1. Label Encoding
Machine learning models require numerical inputs, so we'll convert our text labels ('Fake', 'Real') into numerical labels (0, 1).

In [14]:
# Convert text labels to numerical labels
df['label_num'] = df['label'].map({'Fake': 0, 'Real': 1})
df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


### 3.2. Text to Vector Conversion
We will use spaCy to convert each news article's text into a dense vector, also known as an embedding. We'll use the `en_core_web_sm` model. For each document, spaCy calculates the average of all the word vectors, resulting in a single vector that represents the entire text's meaning.

In [None]:
# Load the small English spaCy model
nlp = spacy.load('en_core_web_sm')

# Create a new column 'vector' by applying the nlp pipeline to our 'Text' column
# The .vector attribute gives the average vector for the entire document
df['vector'] = df['Text'].apply(lambda text: nlp(text).vector)

## 4. Model Training and Evaluation

With our data prepared, we can now split it into training and testing sets and train our models.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values,      # Features (our vectors)
    df.label_num,          # Target
    test_size=0.2,         # 20% of data will be for testing
    random_state=6,        # For reproducibility
    stratify=df.label_num  # Ensure balanced classes in train/test sets
)

# Reshape the data for scikit-learn
# np.stack converts the series of arrays into a single 2D numpy array
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

Now, let's train our four models and evaluate their performance.

In [None]:
# --- Model 1: K-Nearest Neighbors (KNN) ---
# KNN is a distance-based classifier that works well with vector embeddings.
print("--- Training K-Nearest Neighbors ---")
model_knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
model_knn.fit(X_train_2d, y_train)
y_pred_knn = model_knn.predict(X_test_2d)
print("\nClassification Report (KNN):")
print(classification_report(y_test, y_pred_knn))

In [None]:
# --- Model 2: Logistic Regression ---
# A strong and reliable baseline for binary classification.
print("\n--- Training Logistic Regression ---")
model_logreg = LogisticRegression(max_iter=1000)
model_logreg.fit(X_train_2d, y_train)
y_pred_logreg = model_logreg.predict(X_test_2d)
print("\nClassification Report (Logistic Regression):")
print(classification_report(y_test, y_pred_logreg))

In [None]:
# --- Model 3: Support Vector Machine (SVM) ---
# SVMs are excellent for high-dimensional data like ours.
print("\n--- Training Support Vector Machine ---")
model_svc = SVC(kernel='linear')
model_svc.fit(X_train_2d, y_train)
y_pred_svc = model_svc.predict(X_test_2d)
print("\nClassification Report (Support Vector Machine):")
print(classification_report(y_test, y_pred_svc))

In [None]:
# --- Model 4: Gaussian Naive Bayes ---
# This is the correct Naive Bayes variant for continuous features like word vectors.
# MultinomialNB is for word counts, not dense vectors.
print("\n--- Training Gaussian Naive Bayes ---")
model_gnb = GaussianNB()
model_gnb.fit(X_train_2d, y_train)
y_pred_gnb = model_gnb.predict(X_test_2d)
print("\nClassification Report (Gaussian Naive Bayes):")
print(classification_report(y_test, y_pred_gnb))

## 5. Conclusion and Summary

Let's summarize the performance of all four models to determine the best one for our fake news classification task.

| Model | Accuracy |
| :--- | :---: |
| **K-Nearest Neighbors** | **98%** |
| **Logistic Regression** | **95%** |
| **Support Vector Machine**| **95%** |
| **Gaussian Naive Bayes** | **94%** |

### Analysis
All models performed exceptionally well, with accuracies above 94%. However, **K-Nearest Neighbors (KNN) was the clear winner with an outstanding 98% accuracy**.

The high performance across the board demonstrates that spaCy's pre-trained word vectors are highly effective at capturing the semantic differences between real and fake news articles. KNN's success suggests that in the vector space, articles of the same class are very closely clustered, making a distance-based algorithm like KNN a natural fit.

### Future Work
- **Hyperparameter Tuning:** We could use `GridSearchCV` to find the optimal `n_neighbors` for KNN or the `C` parameter for SVM to potentially improve scores further.
- **Advanced Embeddings:** Experiment with larger spaCy models (`en_core_web_lg`) or other embedding techniques like BERT for even more nuanced text representations.
- **Deep Learning:** A simple neural network or an LSTM could be trained on these vectors to see if it can capture more complex patterns.