# EE 344 — Assignment 4: Fake News Classification

In this assignment, you will classify news articles as **fake vs real** using **text features**.
Your tasks for this assignment are as follows:

1. Learn how to extract text features by vectorizing textual inputs using **CountVectorizer (Bag-of-Words)**.
2. Implement **7 classifiers**: Logistic Regression, Perceptron, Linear SVM (LinearSVC), Multinomial Naive Bayes, KNN, Decision Tree, and Random Forest.
3. Evaluate **train and test** performance using **accuracy, precision, recall, and F1-score**.
4. Provide brief answers to discussion questions about (i) the text feature extraction method you implemented and (ii) the effect of using two different KNN distance choices (**Euclidean vs cosine**).


## Submission guidelines
- Complete all **[TODO]** blocks in this notebook.
- Push the finished notebook to your GitHub repository.
- Submit the GitHub link on the Canvas submission page.


**Dataset source (for reference only):**  
Do **not** download data from the link below. Use the provided `evaluation.csv` file that comes with this assignment.
#### https://www.kaggle.com/datasets/aadyasingh55/fake-news-classification



## Setup
Run the next cell to import libraries and define helper functions.


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Reproducibility
RANDOM_STATE = 42

def metrics(y_true, y_pred):
    """Return (accuracy, precision, recall, f1)."""
    acc = accuracy_score(y_true, y_pred)

    # Use 'binary' for binary classification, otherwise fallback to macro.
    avg = "binary" if len(np.unique(y_true)) == 2 else "macro"

    prec, rec, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average=avg, zero_division=0
    )
    return acc, prec, rec, f1


## Load data

Put the dataset file in the same folder as this notebook (recommended), or provide an absolute path.

This dataset uses **semicolon-separated** fields and can contain extra semicolons inside the text.
So we use a custom loader that safely reconstructs the text column.


In [2]:
# === Data path ===
DATA_PATH = "evaluation.csv"

def load_semicolon_dataset(path):
    """
    Handles lines like:
    ;title;text;label
    0;some title;some text that may contain ; ; ; ;0
    """
    rows = []
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        _ = f.readline()  # header
        for line in f:
            line = line.rstrip("\n")
            if not line:
                continue
            parts = line.split(",")
            if len(parts) < 4:
                continue

            idx = parts[0]
            title = parts[1]
            label = parts[-1]
            text = ";".join(parts[2:-1])  # re-join any extra ';' inside text
            rows.append((idx, title, text, label))

    df = pd.DataFrame(rows, columns=["id", "title", "text", "label"])
    df["label"] = pd.to_numeric(df["label"], errors="coerce")
    df = df.dropna(subset=["label"]).reset_index(drop=True)
    df["label"] = df["label"].astype(int)
    return df

df = load_semicolon_dataset(DATA_PATH)
print("Dataset:", df.shape)
print("Label distribution:\n", df["label"].value_counts())

# Combine title + text into one string per document
docs = (df["title"].fillna("") + " " + df["text"].fillna("")).astype(str).tolist()
y = df["label"].values


Dataset: (7815, 4)
Label distribution:
 label
1    4185
0    3630
Name: count, dtype: int64


## Train/test split

We keep a standard **80/20** split with stratification (preserves label ratio).


In [3]:
X_train_text, X_test_text, y_train, y_test = train_test_split(
    docs, y,
    test_size=0.20,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train docs:", len(X_train_text), "Test docs:", len(X_test_text))


Train docs: 6252 Test docs: 1563


## Case Study: Bag-of-Words Features (CountVectorizer)

We need to convert text into numeric features before we can train ML models.

**CountVectorizer** builds a vocabulary from the **training set** and represents each document as a vector of **counts** (one entry per vocabulary term).

We will use:
$$
\texttt{CountVectorizer(}
\texttt{lowercase=True, stopwords="english", ngramrange=(1,2),}
$$
$$
\texttt{ mindf=2, maxdf=0.9, maxfeatures=10000)}
$$

**What each setting means (briefly):**
- `lowercase=True`: convert text to lowercase before building features.
- `stop_words="english"`: remove a predefined list of common English words.
- `ngram_range=(1,2)`: allow 1-word features and 2-word features (bigrams).
- `min_df=2`: keep a term only if it appears in at least 2 training documents.
- `max_df=0.9`: drop a term if it appears in more than 90% of training documents.
- `max_features=10000`: cap the vocabulary size at 10,000 terms (after filtering).

### Tiny example (just to see what it does)

We will build features from 3 short documents and look at the counts.


In [4]:
# CountVectorizer docs (read this once before TODO 1):
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [5]:
toy_docs = [
    "The FAKE news spreads fast",
    "Fake news spreads",
    "Real news spreads",
]

toy_vec = CountVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1, 2),
    min_df=2
)

toy_X = toy_vec.fit_transform(toy_docs)

print("Toy vocab:", list(toy_vec.get_feature_names_out()))
print("Toy counts (rows = docs):\n", toy_X.toarray())


Toy vocab: ['fake', 'fake news', 'news', 'news spreads', 'spreads']
Toy counts (rows = docs):
 [[1 1 1 1 1]
 [1 1 1 1 1]
 [0 0 1 1 1]]


## Build Bag-of-Words features

Goal:
1. Create the `CountVectorizer` using the exact settings below.
2. Fit on the training text only.
3. Transform both train and test text into sparse Bag-of-Words features.

Notes:
- `fit_transform` on train, then `transform` on test.
- The output is a **sparse matrix** (CSR). That is normal for text features.


In [6]:
# --- Bag-of-Words settings ---
MAX_FEATURES = 10000
NGRAM_RANGE = (1, 2)

## [ TODO 1 ]
# 1) Create `vectorizer` using CountVectorizer with:
#    lowercase=True
#    stop_words="english"
#    ngram_range=NGRAM_RANGE
#    min_df=2
#    max_df=0.9
#    max_features=MAX_FEATURES
#
# 2) Fit the vectorizer on the training text, then use it to transform:
#    - the training text into BoW features
#    - the test text into BoW features
#
# (Reminder: fit on train only; do NOT fit on test.)
#
# Print the BoW shapes.
vectorizer = CountVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=NGRAM_RANGE,
    min_df=2,
    max_df=0.9,
    max_features=MAX_FEATURES
)
X_train_bow = vectorizer.fit_transform(X_train_text)
X_test_bow = vectorizer.fit_transform(X_test_text)

print("BoW shapes:", X_train_bow.shape, X_test_bow.shape)


BoW shapes: (6252, 10000) (1563, 10000)


## Models

Create **7 classifiers** using the exact hyperparameters below.

**Important:** For KNN in this notebook, start with **Euclidean distance**.

Models to implement:
- Logistic Regression: `solver="saga"`, `max_iter=2000`, `n_jobs=-1`, `random_state=42`
- Perceptron: `max_iter=1000`, `tol=1e-3`, `random_state=42`
- SVM (LinearSVC): `random_state=42`
- Naive Bayes (MultinomialNB): `alpha=1.0`
- KNN (Euclidean): `n_neighbors=7`, `metric="euclidean"`, `n_jobs=-1`
- Decision Tree: `max_depth=40`, `random_state=42`
- Random Forest: `n_estimators=300`, `random_state=42`, `n_jobs=-1`

Put them in a dictionary named `models`.


In [7]:
## [ TODO 2 ]
# Build the `models` dictionary using the exact parameters above.
def weighted_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="weighted", zero_division=0
    )
    return acc, prec, rec, f1

models = {
    "Logistic Regression": LogisticRegression(solver="saga", max_iter=2000, n_jobs=-1, random_state=42),
    "Perceptron": Perceptron(max_iter=1000, tol=1e-3, random_state=42),
    "SVM (LinearSVC)": LinearSVC(random_state=42),
    "Naive Bayes (MultinomialNB)": MultinomialNB(alpha=1.0),
    "KNN (euclidian)": KNeighborsClassifier(n_neighbors=7, metric="euclidean", n_jobs=-1),
    "Decision Tree": DecisionTreeClassifier(random_state=42, max_depth=40),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1),
}

models

{'Logistic Regression': LogisticRegression(max_iter=2000, n_jobs=-1, random_state=42, solver='saga'),
 'Perceptron': Perceptron(random_state=42),
 'SVM (LinearSVC)': LinearSVC(random_state=42),
 'Naive Bayes (MultinomialNB)': MultinomialNB(),
 'KNN (euclidian)': KNeighborsClassifier(metric='euclidean', n_jobs=-1, n_neighbors=7),
 'Decision Tree': DecisionTreeClassifier(max_depth=40, random_state=42),
 'Random Forest': RandomForestClassifier(n_estimators=300, n_jobs=-1, random_state=42)}

## Train + evaluate

We will evaluate each model on:
- **Training set**
- **Test set**

Metrics:
- Accuracy
- Precision
- Recall
- F1

We will print a table sorted by **Test F1**.


In [8]:
## [ TODO 3 ]
# Write a loop that:
# 1) fits each model on X_train_bow, y_train
# 2) predicts on train and test
# 3) computes (acc, prec, rec, f1) using metrics(...)
# 4) stores results in a list
# 5) prints a DataFrame sorted by Test F1 (descending)
#
# Use the exact column names below.

results = []
best = {"name": None, "pipe": None, "test_f1": -1.0}

for name, clf in models.items():
    pipe = Pipeline(steps=[
        ("Vectorizer", vectorizer),
        ("Classifier", clf),
    ])

    pipe.fit(X_train_text, y_train)

    yhat_tr = pipe.predict(X_train_text)
    yhat_te = pipe.predict(X_test_text)

    tr = weighted_metrics(y_train, yhat_tr)
    te = weighted_metrics(y_test, yhat_te)

    results.append([name, *tr, *te])

    if te[3] > best["test_f1"]:
        best["name"] = name
        best["pipe"] = pipe
        best["test_f1"] = te[3]

cols = [
    "Model",
    "Train Acc", "Train Prec", "Train Rec", "Train F1",
    "Test Acc", "Test Prec", "Test Rec", "Test F1",
]

out = pd.DataFrame(results, columns=cols).sort_values("Test F1", ascending=False).reset_index(drop=True)

pd.set_option("display.max_colwidth", 80)
print("\n=== Results (sorted by Test F1) ===")
print(out.to_string(index=False, formatters={
    "Train Acc": "{:.4f}".format,
    "Train Prec": "{:.4f}".format,
    "Train Rec": "{:.4f}".format,
    "Train F1": "{:.4f}".format,
    "Test Acc": "{:.4f}".format,
    "Test Prec": "{:.4f}".format,
    "Test Rec": "{:.4f}".format,
    "Test F1": "{:.4f}".format,
}))





=== Results (sorted by Test F1) ===
                      Model Train Acc Train Prec Train Rec Train F1 Test Acc Test Prec Test Rec Test F1
              Random Forest    1.0000     1.0000    1.0000   1.0000   0.9949    0.9949   0.9949  0.9949
              Decision Tree    1.0000     1.0000    1.0000   1.0000   0.9904    0.9904   0.9904  0.9904
        Logistic Regression    0.9989     0.9989    0.9989   0.9989   0.9872    0.9872   0.9872  0.9872
            SVM (LinearSVC)    1.0000     1.0000    1.0000   1.0000   0.9853    0.9853   0.9853  0.9853
                 Perceptron    1.0000     1.0000    1.0000   1.0000   0.9808    0.9808   0.9808  0.9808
Naive Bayes (MultinomialNB)    0.9624     0.9625    0.9624   0.9624   0.9610    0.9610   0.9610  0.9610
            KNN (euclidian)    0.7844     0.8071    0.7844   0.7829   0.7364    0.7666   0.7364  0.7330


## Cosine distance for KNN

With Bag-of-Words, each document becomes a long vector of word counts (mostly zeros).  
To compare two documents, we need a way to measure how “close” two vectors are.

Two common choices:

- **Euclidean distance**: straight-line distance between two vectors.
- **Cosine distance**: based on the angle between two vectors (uses cosine similarity under the hood).

In scikit-learn, KNN uses a **distance**. Cosine distance is:
$$
d_{\text{cosine}}(x, z) \;=\; 1 - \cos(x, z)
\;=\; 1 - \frac{x^\top z}{\|x\|_2 \,\|z\|_2}
$$

(where $\cos(x,z)$ is cosine similarity).

### Tiny numeric example (no text, just vectors)

Let:
- $x = [1, 1]$
- $z_1 = [2, 2]$  (same direction as $x$, just “bigger”)
- $z_2 = [2, 0]$  (different direction)

**Euclidean distances**
$$
\|x - z_1\|_2 = \sqrt{(1-2)^2 + (1-2)^2} = \sqrt{2}
$$
$$
\|x - z_2\|_2 = \sqrt{(1-2)^2 + (1-0)^2} = \sqrt{2}
$$
So Euclidean says $z_1$ and $z_2$ are equally far from $x$ here.

**Cosine distances**
$$
\cos(x, z_1) = \frac{1\cdot 2 + 1\cdot 2}{\sqrt{2}\cdot \sqrt{8}} = 1
\Rightarrow d_{\text{cosine}}(x, z_1)=0
$$
$$
\cos(x, z_2) = \frac{1\cdot 2 + 1\cdot 0}{\sqrt{2}\cdot 2} \approx 0.707
\Rightarrow d_{\text{cosine}}(x, z_2)\approx 0.293
$$
So cosine says $z_1$ is closer to $x$ than $z_2$.

### What you will do

Keep everything the same, but change your KNN metric from `"euclidean"` to `"cosine"`, then re-run your evaluation and compare results.


In [9]:
# Tip: For cosine distance, brute-force search is commonly used.
# Example (do not run until TODO 2/3 are done):
#
# knn_cos = KNeighborsClassifier(
#     n_neighbors=7,
#     metric="cosine",
#     algorithm="brute",
#     n_jobs=-1
# )
models = {
    "KNN (euclidian)": KNeighborsClassifier(n_neighbors=7, metric="euclidean", n_jobs=-1),
    "KNN (cos)": KNeighborsClassifier(n_neighbors=7, metric="cosine", algorithm="brute", n_jobs=-1)
}
results = []
best = {"name": None, "pipe": None, "test_f1": -1.0}

for name, clf in models.items():
    pipe = Pipeline(steps=[
        ("Vectorizer", vectorizer),
        ("Classifier", clf),
    ])

    pipe.fit(X_train_text, y_train)

    yhat_tr = pipe.predict(X_train_text)
    yhat_te = pipe.predict(X_test_text)

    tr = weighted_metrics(y_train, yhat_tr)
    te = weighted_metrics(y_test, yhat_te)

    results.append([name, *tr, *te])

    if te[3] > best["test_f1"]:
        best["name"] = name
        best["pipe"] = pipe
        best["test_f1"] = te[3]

cols = [
    "Model",
    "Train Acc", "Train Prec", "Train Rec", "Train F1",
    "Test Acc", "Test Prec", "Test Rec", "Test F1",
]

out = pd.DataFrame(results, columns=cols).sort_values("Test F1", ascending=False).reset_index(drop=True)

pd.set_option("display.max_colwidth", 80)
print("\n=== Results (sorted by Test F1) ===")
print(out.to_string(index=False, formatters={
    "Train Acc": "{:.4f}".format,
    "Train Prec": "{:.4f}".format,
    "Train Rec": "{:.4f}".format,
    "Train F1": "{:.4f}".format,
    "Test Acc": "{:.4f}".format,
    "Test Prec": "{:.4f}".format,
    "Test Rec": "{:.4f}".format,
    "Test F1": "{:.4f}".format,
}))


=== Results (sorted by Test F1) ===
          Model Train Acc Train Prec Train Rec Train F1 Test Acc Test Prec Test Rec Test F1
      KNN (cos)    0.9128     0.9198    0.9128   0.9120   0.8791    0.8913   0.8791  0.8772
KNN (euclidian)    0.7844     0.8071    0.7844   0.7829   0.7364    0.7666   0.7364  0.7330


## Discussion questions (answer in your own words)

Write short answers below (2–5 sentences each is enough).

### Question A
In your own words, what is the added value of allowing 2-word sequences (bigrams) in `ngram_range`?

### Question B
In your own words, why might someone choose to set both `min_df` and `max_df` when building the vocabulary?

### Question C

After you run KNN with **Euclidean** and then with **Cosine** distance:

- Do you observe any difference in results?
- If yes, why do you think the difference happens (your intuition)?

**Your answers:**

- **A:**  
  *Allowing 2-word sequences can push our models to capture recurring 2-word sequences, which may be helpful for classifying
a news article as fake news or real news. If fake news tend to have go-to/popular bigrams, our model will be able to pick
this up and use this information to classify. If the bigram has significant weight, it may be enough to render news
articles as fake! Essentially, allowing 2-word sequences enables our models to capture bigram trends in fake-news, which may be a significant factor when classifying news.*

- **B:**  
  *The parameters min_df and max_df are great tools for filtering words that are "outliers". In our data, we want to exclude words that are essential for effective English communication (such as: the, he, she, it, too, etc), since these words don't reveal information about the authenticity of news, rather assist in the articulation of ideas. For this reason, max_df will filter out words that are very common. Similarly, we want to exclude words that occur very few times since their scarcity will pose a challenge for the model to generalize from them. Min_df will ensure that the model does not overfit and limit the feature space for these models.*

- **C:**  
  *There is a significant difference in the evalutation metrics of both KNN models. The cosine model reported better accuracy across all evaluation metrics, suggesting a better fit for the data than the euclidian distance model. I believe that the cosine model was more successful in capturing the trends in the data and generalizing them because of the strucutre of the text vectors. For two text vectors with a huge difference in size, the euclidian model will infer a huge distance between them, whereas the cosine model will look past the size differences and focus the comparison on word usage patterns. Since we are more concerned with word trends and not size of documents, the cosine model is the more effective tool for comparing two text vectors, leading to better results for classification.*
