### Imports and Setup

Loads all libraries needed for vectorization, SVM, tuning and evaluation 

In [5]:
import pandas as pd 
import numpy as np 
import re 
import string 

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold
from sklearn.metrics import(
    classification_report, 
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score
)

from sklearn.pipeline import Pipeline

import nltk
from nltk.corpus import stopwords 

nltk.download("stopwords")

pd.set_option("display.max_colwidth", 300)
sns.set_style("whitegrid")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\himu7\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load Dataset

Loads the same dataset explored in EDA.

This guarantess:
* No hidden data differences
* No pipleline mismatch

In [7]:
df = pd.read_csv("../data/combined_news.csv")

print(df.shape)
df.head(2)

(44898, 10)


Unnamed: 0,title,text,subject,date,label,text_length,title_length,date_parsed,year_month,content
0,"BREAKING: GOP Chairman Grassley Has Had Enough, DEMANDS Trump Jr. Testimony","Donald Trump s White House is in chaos, and they are trying to cover it up. Their Russia problems are mounting by the hour, and they refuse to acknowledge that there are problems surrounding all of this. To them, it s fake news, or a hoax. However, the facts bear things out differently, and ...",News,"July 21, 2017",0,2114,76,2017-07-21,2017-07,"BREAKING: GOP Chairman Grassley Has Had Enough, DEMANDS Trump Jr. Testimony Donald Trump s White House is in chaos, and they are trying to cover it up. Their Russia problems are mounting by the hour, and they refuse to acknowledge that there are problems surrounding all of this. To them, it s ..."
1,Failed GOP Candidates Remembered In Hilarious Mocking Eulogies (VIDEO),"Now that Donald Trump is the presumptive GOP nominee, it s time to remember all those other candidates who tried so hard to beat him in the race to the White House. After all, how can we forget all the missteps, gaffes, weirdness, and sheer idiocies of such candidates as Jeb Bush, Marco Rubio, J...",News,"May 7, 2016",0,2823,71,2016-05-07,2016-05,"Failed GOP Candidates Remembered In Hilarious Mocking Eulogies (VIDEO) Now that Donald Trump is the presumptive GOP nominee, it s time to remember all those other candidates who tried so hard to beat him in the race to the White House. After all, how can we forget all the missteps, gaffes, weir..."


### Use Choosen Preprocessing

Implements winning preprocessing:
  * No stopwords + punctuation removed

This matches what empirically discovered earlier.

In [8]:
stop_words = set(stopwords.words("english"))

def clean_no_stopwords(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"<.*>", "", text)
    text = text.translate(str.maketrans("","", string.punctuation))
    words = text.split()
    return " ".join([ w for w in words if w not in stop_words])

### Train/Test Split

Freezes the train/test split:

This is critical:
* Hyperparameter tuning must NEVER see test data.

In [10]:
from sklearn.model_selection import train_test_split

X=df['content']
y=df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X,y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train: ", X_train.shape)
print("Test: ", X_test.shape)

Train:  (35918,)
Test:  (8980,)


### Build Pipeline (Preprocessing + TF-IDF + Model)
Builds a Pipeline that ensures:
> Whatever happens in training happens exactly the same in testing.

No leakage. No mismatches.

In [11]:
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        preprocessor=clean_no_stopwords,
        ngram_range=(1,2),
        min_df=5,
        max_df=0.95,
        sublinear_tf=True
    )),
    ("svm", LinearSVC)
])

### Hyperparameter Grid
Defines what are tuning:
SVM's C(regularization strength)

This controls:
* Overfitting vs generalization 

In [12]:
param_grid = {
    "svm__C": [0.01,0.1,1,10]
}

### 7 
Runs 5 fold cross-validation on the training set.

This prevents
* Lucky splits
* Overfitting to one fold

### 8 

Select the best C based on average F1 across folds.

This is true model selection step.

### 9

Evaluates the tuned model on unseen test data. 

This is the onlhy number you should report.

### 10 

Shows how errors are distribured:
* Fake : Real?
* Real : Fake?
This tells what model is afriad of.

### 11

Read actual misclassified articles.

This is where:

* Dataset bias
* Ambiguity
* Label noise become visible.

### 12 

Saves a fully reproducible model.

We can now:
* Deploy 
* Load it 
* Use it for prediction