1. Selection
The Selection phase in KDD is about deciding which data sources are relevant to your project goal.
For this project our goal is to detect whether a news is fake or not.

We are given two datasets: Fake.csv and True.csv
We load these datasets and look for some initial understanding.

In [21]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

In [22]:
# 1. Load the data
df_fake = pd.read_csv('./datasets/Fake.csv')
df_true = pd.read_csv('./datasets/True.csv')

2. Preprocessing
The goal of Pre-processing is to clean the raw, selected data to make it suitable for the next stages. Real-world data is often messy, containing missing values, inconsistent formats, and irrelevant information. We must fix this before the model can learn effectively.

This is classification problem, and the model needs a label to do supervised learning on it. So we add anothe column to let us know what kind of data it is. A label of '1' indicates that it is a fake news and '0' otherwise.
We then combine the two datasets into one.

In [23]:
df_fake['label'] = 1
df_true['label'] = 0

df_combined = pd.concat([df_fake, df_true], ignore_index=True)

In [24]:
df_combined.shape

(44898, 5)

In [25]:
print("--- Combined Data Head (First 5 Rows) ---")
print(df_combined.head())

--- Combined Data Head (First 5 Rows) ---
                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date  label  
0  December 31, 2017      1  
1  December 31, 2017      1  
2  December 30, 2017      1  
3  December 29, 2017      1  
4  December 25, 2017      1  


In [26]:
print("\n--- Combined Data Information ---")
print(df_combined.info())


--- Combined Data Information ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB
None


We check if the dataset has some missing values in it.
Our dataset is clean and has no missingness. Hence, we need not impute it.

In [27]:
df_combined.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

Check if the dataset has some duplicates. Have duplicates can make the model more bias to the same set of data and also increase computational cost.
Smaller the dataset, better is the prediction. There are about 209 duplicate data in our dataset. We simply remove these duplicate data and proceed to the next step.

In [28]:
df_combined.duplicated().sum()

np.int64(209)

In [29]:
df_combined.drop_duplicates(inplace=True)

In [30]:
df_combined.shape

(44689, 5)

In [33]:
print(df_combined['label'].value_counts())
print(f"Imbalance ratio (Fake:True) = {df_combined['label'].value_counts()[1] / df_combined['label'].value_counts()[0]:.3f} : 1")

label
1    23478
0    21211
Name: count, dtype: int64
Imbalance ratio (Fake:True) = 1.107 : 1


We determine if the dataset is imbalaned. The ratio of fake to true is 1.107:1 seems fairly balanced, which is a good thing.

3. Transform

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    df_combined['text'],
    df_combined['label'],
    test_size=0.2,
    random_state=42,
    stratify=df_combined['label']
)

ModuleNotFoundError: No module named 'skleant'

In [None]:
from nltk import word_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

def process_text(text):
    text = re.sub(r'\s+', ' ', text, flags=re.I) # Remove extra white space from text

    text = re.sub(r'\W', ' ', str(text)) # Remove all the special characters from text

    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text) # Remove all single characters from text

    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove any character that isn't alphabetical

    text = text.lower()

    words = word_tokenize(text)

    stop_words = set(stopwords.words("english"))
    Words = [word for word in words if word not in stop_words]

    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

   