In [27]:
# In this notebook, we'll clean up the dataset so we can use it for training.

# Importing libraries
import pandas as pd
import numpy as np



# Loading the dataset using Pandas
df = pd.read_csv('data/train.csv')
df.head()

#Label 1: unreliable
#Label 0: reliable

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [28]:
# Pandas analysis of the entire dataset.
df.shape  # Rows, columns

(20800, 5)

In [29]:
# Checking for nulls
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [30]:
# Inspecting the label distribution in percentages
df['label'].value_counts(normalize=True)  # 0 = reliable, 1 = unreliable

label
1    0.500625
0    0.499375
Name: proportion, dtype: float64

In [31]:
# Checking a sample fake and real article
print("FAKE:\n", df[df['label'] == 1]['text'].iloc[0][:500])
print("\nREAL:\n", df[df['label'] == 0]['text'].iloc[0][:500])

FAKE:
 House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) 
With apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It t

REAL:
 Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Welle

In [32]:
# Dropping rows with missing 'text', since they won't be useful for training
df = df.dropna(subset=['text'])

# Filling missing titles with empty string, so we can still use the text
df['title'] = df['title'].fillna('')

# Dropping the 'author' column, since it's not useful for training
df = df.drop(columns=['author'])

# Recreating the 'content' column now that nulls are handled
df['content'] = df['title'] + ' ' + df['text']

In [33]:
# Checking for nulls after cleaning
df.isnull().sum()

id         0
title      0
text       0
label      0
content    0
dtype: int64

In [34]:
# Import necessary libraries
import re                           # Regular expressions for text cleaning
import nltk                         # Natural Language Toolkit for text processing
from nltk.corpus import stopwords   # Common English stopwords
from nltk.stem import WordNetLemmatizer  # Lemmatizer to reduce words to base form

# Load English stopwords into a set for fast lookup
stop_words = set(stopwords.words('english'))

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to clean and preprocess text
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove all characters except lowercase letters and whitespace
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Split text into individual words (tokens)
    words = text.split()
    
    # Remove common stopwords (e.g., "the", "is", "and")
    words = [word for word in words if word not in stop_words]
    
    # Lemmatize words (e.g., "running" → "run", "cars" → "car")
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the cleaned words back into a single string
    return ' '.join(words)

In [35]:
#The clean_text function is applied to the content column, and a new "clean_content" column is created

df['clean_content'] = df['content'].apply(clean_text)

In [36]:
df.head(100)

# Clean_content is lowercase and lemmatized for better training.

Unnamed: 0,id,title,text,label,content,clean_content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...,house dem aide didnt even see comeys letter ja...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",flynn hillary clinton big woman campus breitba...
2,2,Why the Truth Might Get You Fired,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Why the Trut...,truth might get fired truth might get fired oc...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...,civilian killed single u airstrike identified ...
4,4,Iranian woman jailed for fictional unpublished...,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...,iranian woman jailed fictional unpublished sto...
...,...,...,...,...,...,...
95,95,White House Confirms More Gitmo Transfers Befo...,President Barack Obama will likely release mor...,0,White House Confirms More Gitmo Transfers Befo...,white house confirms gitmo transfer obama leaf...
96,96,The Geometry of Energy and Meditation of Buddha,License DMCA \nA mandala is a visual symbol of...,1,The Geometry of Energy and Meditation of Buddh...,geometry energy meditation buddha license dmca...
97,97,Poll: Most Voters Have Not Heard of Democratic...,There is a minefield of potential 2020 electio...,0,Poll: Most Voters Have Not Heard of Democratic...,poll voter heard democratic election candidate...
98,98,Migrants Confront Judgment Day Over Old Deport...,There are a little more than two weeks between...,0,Migrants Confront Judgment Day Over Old Deport...,migrant confront judgment day old deportation ...


In [25]:
# Demonstration of the effect of cleaning the text

print("Before:\n", df['content'].iloc[0][:500])
print("\nAfter:\n", df['clean_content'].iloc[0][:500])

Before:
 House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) 
With apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democr

After:
 house dem aide didnt even see comeys letter jason chaffetz tweeted house dem aide didnt even see comeys letter jason chaffetz tweeted darrell lucus october subscribe jason chaffetz stump american fork utah image courtesy michael jolley available creative commonsby license apology keith olbermann doubt worst person world weekfbi director james comey according house democratic aide look like also know secondworst person well turn comey sent nowinfamous letter announcing fbi look

In [37]:
# Saving the clean dataset

df.to_csv('data/processed.csv', index=False)

# Data is now cleaned and ready for vectorization