# Data Preprocessing for Fake News Detection  

This notebook is dedicated to the preprocessing of datasets for the **Fake News Detection Project**. It includes tasks like:  
- Cleaning raw datasets.  
- Aligning column structures.  
- Handling missing or irrelevant data.  
- Standardizing data formats for analysis.  
- Preparing the datasets for machine learning and statistical evaluation.  

The preprocessing ensures consistency across datasets, paving the way for effective modeling and analysis.  


In [1]:
# Libraries and Dataset Loading
# Step 1: Importing Necessary Libraries
import pandas as pd

# Step 2: Loading the Datasets
# BuzzFeed FakeNewsNet Dataset
buzzfeed_fake_df = pd.read_csv('Datasets/BuzzFeed_fake_news_content.csv')
buzzfeed_real_df = pd.read_csv('Datasets/BuzzFeed_real_news_content.csv')

# LIAR Datasets
test_df = pd.read_csv('Datasets/test.csv')
train_df = pd.read_csv('Datasets/train.csv')
valid_df = pd.read_csv('Datasets/valid.csv')

# PolitiFact Factcheck Dataset
politifact_factcheck_df = pd.read_csv('Datasets/politifact_factcheck.csv')

# Fake News Corpus Dataset
fnc_df = pd.read_csv('Datasets/Fake News Corpus.csv')

# Step 3: Display Dataset Shapes
print("Dataset Shapes:")
print(f"BuzzFeed Fake News: {buzzfeed_fake_df.shape}")
print(f"BuzzFeed Real News: {buzzfeed_real_df.shape}")
print(f"LIAR Test Set: {test_df.shape}")
print(f"LIAR Train Set: {train_df.shape}")
print(f"LIAR Validation Set: {valid_df.shape}")
print(f"PolitiFact Factcheck: {politifact_factcheck_df.shape}")
print(f"Fake News Corpus: {fnc_df.shape}")


Dataset Shapes:
BuzzFeed Fake News: (91, 12)
BuzzFeed Real News: (91, 12)
LIAR Test Set: (1267, 14)
LIAR Train Set: (10240, 14)
LIAR Validation Set: (1284, 14)
PolitiFact Factcheck: (21152, 8)
Fake News Corpus: (249, 16)


In [2]:
# Define a function to clean and process the BuzzFeed datasets

# Cleaning function for BuzzFeed datasets
def clean_buzzfeed_dataset(df, label, prefix):
    # Keep only relevant columns
    keep_columns = ['id', 'title', 'text', 'url', 'authors']
    df = df[keep_columns].copy()
    
    # Rename columns
    df.rename(columns={'authors': 'speaker'}, inplace=True)
    
    # Add label column
    df['label'] = label
    
    # Modify 'id' column
    df['id'] = df['id'].str.extract(r'(\d+)').astype(str)  # Extract numerical part
    df['id'] = prefix + df['id']  # Add prefix (BF or BR)
    
    return df

# Step 1: Clean both datasets
buzzfeed_fake_cleaned = clean_buzzfeed_dataset(buzzfeed_fake_df, 'False', 'BF')
buzzfeed_real_cleaned = clean_buzzfeed_dataset(buzzfeed_real_df, 'True', 'BR')

# Step 2: Combine the cleaned datasets
buzzfeed_combined = pd.concat([buzzfeed_fake_cleaned, buzzfeed_real_cleaned], ignore_index=True)

# Step 3: Replace missing values explicitly with "Unknown"
buzzfeed_combined = buzzfeed_combined.fillna(value="Unknown")

# Step 4: Save the combined dataset
buzzfeed_combined.to_csv('Buzzfeed.csv', index=False)

# Display the merged dataset
print("Buzzfeed Dataset Processed")
print(buzzfeed_combined.head())


Buzzfeed Dataset Processed
     id                                              title  \
0   BF1  Proof The Mainstream Media Is Manipulating The...   
1  BF10  Charity: Clinton Foundation Distributed “Water...   
2  BF11  A Hillary Clinton Administration May be Entire...   
3  BF12  Trump’s Latest Campaign Promise May Be His Mos...   
4  BF13                    Website is Down For Maintenance   

                                                text  \
0  I woke up this morning to find a variation of ...   
1  Former President Bill Clinton and his Clinton ...   
2  After collapsing just before trying to step in...   
3  Donald Trump is, well, deplorable. He’s sugges...   
4                    Website is Down For Maintenance   

                                                 url  \
0  http://www.addictinginfo.org/2016/09/19/proof-...   
1  http://eaglerising.com/36899/charity-clinton-f...   
2  http://eaglerising.com/36880/a-hillary-clinton...   
3  http://www.addictinginfo.org/2016/09

In [3]:
# Define the function to clean and process the LIAR dataset

def clean_liar_dataset(df):
    # Keep only relevant columns
    keep_columns = ['ID', 'Label', 'Statement', 'Speaker']
    df = df[keep_columns].copy()

    # Rename columns
    df.rename(columns={'ID': 'id', 'Label': 'label', 'Statement': 'title'}, inplace=True)

    # Prefix "LIAR-" to all values in the id column
    df['id'] = df['id'].apply(lambda x: f"LIAR-{x}")

    # Replace missing values with 'unknown'
    df.fillna('unknown', inplace=True)

    # Apply smart mapping for labels
    label_mapping = {
        'true': 'True',
        'TRUE': 'True',
        'false': 'False',
        'FALSE': 'False',
        'mostly-false': 'False',
        'half-true': 'partially-true',
        'mostly-true': 'partially-true',
        'barely-true': 'partially-true',
        'pants-fire': 'exaggerated',
        'unknown': 'unknown'
    }

    # Normalize the labels and map them using the label_mapping
    df['label'] = df['label'].str.lower().map(label_mapping).fillna('unknown')

    return df

# Step 1: Clean and process all three datasets (train, test, valid)
test_cleaned = clean_liar_dataset(test_df)
train_cleaned = clean_liar_dataset(train_df)
valid_cleaned = clean_liar_dataset(valid_df)

# Step 2: Combine all cleaned datasets into one
liar_combined = pd.concat([test_cleaned, train_cleaned, valid_cleaned], ignore_index=True)

# Step 3: Display the updated label counts for verification
print("Updated Label Counts in LIAR Dataset:")
print(liar_combined['label'].value_counts())

# Step 4: Save the combined dataset to CSV
liar_combined.to_csv('LIAR.csv', index=False)

# Display the merged dataset
print("LIAR Dataset Processed and Saved")
print(liar_combined.head())


Updated Label Counts in LIAR Dataset:
label
partially-true    7184
False             2507
True              2053
exaggerated       1047
Name: count, dtype: int64
LIAR Dataset Processed and Saved
           id           label  \
0  LIAR-11972            True   
1  LIAR-11685           False   
2  LIAR-11096           False   
3   LIAR-5209  partially-true   
4   LIAR-9524     exaggerated   

                                               title  \
0  Building a wall on the U.S.-Mexico border will...   
1  Wisconsin is on pace to double the number of l...   
2  Says John McCain has done nothing to help the ...   
3  Suzanne Bonamici supports a plan that will cut...   
4  When asked by a reporter whether hes at the ce...   

                            Speaker  
0                        rick-perry  
1                 katrina-shankland  
2                      donald-trump  
3                     rob-cornilles  
4  state-democratic-party-wisconsin  


In [4]:
# Define the function to clean and process the Politifact Factcheck dataset

# Step 1: Load the dataset
politifact_factcheck_df = pd.read_csv('Datasets/politifact_factcheck.csv')

# Step 2: Select only the required columns
keep_columns = ['verdict', 'statement_originator', 'statement', 'factcheck_date', 'factcheck_analysis_link']
politifact_cleaned = politifact_factcheck_df[keep_columns].copy()

# Step 3: Rename the columns
politifact_cleaned.rename(columns={
    'verdict': 'label',
    'statement_originator': 'speaker',
    'statement': 'title',
    'factcheck_date': 'date',
    'factcheck_analysis_link': 'url'
}, inplace=True)

# Step 4: Add a new 'id' column with sequential values prefixed with 'PF'
politifact_cleaned.insert(0, 'id', ['PF' + str(i) for i in range(1, len(politifact_cleaned) + 1)])

# Step 5: Replace missing values with 'unknown'
politifact_cleaned = politifact_cleaned.fillna('unknown')

# Step 6: Apply smart mapping for the 'label' column
label_mapping = {
    'true': 'True',
       # 'True': 'TRUE',
        'TRUE': 'True',
        'false': 'False',
       # 'False': 'FALSE',
        'FALSE': 'False',
        'mostly-false': 'False',
        'half-true': 'partially-true',
        'mostly-true': 'partially-true',
        'barely-true': 'partially-true',
        'pants-fire': 'exaggerated',
        'unknown': 'unknown'
}
politifact_cleaned['label'] = politifact_cleaned['label'].str.lower().map(label_mapping).fillna('unknown')

# Step 7: Save the cleaned dataset
politifact_cleaned.to_csv('Politifact.csv', index=False)

# Display the processed dataset
print("Politifact Factcheck Dataset Processed")
print(politifact_cleaned.head())


Politifact Factcheck Dataset Processed
    id           label       speaker  \
0  PF1            True  Barack Obama   
1  PF2           False    Matt Gaetz   
2  PF3  partially-true  Kelly Ayotte   
3  PF4           False      Bloggers   
4  PF5  partially-true  Bobby Jindal   

                                               title       date  \
0  John McCain opposed bankruptcy protections for...  6/16/2008   
1  "Bennie Thompson actively cheer-led riots in t...  6/13/2022   
2  Says Maggie Hassan was "out of state on 30 day...  5/27/2016   
3  "BUSTED: CDC Inflated COVID Numbers, Accused o...   2/5/2021   
4  "I'm the only (Republican) candidate that has ...  8/30/2015   

                                                 url  
0  https://www.politifact.com/factchecks/2008/jun...  
1  https://www.politifact.com/factchecks/2022/jun...  
2  https://www.politifact.com/factchecks/2016/may...  
3  https://www.politifact.com/factchecks/2021/feb...  
4  https://www.politifact.com/factchecks/2

In [5]:
# Define the function to clean and process the Fake News Corpus dataset

# Step 1: Select only the required columns
keep_columns = ['id', 'type', 'url', 'content', 'title', 'authors']
fnc_cleaned = fnc_df[keep_columns].copy()

# Step 2: Rename the columns
fnc_cleaned.rename(columns={
    'type': 'label',
    'content': 'text',
    'authors': 'speaker'
}, inplace=True)

# Step 3: Modify 'id' column by adding a prefix 'FNC' to the existing id values
fnc_cleaned['id'] = 'FNC' + fnc_cleaned['id'].astype(str)

# Step 4: Reorder the columns as per the new structure
fnc_cleaned = fnc_cleaned[['id', 'title', 'text', 'label', 'speaker', 'url']]

# Step 5: Replace missing values with 'unknown'
fnc_cleaned = fnc_cleaned.fillna('unknown')

# Step 6: Apply the label mapping
label_mapping = {
    'true': 'True',
        'TRUE': 'True',
        'false': 'False',
        'FALSE': 'False',
        'mostly-false': 'False',
        'half-true': 'partially-true',
        'mostly-true': 'partially-true',
        'barely-true': 'partially-true',
        'pants-fire': 'exaggerated',
        'unknown': 'unknown'
}

# Apply the mapping to the 'label' column
fnc_cleaned['label'] = fnc_cleaned['label'].map(label_mapping).fillna('unknown')

# Step 7: Save the cleaned dataset
fnc_cleaned.to_csv('FNC.csv', index=False)

# Display the processed dataset
print("Fake News Corpus Dataset Processed")
print(fnc_cleaned.head())


Fake News Corpus Dataset Processed
       id                                              title  \
0  FNC141  Church Congregation Brings Gift to Waitresses ...   
1  FNC256  AWAKENING OF 12 STRANDS of DNA – “Reconnecting...   
2  FNC700  Never Hike Alone - A Friday the 13th Fan Film ...   
3  FNC768  Elusive ‘Alien Of The Sea ‘ Caught By Scientis...   
4  FNC791  Trump’s Genius Poll Is Complete & The Results ...   

                                                text        label  \
0  Sometimes the power of Christmas will make you...        False   
1  AWAKENING OF 12 STRANDS of DNA – “Reconnecting...        False   
2  Never Hike Alone: A Friday the 13th Fan Film U...        False   
3  When a rare shark was caught, scientists were ...        False   
4  Donald Trump has the unnerving ability to abil...  exaggerated   

           speaker                                                url  
0      Ruth Harris  http://awm.com/church-congregation-brings-gift...  
1     Zurich Times  h

In [6]:
# Step 1: Load All Preprocessed Fake News Datasets
buzzfeed_df = pd.read_csv('Buzzfeed.csv')
liar_df = pd.read_csv('LIAR.csv')
politifact_df = pd.read_csv('Politifact.csv')
fnc_df = pd.read_csv('FNC.csv')  

# Step 2: Standardize Column Structure Across All Datasets
final_columns = ['id', 'title', 'text', 'label', 'speaker', 'date', 'url']

# Utility Function: Ensure All Datasets Contain Required Columns
def ensure_columns(df, columns):
    for col in columns:
        if col not in df.columns:
            df[col] = 'unknown'  
    return df[columns]

# Apply Standardization to Each Dataset
buzzfeed_df = ensure_columns(buzzfeed_df, final_columns)
liar_df = ensure_columns(liar_df, final_columns)
politifact_df = ensure_columns(politifact_df, final_columns)
fnc_df = ensure_columns(fnc_df, final_columns)  

# Step 3: Merge All Standardized Datasets into a Unified Dataset
fake_news_dataset = pd.concat([buzzfeed_df, liar_df, politifact_df, fnc_df], ignore_index=True)

# Step 5: Save the Final Unified and Labeled Dataset to CSV
fake_news_dataset.to_csv('Fake_News_Dataset.csv', index=False)

# Step 6: Output a Preview of the Final Unified Dataset
print("Fake News Dataset Created")
print(fake_news_dataset.head())


Fake News Dataset Created
     id                                              title  \
0   BF1  Proof The Mainstream Media Is Manipulating The...   
1  BF10  Charity: Clinton Foundation Distributed “Water...   
2  BF11  A Hillary Clinton Administration May be Entire...   
3  BF12  Trump’s Latest Campaign Promise May Be His Mos...   
4  BF13                    Website is Down For Maintenance   

                                                text  label  \
0  I woke up this morning to find a variation of ...  False   
1  Former President Bill Clinton and his Clinton ...  False   
2  After collapsing just before trying to step in...  False   
3  Donald Trump is, well, deplorable. He’s sugges...  False   
4                    Website is Down For Maintenance  False   

                       speaker     date  \
0              Wendy Gittleson  unknown   
1               View All Posts  unknown   
2  View All Posts,Tony Elliott  unknown   
3                  John Prager  unknown   
4      

In [7]:
print("Dataset Shape:", fake_news_dataset.shape)


Dataset Shape: (34374, 7)
