# Fake News Dataset

### Generic Purpose

The dataset is meant to support fake news detection. In other words, it provides labeled examples of news articles (real vs. fake) that can be used to train, test, and evaluate machine learning or NLP models. The goal is to help systems automatically distinguish misinformation from reliable reporting.

#### Two Specific Objectives

##### Build and Evaluate a Classifier
<ul>
    <li>Use the title and/or text fields to train a supervised learning model that predicts the label (fake vs. real).</li>
    <li>Benchmark different models (e.g., logistic regression, random forest, transformers) to see which performs best.</li>
</ul>

##### Analyze Patterns of Fake vs. Real News
<ul>
<li>Explore linguistic and metadata differences (e.g., vocabulary, sentiment, writing style, publishing source, frequency of certain words).</li>
<li>Identify features that strongly correlate with fake news, which could also help in understanding misinformation strategies.</li>
</ul>

First things first, we are installing the following:

#### Pandas

A toolkit for organizing data into something we can work with.
Given that we have a CSV with thousands of articles, each with columns like <i>title, content, source, and label(fake/real)</i>.
Pandas gives us DataFrames, which are like spreadsheets in Python.

This enables us to: 
<ul>
    <li>Drop duplicates or empty rows (some articles might be missing content).</li>
    <li>Normalize text (e.g., lowercasing, removing weird spacing).</li>
    <li>Filter out specific sources or select only certain date ranges.</li>
    <li>Join/merge multiple datasets if you’re combining sources.</li>
</ul>

#### NumPy

Lives underneath Pandas and makes all the number crunching fast. 
Even though text data feels “non-numeric,” eventually you turn it into numbers (word counts, TF-IDF values, embeddings). 

NumPy handles:
<ul>
    <li>Efficient arrays (much faster than vanilla Python lists).</li>
    <li>Math operations on big chunks of data at once, instead of looping line by line.</li>
    <li>The foundations of most machine learning libraries — scikit-learn, TensorFlow, PyTorch all lean on NumPy arrays behind the scenes.</li>
</ul>

##### Pandas 🐼 for the data cleaning layer.
##### NumPy 🔢 is the engine under the hood that makes it all run efficiently.

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('./fake_news_dataset.csv')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20000 non-null  object
 1   text      20000 non-null  object
 2   date      20000 non-null  object
 3   source    19000 non-null  object
 4   author    19000 non-null  object
 5   category  20000 non-null  object
 6   label     20000 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


#### Number of Rows & Columns

There are 20,000 rows (entries).

There are 7 columns in total.

#### Columns Overview

title: headline of the article

text: main body of the article

date: publication date

source: news source/publisher

author: author name (some missing values: 19,000 non-null → ~1,000 missing)

category: probably the section of the article (politics, tech, etc.)

label: this is most likely the target (fake or real news indicator)

#### Data Types

All columns are currently object type — so text-based (strings).

Even date is an object right now, not yet converted to a datetime type.

#### Memory Usage

About 1.1 MB, which is tiny — so the dataset is lightweight and easy to work with.

#### Data Quality Check

No missing values in most columns except author.

If you’re doing text analysis, the main columns of interest will be title, text, and label.

In [5]:
data.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


the ```data.head()``` shows us the first 5 rows of our DataFrame, so we can peek at what the dataset actually looks like.

*   **Structure & Sample Records**
    
    *   Each row is one **news article**.
        
    *   Columns match what we saw before: title, text, date, source, author, category, and label.
        
*   **Examples of Content**
    
    *   Titles are short headlines like _“Foreign Democrat final.”_ or _“To offer down resource great point.”_
        
    *   text contains longer article bodies (truncated in the preview).
        
    *   date is a proper timestamp string (like 2023-03-10).
        
    *   source names publishers (NY Times, Fox News, CNN, Reuters).
        
    *   author column has individual names.
        
    *   category shows the article section (Politics, Business, Science, Technology).
        
    *   label is the target (either **real** or **fake**).
        
*   **Quick Observations**
    
    *   The dataset is **balanced in terms of fake/real** at least in this small preview (2 “real”, 3 “fake”).
        
    *   Text looks a little odd / randomly generated (“Himself church myself carry...”), which suggests this might be a **synthetic dataset** rather than real journalism.
        
    *   Dates span across different years, so temporal patterns could also be analyzed.

### De-duplication

In [None]:
data = data.drop_duplicates(subset=["title", "text"])

👉 Ensures that the dataset doesn’t have multiple rows with the exact same article (title + text). Prevents model bias from duplicate entries.

### Missing-value normalization

In [7]:
data = data.replace(r"^\s*$", np.nan, regex=True)

👉 Converts blank or whitespace-only strings into NaN, making missing values easier to detect and handle.

### Type casting for dates + calendar parts

In [8]:
data['date'] = pd.to_datetime(data['date'], errors='coerce')

data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day

👉 Converts date into a proper datetime object. If conversion fails, it sets them as NaT (Not a Time).
<br>
<br>
👉 Then extracts year, month, and day → useful for time-based analysis or trends.

In [22]:
data['date'] = data['date'].fillna(pd.NaT)

👉 Ensures missing dates are explicitly marked as NaT rather than leaving them undefined.

### String cleanup for object columns

In [23]:
object_cols = data.select_dtypes(include="object").columns
for col in object_cols:
    data[col] = data[col].fillna('unknown')
    data[col] = data[col].str.strip().str.lower()

👉 For all categorical/text columns:
    
*   Fills missing values with "unknown".
    
*   Strips leading/trailing whitespace and standardizes everything to lowercase → prevents issues like "CNN", "cnn ", "cnn" being treated differently.

### Source normalization

In [24]:
source_mapping = {
    "bbc news": "bbc",
    "bbc.com": "bbc",
    "bbc.co.uk": "bbc",
    "bbc": "bbc",

    "cnn news": "cnn",
    "cnn.com": "cnn",
    "cnn": "cnn",

    "foxnews.com": "fox news",
    "fox news": "fox news",
    "fox": "fox news",

    "ny times": "new york times",
    "nytimes.com": "new york times",
    "new york times": "new york times",
    "nyt": "new york times",

    "reuters.com": "reuters",
    "reuters": "reuters",

    "dailynews.com": "daily news",
    "daily news": "daily news",

    "global times": "global times",
    "globaltimes.cn": "global times",

    "guardian.co.uk": "the guardian",
    "the guardian": "the guardian",
    "guardian": "the guardian",

    "unknown": "unknown",
    "": "unknown",
    "n/a": "unknown"
}

👉 Standardizes publisher names into consistent categories. For example:

* "bbc.com", "bbc news", "bbc" → all mapped to "bbc".

* "cnn.com", "cnn" → all mapped to "cnn".

* Missing or odd entries like "n" or "n/a" → mapped to "unknown".

In [25]:
data["source"] = data["source"].str.lower().map(source_mapping).fillna("unknown")

👉 Cleans the source column:

* Lowercases everything.

* Maps raw source names to the standardized values defined in source_mapping.

* Any unmapped source becomes "unknown".

### Feature engineering (text length)

In [26]:
data["text_length"] = data["text"].str.len()

👉 Creates a new column text_length that stores the number of characters in each article’s text.
* This is useful for filtering out junk rows (like very short texts) and possibly as a feature in classification (fake news might systematically differ in length).

### Content sanity filter

In [27]:
data = data[(data["text_length"] > 30) & (data["text_length"] < 10000)]

👉 Keeps only articles with length between 30 and 10,000 characters.

* Removes noise: texts that are too short (like one-liners) or suspiciously long (maybe corrupted).

* Ensures the dataset has meaningful article content for NLP.

### Final completeness check (QA)

In [28]:
print("Remaining missing values:\n", data.isnull().sum())

Remaining missing values:
 title          0
text           0
date           0
source         0
author         0
category       0
label          0
year           0
month          0
day            0
text_length    0
dtype: int64


👉 Verifies whether any columns still have missing values after cleaning.

* Output shows 0 missing values across all columns → dataset is now fully clean.

In [29]:
print("\nUnique Authors (sample):", data["author"].unique()[:10])


Unique Authors (sample): ['paula george' 'joseph hill' 'julia robinson' 'mr. david foster dds'
 'austin walker' 'sherri fry' 'alyssa young' 'tina garrett'
 'heather greene' 'erin hanson']


👉 Prints the first 10 unique authors to check if author cleaning worked.

* Sample shows names like "paula george", "joseph hill", "mr. david foster dds".

* This confirms that author names are standardized (lowercase, no blanks).

In [30]:
print("\nUnique Sources (sample):", data["source"].unique()[:10])


Unique Sources (sample): ['new york times' 'fox news' 'cnn' 'reuters' 'daily news' 'global times'
 'the guardian' 'bbc' 'unknown']


👉 Prints the first 10 unique news sources after normalization.

* Sample shows: "new york times", "fox news", "cnn", "reuters", "bbc", "unknown".

* Confirms your source_mapping step worked — sources are now consistent categories.

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   title        20000 non-null  object        
 1   text         20000 non-null  object        
 2   date         20000 non-null  datetime64[ns]
 3   source       20000 non-null  object        
 4   author       20000 non-null  object        
 5   category     20000 non-null  object        
 6   label        20000 non-null  object        
 7   year         20000 non-null  int32         
 8   month        20000 non-null  int32         
 9   day          20000 non-null  int32         
 10  text_length  20000 non-null  int64         
dtypes: datetime64[ns](1), int32(3), int64(1), object(6)
memory usage: 1.4+ MB


*   All 20,000 rows are present.
    
*   No missing values (cleaned nicely 👍).
    
*   Dates are properly converted into datetime64.
    
*   New engineered columns (year, month, day, text\_length) are stored as numeric types.
    
*   Dataset is small (1.4 MB), so very manageable.

### Save

In [32]:
data.to_csv('cleaned_fake_news.csv', index=False)

👉 It will save the file directly into your working directory (the same folder where your script or notebook is running).