# Fake & Real News Dataset

#### Purpose

<strong>Fake News Detection</strong>
<br>
![Fake News GIF](gifs/Cnn%20News%20GIF.gif)

    The dataset is meant to support fake news detection.

#### Objectives

##### Build and Evaluate a Classifier

![Predict GIF](gifs/Season%202%20Rolls%20Eyes%20GIF%20by%20BBC%20Three.gif)

<ul>
    <li>Use the title and/or text fields to train a supervised learning model that predicts the label (fake vs. real).</li>
    <li>Benchmark different models (e.g., logistic regression, random forest, transformers) to see which performs best.</li>
</ul>

##### Analyze Patterns of Fake vs. Real News

![Analyze](gifs/Serious%20Episode%205%20GIF%20by%20One%20Chicago.gif)

<ul>
<li>Explore linguistic and metadata differences (e.g., vocabulary, sentiment, writing style, publishing source, frequency of certain words).</li>
<li>Identify features that strongly correlate with fake news, which could also help in understanding misinformation strategies.</li>
</ul>

#### Import Libraries

In [1]:
import pandas as pd
import numpy as np

#### Reading the File

In [2]:
data = pd.read_csv('fake_news_dataset.csv')

#### Exploring the Dataset

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20000 non-null  object
 1   text      20000 non-null  object
 2   date      20000 non-null  object
 3   source    19000 non-null  object
 4   author    19000 non-null  object
 5   category  20000 non-null  object
 6   label     20000 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


In [4]:
data.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


#### De-Duplication

In [5]:
data = data.drop_duplicates(subset=["title", "text"])

#### Missing-value normalization

In [6]:
data = data.replace(r"^\s*$", np.nan, regex=True)

#### Type casting for dates + calendar parts

In [7]:
data['date'] = pd.to_datetime(data['date'], errors='coerce')

data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day

In [8]:
data['date'] = data['date'].fillna(pd.NaT)

#### String Cleanup for Object Columns

In [9]:
object_cols = data.select_dtypes(include="object").columns
for col in object_cols:
    data[col] = data[col].fillna('unknown')
    data[col] = data[col].str.strip().str.lower()

#### Source Normalization

In [10]:
source_mapping = {
    "bbc news": "bbc",
    "bbc.com": "bbc",
    "bbc.co.uk": "bbc",
    "bbc": "bbc",

    "cnn news": "cnn",
    "cnn.com": "cnn",
    "cnn": "cnn",

    "foxnews.com": "fox news",
    "fox news": "fox news",
    "fox": "fox news",

    "ny times": "new york times",
    "nytimes.com": "new york times",
    "new york times": "new york times",
    "nyt": "new york times",

    "reuters.com": "reuters",
    "reuters": "reuters",

    "dailynews.com": "daily news",
    "daily news": "daily news",

    "global times": "global times",
    "globaltimes.cn": "global times",

    "guardian.co.uk": "the guardian",
    "the guardian": "the guardian",
    "guardian": "the guardian",

    "unknown": "unknown",
    "": "unknown",
    "n/a": "unknown"
}

In [11]:
data["source"] = data["source"].str.lower().map(source_mapping).fillna("unknown")

#### Feature Engineering (Text Length)

In [12]:
data["text_length"] = data["text"].str.len()

#### Content Sanity Filter

In [13]:
data = data[(data["text_length"] > 30) & (data["text_length"] < 10000)]

#### Final Completeness Check (QA)

In [14]:
print("Remaining missing values:\n", data.isnull().sum())

Remaining missing values:
 title          0
text           0
date           0
source         0
author         0
category       0
label          0
year           0
month          0
day            0
text_length    0
dtype: int64


In [15]:
print("\nUnique Authors (sample):", data["author"].unique()[:10])


Unique Authors (sample): ['paula george' 'joseph hill' 'julia robinson' 'mr. david foster dds'
 'austin walker' 'sherri fry' 'alyssa young' 'tina garrett'
 'heather greene' 'erin hanson']


In [16]:
print("\nUnique Sources (sample):", data["source"].unique()[:10])


Unique Sources (sample): ['new york times' 'fox news' 'cnn' 'reuters' 'daily news' 'global times'
 'the guardian' 'bbc' 'unknown']


In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   title        20000 non-null  object        
 1   text         20000 non-null  object        
 2   date         20000 non-null  datetime64[ns]
 3   source       20000 non-null  object        
 4   author       20000 non-null  object        
 5   category     20000 non-null  object        
 6   label        20000 non-null  object        
 7   year         20000 non-null  int32         
 8   month        20000 non-null  int32         
 9   day          20000 non-null  int32         
 10  text_length  20000 non-null  int64         
dtypes: datetime64[ns](1), int32(3), int64(1), object(6)
memory usage: 1.4+ MB


#### Save

In [18]:
data.to_csv('cleaned_fake_news.csv', index=False)