<a href="https://colab.research.google.com/github/olumideadekunle/Data-Sharing-among-Business/blob/main/Through_the_Lens_of_Truth_Analyzing_and_Detecting_Fake_News_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Through the Lens of Truth: Analyzing and Detecting Fake News

## About the organization

![TruthLens](https://drive.google.com/uc?export=view&id=1BSdTj6PVZwEnSCqucDa5DVHUGcPV_6jK)

The TruthLens Institute is a pioneering research organization dedicated to combating misinformation and fostering digital literacy worldwide. Founded in 2015 by a coalition of data scientists, journalists, and social researchers, TruthLens focuses on leveraging cutting-edge technology and interdisciplinary approaches to address the growing challenges of fake news, biased reporting, and disinformation campaigns.

### Mission:

To empower individuals and organizations with tools, insights, and strategies to identify and mitigate the spread of false or misleading information.

### Core Focus Areas:

- Data-Driven Research: Analyzing large datasets to uncover patterns and trends in misinformation.
- Technology Development: Creating AI-driven solutions to detect and counteract fake news in real time.
- Public Education: Offering workshops, webinars, and toolkits to enhance critical thinking and digital literacy.
- Policy Advocacy: Collaborating with governments and tech companies to implement ethical frameworks for content moderation.

### Impact:

Over the years, TruthLens has partnered with global organizations like the United Nations, educational institutions, and social media platforms to amplify its efforts. Their groundbreaking studies have shaped public discourse and influenced policymaking in the realm of digital ethics and media integrity.

**Why the Name "TruthLens"?**

The name reflects the organization’s mission to provide a clear, unbiased lens through which to view the information landscape. By filtering out noise and highlighting the truth, the institute aims to restore trust in media and information ecosystems.

## Project Introduction

As part of your commitments to making the world a better place, you volunteer 8-10 hours a week as a data scientist for a research organization, TruthLens to help in tackling misinformation and understanding its viral nature. Your mission is to analyze a dataset containing text and metadata from websites tagged as fake or biased news sources. This project allows you to explore real-world data challenges, build detection models, and develop actionable insights to combat misinformation.

This project focuses on exploring, cleaning, and analyzing a dataset containing text and metadata scraped from 244 websites. You will also build predictive models to detect fake or biased content using natural language processing (NLP) and metadata features. The dataset contains 12,999 posts from the last 30 days, providing a rich resource for analysis.

In addition to technical skills, you will also reflect on the nuances of detecting misinformation, the ethical challenges of labeling data, and potential improvements for the dataset.

## Objectives

The main objectives of this project are:

- Data Exploration: Understand the structure, distribution, and nuances of the dataset.
- Data Cleaning: Handle missing or inconsistent labels and clean text data for analysis.
- Feature Engineering: Extract meaningful features from both text and metadata.
- Model Development: Build and evaluate machine learning models to detect fake or biased news.
- Insights and Recommendations: Provide actionable insights and propose potential improvements for misinformation detection systems.

## About the dataset

The dataset contains:

- Text Data: Articles or posts from websites.
- Metadata: Information such as timestamps, URLs, and labels (e.g., "bs").
- Labels: Predefined tags from the BS Detector extension indicating the type of fake or biased content.

**You would find the dataset at: "[fake_or_real_news.csv](https://drive.google.com/file/d/1m1gRCISgJr0W2TiveQxCxOyw6G9Yiwkj/view?usp=sharing)"**

## Task

**Phase 1: Data Exploration and Cleaning**

- Load the dataset and examine its structure (e.g., columns, data types, missing values, remove unnecessary columns).
- Preprocess the text data (e.g., remove stopwords, punctuations, and perform tokenization).

**Phase 2: Feature Engineering**

- Extract key features from the text, such as word count, sentiment, and term frequency. You can generate a word cloud for frequently occurring terms in fake news articles.
- Extract metadata-based features (e.g., domain, publication time patterns). Consider identifying if specific domains contribute more fake news than others.

**Phase 3: Model Development**

- Split the data into training and test sets, ensuring balanced distribution of labels.
- Use NLP techniques (e.g., TF-IDF, embeddings) to represent the text data, and compare the performance of different models
- Evaluate model performance using appropriate metrics (e.g., accuracy, F1-score).

**Phase 4: Insights and Recommendations**

- Analyze the results and discuss the model's strengths and weaknesses and write a summary of key insights from the model and the dataset.
- Propose ethical considerations and improvements for detecting misinformation. Suggest additional features or external data sources that could enhance model performance.

## Deliverables

- Exploratory Data Analysis (EDA) notebook with visualizations and data cleaning steps. (3 weeks) --> Jupyter notebook
- An organized Jupyter Notebook detailing necessary project phases (2 weeks) --> Jupyter notebook
- Detailed documentation of the entire workflow, insights, and recommendations, including challenges faced and solutions implemented. (2 weeks) --> Microsoft word document or pdf file format

**Timeline = 7 weeks.**

Phase 1: Data Exploration & Cleaning
1. Load and Clean the Data

In [4]:
import pandas as pd

# Load data from the Google Drive link provided in the notebook
# The file is a CSV, not an XLSX
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1m1gRCISgJr0W2TiveQxCxOyw6G9Yiwkj")

# Keep only relevant columns
df = df[['title', 'text', 'label']]

# Drop missing values
df.dropna(inplace=True)

# Display basic info
print(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   6335 non-null   object
 1   text    6335 non-null   object
 2   label   6335 non-null   object
dtypes: object(3)
memory usage: 148.6+ KB
None


Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


2. Text Cleaning and Preprocessing

In [7]:
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # Download the punkt_tab resource

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    filtered_words = [word for word in tokens if word not in stopwords.words('english')]
    return " ".join(filtered_words)

df['clean_text'] = df['text'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Phase 2: Feature Engineering
1. Basic Features

In [8]:
df['word_count'] = df['clean_text'].apply(lambda x: len(x.split()))
df['char_count'] = df['clean_text'].apply(len)


TF-IDF Transformation

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['clean_text'])


 Phase 3: Model Development
1. Train-Test Split

In [10]:
from sklearn.model_selection import train_test_split

y = df['label'].map({'FAKE': 0, 'REAL': 1})  # Encode labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


Train and Evaluate Models

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.89      0.87       633
           1       0.88      0.86      0.87       634

    accuracy                           0.87      1267
   macro avg       0.87      0.87      0.87      1267
weighted avg       0.87      0.87      0.87      1267

[[562  71]
 [ 90 544]]


 Phase 4: Insights & Recommendations
Insights:

Most fake news articles had high word counts.

The model achieved high accuracy with Naive Bayes using TF-IDF.

Certain keywords (e.g., "breaking", "exclusive") are common in fake news.

Recommendations:

Include source/domain as a feature in future iterations.

Improve label quality by combining human and AI annotation.

Address ethical risks: avoid false positives, ensure transparency.

| Deliverable             | Format            | Tool                  |
| ----------------------- | ----------------- | --------------------- |
| EDA & Cleaning Notebook | `.ipynb`          | Jupyter               |
| Final Modeling Notebook | `.ipynb`          | Jupyter               |
| Project Report          | `.docx` or `.pdf` | MS Word / Google Docs |
