<a href="https://colab.research.google.com/github/micaelasousai/Analyzing-U.S.-Election-Misinformation-on-Reddit/blob/main/Big_Data_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Engagement Metrics - Do False Claims Spread Faster?**
## **Objective**
This project aims to analyze how misinformation spreads on Reddit during the 2024 U.S. Presidential election. Specifically, we will:
- Collect election-related discussions from Reddit.
- Use NLP techniques to classify misinformation.
- Compare engagement levels (comments, virality, shares) between fact-checked and misleading posts.
- Visualize the spread of misinformation vs. factual content.

## **Tools & Libraries**
- **Programming Languages**: Python
- **APIs**: Reddit API (PRAW)
- **Libraries**: NLTK, Transformers, PRAW, Pandas, Matplotlib, Seaborn
- **Visualization**: Matplotlib
- **Data Storage**: Google Drive / Local Storage


## **Stage 1: Collecting Data from Reddit**
We used PRAW to fetch data from relevant subreddits like r/politics, r/news, and r/2024elections, then organized the posts and comments into a structured Pandas DataFrame for analysis.

**The block of code below takes approximately 70 minutes to run so we recommend that you skip testing it and use the reddit_scrape.csv file attached for the steps following it**

In [None]:
!pip install praw

import praw
import time
import logging
import numpy as np
import pandas as pd
import seaborn as sns
from pprint import pprint
from datetime import datetime
from google.colab import files
import matplotlib.pyplot as plt

logging.getLogger("praw").setLevel(logging.ERROR)

start_date = datetime(2023, 6, 1)
end_date = datetime(2024, 12, 31)
start_ts = start_date.timestamp()
end_ts = end_date.timestamp()

user_agent = "Scraper 1.0 by /u/SouthBee4571"
reddit = praw.Reddit(
    client_id="dfKJovGqVILyM2BBEmUb5w",
    client_secret="XTmfF9JE6Z4kkGPBPTuJJGP2SRnrnw",
    user_agent=user_agent
)

subreddits = [
    'politics',
    'news',
    'Conservative',
    'liberal',
    'PoliticalDiscussion',
    '2024elections'
]

posts_data = []

for sub in subreddits:
    for submission in reddit.subreddit(sub).hot(limit=None):
        if not (start_ts <= submission.created_utc <= end_ts):
            continue
        time.sleep(3)
        comments = [comment.body for comment in submission.comments if isinstance(comment, praw.models.Comment)][:30]

        score = submission.score
        upvote_ratio = submission.upvote_ratio
        estimated_upvotes = int(score * upvote_ratio)
        estimated_downvotes = score - estimated_upvotes

        posts_data.append({
            "post_id": submission.id,
            "title": submission.title,
            "selftext": submission.selftext,
            "url": submission.url,
            "comments": comments,
            "num_comments": submission.num_comments,
            "upvotes": estimated_upvotes,
            "downvotes": estimated_downvotes,
            "score": score,
            "timestamp": submission.created_utc,
            "subreddit": submission.subreddit.display_name
        })

df = pd.DataFrame(posts_data)
df['datetime'] = pd.to_datetime(df['timestamp'], unit='s').astype(str)
df['content'] = df.apply(
    lambda row: row['title'] + "\n\n" + (row['selftext'] if row['selftext'] else "") + "\n" + row['url'],
    axis=1
)

print(df[['subreddit', 'datetime', 'title']].head())

df.to_csv("reddit_scrape.csv", index=False, encoding='utf-8')
files.download("reddit_scrape.csv")


## **Stage 2: Cleaning and Preprocessing Data**

We cleaned the text data by removing stopwords and special characters from the title and self-text fields, converting everything to lowercase, and tokenizing the words using NLTK, then stored the cleaned results in a new DataFrame for further analysis in the stages following.

**In the code cell below, upload the reddit_scrape.csv file when testing**


In [None]:
# Upload the reddit_scrape.csv file here
from google.colab import files
uploaded = files.upload()



KeyboardInterrupt: 

In [None]:
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
nltk.download('punkt_tab', force=True)
nltk.download('stopwords', force=True)
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words("english"))

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    words = word_tokenize(text)
    words = [word for word in words if word not in stop_words]
    return " ".join(words)

# Read CSV file into a DataFrame
df = pd.read_csv("reddit_scrape.csv", encoding='utf-8')

# Apply cleaning function to 'title' and 'selftext' columns
df["Cleaned_Title"] = df["title"].apply(lambda x: clean_text(str(x)))
df["Cleaned_Selftext"] = df["selftext"].apply(lambda x: clean_text(str(x)))

# View the cleaned content
print(df[['title', 'Cleaned_Title', 'selftext', 'Cleaned_Selftext']].head())

## **Stage 3: Fact-Checking Misinformation**

We trained a machine learning model using a labeled dataset from Kaggle to evaluate the accuracy of news content. The model was designed to classify each post along a six-point scale ranging from “true” (1) to “completely_false” (6), allowing us to assess the credibility of Reddit posts with more nuance. These labels helped us determine the degree of misinformation in each post rather than simply classifying content as either verified or misleading.

In [None]:
# Upload the test.csv file here
from google.colab import files
uploaded = files.upload()

In [None]:
#Creating an ML algorithm to detect fake news from headlines
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#adding the csv file that will be used to train the ML with fake and real news
#training_df = '/content/drive/My Drive/Reddit_folder/test.csv' # this csv are from my personal drive. Data is from: https://www.kaggle.com/datasets/arashnic/fake-claim-dataset/data

training = pd.read_csv("test.csv")

#adding the range to the labels in the training dataset (This range label is from Kraggle's)
label_range = {
    "true": 1,
    "mostly-true": 2,
    "half-true": 3,
    "barely-true": 4,
    "false": 5,
    "pants-fire": 6
}

training['label'] = training['label'].map(label_range)

training = training.dropna(subset=['label'])#dropping rows with Na

# splitting dataset into features and labels
X = training['statement']  # using the text to train the ML
y = training['label']

# splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# using TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# log regression model for training
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', class_weight='balanced', max_iter=1000)
model.fit(X_train_tfidf, y_train)

#will use text instead of redit title to check for fake and real news
df["Cleaned_Selftext"] = df["selftext"].apply(lambda x: clean_text(str(x)))
reddit_titles = df['selftext'] #getting the reddit titles

# using the trained TF-IDF Vectorizer
reddit_titles_tfidf = vectorizer.transform(df["Cleaned_Selftext"])

# predicting whether the reddit titles are fake or real news
reddit_predictions = model.predict(reddit_titles_tfidf)

#adding our own range names:
inverse_label_range = {
    1: "true",
    2: "mostly-true",
    3: "half-true",
    4: "barely-true",
    5: "false",
    6: "completely-false"
}

# adding prediction labels
df['prediction'] = reddit_predictions
df['prediction'] = df['prediction'].map(inverse_label_range)


#printing the first few to check
df.head()

## **Stage 4: Classify Posts Using Sentiment Analysis**
We used NLTK for sentiment analysis to evaluate the tone of each post, classifying them as positive, neutral, or negative based on their content.

In [None]:
import nltk
!pip install nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Sentiment function
def get_sentiment(post):
  scores = analyzer.polarity_scores(post)
  compound = scores['compound']
  if compound >= 0.05:
    sentiment = 'positive'
  elif compound <= -0.05:
    sentiment = 'negative'
  else:
    sentiment = 'neutral'
  return sentiment

# Apply function to get sentiment
df['Sentiment'] = df['Cleaned_Selftext'].apply(get_sentiment)

df.head()

## **Stage 5: Analyze Engagement Metrics**
We computed various engagement metrics such as comment activity, virality score, and controversy ratio to understand how users interact with different types of content. These metrics were then compared between factual and misleading posts to assess whether false claims tend to spread faster. Additionally, we analyzed and compared the average sentiment of factual posts versus misinformation to identify notable differences in tone.

**Comment Activity**

In [None]:
#convert num_comments to numeric
df['num_comments'] = pd.to_numeric(df['num_comments'], errors='coerce')
df['Comment_Activity'] = df['num_comments']
#print the results
print("Average number of comments per post:", df['Comment_Activity'].mean())
print("Top 5 most commented posts:")
top5 = df[['title', 'Comment_Activity']].sort_values(by='Comment_Activity', ascending=False).head().reset_index(drop=True)
print(top5)

**Virality Score**

In [None]:
# Code Section for the above
import time
import pandas as pd
from datetime import datetime

# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

# Calculate post age in hours (how long it's been online)
df['post_age_hours'] = (datetime.utcnow() - df['timestamp']).dt.total_seconds() / 3600

# Set weights
W1 = 2  # Weight for crossposts
W2 = 1  # To prevent division by zero

# Ensure timestamp column is in datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Convert timestamps to Unix time (seconds since epoch)
df['timestamp'] = df['timestamp'].astype('int64') // 10**9  # Convert to seconds

current_time = time.time()
df['post_age_hours'] = (current_time - df['timestamp']) / 3600

# Calculate virality score
df['virality_score'] = (df['upvotes'] + (df['num_comments'] * W1)) / (df['post_age_hours'] + W2)

# Display top viral posts
df[['post_id', 'title', 'virality_score']].sort_values(by='virality_score', ascending=False).head(10)



**Controversy Ratio**

In [None]:
#making rows nan that are not ints in the upvotes and downvotes
df['upvotes'] = pd.to_numeric(df['upvotes'], errors='coerce')
df['downvotes'] = pd.to_numeric(df['downvotes'], errors='coerce')

#deleting the nan rows
df = df.dropna(subset = ['upvotes', 'downvotes'])

#getting the upvotes + downvotes to check the ratio
df['controversy_ratio'] = df['upvotes']/(df['downvotes'] + 1)

#checking the first few
print(df.head())



**Cross-Subredit Spread**

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a unique identifier for posts based on title and selftext
df['combined_text'] = df['Cleaned_Title'].fillna('') + ' ' + df['Cleaned_Selftext'].fillna('')

# Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['combined_text'])

# Compute cosine similarity between posts
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Set a similarity threshold (e.g., 0.7) to consider posts as near-duplicates
similarity_threshold = 0.7

# Identify cross-subreddit misinformation spread
subreddit_spread = {}
for i in range(len(df)):
    similar_subreddits = set()
    for j in range(len(df)):
        if i != j and cosine_sim[i, j] > similarity_threshold:
            similar_subreddits.add(df.iloc[j]['subreddit'])
    subreddit_spread[df.iloc[i]['post_id']] = len(similar_subreddits)

# Convert to DataFrame
spread_df = pd.DataFrame(list(subreddit_spread.items()), columns=['post_id', 'unique_subreddit_count'])

# Calculate Cross-Subreddit Spread Score
# The Cross subreddit score is calculated as number of unique subreddits with similar posts divided by total number of subreddits in dataset
spread_df['cross_subreddit_score'] = spread_df['unique_subreddit_count'] / df['subreddit'].nunique()

average_score_true = spread_df[df['prediction'] == "true"]['cross_subreddit_score'].mean()
average_score_mostly_true = spread_df[df['prediction'] == "mostly-true"]['cross_subreddit_score'].mean()
average_score_half_true = spread_df[df['prediction'] == "half-true"]['cross_subreddit_score'].mean()
average_score_barely_true = spread_df[df['prediction'] == "barely-true"]['cross_subreddit_score'].mean()
average_score_false = spread_df[df['prediction'] == "false"]['cross_subreddit_score'].mean()
average_score_completely_false = spread_df[df['prediction'] == "completely-false"]['cross_subreddit_score'].mean()

print("Average Cross-Subreddit Spread Score for Posts with True Information:", average_score_true)
print("Average Cross-Subreddit Spread Score for Posts with Mostly-True Information:", average_score_mostly_true)
print("Average Cross-Subreddit Spread Score for Posts with Half-True Information:", average_score_half_true)
print("Average Cross-Subreddit Spread Score for Posts with Barely-True Information:", average_score_barely_true)
print("Average Cross-Subreddit Spread Score for Posts with False Information:", average_score_false)
print("Average Cross-Subreddit Spread Score for Posts with Completely-False Information:", average_score_completely_false)

# Save results
spread_df.to_csv('cross_subreddit_spread.csv', index=False)

## **Step 6: Visualize Findings & Build Dashboard**


In [None]:
import datetime
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

if df['datetime'].dtype != 'datetime64[ns]':
    df['datetime'] = pd.to_datetime(df['datetime'])

df = df[df['post_id'] != '1bwbuka']

start_date = pd.to_datetime("2023-06-01")
end_date = pd.to_datetime("2024-12-31")
df_filtered = df[(df['datetime'] >= start_date) & (df['datetime'] <= end_date)].copy()

df_filtered['date'] = df_filtered['datetime'].dt.date
daily_activity = df_filtered.groupby('date')['Comment_Activity'].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(data=daily_activity, x='date', y='Comment_Activity')
plt.title("Figure 1. Daily Average Comment Activity (August – December 2024)")
plt.xlabel("Date")
plt.ylabel("Avg num of comments")
plt.xlim([pd.to_datetime("2024-08-01"), pd.to_datetime("2024-12-31")])
plt.tight_layout()
plt.show()




**Average Comment Activity**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df['Comment_Activity'] = df['num_comments']

heatmap_data = df.pivot_table(
    index='subreddit',
    columns='prediction',
    values='Comment_Activity',
    aggfunc='mean'
)

plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlGnBu", linewidths=0.5)

plt.title("Figure 2. Average Comment Activity by Subreddit and Verdict")
plt.xlabel("Predicted Verdict")
plt.ylabel("Subreddit")
plt.tight_layout()
plt.show()



**Visualize Virality Score for Verified Versus Misleading Data**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate mean virality score per fact-check label
avg_virality = df.groupby("prediction")["virality_score"].mean()

# Set wider figure size
plt.figure(figsize=(10, 6))  # You can tweak the width (10) as needed

# Create bar chart
sns.barplot(x=avg_virality.index, y=avg_virality.values, palette=["green", "red"])
plt.xlabel("Fact-Check Label")
plt.ylabel("Average Virality Score")
plt.title("Average Virality Score of Predicted Posts")
plt.show()


**Graphing the controversy ratio**

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#will graph the controvery ratio based on misleading vs verified posts

df['log_controversy_ratio'] = np.log1p(df['controversy_ratio']) #logging the controversy ratio to handle outliers

#making the order of the x-axis (the misleading vs truthful news) to be in order
in_order = ["completely-false", "false", "barely-true", "half-true", "mostly-true", "true"]

sns.boxplot(x='prediction', y = 'log_controversy_ratio', data=df, palette="Set3", order=in_order)
plt.title("Controversy Ratio of Misleading vs Truthful News")
plt.xlabel("Fact Check Label")
plt.ylabel("Average controversy ratio")
plt.show()

**Cross-Subredit Spread Table With Results**

In [None]:
from prettytable import PrettyTable

# Create table to store the values
cross_subreddit_spread_table = PrettyTable(['Post Accuracy Prediction', 'Average Cross-Subreddit Spread Score'])

# Add rows to the table
cross_subreddit_spread_table.add_row(['true', average_score_true])
cross_subreddit_spread_table.add_row(['mostly-true', average_score_mostly_true])
cross_subreddit_spread_table.add_row(['half-true', average_score_half_true])
cross_subreddit_spread_table.add_row(['barely-true', average_score_barely_true])
cross_subreddit_spread_table.add_row(['false', average_score_false])
cross_subreddit_spread_table.add_row(['completely-false', average_score_completely_false])

# Print the table with results
print(cross_subreddit_spread_table)

NameError: name 'average_score_barely_true' is not defined