## Download Webis-TLDR-17

We download Webis-TLDR-17 from Hugging Face datasets

The dataset consists of 3,848,330 preprocessed Reddit subreddits posts (submissions & comments) containing the "TL;DR" mention from 2006 to 2016. 

Multiple subreddits are included, with an average length of 270 words for content, and 28 words for the summary.

More information on the [HuggingFace webpage](https://huggingface.co/datasets/webis/tldr-17).

In [1]:
# !pip install datasets
from datasets import load_dataset
import pandas as pd

# Stream the dataset without downloading it
dataset = load_dataset("webis/tldr-17", split="train", streaming=True, trust_remote_code=True)

Based on the mensrigth subrredit, we use network overlap to infer extra datasets. 

In [None]:
from collections import defaultdict
from tqdm import tqdm

subreddit_users = defaultdict(set)

for row in tqdm(dataset, desc="Processing Reddit posts"):
    sub = row["subreddit"]
    auth = row["author"]

    if sub is None or auth is None:
        continue

    sub = sub.lower()
    subreddit_users[sub].add(auth)


Processing Reddit posts: 3848330it [27:28, 2335.15it/s]


In [None]:
# ------------------------------------------------------------
# COMPUTE JACCARD SIMILARITY WITH TARGET SUBREDDIT
# ------------------------------------------------------------
TARGET = "mensrights"
MIN_USERS = 20
target_users = subreddit_users[TARGET]

results = []

for sub, users in subreddit_users.items():
    if sub == TARGET:
        continue
    if len(users) < MIN_USERS:
        continue  # ignore extremely small subreddits

    inter = len(users & target_users)
    union = len(users | target_users)
    
    if union > 0:
        jacc = inter / union
        results.append((sub, jacc, len(users)))

# Sort by similarity
results = sorted(results, key=lambda x: x[1], reverse=True)


In [None]:
# ------------------------------------------------------------
# SAVE RESULTS
# ------------------------------------------------------------
df_sim = pd.DataFrame(results, columns=["subreddit", "jaccard_similarity", "unique_users"])
df_sim.to_csv("../outputs/related_subreddits_user_overlap.csv", index=False)

df_sim

Unnamed: 0,subreddit,jaccard_similarity,unique_users
0,tumblrinaction,0.023218,2681
1,subredditdrama,0.016807,2122
2,feminism,0.016450,562
3,askmen,0.015078,7879
4,libertarian,0.015022,2378
...,...,...,...
4829,enoughtrumpspam,0.000000,29
4830,mobiusff,0.000000,39
4831,frankocean,0.000000,32
4832,dks3builds,0.000000,20


Based on literature review, we find that subreddits as theredpill and askmen to be relevant for often being spaces of misogynistic behaviour and anti-feminist debates. 

In [2]:
selected_subreddits = [
    "mensrights",
    "theredpill",
    # "femradebates",
    # "tumblrinaction",
    "askmen"
]

selected_rows = []

for row in dataset:
    sub = row["subreddit"]
    
    if sub is None:
        continue
    
    if sub.lower() in selected_subreddits:
        selected_rows.append(row)


In [None]:
df_raw = pd.DataFrame(selected_rows)
df_raw.to_csv("../data/raw/data_raw.csv", index=False)

print("Saved", len(df_raw), "posts.")

Saved 24647 posts.


We  now have a dataset with ~25k posts, which is a good start for BERTopic modelling since these models work better with dataframes from 10k-20k or more rows.