## Reddit dataset consolidation

This notebook gathers every CSV inside `Reddit Dataset/` (except the large `kaggle_RC_2019-05.csv`) and loads them with the correct headers provided in `headers.txt`.

In [1]:
from pathlib import Path
import pandas as pd
from tqdm.auto import tqdm
import re
import numpy as np
import torch
from sentence_transformers import SentenceTransformer


DATA_DIR = Path("Reddit Dataset")
EXCLUDE_FILES = {"kaggle_RC_2019-05.csv"}  # giant generic dump that dilutes signals

COLUMN_NAMES = [
    "text",
    "id",
    "subreddit",
    "meta",
    "time",
    "author",
    "ups",
    "downs",
    "authorlinkkarma",
    "authorkarma",
    "authorisgold",
]

csv_paths = sorted(
    path for path in DATA_DIR.glob("*.csv") if path.name not in EXCLUDE_FILES
)
meta_groups = sorted({path.stem.split("_", 1)[0] for path in csv_paths})

print(f"Found {len(csv_paths)} subreddit CSV files to combine.")
print("Meta groups:", meta_groups)
print("First 5 files:", [p.name for p in csv_paths[:5]])
print("Last 5 files:", [p.name for p in csv_paths[-5:]])


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  from .autonotebook import tqdm as notebook_tqdm


Found 49 subreddit CSV files to combine.
Meta groups: ['entertainment', 'gaming', 'humor', 'learning', 'lifestyle', 'news', 'television']
First 5 files: ['entertainment_comicbooks.csv', 'entertainment_harrypotter.csv', 'entertainment_movies.csv', 'entertainment_music.csv', 'entertainment_starwars.csv']
Last 5 files: ['television_gameofthrones.csv', 'television_himym.csv', 'television_mylittlepony.csv', 'television_startrek.csv', 'television_thewalkingdead.csv']


In [2]:
frames = []
total_filtered = 0

for csv_path in tqdm(csv_paths, desc="Loading subreddit CSVs"):
    df = pd.read_csv(csv_path)

    # Drop exported pandas index + rogue placeholder column if present
    df = df.drop(columns=df.columns[0])
    if len(df.columns) > len(COLUMN_NAMES):
        df = df.drop(columns=df.columns[0])

    if len(df.columns) != len(COLUMN_NAMES):
        raise ValueError(
            f"Unexpected column count {len(df.columns)} in {csv_path.name}."
        )

    df.columns = COLUMN_NAMES

    meta_group = csv_path.stem.split("_", 1)[0]
    before = len(df)
    df = df[df["meta"] == meta_group]
    filtered = before - len(df)
    if filtered:
        total_filtered += filtered
        print(f"Filtered {filtered} malformed rows in {csv_path.name}")

    frames.append(df)

combined_df = pd.concat(frames, ignore_index=True)
print(f"\nCombined shape: {combined_df.shape[0]:,} rows × {combined_df.shape[1]} columns")
print(f"Total malformed rows dropped: {total_filtered:,}")
combined_df.head()


Loading subreddit CSVs:   2%|▏         | 1/49 [00:00<00:05,  8.66it/s]

Filtered 1 malformed rows in entertainment_comicbooks.csv


Loading subreddit CSVs:   6%|▌         | 3/49 [00:00<00:03, 11.64it/s]

Filtered 1 malformed rows in entertainment_harrypotter.csv
Filtered 1 malformed rows in entertainment_movies.csv


Loading subreddit CSVs: 100%|██████████| 49/49 [00:04<00:00, 11.52it/s]



Combined shape: 2,423,702 rows × 11 columns
Total malformed rows dropped: 3


Unnamed: 0,text,id,subreddit,meta,time,author,ups,downs,authorlinkkarma,authorkarma,authorisgold
0,sometimes they have a difference of opinion s...,d01727e,comicbooks,entertainment,1455577000.0,TheStealthBox,5.0,0.0,208.0,32044.0,0.0
1,try polysuede or felt that is acidfree or pass...,d02fswl,comicbooks,entertainment,1455661000.0,mrindustrialist,1.0,0.0,1.0,75.0,0.0
2,take them in to a second hand book store amp ...,d01qm82,comicbooks,entertainment,1455615000.0,matthew_lane,2.0,0.0,250.0,7710.0,0.0
3,a lot of cities have ways of getting comics in...,d01k3vi,comicbooks,entertainment,1455597000.0,Daiteach,3.0,0.0,439.0,11111.0,0.0
4,i m probably in the minority but even the wo...,d01km27,comicbooks,entertainment,1455598000.0,Nejfelt,2.0,0.0,150.0,918.0,0.0


In [3]:
print("Records per meta subreddit (top 10):")
print(combined_df["meta"].value_counts().head(10))

combined_df.sample(3, random_state=42)


Records per meta subreddit (top 10):
meta
gaming           428443
news             408716
lifestyle        384494
humor            382197
television       321794
learning         271179
entertainment    226879
Name: count, dtype: int64


Unnamed: 0,text,id,subreddit,meta,time,author,ups,downs,authorlinkkarma,authorkarma,authorisgold
1767257,i wish this sub would ban dumb shit like this ...,d01yzxb,libertarian,news,1455638000.0,AlCapone564,30.0,0.0,2794.0,1807.0,0.0
237144,if only mmr could get you attitude,d02kli8,dota2,gaming,1455668000.0,ShrikeGFX,1.0,0.0,276.0,2542.0,0.0
1747502,so basically you re fucked out of a good job o...,d02tety,conspiracy,news,1455682000.0,goober_boobz,1.0,0.0,190.0,2997.0,0.0


In [4]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running embeddings on: {DEVICE.upper()}")

MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

RECENT_DAYS = 5            # emergence window
BASELINE_DAYS = 15         # compare with the preceding period
FRESHNESS_DAYS = 20        # ignore tokens seen earlier than this
MIN_RECENT_USES = 15       # ensures terms are genuinely active
TOP_CANDIDATE_POOL = 120   # rank these many before semantic dedup
TARGET_TERM_COUNT = 20
COSINE_DUP_THRESHOLD = 0.88

TOKEN_REGEX = r"(?P<token>[a-zA-Z][a-zA-Z0-9'#_+\-]{1,24})"

STOPWORDS = {
    "the","and","you","that","with","this","have","your","from","they","them",
    "what","when","were","would","there","could","should","about","because",
    "their","just","like","cant","dont","doesnt","im","ive","ill","lets",
    "was","for","are","but",
}

def normalize_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    return text.strip()


Running embeddings on: CUDA


In [5]:
df = combined_df.loc[:, ["meta", "time", "text"]].copy()
df["event_dt"] = pd.to_datetime(df["time"], unit="s", utc=True).dt.floor("D")

analysis_end = df["event_dt"].max()
recent_start = analysis_end - pd.Timedelta(days=RECENT_DAYS - 1)
baseline_start = recent_start - pd.Timedelta(days=BASELINE_DAYS)
fresh_cutoff = analysis_end - pd.Timedelta(days=FRESHNESS_DAYS)

window_mask = df["event_dt"].between(baseline_start, analysis_end)
df = df.loc[window_mask].copy()
df["text_norm"] = df["text"].fillna("").map(normalize_text)
df = df.loc[df["text_norm"].str.len() > 0].reset_index(drop=True)

print(f"Filtered to {len(df):,} rows within [{baseline_start.date()} → {analysis_end.date()}].")

token_matches = (
    df["text_norm"]
    .str.extractall(TOKEN_REGEX)
    .reset_index()
    .rename(columns={"level_0": "row_idx", "token": "token"})
)

token_df = token_matches.merge(
    df[["meta", "event_dt", "text_norm"]],
    left_on="row_idx",
    right_index=True,
    how="left",
)

token_df = token_df.loc[~token_df["token"].isin(STOPWORDS)].reset_index(drop=True)

print(f"Extracted {len(token_df):,} token-context rows "
      f"({token_df['token'].nunique():,} unique tokens).")


Filtered to 2,183,806 rows within [2016-01-29 → 2016-02-17].
Extracted 59,458,099 token-context rows (60,070 unique tokens).


In [6]:
recent_mask = token_df["event_dt"] >= recent_start
baseline_mask = token_df["event_dt"].between(baseline_start, recent_start - pd.Timedelta(days=1))
fresh_mask = token_df["event_dt"] >= fresh_cutoff

recent_counts = (
    token_df.loc[recent_mask]
    .groupby(["meta", "token"])
    .size()
    .rename("recent_freq")
)

baseline_counts = (
    token_df.loc[baseline_mask]
    .groupby(["meta", "token"])
    .size()
    .rename("baseline_freq")
)

first_seen = (
    token_df.groupby(["meta", "token"])["event_dt"]
    .min()
    .rename("first_seen")
)

last_context = (
    token_df.loc[recent_mask]
    .sort_values("event_dt")
    .groupby(["meta", "token"])
    .agg(
        last_seen=("event_dt", "max"),
        example_context=("text_norm", "last")
    )
)

candidate_stats = (
    recent_counts.to_frame()
    .join(baseline_counts, how="left")
    .join(first_seen, how="left")
    .join(last_context, how="left")
    .fillna({"baseline_freq": 0})
    .reset_index()
)

candidate_stats = candidate_stats[
    (candidate_stats["recent_freq"] >= MIN_RECENT_USES) &
    (candidate_stats["first_seen"] >= fresh_cutoff)
]

candidate_stats["growth_ratio"] = (candidate_stats["recent_freq"] + 1) / (candidate_stats["baseline_freq"] + 1)
candidate_stats["novelty_score"] = candidate_stats["recent_freq"] * candidate_stats["growth_ratio"]

candidate_stats = candidate_stats.sort_values("novelty_score", ascending=False)

print(f"Candidate pool after filters: {len(candidate_stats)} token/meta pairs.")
candidate_stats.head()


Candidate pool after filters: 111075 token/meta pairs.


Unnamed: 0,meta,token,recent_freq,baseline_freq,first_seen,last_seen,example_context,growth_ratio,novelty_score
43065,humor,sanders,188700,0.0,2016-02-13 00:00:00+00:00,2016-02-17 00:00:00+00:00,but they were their party was literally called...,188701.0,35607880000.0
39397,humor,kanye,77245,0.0,2016-02-14 00:00:00+00:00,2016-02-17 00:00:00+00:00,kanye lost his soul when his moma died,77246.0,5966867000.0
45786,humor,west,74898,0.0,2016-02-14 00:00:00+00:00,2016-02-17 00:00:00+00:00,here s a link to the story with a video it s i...,74899.0,5609785000.0
40243,humor,mentally,74298,0.0,2016-02-16 00:00:00+00:00,2016-02-17 00:00:00+00:00,they basically are children mentally they re t...,74299.0,5520267000.0
41241,humor,pablo,67649,0.0,2016-02-15 00:00:00+00:00,2016-02-17 00:00:00+00:00,downpablo if you must but i am going to listen...,67650.0,4576455000.0


In [7]:
if candidate_stats.empty:
    print("❌ No terms satisfy the frequency/freshness criteria. "
          "Relax the thresholds and re-run.")
else:
    pool = candidate_stats.head(TOP_CANDIDATE_POOL).copy()
    pool["context_for_embed"] = pool["token"] + " || " + pool["example_context"].fillna("")

    model = SentenceTransformer(MODEL_NAME, device=DEVICE)
    embeddings = model.encode(
        pool["context_for_embed"].tolist(),
        batch_size=256,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True,
    )
    pool["embedding"] = list(embeddings)

    def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9))

    selected_rows = []
    for idx, row in pool.iterrows():
        emb = row["embedding"]
        if all(cosine_sim(emb, sel["embedding"]) < COSINE_DUP_THRESHOLD for sel in selected_rows):
            selected_rows.append(row)
        if len(selected_rows) == TARGET_TERM_COUNT:
            break

    if len(selected_rows) < TARGET_TERM_COUNT:
        print(f"⚠️ Only {len(selected_rows)} unique terms after dedup; "
              f"no more high-quality candidates available.")
        selected_rows = selected_rows or pool.head(TARGET_TERM_COUNT).to_dict("records")

    candidates = (
        pd.DataFrame(selected_rows)
          .drop(columns=["embedding", "context_for_embed"])
          .reset_index(drop=True)
          .assign(
              baseline_window=f"{baseline_start.date()} → {(recent_start - pd.Timedelta(days=1)).date()}",
              recent_window=f"{recent_start.date()} → {analysis_end.date()}",
          )
    )

    display(candidates[[
        "meta", "token", "recent_freq", "baseline_freq", "growth_ratio",
        "novelty_score", "first_seen", "last_seen", "example_context",
        "baseline_window", "recent_window",
    ]])

    print(f"\nReturned {len(candidates)} candidate terms ready for SIR modeling, "
          "semantic tracking, and downstream analysis.")


Batches: 100%|██████████| 1/1 [00:00<00:00,  3.52it/s]


Unnamed: 0,meta,token,recent_freq,baseline_freq,growth_ratio,novelty_score,first_seen,last_seen,example_context,baseline_window,recent_window
0,humor,sanders,188700,0.0,188701.0,35607880000.0,2016-02-13 00:00:00+00:00,2016-02-17 00:00:00+00:00,but they were their party was literally called...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
1,humor,kanye,77245,0.0,77246.0,5966867000.0,2016-02-14 00:00:00+00:00,2016-02-17 00:00:00+00:00,kanye lost his soul when his moma died,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
2,humor,west,74898,0.0,74899.0,5609785000.0,2016-02-14 00:00:00+00:00,2016-02-17 00:00:00+00:00,here s a link to the story with a video it s i...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
3,humor,mentally,74298,0.0,74299.0,5520267000.0,2016-02-16 00:00:00+00:00,2016-02-17 00:00:00+00:00,they basically are children mentally they re t...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
4,humor,pablo,67649,0.0,67650.0,4576455000.0,2016-02-15 00:00:00+00:00,2016-02-17 00:00:00+00:00,downpablo if you must but i am going to listen...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
5,humor,ber,62550,0.0,62551.0,3912565000.0,2016-02-14 00:00:00+00:00,2016-02-17 00:00:00+00:00,ber nie ber nie ber nie ber nie ber nie ber ni...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
6,humor,bernie,191947,50.0,3763.686275,722428300.0,2016-02-12 00:00:00+00:00,2016-02-17 00:00:00+00:00,i had to setup bernie his own deskbed so he wo...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
7,lifestyle,bike,26669,0.0,26670.0,711262200.0,2016-02-13 00:00:00+00:00,2016-02-17 00:00:00+00:00,this was my experience shopping for ducs and b...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
8,humor,time,18972,0.0,18973.0,359955800.0,2016-02-13 00:00:00+00:00,2016-02-17 00:00:00+00:00,ugh these things always make me cringe but eve...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17
9,news,bernie,17583,0.0,17584.0,309179500.0,2016-02-13 00:00:00+00:00,2016-02-17 00:00:00+00:00,kind of sounds like what bernie sanders has be...,2016-01-29 → 2016-02-12,2016-02-13 → 2016-02-17



Returned 20 candidate terms ready for SIR modeling, semantic tracking, and downstream analysis.


The above are 20 candidate terms evaluating using embedding anomaly detection. However, many of these terms continue to be common words rather than true slang; we must try alternative methods to identify evolving terms.