# Bluesky dataset overview (mounted)

This notebook is a **lightweight, streaming-friendly overview** for the Bluesky dataset from Zenodo record **14669616**.

**Key goal:** follow the same workflow as your Higgs notebook — you **mount the dataset on the server** and point the notebook at the mounted folder (no downloads).

What you'll do here:

1. Locate your repo root and load `SETTINGS.DATA_ROOT` from `src.core.config`
2. Point to the mounted dataset directory
3. Inspect what files are present (top-level)
4. Peek at the main CSV/CSV.GZ files without loading them fully
5. Compute basic graph stats for `followers.csv.gz` **streaming in chunks**
6. Peek inside large `.tar.gz` archives **without extracting them**


In [1]:
# ============================================================
# 0) Project setup (same pattern as Higgs notebook)
#    - find repo root by locating `src/`
#    - import SETTINGS from src.core.config
# ============================================================
from pathlib import Path
import sys

def find_repo_root(start: Path | None = None) -> Path:
    start = (start or Path.cwd()).resolve()
    for p in [start, *start.parents]:
        if (p / "src").is_dir():
            return p
    raise FileNotFoundError("Could not find repo root containing `src/`")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from src.core.config import SETTINGS  # noqa: E402

print("REPO_ROOT:", REPO_ROOT)
print("DATA_ROOT:", SETTINGS.DATA_ROOT)


REPO_ROOT: /mnt/c/Users/rescic/PycharmProjects/dezinfo-datasets
DATA_ROOT: /home/rescic/dezinfo_data


## 1) Point to the mounted dataset folder

Mount the Zenodo dataset somewhere under `SETTINGS.DATA_ROOT`.

Set `DATASET_SUBDIR` to the folder name you mounted (for example `bluesky_14669616/` or similar).


In [3]:
from pathlib import Path

# Change this to match your mount folder name under DATA_ROOT:
DATASET_SUBDIR = "bluesky"  # <-- EDIT ME

DATASET_DIR = Path(SETTINGS.DATA_ROOT) / DATASET_SUBDIR
assert DATASET_DIR.exists(), f"Mounted dataset folder not found: {DATASET_DIR}"
print("DATASET_DIR:", DATASET_DIR)


DATASET_DIR: /home/rescic/dezinfo_data/bluesky


## 2) Discover what's in the dataset directory

We look for commonly expected files (based on the Zenodo record), but we **don't assume** they're all present —
different mounts or versions can vary.

This cell prints:

- a list of expected filenames that are found
- a quick top-level directory listing (first ~50 entries)


In [4]:
EXPECTED = [
    "followers.csv.gz",
    "interactions.csv.gz",
    "feed_bookmarks.csv",
    "user_posts.tar.gz",
    "graphs.tar.gz",
    "feed_posts.tar.gz",
    "feed_posts_likes.tar.gz",
]

found = {}
for name in EXPECTED:
    p = DATASET_DIR / name
    if p.exists():
        found[name] = p

print("Found expected files:")
for k, v in found.items():
    print(f"  - {k}: {v}")

print("\nTop-level listing (first 50 entries):")
for p in sorted(DATASET_DIR.iterdir())[:50]:
    print(" ", p.name)


Found expected files:
  - followers.csv.gz: /home/rescic/dezinfo_data/bluesky/followers.csv.gz
  - interactions.csv.gz: /home/rescic/dezinfo_data/bluesky/interactions.csv.gz
  - feed_bookmarks.csv: /home/rescic/dezinfo_data/bluesky/feed_bookmarks.csv
  - user_posts.tar.gz: /home/rescic/dezinfo_data/bluesky/user_posts.tar.gz
  - graphs.tar.gz: /home/rescic/dezinfo_data/bluesky/graphs.tar.gz
  - feed_posts.tar.gz: /home/rescic/dezinfo_data/bluesky/feed_posts.tar.gz
  - feed_posts_likes.tar.gz: /home/rescic/dezinfo_data/bluesky/feed_posts_likes.tar.gz

Top-level listing (first 50 entries):
  feed_bookmarks.csv
  feed_posts.tar.gz
  feed_posts_likes.tar.gz
  followers.csv.gz
  graphs.tar.gz
  interactions.csv.gz
  scripts
  scripts.tar.gz
  user_posts.tar.gz


## 3) Peek at CSV / CSV.GZ files (schema + a few rows)

Large CSV files can be huge, so we only read the first few lines to infer columns and get a feel for the data.

Notes:

- `pandas.read_csv(..., nrows=...)` is safe for quick peeks
- for `.csv.gz`, pandas detects compression automatically from the extension


In [5]:
import pandas as pd

def peek_csv(path: Path, nrows: int = 5, **kwargs) -> pd.DataFrame:
    return pd.read_csv(path, nrows=nrows, **kwargs)

# followers.csv.gz is typically a 2-column edge list: (u, v)
if "followers.csv.gz" in found:
    followers_head = peek_csv(found["followers.csv.gz"], nrows=10, header=None, names=["u", "v"])
    display(followers_head)

# interactions.csv.gz is typically 6 integers:
# (user_id, replied_author, thread_root_author, reposted_author, quoted_author, date)
if "interactions.csv.gz" in found:
    interactions_head = peek_csv(
        found["interactions.csv.gz"],
        nrows=10,
        header=None,
        names=["user_id", "replied_author", "thread_root_author", "reposted_author", "quoted_author", "date"],
    )
    display(interactions_head)

# feed_bookmarks.csv often has column headers (but we don't assume the exact schema)
if "feed_bookmarks.csv" in found:
    feed_bookmarks_head = peek_csv(found["feed_bookmarks.csv"], nrows=10)
    display(feed_bookmarks_head)


Unnamed: 0,u,v
0,0,1
1,0,10
2,0,100
3,0,101
4,0,102
5,0,103
6,0,104
7,0,105
8,0,106
9,0,107


Unnamed: 0,user_id,replied_author,thread_root_author,reposted_author,quoted_author,date
0,836672,,,833271,,202309192352
1,836672,,,61971,,202310021913
2,836672,,,47191,,202309231547
3,836672,,,17234,,202309301358
4,836672,,,20490,,202307261536
5,836672,,,61971,,202309162126
6,836672,,,44299,,202309201818
7,836672,,,61971,,202309252042
8,836672,,,20490,,202309232023
9,836672,,,20490,,202308011726


Unnamed: 0,Science,408833,202309192111
0,Science,204992,202307290107
1,Science,1798953,202309232103
2,Science,1428436,202311051321
3,Science,976464,202309131041
4,Science,325292,202310130606
5,Science,472493,202308100948
6,Science,1425781,202307290147
7,Science,147657,202306081030
8,Science,88914,202311091201
9,Science,1667057,202309221235


## 4) Streaming graph stats from `followers.csv.gz`

`followers.csv.gz` is a directed follower graph. To avoid loading everything into RAM, we:

- stream the file in `chunksize` batches
- update `out_degree` and `in_degree` counters incrementally
- report:
  - number of edges
  - number of unique sources/targets seen
  - basic distribution summaries

If the dataset is extremely large, you can **reduce memory** further by:
- computing only summary stats (without storing full counters), or
- sampling edges.


In [6]:
from collections import Counter
from typing import Tuple
import numpy as np

def stream_edge_counts_from_csv_gz(path: Path, chunksize: int = 2_000_000) -> Tuple[int, Counter, Counter]:
    out_deg = Counter()
    in_deg = Counter()
    m = 0

    for chunk in pd.read_csv(path, header=None, names=["u", "v"], chunksize=chunksize):
        m += len(chunk)
        out_deg.update(chunk["u"].astype("int64").tolist())
        in_deg.update(chunk["v"].astype("int64").tolist())

    return m, out_deg, in_deg

def summarize_counter(counter: Counter, label: str) -> None:
    vals = np.fromiter(counter.values(), dtype=np.int64)
    if len(vals) == 0:
        print(f"{label}: (empty)")
        return
    qs = np.quantile(vals, [0.5, 0.9, 0.99, 0.999])
    print(f"\n{label} summary")
    print("  nodes:", len(vals))
    print("  mean:", float(vals.mean()))
    print("  median/p90/p99/p99.9:", qs)

if "followers.csv.gz" in found:
    m_edges, out_deg, in_deg = stream_edge_counts_from_csv_gz(found["followers.csv.gz"], chunksize=1_000_000)
    print("Followers edges:", m_edges)
    print("Unique sources (out-degree):", len(out_deg))
    print("Unique targets (in-degree):", len(in_deg))

    summarize_counter(out_deg, "Out-degree")
    summarize_counter(in_deg, "In-degree")
else:
    print("followers.csv.gz not found; skipping follower graph stats.")


Followers edges: 144581603
Unique sources (out-degree): 3956887
Unique targets (in-degree): 3222994

Out-degree summary
  nodes: 3956887
  mean: 36.53922970254142
  median/p90/p99/p99.9: [   9.   77.  394. 1362.]

In-degree summary
  nodes: 3222994
  mean: 44.85940805350553
  median/p90/p99/p99.9: [   5.      70.     581.    3246.014]


## 5) Peek inside `.tar.gz` archives without extracting

Some parts of the dataset are shipped as `.tar.gz` archives (e.g., `user_posts.tar.gz`, `graphs.tar.gz`).

Extracting can be slow and can duplicate a lot of data on disk — instead we:

- list a few member filenames inside the tarball
- read the first few lines of a selected member file
- if the member appears to be JSONL, parse the first few JSON objects


In [None]:
import tarfile
import io
import json

def list_tar_members(tar_path: Path, max_members: int = 30) -> list[str]:
    with tarfile.open(tar_path, mode="r:gz") as tf:
        names = [m.name for m in tf.getmembers() if m.isfile()]
    return names[:max_members]

def peek_first_text_lines_in_tar(
    tar_path: Path,
    member_name: str | None = None,
    nlines: int = 5,
    encoding: str = "utf-8",
) -> list[str]:
    lines: list[str] = []
    with tarfile.open(tar_path, mode="r:gz") as tf:
        members = [m for m in tf.getmembers() if m.isfile()]
        assert members, f"No file members in {tar_path}"
        target = next((m for m in members if m.name == member_name), members[0])

        f = tf.extractfile(target)
        assert f is not None, f"Could not extract member: {target.name}"
        txt = io.TextIOWrapper(f, encoding=encoding, errors="replace")
        for _ in range(nlines):
            line = txt.readline()
            if not line:
                break
            lines.append(line.rstrip("\n"))
    return lines

def peek_first_json_objects_in_tar_jsonl(
    tar_path: Path,
    member_name: str | None = None,
    n: int = 3,
) -> list[dict]:
    objs: list[dict] = []
    with tarfile.open(tar_path, mode="r:gz") as tf:
        members = [m for m in tf.getmembers() if m.isfile()]
        assert members, f"No file members in {tar_path}"
        target = next((m for m in members if m.name == member_name), members[0])

        f = tf.extractfile(target)
        assert f is not None, f"Could not extract member: {target.name}"
        txt = io.TextIOWrapper(f, encoding="utf-8", errors="replace")
        for line in txt:
            line = line.strip()
            if not line:
                continue
            try:
                objs.append(json.loads(line))
            except json.JSONDecodeError:
                break
            if len(objs) >= n:
                break
    return objs

# user posts archive: often JSONL
if "user_posts.tar.gz" in found:
    tarp = found["user_posts.tar.gz"]
    members = list_tar_members(tarp, max_members=20)
    print("user_posts.tar.gz sample members:")
    for n in members:
        print(" ", n)

    objs = peek_first_json_objects_in_tar_jsonl(tarp, member_name=None, n=3)
    if objs:
        print("\nFirst JSON objects from first member:")
        display(objs)
    else:
        print("\nCould not parse JSON objects from first member; showing raw lines instead:")
        lines = peek_first_text_lines_in_tar(tarp, member_name=None, nlines=10)
        print("\n".join(lines))

# graphs archive: could be edge lists / csv / jsonl depending on release
if "graphs.tar.gz" in found:
    tarp = found["graphs.tar.gz"]
    members = list_tar_members(tarp, max_members=20)
    print("\ngraphs.tar.gz sample members:")
    for n in members:
        print(" ", n)

    lines = peek_first_text_lines_in_tar(tarp, member_name=None, nlines=10)
    print("\nFirst 10 lines from first member:")
    print("\n".join(lines))
