# TGDataset (Zenodo 7640712) — Mounted Archive Overview

This notebook inspects **TGDataset** when it is mounted on your server as one or more `*.tar.gz` parts (e.g. `TGDataset_1.tar.gz`, ...). It **does not download or extract** anything; it reads metadata and small samples directly from the archives.

In [1]:
# ============================================================
# 0) Project setup (same pattern as your other notebooks)
# ============================================================
from pathlib import Path
import sys

def find_repo_root(start: Path | None = None) -> Path:
    start = (start or Path.cwd()).resolve()
    for p in [start, *start.parents]:
        if (p / "src").is_dir():
            return p
    raise FileNotFoundError("Could not find repo root containing `src/`")

REPO_ROOT = find_repo_root()
if str(REPO_ROOT) not in sys.path:
    sys.path.insert(0, str(REPO_ROOT))

from src.core.config import SETTINGS  # noqa: E402

print("REPO_ROOT:", REPO_ROOT)
print("DATA_ROOT:", SETTINGS.DATA_ROOT)


REPO_ROOT: /mnt/c/Users/rescic/PycharmProjects/dezinfo-datasets
DATA_ROOT: /home/rescic/dezinfo_data


## Locate the mounted TGDataset archives

We assume your mount folder is named `tgdataset` under `SETTINGS.DATA_ROOT`, and it contains one or more `*.tar.gz` parts (e.g. `TGDataset_1.tar.gz` ...). We collect all archives and sort them for stable processing.

In [9]:
# ============================================================
# 1) Mounted dataset path (NO downloading / NO extracting)
# ============================================================
DATASET_SUBDIR = "tgdataset"   # <-- mount folder name (you said this matches)
DATASET_DIR = Path(SETTINGS.DATA_ROOT) / DATASET_SUBDIR
assert DATASET_DIR.exists(), f"Mounted dataset folder not found: {DATASET_DIR}"
print("DATASET_DIR:", DATASET_DIR)

# Find *all* tar.gz parts (e.g., TGDataset_1.tar.gz ... TGDataset_4.tar.gz)
TAR_PATHS = sorted(DATASET_DIR.glob("*.tar.gz"))
assert TAR_PATHS, f"No .tar.gz files found in {DATASET_DIR}"

print("\nArchives found:")
for p in TAR_PATHS:
    print(" -", p.name)


DATASET_DIR: /home/rescic/dezinfo_data/tgdataset

Archives found:
 - TGDataset_1.tar.gz
 - TGDataset_2.tar.gz
 - TGDataset_3.tar.gz
 - TGDataset_4.tar.gz


## List archive contents (quick peek)

We print a short listing of the first few file members in each archive so you can see what types of files are stored inside (JSON/JSONL, CSV, etc.). This step reads **only metadata**, not the full file contents.

In [3]:
import tarfile

def list_tar_members(tar_path: Path, max_members: int = 30) -> list[tuple[str, int]]:
    with tarfile.open(tar_path, mode="r:gz") as tf:
        files = [m for m in tf.getmembers() if m.isfile()]
        return [(m.name, m.size) for m in files[:max_members]]

for tp in TAR_PATHS:
    print(f"\n== {tp.name} ==")
    for name, size in list_tar_members(tp, max_members=15):
        print(f"{size:12d}  {name}")



== TGDataset_1.tar.gz ==
  4729515989  public_db/folder_0/file_28_from_ebaladf_to_elanvirtualgallery.json
  5063177340  public_db/folder_0/file_32_from_FantastArt_to_fenixatelier.json
  3543132657  public_db/folder_0/file_13_from_BlessKingss_to_Boss_proof.json
  3877814451  public_db/folder_0/file_12_from_Bibliotecacir_to_blessher.json
  3537316326  public_db/folder_0/file_22_from_DailyBriefing_to_daytejaby.json
  4366933161  public_db/folder_0/file_17_from_channelforsell6_to_chrir.json
  2788424285  public_db/folder_0/file_19_from_cnnethiopia_to_coronaRSAstatus.json
  3584765176  public_db/folder_0/file_34_from_flowers_in_your_lungs_to_freeJuliane.json
  3782020535  public_db/folder_0/file_27_from_drone_life_to_ebaka1337.json
  3402382031  public_db/folder_0/file_33_from_fenixconservadora_to_flowersunicorn_shop.json
  3618193723  public_db/folder_0/file_18_from_chrisaresoffiziell_to_CNNBrk.json
  3946563459  public_db/folder_0/file_1_from_adhd_recharge_to_ahvvn.json
  3530975727  pub

## JSON footprint analysis inside compressed archives

This section estimates how much JSON-like data exists in TGDataset.

Key points:
- We scan **all tar archives** and look for files ending in
  `.json`, `.jsonl`, `.ndjson` (including `.gz` variants)
- File sizes are taken from tar metadata (fast, no extraction)
- We aggregate sizes by **top-level folder inside the archive**

This gives a high-level view of where most of the structured data lives
and how large it is, without loading any actual records.

In [None]:
from collections import defaultdict
from pathlib import Path
import tarfile

def sizeof_fmt(num: int, suffix="B") -> str:
    for unit in ["", "Ki", "Mi", "Gi", "Ti", "Pi"]:
        if abs(num) < 1024.0:
            return f"{num:3.1f} {unit}{suffix}"
        num /= 1024.0
    return f"{num:.1f} Ei{suffix}"

json_like_suffixes = (".json", ".jsonl", ".ndjson", ".json.gz", ".jsonl.gz", ".ndjson.gz")

# We'll store lightweight member refs as dicts so we don't keep TarInfo objects around.
json_members = []  # each: {"tar": Path, "name": str, "size": int}

folder_bytes = defaultdict(int)
total_bytes = 0

for tar_path in TAR_PATHS:
    with tarfile.open(tar_path, mode="r:gz") as tf:
        for m in tf.getmembers():
            if not m.isfile():
                continue
            n = m.name
            ln = n.lower()
            if ln.endswith(json_like_suffixes):
                json_members.append({"tar": tar_path, "name": n, "size": m.size})
                total_bytes += m.size

                parts = Path(n).parts
                top = parts[0] if parts else "."
                folder_bytes[top] += m.size

print("JSON-like members found (across all tars):", len(json_members))
assert json_members, "No JSON-like files were found inside the .tar.gz parts. Inspect archive contents above."

print("Total JSON footprint (inside tars):", sizeof_fmt(total_bytes))

print("\nTop-level folder breakdown (inside tars):")
for k, v in sorted(folder_bytes.items(), key=lambda kv: kv[1], reverse=True):
    print(f"  - {k:25s} {sizeof_fmt(v)}")

print("\nExample members (inside tars):")
for m in sorted(json_members, key=lambda d: (d["tar"].name, d["name"]))[:12]:
    print(f"  {m['tar'].name} :: {m['name']}")


## Peek at raw bytes to understand the JSON format

If later sampling shows '0 channels', it usually means the files are **not line-delimited JSON**. Here we peek the first bytes of a representative member to detect whether it looks like:
- JSON array (`[`)
- JSON object (`{`)
- JSONL/NDJSON (many lines each parseable)
- gzip stream inside tar (magic bytes `1f 8b`)

In [14]:

from pathlib import Path
import tarfile

# 1) Point to your mounted folder
DATASET_DIR = Path.home() / "dezinfo_data" / "tgdataset"  # <-- adjust if needed
assert DATASET_DIR.exists(), f"Mounted dataset folder not found: {DATASET_DIR}"
print("DATASET_DIR:", DATASET_DIR)

# 2) Discover tarballs (TGDataset_*.tar.gz)
TAR_PATHS = sorted(DATASET_DIR.glob("*.tar.gz"))
assert TAR_PATHS, f"No .tar.gz files found under {DATASET_DIR}"
print("Found tarballs:")
for p in TAR_PATHS:
    print("  -", p.name)

# ---- sniff helper (yours, slightly made robust) ----
def sniff_member_start(tar_path: Path, member_name: str, max_bytes: int = 2048) -> dict:
    with tarfile.open(tar_path, mode="r:*") as tf:  # r:* is more robust than r:gz
        m = tf.getmember(member_name)
        f = tf.extractfile(m)
        assert f is not None, f"Could not extract member: {member_name}"
        raw = f.read(max_bytes)

    is_gzip_stream = raw[:2] == b"\x1f\x8b"

    ws = b" \t\r\n"
    i = 0
    while i < len(raw) and raw[i:i+1] in ws:
        i += 1
    first = raw[i:i+1] if i < len(raw) else b""

    return {
        "tar": tar_path.name,
        "member": member_name,
        "first_non_ws_byte": first,
        "first_non_ws_char": first.decode("utf-8", errors="replace") if first else "",
        "looks_like_array": first == b"[",
        "looks_like_object": first == b"{",
        "nested_gzip_stream": is_gzip_stream,
    }

# 3) Collect members across tarballs
def collect_members(tar_path: Path):
    out = []
    with tarfile.open(tar_path, mode="r:*") as tf:
        for m in tf.getmembers():
            if m.isfile():
                out.append({"tar": tar_path, "name": m.name})
    return out

members = []
for tp in TAR_PATHS:
    members.extend(collect_members(tp))

print(f"Collected {len(members):,} file members from {len(TAR_PATHS)} tarballs.")

# 4) Pick “JSON-ish” candidates by filename; fallback to everything
json_members = [
    d for d in members
    if d["name"].lower().endswith((".json", ".jsonl"))
       or ".jsonl" in d["name"].lower()
       or ".json" in d["name"].lower()
]
print(f"JSON-ish members: {len(json_members):,}")

candidates = json_members if json_members else members

# 5) Sniff one deterministic sample
sample_ref = sorted(candidates, key=lambda d: (d["tar"].name, d["name"]))[0]
sniff = sniff_member_start(sample_ref["tar"], sample_ref["name"]) .
sniff


DATASET_DIR: /home/rescic/dezinfo_data/tgdataset
Found tarballs:
  - TGDataset_1.tar.gz
  - TGDataset_2.tar.gz
  - TGDataset_3.tar.gz
  - TGDataset_4.tar.gz


ConnectionAbortedError: [Errno 103] Software caused connection abort

## Sampling JSON objects and inferring schema

TGDataset JSON files are not guaranteed to be line-delimited JSON (JSONL).
Some files may be:
- JSON arrays
- Pretty-printed multi-line JSON objects
- Gzipped JSON inside tar archives

To handle this safely, we:
- Sample a small number of objects per file
- Detect JSONL vs JSON-array vs single-object formats
- Parse only a bounded amount of data per file

From these samples we infer:
- Common top-level keys
- Likely identifier fields (IDs, user IDs, message IDs)
- Likely text or content fields

This gives a **schema overview** without loading the full dataset.

In [None]:
import json
import tarfile
import gzip
import io
from pathlib import Path
from typing import Iterator, Dict, Any

def iter_json_objects_from_tar_member(
    tar_path: Path,
    member_name: str,
    max_objects: int = 200,
    max_json_bytes: int = 5_000_000,  # cap for array/object sampling
) -> Iterator[Dict[str, Any]]:
    with tarfile.open(tar_path, mode="r:gz") as tf:
        m = tf.getmember(member_name)
        f = tf.extractfile(m)
        if f is None:
            return

        lower = member_name.lower()
        is_nested_gz = lower.endswith(".gz")

        # Build a binary stream (possibly gz-decompressed)
        raw = gzip.GzipFile(fileobj=f) if is_nested_gz else f

        # Peek a prefix
        prefix = raw.read(2048)
        if not prefix:
            return

        # Re-open to restart stream (tar streams are not reliably seekable)
        f2 = tf.extractfile(m)
        if f2 is None:
            return
        raw2 = gzip.GzipFile(fileobj=f2) if is_nested_gz else f2

        s = prefix.lstrip()
        if s.startswith(b"[") or s.startswith(b"{"):
            buf = raw2.read(max_json_bytes)
            try:
                data = json.loads(buf.decode("utf-8", errors="replace"))
            except Exception:
                return

            if isinstance(data, dict):
                yield data
            elif isinstance(data, list):
                for x in data:
                    if isinstance(x, dict):
                        yield x
                        max_objects -= 1
                        if max_objects <= 0:
                            break
            return

        # Otherwise: JSONL/NDJSON
        txt = io.TextIOWrapper(raw2, encoding="utf-8", errors="replace")
        for line in txt:
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
            except Exception:
                continue
            if isinstance(obj, dict):
                yield obj
                max_objects -= 1
                if max_objects <= 0:
                    break

# Demo: sample a few objects from the first JSON-like member
demo = []
for obj in iter_json_objects_from_tar_member(sample_ref["tar"], sample_ref["name"], max_objects=3):
    demo.append(obj)
demo


## Key/field discovery across all tar parts (sample-based)

We sample up to a few hundred objects across multiple JSON-like members to infer:
- most common top-level keys
- likely identifier fields
- likely text fields

If this still reports `Sampled objects: 0`, it means the JSON format differs (e.g., not JSON at all, or needs a different decoder). In that case, use the earlier sniff/peek sections to identify the true format.

In [None]:
from collections import Counter
from itertools import islice

likely_id_fields = {"id", "uid", "user_id", "did", "rid", "post_id", "message_id", "cid", "tid", "chat_id"}
likely_text_fields = {"text", "title", "description", "body", "content", "about"}

key_counts = Counter()
id_field_counts = Counter()
text_field_counts = Counter()

sampled = 0
MAX_OBJECTS = 500

# Iterate over a subset of members across all tar parts
for ref in sorted(json_members, key=lambda d: (d["tar"].name, d["name"]))[:50]:
    for obj in iter_json_objects_from_tar_member(ref["tar"], ref["name"], max_objects=100):
        sampled += 1
        key_counts.update(obj.keys())

        for f in likely_id_fields:
            if f in obj and obj.get(f) not in (None, ""):
                id_field_counts[f] += 1
        for f in likely_text_fields:
            if f in obj and obj.get(f) not in (None, ""):
                text_field_counts[f] += 1

        if sampled >= MAX_OBJECTS:
            break
    if sampled >= MAX_OBJECTS:
        break

print("Sampled objects:", sampled)

print("\nMost common top-level keys:")
for k, c in key_counts.most_common(25):
    print(f"  {k:30s} {c}")

print("\nLikely identifier fields (presence counts):")
for k, c in id_field_counts.most_common():
    print(f"  {k:20s} {c}")

print("\nLikely title/description fields (presence counts):")
for k, c in text_field_counts.most_common():
    print(f"  {k:20s} {c}")
