# Higgs Twitter Dataset

This notebook documents the **Higgs Twitter dataset** from the SNAP collection.

It shows:
- where the data is stored on the DezInfo server
- how it is accessed via WSL + sshfs
- the structure of each file
- the first rows ("head") of each dataset

Source:
https://snap.stanford.edu/data/higgs-twitter.html


In [1]:
import sys
from pathlib import Path

# Walk upwards until we find a folder that contains "src/"
p = Path.cwd().resolve()
while p != p.parent and not (p / "src").exists():
    p = p.parent

if not (p / "src").exists():
    raise RuntimeError("Could not find project root (folder containing 'src').")

sys.path.insert(0, str(p))

print("Project root:", p)
print("Has src?:", (p / "src").exists())


Project root: /mnt/c/Users/rescic/PycharmProjects/PythonProject/dezinfo-datasets
Has src?: True


In [2]:
from pathlib import Path
import pandas as pd

from src.core.config import SETTINGS

In [3]:
DATASET_DIR = Path(SETTINGS.DATA_ROOT) / "higgs-twitter"
DATASET_DIR


PosixPath('/home/rescic/dezinfo_data/higgs-twitter')

In [4]:
FILES = {
    "social_network": "higgs-social_network.edgelist",
    "retweet_network": "higgs-retweet_network.edgelist",
    "reply_network": "higgs-reply_network.edgelist",
    "mention_network": "higgs-mention_network.edgelist",
    "activity_time": "higgs-activity_time.txt",
}

missing = [k for k, fn in FILES.items() if not (DATASET_DIR / fn).exists()]
missing


[]

In [5]:
def read_head_whitespace(path: Path, nrows: int = 5) -> pd.DataFrame:
    compression = "gzip" if path.name.endswith(".gz") else None
    return pd.read_csv(
        path,
        sep=r"\s+",
        header=None,
        nrows=nrows,
        compression=compression,
        engine="python",
    )


In [6]:
def label_edgelist(df: pd.DataFrame) -> pd.DataFrame:
    if df.shape[1] == 2:
        df.columns = ["src", "dst"]
    elif df.shape[1] == 3:
        df.columns = ["src", "dst", "weight"]
    else:
        df.columns = [f"col{i}" for i in range(df.shape[1])]
    return df


In [7]:
from collections import Counter
from typing import Dict

def full_edgelist_stats(path: Path, chunksize: int = 2_000_000) -> Dict:
    """
    Compute full-dataset statistics for a large edge list using streaming.
    Assumes whitespace-separated columns: src dst [weight]
    """
    nodes = set()
    edges = 0
    self_loops = 0
    reciprocal_pairs = set()

    out_deg = Counter()
    in_deg = Counter()

    for chunk in pd.read_csv(
        path,
        sep=r"\s+",
        header=None,
        chunksize=chunksize,
        engine="python",
    ):
        src = chunk.iloc[:, 0]
        dst = chunk.iloc[:, 1]

        edges += len(chunk)
        nodes.update(src)
        nodes.update(dst)

        out_deg.update(src)
        in_deg.update(dst)

        self_loops += (src == dst).sum()

        # reciprocal edge detection (a->b and b->a)
        for a, b in zip(src, dst):
            if (b, a) in reciprocal_pairs:
                continue
            reciprocal_pairs.add((a, b))

    reciprocal_count = sum(
        1 for (a, b) in reciprocal_pairs if (b, a) in reciprocal_pairs and a != b
    ) // 2

    return {
        "edges": edges,
        "nodes": len(nodes),
        "self_loops": int(self_loops),
        "reciprocal_edge_pairs": reciprocal_count,
        "avg_out_degree": edges / len(nodes),
        "avg_in_degree": edges / len(nodes),
        "max_out_degree": max(out_deg.values()),
        "max_in_degree": max(in_deg.values()),
    }


In [8]:
def full_activity_stats(path: Path, chunksize: int = 2_000_000):
    users = set()
    interactions = Counter()
    min_ts = None
    max_ts = None
    rows = 0

    for chunk in pd.read_csv(
        path,
        sep=r"\s+",
        header=None,
        chunksize=chunksize,
        engine="python",
    ):
        userA = chunk.iloc[:, 0]
        userB = chunk.iloc[:, 1]
        ts = chunk.iloc[:, 2]
        itype = chunk.iloc[:, 3]

        rows += len(chunk)
        users.update(userA)
        users.update(userB)
        interactions.update(itype)

        min_ts = ts.min() if min_ts is None else min(min_ts, ts.min())
        max_ts = ts.max() if max_ts is None else max(max_ts, ts.max())

    return {
        "events": rows,
        "unique_users": len(users),
        "interaction_counts": dict(interactions),
        "time_min": int(min_ts),
        "time_max": int(max_ts),
    }


## Social Network (Follower Graph)

This table shows the **directed follower network** extracted from Twitter during the Higgs boson rumor spreading period.

Each row represents a follower relationship:
- `src` — the user who **follows**
- `dst` — the user being **followed**
- `weight` (if present) — number of times the follower relationship appears or is reinforced in the data

In [9]:
df_social = read_head_whitespace(DATASET_DIR / FILES["social_network"], nrows=5)
df_social = label_edgelist(df_social)
df_social

Unnamed: 0,src,dst
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6


In [10]:
social_stats = full_edgelist_stats(DATASET_DIR / FILES["social_network"])
pd.Series(social_stats)


edges                    1.485584e+07
nodes                    4.566260e+05
self_loops               2.300000e+01
reciprocal_edge_pairs    0.000000e+00
avg_out_degree           3.253394e+01
avg_in_degree            3.253394e+01
max_out_degree           1.259000e+03
max_in_degree            5.138600e+04
dtype: float64

## Retweet Network

This table represents the **retweet interaction network**.

Each row corresponds to a retweet event aggregated over time:
- `src` — the user who **retweets**
- `dst` — the user whose content is **retweeted**
- `weight` — number of retweets from `src` to `dst`

In [11]:
df_retweet = read_head_whitespace(DATASET_DIR / FILES["retweet_network"], nrows=5)
df_retweet = label_edgelist(df_retweet)
df_retweet


Unnamed: 0,src,dst,weight
0,298960,105232,1
1,95688,3393,1
2,353237,62217,1
3,4974,3571,1
4,241892,8,1


In [12]:
retweet_stats = full_edgelist_stats(DATASET_DIR / FILES["retweet_network"])
pd.Series(retweet_stats)


edges                    328132.000000
nodes                    256491.000000
self_loops                    0.000000
reciprocal_edge_pairs         0.000000
avg_out_degree                1.279312
avg_in_degree                 1.279312
max_out_degree              134.000000
max_in_degree             14060.000000
dtype: float64

## Reply Network

This table contains the **reply interaction network**.

Each row represents replies between users:
- `src` — the user who **replies**
- `dst` — the user being **replied to**
- `weight` — number of reply interactions between the two users

In [13]:
df_reply = read_head_whitespace(DATASET_DIR / FILES["reply_network"], nrows=5)
df_reply = label_edgelist(df_reply)
df_reply


Unnamed: 0,src,dst,weight
0,161345,8614,1
1,428368,11792,1
2,77904,10701,1
3,124554,286277,1
4,194873,194873,1


In [14]:
reply_stats = full_edgelist_stats(DATASET_DIR / FILES["reply_network"])
pd.Series(reply_stats)


edges                    32523.00000
nodes                    38918.00000
self_loops                 343.00000
reciprocal_edge_pairs        0.00000
avg_out_degree               0.83568
avg_in_degree                0.83568
max_out_degree              35.00000
max_in_degree             1206.00000
dtype: float64

## Mention Network

This table represents the **mention network**.

Each row captures when a user mentions another user in a tweet:
- `src` — the user who **mentions**
- `dst` — the user being **mentioned**
- `weight` — number of mentions from `src` to `dst`

Mentions often signal **attention, endorsement, or confrontation**, and are a key mechanism for targeting information toward specific users.


In [15]:
df_mention = read_head_whitespace(DATASET_DIR / FILES["mention_network"], nrows=5)
df_mention = label_edgelist(df_mention)
df_mention


Unnamed: 0,src,dst,weight
0,316609,5011,1
1,439696,12389,1
2,60059,6929,1
3,161345,8614,1
4,137487,759,1


In [16]:
mention_stats = full_edgelist_stats(DATASET_DIR / FILES["mention_network"])
pd.Series(mention_stats)


edges                    150818.000000
nodes                    116408.000000
self_loops                 5353.000000
reciprocal_edge_pairs         0.000000
avg_out_degree                1.295598
avg_in_degree                 1.295598
max_out_degree              169.000000
max_in_degree             11953.000000
dtype: float64

## Activity Log (Temporal Interactions)

This table contains the **time-resolved interaction log** of Twitter activity related to the Higgs boson topic.

Each row corresponds to a single interaction event:
- `userA` — the **initiating** user
- `userB` — the **target** user
- `timestamp` — Unix timestamp of the interaction
- `interaction` — type of interaction:
  - `RT` — retweet
  - `RE` — reply
  - `MT` — mention

This file enables **temporal analysis**, such as information cascades, burst detection, and diffusion dynamics over time.



In [17]:
df_activity = read_head_whitespace(DATASET_DIR / FILES["activity_time"], nrows=5)
df_activity = label_edgelist(df_activity)
df_activity



Unnamed: 0,col0,col1,col2,col3
0,223789,213163,1341100972,MT
1,223789,213163,1341100972,RE
2,376989,50329,1341101181,RT
3,26375,168366,1341101183,MT
4,376989,13813,1341101192,RT


In [18]:
activity_stats = full_activity_stats(DATASET_DIR / FILES["activity_time"])
pd.Series(activity_stats)


events                                                   563069
unique_users                                             304691
interaction_counts    {'MT': 171237, 'RE': 36902, 'RT': 354930}
time_min                                             1341100972
time_max                                             1341705593
dtype: object

## Summary

The Higgs Twitter dataset provides:
- a **static social backbone** (follower network),
- **interaction networks** capturing different engagement modes,
- and a **temporal event log** enabling dynamic diffusion analysis.

Together, these components allow the study of **how information spreads**, **who amplifies it**, and **how users interact over time**.

#### Strongly Connected Components

In [19]:
import networkx as nx
import random

def sample_scc_stats(path: Path, sample_edges=200_000):
    edges = []
    for chunk in pd.read_csv(path, sep=r"\s+", header=None, chunksize=200_000):
        edges.extend(zip(chunk.iloc[:, 0], chunk.iloc[:, 1]))
        if len(edges) >= sample_edges:
            break

    G = nx.DiGraph()
    G.add_edges_from(edges)

    sccs = list(nx.strongly_connected_components(G))
    return {
        "sample_edges": len(edges),
        "scc_count": len(sccs),
        "largest_scc": max(len(c) for c in sccs),
    }


In [21]:
sample_scc_stats(DATASET_DIR / FILES["social_network"])
sample_scc_stats(DATASET_DIR / FILES["retweet_network"])
sample_scc_stats(DATASET_DIR / FILES["reply_network"])
sample_scc_stats(DATASET_DIR / FILES["mention_network"])


{'sample_edges': 150818, 'scc_count': 110704, 'largest_scc': 1801}

### Strongly Connected Components (SCCs)

We analyze SCCs on edge samples to assess the presence of
reciprocal and cyclic interaction patterns.

Due to the size of the full networks, SCCs are computed on
random edge samples


In [22]:
pd.DataFrame.from_dict(
    {
        "social": sample_scc_stats(DATASET_DIR / FILES["social_network"]),
        "retweet": sample_scc_stats(DATASET_DIR / FILES["retweet_network"]),
        "reply": sample_scc_stats(DATASET_DIR / FILES["reply_network"]),
        "mention": sample_scc_stats(DATASET_DIR / FILES["mention_network"]),
    },
    orient="index"
)


Unnamed: 0,sample_edges,scc_count,largest_scc
social,200000,50580,1997
retweet,200000,175891,400
reply,32523,36132,322
mention,150818,110704,1801
