# BotSim-24 Dataset

**Hard facts:**
* Release year: 2024
* Domain: Synthetic simulation of social media behavior (Twitter/X–like platform)
* Size: configurable (typically 10k–100k user agents)
* Labels: bot vs human (ground truth, generated by simulator)
* Features: behavioral and temporal metrics (posting frequency, burstiness, coordination, interaction ratios)
* Data type: fully synthetic with controlled parameters for activity rate, coordination level, and posting regularity
* Simulation scenarios: include isolated bots, coordinated botnets, and mixed human–bot communities
* Use case: robustness testing for bot detection models trained on real-world data

**Description:**
The BotSim-24 dataset is a large-scale synthetic simulation of social media user behavior. It was created to evaluate bot detection systems under controlled yet realistic conditions, providing clean ground-truth labels. Each simulated agent exhibits time-stamped posting behavior, interaction patterns, and coordination features modeled after real-world distributions of activity and engagement.
Unlike real datasets such as BotArtist or Fox8-23, BotSim-24’s labels are perfectly known, making it ideal for behavioral model training, stress testing, and fusion calibration in multimodal detection systems.

**Source:** ArXiv, GitHub

## Features
### Temporal Activity
* mean_inter_event_time – Average time (in seconds/minutes) between two consecutive posts.
* std_inter_event_time – Standard deviation of inter-event times (activity irregularity).
* burstiness – Normalized measure of posting bursts, calculated as (σ−μ)/(σ+μ).
* circadian_entropy – Entropy of activity across 24 hours; lower values indicate more regular posting.
* active_hours_ratio – Fraction of time spent posting during daytime hours.

### Interaction Behavior
* reply_ratio – Proportion of posts that are replies.
* retweet_ratio – Proportion of posts that are retweets.
* mention_ratio – Fraction of posts that mention other users.
* url_share_ratio – Fraction of posts containing URLs.
* hashtag_use_rate – Average number of hashtags per post.
* Coordination & Overlap
* shared_url_overlap – Jaccard overlap of URLs shared by this user with others (coordination signal).
* shared_hashtag_overlap – Fraction of hashtags shared with a user’s cluster peers.
* synchrony_index – Proportion of posts occurring simultaneously (within a short time window) with other bots.
* cosine_sim_to_cluster_centroid – Similarity of posting pattern to the cluster mean (behavioral mimicry).

### Cascade & Diffusion
* cascade_depth_mean – Average depth of reply/retweet cascades initiated by this account.
* cascade_width_mean – Average number of direct replies/retweets per cascade.
* max_cascade_size – Maximum number of participants in a cascade initiated by this user.
* reproduction_number – Approximate branching factor, estimating how fast a cascade spreads.

### Account Metadata (optional)
* account_age_days – Account lifetime in days.
* followers_count – Number of followers (simulated).
* friends_count – Number of followed accounts.
* statuses_count – Total number of posts.
* activity_rate – Posts per day, calculated as statuses_count / account_age_days.

### Label:
* bot_label — binary class label (1 = bot, 0 = human).

In [1]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

import warnings

from utils.data_prepping import evaluate_model, plot_threshold  # if available

warnings.filterwarnings("ignore")
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (6, 4)
np.random.seed(42)
