# Prepare Datasets
## Setting:

In [1]:
from pathlib import Path
import sys
import os
from pathlib import Path
import pandas as pd
import subprocess
from pathlib import Path

def _add_notebooks_src_to_path():
    here = Path.cwd().resolve()
    for p in [here, *here.parents]:
        candidate = p / "notebooks" / "src"
        if candidate.is_dir():
            if str(candidate) not in sys.path:
                sys.path.insert(0, str(candidate))
            return candidate
    raise FileNotFoundError("Could not find 'notebooks/src' from current working directory.")

print("Using helpers from:", _add_notebooks_src_to_path())

from constants import (
    REPO_ROOT, PKG_DIR, RAW_DATA_DIR, PROCESSED_DATA_DIR, MODELS_ROOT, ensure_repo_importable
)
dataset_name = "wikipedia"   # e.g., "uci"
bipartite = True
RAW_DATA_FILE = RAW_DATA_DIR / f"{dataset_name}.csv"
DATASET_DIR = PROCESSED_DATA_DIR / dataset_name
DATASET_DIR.mkdir(parents=True, exist_ok=True)


print("PARENT_DIR       :", REPO_ROOT)
print("RAW_DATA_FILE    :", RAW_DATA_FILE)
print("DATASET_DIR      :", DATASET_DIR)

Using helpers from: /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/notebooks/src
PARENT_DIR       : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain
RAW_DATA_FILE    : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/raw/wikipedia.csv
DATASET_DIR      : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/processed/wikipedia


The first step to running the experiments is the preparation of the datasets. The datasets can be downloaded from the 
following sites:

- [Wikipedia](http://snap.stanford.edu/jodie/#datasets)
- [UCI-Messages/UCI-Forums](https://toreopsahl.com/datasets/)

For Wikipedia do: 

In [2]:
!curl -O http://snap.stanford.edu/jodie/wikipedia.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 11  533M   11 60.7M    0     0  5894k      0  0:01:32  0:00:10  0:01:22 10.6M 0  0:06:06  0:00:05  0:06:01 1766k     0  3691k      0  0:02:28  0:00:08  0:02:20 5880k^C


In [1]:
!curl -O http://snap.stanford.edu/jodie/reddit.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2258M  100 2258M    0     0  4083k      0  0:09:26  0:09:26 --:--:-- 6002k:00  0:00:37  0:06:23 6176k58M   12  292M    0     0  4378k      0  0:08:48  0:01:08  0:07:40 1939k13 2258M   13  298M    0     0  4273k      0  0:09:01  0:01:11  0:07:50 1991k58M   13  302M    0     0  4053k      0  0:09:30  0:01:16  0:08:14  862k    0  0:10:06  0:01:23  0:08:43 1341k   0  0:10:27  0:01:34  0:08:53 2685k04k      0  0:12:25  0:02:21  0:10:04 1327k0     0  3078k      0  0:12:31  0:02:23  0:10:08 1206k77k      0  0:13:23  0:02:53  0:10:30 2293k 0:13:24  0:02:57  0:10:27 2710k    0  3390k      0  0:11:22  0:03:35  0:07:47 6299k0     0  3481k      0  0:11:04  0:03:42  0:07:22 6194k  0  3521k      0  0:10:56  0:03:45  0:07:11 6312kk      0  0:10:29  0:03:59  0:06:30 5713k4k      0  0:10:01  0:04:18  0:05:43 6347k 0:09:26  0:04:46  0:04:40 637

For UCI-Messages do:

In [None]:
!curl http://opsahl.co.uk/tnet/datasets/OCnodeslinks.txt > UCI-Messages.txt

The raw dataset files should have the same format as the Wikipedia dataset; that is: 

- First column: Source node ids
- Second column: Target node ids
- Third column: UNIX timestamp
- Fourth column: State label (not necessary for link prediction task)
- Fifth column and onwards: Comma seperated list of edge features

## Reformat UCI-Messages:
The UCI datasets do not have this form by default. To make the conversion easier, the 
[format_uci_data.py](./format_uci_data.py) is provided. First download the dataset to a file from the website 
mentioned above. Then use the script to convert the downloaded file to an appropriate .csv file by running:

In [None]:
input_path = "path/to/UCI-Messages.txt"
output_path = RAW_DATA_DIR / "ucim.csv"

inp = Path(input_path)
outp = Path(output_path)
outp.parent.mkdir(parents=True, exist_ok=True)

# --- Read raw ---
# Original script uses sep=" ". If your file has variable whitespace, consider sep=r"\s+", engine="python".
raw_data = pd.read_csv(inp, sep=" ", header=None)

# --- Transform ---
raw_data.columns = ['timestamp', 'item_id', 'user_id', 'state_label']
raw_data['timestamp'] = pd.to_datetime(raw_data['timestamp'])
raw_data['timestamp'] = (raw_data['timestamp'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
reordered_data = raw_data[['user_id', 'item_id', 'timestamp', 'state_label']]

# --- Save ---
reordered_data.to_csv(outp, index=False)
print(f"Reformatted file saved to {outp}")

# (Optional) Peek at the first few rows
try:
    from IPython.display import display
    display(reordered_data.head())
except Exception:
    print(reordered_data.head())

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/UCI-Messages.txt'

Place the correctly formatted datasets as .csv files in the [/resources/datasets/raw](./resources/datasets/raw) directory.

To prepare the raw datasets for the usage we have to preprocess the files:

Wikipedia/UCI-Forums:

In [3]:

preprocess_script = Path(PKG_DIR/"data/preprocess_dataset.py")
py = sys.executable
cmd = [
    py,
    preprocess_script,
    "-f", str(RAW_DATA_FILE),
    "-t", str(DATASET_DIR),
]
if bipartite:
    cmd.append("--bipartite")

# Ensure Python can import the repo package when the script runs
env = os.environ.copy()
py_paths = [str(PKG_DIR)]
if env.get("PYTHONPATH"):
    py_paths.append(env["PYTHONPATH"])
env["PYTHONPATH"] = os.pathsep.join(py_paths)

print("\nExecuting…")
subprocess.run(cmd, cwd=str(preprocess_script.parent), env=env, check=True)
print("\n Preprocessing finished.")


Executing…
Dataset wikipedia has been processed and will be saved.

Dataset information:
Edge feature shape: (157474, 172)
Node feature shape: (9227, 172)
Successfully saved the preprocessed dataset to /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/processed/wikipedia

 Preprocessing finished.
