# Prepare Datasets
## Setting:

In [7]:
from pathlib import Path
import sys
import os
from pathlib import Path
import pandas as pd
import subprocess
from pathlib import Path

def _add_notebooks_src_to_path():
    here = Path.cwd().resolve()
    for p in [here, *here.parents]:
        candidate = p / "notebooks" / "src"
        if candidate.is_dir():
            if str(candidate) not in sys.path:
                sys.path.insert(0, str(candidate))
            return candidate
    raise FileNotFoundError("Could not find 'notebooks/src' from current working directory.")

print("Using helpers from:", _add_notebooks_src_to_path())

from constants import (
    REPO_ROOT, PKG_DIR, RAW_DATA_DIR, PROCESSED_DATA_DIR, MODELS_ROOT, ensure_repo_importable
)
dataset_name = "wikipedia"   # e.g., "uci"
bipartite = True
RAW_DATA_FILE = RAW_DATA_DIR / f"{dataset_name}.csv"
DATASET_DIR = PROCESSED_DATA_DIR / dataset_name
DATASET_DIR.mkdir(parents=True, exist_ok=True)


print("PARENT_DIR       :", REPO_ROOT)
print("RAW_DATA_FILE    :", RAW_DATA_FILE)
print("DATASET_DIR      :", DATASET_DIR)

Using helpers from: /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/notebooks/src
PARENT_DIR       : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain
RAW_DATA_FILE    : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/raw/wikipedia.csv
DATASET_DIR      : /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/processed/wikipedia


The first step to running the experiments is the preparation of the datasets. The datasets can be downloaded from the 
following sites:

- [Wikipedia](http://snap.stanford.edu/jodie/#datasets)
- [UCI-Messages/UCI-Forums](https://toreopsahl.com/datasets/)

For Wikipedia do: 

In [5]:
!curl -O http://snap.stanford.edu/jodie/wikipedia.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 51  533M   51  276M    0     0  5032k      0  0:01:48  0:00:56  0:00:52 5036k   0  0:02:09  0:00:09  0:02:00 5788k^C


For UCI-Messages do:

In [None]:
!curl http://opsahl.co.uk/tnet/datasets/OCnodeslinks.txt > UCI-Messages.txt

The raw dataset files should have the same format as the Wikipedia dataset; that is: 

- First column: Source node ids
- Second column: Target node ids
- Third column: UNIX timestamp
- Fourth column: State label (not necessary for link prediction task)
- Fifth column and onwards: Comma seperated list of edge features

## Reformat UCI-Messages:
The UCI datasets do not have this form by default. To make the conversion easier, the 
[format_uci_data.py](./format_uci_data.py) is provided. First download the dataset to a file from the website 
mentioned above. Then use the script to convert the downloaded file to an appropriate .csv file by running:

In [None]:
input_path = "path/to/UCI-Messages.txt"
output_path = RAW_DATA_DIR / "ucim.csv"

inp = Path(input_path)
outp = Path(output_path)
outp.parent.mkdir(parents=True, exist_ok=True)

# --- Read raw ---
# Original script uses sep=" ". If your file has variable whitespace, consider sep=r"\s+", engine="python".
raw_data = pd.read_csv(inp, sep=" ", header=None)

# --- Transform ---
raw_data.columns = ['timestamp', 'item_id', 'user_id', 'state_label']
raw_data['timestamp'] = pd.to_datetime(raw_data['timestamp'])
raw_data['timestamp'] = (raw_data['timestamp'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
reordered_data = raw_data[['user_id', 'item_id', 'timestamp', 'state_label']]

# --- Save ---
reordered_data.to_csv(outp, index=False)
print(f"Reformatted file saved to {outp}")

# (Optional) Peek at the first few rows
try:
    from IPython.display import display
    display(reordered_data.head())
except Exception:
    print(reordered_data.head())

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/UCI-Messages.txt'

Place the correctly formatted datasets as .csv files in the [/resources/datasets/raw](./resources/datasets/raw) directory.

To prepare the raw datasets for the usage we have to preprocess the files:

Wikipedia/UCI-Forums:

In [12]:

preprocess_script = Path(PKG_DIR/"data/preprocess_dataset.py")
py = sys.executable
cmd = [
    py,
    preprocess_script,
    "-f", str(RAW_DATA_FILE),
    "-t", str(DATASET_DIR),
]
if bipartite:
    cmd.append("--bipartite")

# Ensure Python can import the repo package when the script runs
env = os.environ.copy()
py_paths = [str(PKG_DIR)]
if env.get("PYTHONPATH"):
    py_paths.append(env["PYTHONPATH"])
env["PYTHONPATH"] = os.pathsep.join(py_paths)

print("\nExecuting…")
subprocess.run(cmd, cwd=str(preprocess_script.parent), env=env, check=True)
print("\n Preprocessing finished.")


Executing…
Dataset wikipedia has been processed and will be saved.

Dataset information:
Edge feature shape: (157474, 172)
Node feature shape: (9227, 172)
Successfully saved the preprocessed dataset to /Users/juliawenkmann/Documents/CodingProjects/master_thesis/time_to_explain/resources/datasets/processed/wikipedia

 Preprocessing finished.
