%%markdown
# Dataset Generation Template

This notebook provides a template for generating and uploading new datasets to MLflow, after you have the raw scraped data.

## Data Format

Your data needs to be organized in a dictionary with three required keys:
- `inputs`: Feature matrix as NumPy array
- `target`: Binary target matrix as NumPy array
- `target_names`: List of strings naming each target feature

### Shape Requirements

| Component | Type | Shape | Description | Example |
|-----------|------|-------|-------------|----------|
| `inputs` | `np.ndarray` | `(n_samples, n_features)` | Each row is one sample, each column a feature | `(1000, 10)` for 1000 samples with 10 features |
| `target` | `np.ndarray` | `(n_samples, n_targets)` | Binary matrix where each column represents one target | `(1000, 3)` for 1000 samples with 3 possible targets |
| `target_names` | `list[str]` | `(n_targets,)` | Names for each target column | `["cat", "dog", "bird"]` for 3 targets |

## Example

```python
data = {
    "inputs": np.array([
        [0.1, 0.2, 0.3],  # Sample 1 with 3 features
        [0.4, 0.5, 0.6],  # Sample 2 with 3 features
        # ... more samples
    ]),
    "target": np.array([
        [1, 0],  # Sample 1: positive for target 1, negative for target 2
        [0, 1],  # Sample 2: negative for target 1, positive for target 2
        # ... more samples
    ]),
    "target_names": ["group_A", "group_B"]  # Names for the two target columns
}
```

## Uploading the dataset
Once your data is prepared, use `upload_dataset()` to save it to MLflow.
This will verify that the data is formatted correctly and then upload it to the server.

```python
upload_dataset(
    data=data,
    dataset_name="ftir_no_bonding_effects",  # Broader dataest for which we can have multiple versions
    version_name="initial_data",  # Description of this version
    description="FTIR dataset downloaded from the FCGFormer paper without any modifications"  # Optional details
)
```

## Accessing the Dataset

After upload, the dataset will be available in MLflow for model training with:
- NumPy arrays saved as `.npy` files
- Target names and counts (number of positive examples) in text files
- The code of this notebook saved for reproducability (so you don't have to upload it anywhere)

You can view your dataset in MLflow by opening the link printed after calling `upload_dataset()`.

In [1]:
# This cell defines upload_dataset. You can ignore it.

# Install required packages for `upload_dataset()`
%pip install numpy mlflow ipynbname requests

import os
import urllib.parse
import mlflow
import numpy as np
from typing import Dict, Any
import jupyter_client
try:
    from notebook import notebookapp
except ImportError:
    from jupyter_server import serverapp as notebookapp

# MLFlow creds
MLFLOW_DOMAIN = "mlflow.gritans.lv"
MLFLOW_USERNAME = "data_user"
MLFLOW_PASSWORD = "ais7Rah2foo0gee9"
MLFLOW_TRACKING_URI = f"https://{MLFLOW_DOMAIN}"

parsed_uri = urllib.parse.urlparse(MLFLOW_TRACKING_URI)
auth_uri = parsed_uri._replace(
    netloc=f"{urllib.parse.quote(MLFLOW_USERNAME)}:{urllib.parse.quote(MLFLOW_PASSWORD)}@{parsed_uri.netloc}"
).geturl()

mlflow.set_tracking_uri(auth_uri)


def upload_dataset(
    data: Dict[str, Any],
    dataset_name: str,
    version_name: str,
    description: str | None = None,
):
    """
    Args:
        data (Dict[str, Any]): Dictionary containing the dataset with keys:
            - "inputs": NumPy array of shape (num_samples, num_input_features)
            - "target": NumPy array of shape (num_samples, num_output_features)
            - "target_names": List of target feature names, in the same order as the target array.
        dataset_name (str): Name of the dataset.
        version_name (str): A descriptive version name for the dataset. Doesn't need to be unique, just for reference.
        description (str): An (optional) description of this dataset version.
    """
    # Check dictionary
    expected_keys = {"inputs", "target", "target_names"}
    assert set(data.keys()) == expected_keys, (
        f"Invalid dataset format. Keys should be {expected_keys}."
    )

    # Check expected types
    assert isinstance(data["inputs"], np.ndarray), (
        f"Inputs must be a numpy.ndarray. Got {type(data['inputs'])}."
    )
    assert isinstance(data["target"], np.ndarray), (
        f"Targets must be a numpy.ndarray. Got {type(data['target'])}."
    )
    assert isinstance(data["target_names"], list), (
        f"target names must be a list. Got {type(data['target_names'])}."
    )
    assert all(isinstance(name, str) for name in data["target_names"]), (
        "All target names must be strings."
    )

    # Check expected shapes
    inputs: np.ndarray = data["inputs"]
    target: np.ndarray = data["target"]
    target_names = data["target_names"]

    assert inputs.ndim == 2, (
        f"Inputs must be a (num_samples, num_input_features) array. "
        f"Got {inputs.ndim} dimensions."
    )
    assert target.ndim == 2, (
        f"Targets must be a (num_samples, num_output_features) array. "
        f"Got {target.ndim} dimensions."
    )

    n_samples = inputs.shape[0]
    assert n_samples > 0, (
        f"Inputs must have at least one sample. Got {n_samples} samples."
    )
    assert n_samples == target.shape[0], (
        f"Inputs and targets must have the same number of samples. "
        f"Got {n_samples} inputs and {target.shape[0]} targets."
    )

    n_outputs = target.shape[1]
    assert n_outputs > 0 and n_outputs == len(target_names), (
        f"Targets must have the same number of features as target names. "
        f"Got {n_outputs} target features and {len(target_names)} target names."
    )

    # Compute number of positive samples per target
    pos_counts = target.sum(axis=0)

    mlflow.set_experiment(experiment_name=dataset_name)
    with mlflow.start_run(run_name=version_name) as run:
        local_dir = os.path.join("./runs", run.info.run_id)
        os.makedirs(local_dir, exist_ok=True)

        # Log the notebook generating this dataset
  
    try:
        # primary: ipynbname often just works
        import ipynbname
        notebook_path = ipynbname.path()

    except Exception:
        # fallback: query the Jupyter server’s /api/sessions
        # 1) get your kernel id
        conn_file = jupyter_client.find_connection_file()
        kernel_id = os.path.basename(conn_file).split('-', 1)[1].split('.')[0]

        # 2) iterate over all running notebook servers
        for srv in notebookapp.list_running_servers():
            # build the URL for sessions
            url = srv['url'].rstrip('/') + '/api/sessions'
            token = srv.get('token', '')
            params = {'token': token} if token else {}

            try:
                resp = requests.get(url, params=params)
                resp.raise_for_status()
            except Exception:
                continue

            # 3) look for our kernel in their active sessions
            for sess in resp.json():
                if sess['kernel']['id'] == kernel_id:
                    # 4) reconstruct the full path
                    rel_path = sess['notebook']['path']       # e.g. "subdir/MyNotebook.ipynb"
                    notebook_path = os.path.join(srv['notebook_dir'], rel_path)
                    break
            else:
                continue
            break
        else:
            raise RuntimeError("Could not locate the current notebook path")

        filename = os.path.basename(notebook_path)
        mlflow.log_artifact(notebook_path, filename)

        # Log the dataset
        inputs_path = os.path.join(local_dir, "inputs.npy")
        target_path = os.path.join(local_dir, "target.npy")
        target_names_path = os.path.join(local_dir, "target_names.txt")
        pos_counts_path = os.path.join(local_dir, "pos_counts.txt")

        ## save locally
        np.save(inputs_path, inputs)
        np.save(target_path, target)

        with open(target_names_path, "w") as f:
            for name in target_names:
                f.write(f"{name}\n")

        with open(pos_counts_path, "w") as f:
            for i, count in enumerate(pos_counts):
                f.write(f"{target_names[i]}: {count}\n")

        ## upload to mlflow
        mlflow.log_artifact(inputs_path)
        mlflow.log_artifact(target_path)
        mlflow.log_artifact(target_names_path)
        mlflow.log_artifact(pos_counts_path)

        # Log parameters for browsing
        mlflow.log_param("target_names", target_names)
        mlflow.log_param("input_features", inputs.shape[1])
        mlflow.log_param("output_features", target.shape[1])
        mlflow.log_param("num_samples", n_samples)

        pos_counts_dict = {name: count for name, count in zip(target_names, pos_counts)}
        mlflow.log_param("pos_counts", pos_counts_dict)

        # Log description
        if description:
            mlflow.set_tag("description", description)
        
        # Print MLflow experiment URL
        experiment = mlflow.get_experiment_by_name(dataset_name)
        if experiment:
            print("\nAccess your dataset:")
            print(f"View dataset at: https://{MLFLOW_USERNAME}:{MLFLOW_PASSWORD}@{MLFLOW_DOMAIN}/#/experiments/{experiment.experiment_id}")
            print(f"View this version at: https://{MLFLOW_USERNAME}:{MLFLOW_PASSWORD}@{MLFLOW_DOMAIN}/#/experiments/{experiment.experiment_id}/runs/{run.info.run_id}")


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import necessary libraries
import numpy as np
import os
from glob import glob
from tqdm import tqdm

class_names = ["alkane", "methyl", "alkene", "alkyne", "alcohols", "amines", "nitriles", "aromatics",
 "alkyl halides", "esters", "ketones", "aldehydes", "carboxylic acids", "ether",
 "acyl halides", "amides", "nitro"]

# Define the data directory
data_dir = "../data/ftir"
splits = ["train", "valid", "test"]

# Initialize arrays to store combined data
all_inputs = []
all_targets = []

# Load data from all splits
print("Loading FTIR dataset from all splits...")
for split in splits:
    split_dir = os.path.join(data_dir, split)
    
    # Get all sample IDs
    npy_paths = glob(os.path.join(split_dir, "*.npy"))
    ids = [int(os.path.splitext(os.path.basename(path))[0]) for path in npy_paths]
    ids.sort()
    
    print(f"Loading {len(ids)} samples from {split} split...")
    
    # Load each sample
    for sample_id in tqdm(ids, desc=f"Loading {split}", unit="sample"):
        npy_path = os.path.join(split_dir, f"{sample_id}.npy")
        txt_path = os.path.join(split_dir, f"{sample_id}.txt")
        
        # Load feature vector
        x = np.load(npy_path)
        
        # Load target labels
        with open(txt_path, "r") as f:
            y = np.array([int(tok) for tok in f.read().strip().split()], dtype=np.int32)
        
        all_inputs.append(x)
        all_targets.append(y)

# Convert lists to numpy arrays
inputs = np.vstack(all_inputs)  # Stack vertically to create (n_samples, n_features)
target = np.vstack(all_targets)  # Stack vertically to create (n_samples, n_classes)

# Print dataset statistics
print(f"\nDataset statistics:")
print(f"Total samples: {inputs.shape[0]}")
print(f"Feature dimension: {inputs.shape[1]}")
print(f"Number of classes: {len(class_names)}")
print(f"Positive samples per class:")
for i, name in enumerate(class_names):
    count = np.sum(target[:, i])
    percent = count / len(target) * 100
    print(f" - {name}: {count} ({percent:.2f}%)")

# Package the data in the required format
data = {
    "inputs": inputs,
    "target": target,
    "target_names": class_names
}

Loading FTIR dataset from all splits...
Loading 6342 samples from train split...


Loading train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 6342/6342 [00:01<00:00, 4125.93sample/s]


Loading 1387 samples from valid split...


Loading valid: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1387/1387 [00:00<00:00, 2262.90sample/s]


Loading 933 samples from test split...


Loading test: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 933/933 [00:00<00:00, 3430.38sample/s]



Dataset statistics:
Total samples: 8662
Feature dimension: 3602
Number of classes: 17
Positive samples per class:
 - alkane: 5986 (69.11%)
 - methyl: 5557 (64.15%)
 - alkene: 1165 (13.45%)
 - alkyne: 227 (2.62%)
 - alcohols: 2339 (27.00%)
 - amines: 817 (9.43%)
 - nitriles: 375 (4.33%)
 - aromatics: 5018 (57.93%)
 - alkyl halides: 2405 (27.76%)
 - esters: 961 (11.09%)
 - ketones: 787 (9.09%)
 - aldehydes: 207 (2.39%)
 - carboxylic acids: 629 (7.26%)
 - ether: 2155 (24.88%)
 - acyl halides: 96 (1.11%)
 - amides: 165 (1.90%)
 - nitro: 443 (5.11%)


In [3]:
# Upload the FTIR dataset to MLflow
upload_dataset(
    data=data,
    dataset_name="ftir_complete",
    version_name="combined_splits",
    description="""
    Complete FTIR spectroscopy dataset combined from train, validation, and test splits.
    
    This dataset contains FTIR (Fourier-transform infrared) spectroscopy data from the FCG-former paper,
    with 17 functional group classes. Each sample is labeled with the presence/absence of each functional group.
    
    Features: FTIR spectra
    Targets: Binary labels for 17 functional groups (alkane, methyl, alkene, etc.)
    Combined from: train, validation, and test splits
    """
)

🏃 View run combined_splits at: https://data_user:ais7Rah2foo0gee9@mlflow.gritans.lv/#/experiments/6/runs/9660661d08d74b3d90fc3d2ef49fe485
🧪 View experiment at: https://data_user:ais7Rah2foo0gee9@mlflow.gritans.lv/#/experiments/6
