# SciCat workshop exercise

This exercise walks you through downloading a dataset and data files from SciCat and uploading processed data to SciCat.
It uses a basic, contrived workflow to process the data using [Scipp](https://scipp.github.io/https://scipp.github.io/).

In [None]:
import scipp as sc
from scitacean import Client, Dataset
from scitacean.transfer.ssh import SSHFileTransfer

%matplotlib widget

## Setup

The first cell is some workshop-specific configuration.
The production instance is currently located at `"https://scicat.ess.eu/api/v3"`.
And the source folder will eventually be under `/ess/data`.
But that is for permanent storage.

In [None]:
scicat_url = "https://staging.scicat.ess.eu/api/v3"
source_folder = "/mnt/groupdata/scicat/upload/workshop/20230322/{pid.pid}"

Get your access token from SciCat

1. Log in at `https://staging.scicat.ess.eu`
2. Click on your user icon in the top-right and go to 'Settings'.
3. Copy 'Catamel Token' as a string to the `token` variable below.

In [None]:
token = "4UCWM97oLo0UvqFMPk91uYYqLz1H4llxMPExmXxPi8e6Bi9AKAFA2bTGoJVpVWCP"

Set the host name that you use to connect to 'login' with SSH.
Your `ssh-agent` must be set up to connect to this host without asking for a password / passphrase on the terminal.

In [None]:
ssh_host = "login.esss.dk"

## Fetch the input data

Create a client to talk to the SciCat server and file server:

In [None]:
client = Client.from_token(
    url=scicat_url,
    token=token,
    file_transfer=SSHFileTransfer(
        host=ssh_host,
        source_folder=source_folder,
    ),
)

Find the ID of the raw dataset in the web interface of SciCat:

In [None]:
input_pid = "20.500.12269/f5ac29c4-95fa-4bea-bde1-00ea1fbc1b0e"

1. Download the dataset with the given PID.
2. Inspect the dataset to make sure it is the correct one.
3. Download its files to a local folder of your choice.

Check out https://scicatproject.github.io/scitacean/ to find out how these things work.

In [None]:
raw = client.get_dataset(input_pid)

In [None]:
raw

In [None]:
raw_dataset = client.download_files(raw_dataset, target="./data")

In [None]:
raw_dataset

In [None]:
(input_file,) = raw_dataset.files
input_file_name = input_file.local_path

## Process the data

The data is a crude mock up of a wavelength spectrum.
Your task is to 

In [None]:
raw_data = sc.io.open_hdf5(input_file_name)

In [None]:
raw_data.plot(ls="-", marker=None)

In [None]:
background_range = slice(1.3 * sc.Unit("Å"), 1.4 * sc.Unit("Å"))

In [None]:
background = raw_data["wavelength", background_range].mean()
background

In [None]:
ch = raw_dataset.meta["proton_charge"]
proton_charge = sc.scalar(ch["value"], unit=ch["unit"])
proton_charge

In [None]:
corrected = (raw_data - background) / proton_charge

In [None]:
corrected.plot(ls="-", marker=None)

## Save the derived data

1. Use [DataArray.save_hdf5](https://scipp.github.io/generated/classes/scipp.DataArray.html#scipp.DataArray.save_hdf5) to save the corrected data to file.
2. Make a derived dataset from the input dataset and the file you just wrote.
   (Tip: Use [Dataset.derive](https://scicatproject.github.io/scitacean/generated/classes/scitacean.Dataset.html#scitacean.Dataset.derive).)
3. Inspect the derived dataset in Jupyter.
    - Do all fields make sense?
    - Is the file path correct?
3. Upload the derived dataset and data file to SciCat (using the client from above).
4. Inspect the dataset in the web interface and the file with SSH.

In [None]:
corrected.to_hdf5("data/corrected.h5")

In [None]:
# Keep a bunch of arguments.
# In this case, because the authors of this notebook are also owners of the raw data.
# You are probably not!
derived = raw_dataset.derive(
    keep=(
        "contact_email",
        "instrument_id",
        "investigator",
        "orcid_of_owner",
        "owner",
        "owner_email",
        "techniques",
        "license",
        "data-format",
        "proposal_id",
        "sample_id",
    )
)
# Make sure that you are in this group! Otherwise you cannot access your dataset!
derived.owner_group = "ess"
derived.access_groups = ["dmsc"]

In [None]:
derived.add_local_files("data/corrected.h5", base_path="data")

In [None]:
derived

- Upload the new dataset to SciCat and the file to the file server.
- Catch and inspect the return value of `client.upload_new_dataset_now`.

<div class="alert alert-warning">

**Warning**

Every time you call `client.upload_new_dataset_now`, it will create a new dataset in SciCat and upload a copy of the file.
Ideally, do not keep a call to this function around in the notebook so you don't accidentally end up uploading lots of duplicate data.
</div>

In [None]:
# finalized = client.upload_new_dataset_now(derived)

In [None]:
finalized