# Basic workflow with ManGO

## Authentication with iCommands

<div class="alert alert-block alert-info">
<h3>Every seven days</h3>
    
1. Go to https://mango.vscentrum.be/
2. In the tab of your zone, click on "How to connect"
3. Copy the snippet provided under "iCommands Client on Linux".

<font size=3>Then **paste the snippet** in the cell below, right under `%%bash`, like the (anonimized) example below.</font>

(Replacing the `USERNAME` with your username and the `TOKEN` with the password provided by "How to connect" should also work.)

**You don't need to do this every time: The authentication lasts 7 days.**
</div>

In [12]:
%%bash
mkdir -p ~/.irods
cat > ~/.irods/irods_environment.json <<'EOF'
{
    "irods_host": "vsc.irods.hpc.kuleuven.be",
    "irods_port": 1247,
    "irods_zone_name": "vsc",
    "irods_authentication_scheme": "pam_password",
    "irods_encryption_algorithm": "AES-256-CBC",
    "irods_encryption_salt_size": 8,
    "irods_encryption_key_size": 32,
    "irods_encryption_num_hash_rounds": 8,
    "irods_user_name": "USER",
    "irods_ssl_ca_certificate_file": "",
    "irods_ssl_verify_server": "cert",
    "irods_client_server_negotiation": "request_server_negotiation",
    "irods_client_server_policy": "CS_NEG_REQUIRE",
    "irods_default_resource": "default",
    "irods_cwd": "/vsc/home"
}
EOF
iinit -h | grep Version | grep -v -q 4.2. || sed -i 's/"irods_authentication_scheme": "pam_password"/"irods_authentication_scheme": "PAM"/' ~/.irods/irods_environment.json
echo 'TOKEN' | iinit --ttl 168 >/dev/null && echo You are now authenticated to irods. Your session is valid for 168 hours.
ils

bash: line 21: iinit: command not found
sed: 1: "/Users/u0089478/.irods/ ...": invalid command code u
bash: line 22: iinit: command not found
bash: line 23: ils: command not found


CalledProcessError: Command 'b'mkdir -p ~/.irods\ncat > ~/.irods/irods_environment.json <<\'EOF\'\n{\n    "irods_host": "vsc.irods.hpc.kuleuven.be",\n    "irods_port": 1247,\n    "irods_zone_name": "vsc",\n    "irods_authentication_scheme": "pam_password",\n    "irods_encryption_algorithm": "AES-256-CBC",\n    "irods_encryption_salt_size": 8,\n    "irods_encryption_key_size": 32,\n    "irods_encryption_num_hash_rounds": 8,\n    "irods_user_name": "USER",\n    "irods_ssl_ca_certificate_file": "",\n    "irods_ssl_verify_server": "cert",\n    "irods_client_server_negotiation": "request_server_negotiation",\n    "irods_client_server_policy": "CS_NEG_REQUIRE",\n    "irods_default_resource": "default",\n    "irods_cwd": "/vsc/home"\n}\nEOF\niinit -h | grep Version | grep -v -q 4.2. || sed -i \'s/"irods_authentication_scheme": "pam_password"/"irods_authentication_scheme": "PAM"/\' ~/.irods/irods_environment.json\necho \'TOKEN\' | iinit --ttl 168 >/dev/null && echo You are now authenticated to irods. Your session is valid for 168 hours.\nils\n'' returned non-zero exit status 127.

## Setting up

The first step _in each session_ is to set up ManGO (and load any other libraries you need).

In [14]:
import os
import ssl
from irods.session import iRODSSession # to communicate with ManGO
from mango_mdschema import Schema, ValidationError, ConversionError # to add structured metadata
import logging
logger = logging.getLogger("mango_mdschema") # to read the validation
logger.setLevel(logging.INFO)
try:
    env_file = os.environ['IRODS_ENVIRONMENT_FILE']
except KeyError:
    env_file = os.path.expanduser('~/.irods/irods_environment.json')

ssl_context = ssl.create_default_context(
        purpose=ssl.Purpose.SERVER_AUTH,
        cafile=None, capath=None, cadata=None
        )
ssl_settings = {'ssl_context': ssl_context}

Since we are working interactively we will create an `irods.session.iRODSSession` object and then close it at the end of the notebook with `session.cleanup()`. If you were working on a script, you could run all your code inside a `with` statement.

In [15]:
session = iRODSSession(irods_env_file=env_file, **ssl_settings)

The final step to set up your environment is to define your working directory in a variable. For this notebook, it's "/vsc/home/ManGO-VIB/" (the training project). Other projects will have other paths.

In [16]:
home_dir = "/gbiomed/home/BADS/"

You will need all the code above at the start of any notebook that needs to connect to ManGO.

--------------

The code below is illustration of basic functions to communicate with ManGO; take them as a cheatsheet and use them at your convenience.

## Collections

You can connect to a specific iRODS collection with `session.collections.get("/path/to/collection")`; this could be your home collection, project collection or any other sub-collection. After you instantiate the collection you prefer, you can see some basic information about it. The `subcollections` and `data_objects` attributes return lists of the sub-collections and data objects of this instantiated collection.
Let's retrieve our existing home collection:

In [17]:
coll = session.collections.get(home_dir)
coll

CAT_INVALID_AUTHENTICATION: None

In [9]:
coll.path

NameError: name 'coll' is not defined

The subcollections and data objects contained in a collection can be retrieved with the `subcollections` and `data_objects` attributes, respectively. We can also use the `.walk()` method to get the full tree.

<div class="alert alert-block alert-info">
<b>Note</b>: Your output will be different depending on your reading permissions; you'll only see the dataset that you have access to and the collection of your team.
</div>

In [None]:
coll.subcollections

In [None]:
for item in coll.subcollections[0].walk():
    print(f"{item[0]} contains {len(item[1])} subcollections and {len(item[2])} data objects.")
    if len(item[1]) == 1:
        print("The subcollection is:", item[1][0])
    if len(item[2]) == 1:
        print("The data object is:", item[2][0])
    print()

## Editing data objects and collections in ManGO

This section shows how to create new collections and data objects, upload local data to ManGO and remove data objects from ManGO.
Here we will create the "input" collection inside the "example" collection.

In [None]:
example_dir = home_dir + "example/"
example_coll = session.collections.get(example_dir)
example_subcoll = session.collections.create(example_dir + "input/") # this won't work if you created it before
example_subcoll.path

In [None]:
example_subcoll.data_objects

In [None]:
# create a new data object
session.data_objects.create(example_dir + "input/new_object.txt")
example_subcoll.data_objects

When you have results to upload to ManGO you should save your output locally and then send it to ManGO with `iput()`.

In [None]:
testfile = [f for f in os.listdir() if f.endswith("fastq")][0]
testfile

In [None]:
# send a local file
session.data_objects.put(testfile, example_dir + "input/" + testfile)
example_subcoll.data_objects

For example, this is how the input fastq files that you will use in the exercised were sent to ManGO.

In [None]:
# DO NOT RUN
# source_input_dir = "/staging/leuven/stg_00079/teaching/prep_mango/input"
# for file in os.listdir(source_input_dir):
#    session.data_objects.put(f"{source_input_dir}/{file}", f"{home_dir}input/{file}")

Data objects can be removed from ManGO with `unlink()`. The `force` argument indicates whether it should be permanently deleted (`True`) or sent to trash (`False`). Objects in the trash get removed automatically after 14 days.

In [None]:
# remove data objects
session.data_objects.unlink(example_dir + "input/" + testfile, force=True)
example_subcoll.data_objects

## Download data from ManGO

In order to access the data you have on ManGO, you should download it with `get()`. If you provide a second argument with a local path, next to obtaining the normal information about the object you will also download it.

In [None]:
object_to_download = example_coll.data_objects[0]
source_path = object_to_download.path
filename = object_to_download.name
f"We will move the object in '{source_path}' to (local) '{filename}'."

In [None]:
os.path.exists(filename)

In [None]:
session.data_objects.get(source_path, filename)
os.path.exists(filename)

Once you have download the file you can use normal Python commands to do something with it, like read the contents of a text file or show an image.

In [None]:
with open(filename, 'r') as f:
    first_line = f.readline()
first_line

# Checksum

You can check the sha2 checksums with the `checksum` attribute, if they have been set with the `chksum()` method. If you just set them you won't be able to retrieve them.

In [None]:
obj = session.data_objects.get(source_path)

In [None]:
obj.checksum

## Metadata

An important feature of Tier 1-Data/iRODS is the ability to add metadata to collections and data objects. More interestingly, we can use the `mango-mdschema` package to add structure metadata and validate it against a schema.

In [None]:
obj.metadata.items()

In [None]:
fastq_schema = Schema("fastq-3.0.0-published.json")

In [None]:
print(fastq_schema)

In [None]:
fastq_schema.print_requirements("sample")

In [None]:
fastq_schema.print_requirements("organism")

In [None]:
fastq_schema.print_requirements("fastq")

In [None]:
example_metadata = {
    "sample": {
        "sample_id": "18S_amplicon",
        "condition": "normal"
    },
    "organism": "Mouse",
    "fastq" : {
        "encoding": "phred64",
        "no_records": 109831
    }
}
fastq_schema.validate(example_metadata)

In [None]:
example_metadata = {
    "sample": {
        "sample_id": "18S_amplicon",
        "condition": "normal"
    },
    "organism": "mouse",
    "fastq" : {
        "encoding": "Phred+64",
        "no_records": "109831"
    }
}
fastq_schema.validate(example_metadata)

In [None]:
fastq_schema.apply(obj, example_metadata)

In [None]:
obj.metadata.items()

In [None]:
fastq_schema.from_avus(obj.metadata.items())

It is also possible to retrieve a particular metadata item (`iRODSMeta` instance) by its name. The `iRODSMeta` instance has `name`, `value` and `units`:

In [None]:
one_avu = obj.metadata["mgs.fastq.fastq.no_records"]
print("Name: ", one_avu.name)
print("Value: ", one_avu.value)
print("Units: ", one_avu.units)

## Queries

We can run queries with `session.query()`, which collects information from collections, data objects, and their metadata with specific classes. More interestingly, we can filter that information based on certain Criteria.


Class | Information about | Useful attributes
---- | ------ | ----------
`Collection` | A collection | `name`, `owner_name`, `id` ...
`DataObject` | A data object | `name`, `path`, `size`, `owner_name`, `id` ...
`CollectionMeta` | The metadata of a collection | `name`, `value`, `units`, ...
`DataObjectMeta` | The metadata of a data object | `name`, `value`, `units`, ...

In [None]:
from irods.models import Collection, DataObject, CollectionMeta, DataObjectMeta
from irods.column import Criterion

The following query retrieves all the collections inside our project collection (`home_dir`), regardless of their depth, and prints their paths.

In [None]:
query = session.query(Collection.name)
for result in query:
    if result[Collection.name].startswith(home_dir):
        print(result[Collection.name])

In the cells below, we request the path of our collections and the names and date of creation of our data objects.
Then we filter the results based on the following criteria:

- The collection path has to end in "put" ('like' + '%put')
- The data object is smaller than 1GB in size.
- The data object must have been created after '2023-01-27 13:45:25'.
    + In order to define the date-time threshold we use the `datetime` library.
- The data object should have an "organism" metadata item from the "fastq" schema with value "human".

In [None]:
import datetime
threshold = datetime.datetime.fromisoformat('2023-01-27 13:45:25')

In [None]:
my_files = session.query(Collection.name, DataObject.name, DataObject.create_time, DataObject.path).filter(
    Criterion('like', Collection.name, '%ple')).filter(
    Criterion('<', DataObject.size, 1000000000)).filter(
    Criterion('>', DataObject.create_time, threshold)).filter(
    Criterion('=', DataObjectMeta.name, 'mgs.fastq.fastq.encoding')).filter(
    Criterion('=', DataObjectMeta.value, 'Phred+64')
    )
for item in my_files:
    print(item[DataObject.name], item[Collection.name], item[DataObject.create_time])

In [None]:
[item[DataObject.path] for item in my_files] # DO NOT USE THIS FOR DOWNLOADING

In [None]:
[f"{item[Collection.name]}/{item[DataObject.name]}" for item in my_files] # THIS WORKS FOR DOWNLOADING

The `.execute()` method returns a printable table with the columns requested in `.query()`.

In [None]:
print(my_files.execute())

## CLEAN UP 
<div class="alert alert-block alert-warning">
    <font size=4><b>Do not forget to clean up your session!</b></font>
</div>

In [None]:
# leave this cell at the end and running every time you are done
session.cleanup()

In [None]:
# RESET in order to replay the demo
session.collections.get(example_dir + "input/").remove()
os.remove(filename)