# Working with data

On your computer you can open an Explorer which will show you all the files and the directories. In iRODS we have something similar: we have data objects which you can see for the moment as a file and we have collections which are similar to directories.

In the course of this and the next tutorials it will become clear that this is only an analogy and that there is more to data objects and collections.

## Create an iRODS session

Again, to work from your Laptop with data in iRODS you will need a `session` as describes in [01-Setup-and-connect](01-Setup-and-connect.ipynb).

In [1]:
import os
print(os.getcwd())
os.chdir('..')
print(os.getcwd()) # needs to return ../iBridges

/Users/staig001/git-repos/iBridges/Tutorials
/Users/staig001/git-repos/iBridges


In [2]:
from ibridges.interactive import interactive_auth
session = interactive_auth()

Auth without password


## Create a collection and upload a file

We will now create a new collection on iRODS in your current iRODS working location:

In [3]:
from ibridges.path import IrodsPath
from ibridges.data_operations import create_collection
irods_path = IrodsPath(session, '~')
print("Current working location:", irods_path)
irods_coll_path = irods_path.joinpath('demo')
print("New collection name:", irods_coll_path)
coll = create_collection(session, irods_coll_path)
print("New collection is created:", irods_coll_path.collection_exists())

Current working location: /nluu12p/home/research-test-christine
New collection name: /nluu12p/home/research-test-christine/demo
New collection is created: True


Now we will upload a file to that new iRODS collection. In this tutorial we assume that you have a file called `demofile.txt` in your home:
- Linux: `/home/<user>/demofile.txt`
- Mac: `/Users/<user>/demofile.txt`
- Windows: `C:\Users\<user>\demofile.txt`

Let's check that first.

In [4]:
import os
from pathlib import Path
local_file = Path(os.path.expanduser('~')).joinpath('demofile.txt')
if os.path.isfile(local_file):
    print('You are good to follow the next steps!')
else:
    print(f'Please create a file {local_file} before you continue')

You are good to follow the next steps!


Now we can upload that file:

In [5]:
from ibridges import upload
upload(session, local_file, irods_coll_path)

OVERWRITE_WITHOUT_FORCE_FLAG: 

We now created a new data object in iRODS:
<img src="img/DataObject1.png" width="400">

How can we now be sure that the file is uploaded? Remember, when we created the new collection `demo`, we created a python object called `coll`. Let's inspect that:

In [6]:
print(coll.name)
print(coll.path)

demo
/nluu12p/home/research-test-christine/demo


Apart from information about its name and its path, a collection also carries a list of `subcollections` and `data_objects`.

In [7]:
print(f'Current subcollections in {coll.path}:', coll.subcollections)
print(f'Current data objects in {coll.path}:', coll.data_objects)

Current subcollections in /nluu12p/home/research-test-christine/demo: []
Current data objects in /nluu12p/home/research-test-christine/demo: [<iRODSDataObject 24384113 demofile.txt>]


## Download the file and the collection

Of course we can also download the data object we have just created. We will download it to your `Downloads` directory on you computer:

In [8]:
local_path = Path(os.path.expanduser('~')).joinpath('Downloads')
assert local_path.is_dir()

In [9]:
from ibridges import download
download(session, irods_coll_path.joinpath('demofile.txt'), local_path)

OVERWRITE_WITHOUT_FORCE_FLAG: 

You can verify taht the file is downloaded.
Now, what happens if we download the file again?

In [10]:
download(session, irods_coll_path.joinpath('demofile.txt'), local_path)

OVERWRITE_WITHOUT_FORCE_FLAG: 

You will receive an Exception `OVERWRITE_WITHOUT_FORCE_FLAG`. This means that the file is already there.  

**Note, both the upload and the download function do not overwrite existing data by default.** 

You can force to overwrite the existing data by setting the flag `overwrite`:

In [11]:
download(session, irods_coll_path.joinpath('demofile.txt'), local_path, overwrite=True)

## System metadata

Both, collections and data objects in iRODS are labeled automatically with some extra information. This information is called *system metadata*. We already saw the `name` and the `path` for the collection. But there is more!

In [12]:
print(f'Collection {coll.name} was created on {coll.create_time}')
print(f'The collection was last modified on {coll.modify_time}')
print(f'The collection was uploaded by and is owned by {coll.owner_name}')

Collection demo was created on 2024-02-28 16:12:55
The collection was last modified on 2024-02-28 16:21:45
The collection was uploaded by and is owned by c.staiger@uu.nl


We can also get some system information about the data object we created by uploading the demo file. To insoect a data object (and also a collection) we have to retrieve it from iRODS. We will use the function `get_dataobject` to do so. Note, for collections the function is called `get_collection`.

In [13]:
from ibridges import get_dataobject, get_collection
obj = get_dataobject(session, irods_coll_path.joinpath('demofile.txt'))

Now we can inspect the object:

In [14]:
print(f'Data object {obj.name} was created on {obj.create_time}')
print(f'Data object {obj.name} full path in iRODS is {obj.path}')
print(f'The data object was created on {obj.create_time}')
print(f'The data object was last modified on {obj.modify_time}')
print(f'The data object was uploaded by and is owned by {obj.owner_name}')

Data object demofile.txt was created on 2024-02-28 16:21:45
Data object demofile.txt full path in iRODS is /nluu12p/home/research-test-christine/demo/demofile.txt
The data object was created on 2024-02-28 16:21:45
The data object was last modified on 2024-02-28 16:21:45
The data object was uploaded by and is owned by c.staiger@uu.nl


**Data objects also carry a `size` and a `checksum`**, with which you can check whether the data reached iRODS completely. Checksums are particularly useful. With the size you can only check whether the length of your local file matches the on in iRODS. However, you cannot see if the contents is really the same. E.g. the length of 'Hello' and 'Hallo' are the same, but they do differ. With a checksum you can detect this. A checksum is a digest of the contents of a file.

**Note, both functions, `upload` and `download`, will always calculate and compare the checksums between your computer and iRODS**, to make sure that the data is tranferred correctly. 

In [15]:
print('Data object size', obj.size)
print('Data object checksum', obj.checksum)

Data object size 27
Data object checksum sha2:S75BqiOKr5pS7pQokupxImlIkHOzETWyNqaCtutCHP8=


The iRODS full path `obj.path` is the address with which you can get the full data object. Currently, this is the uploaded file and its system metadata
<img src="img/DataObject2.png" width="400">

## Data object replicas

Usually when we upload a file to another computer we create one new item which can be found under the path that we used to upload the data. 

Some iRODS systems distribute the uploaded data to different storage systems. That means, that the file is stored on several storage systems, but you have one path under which you can find the data and address the data. This is called in iRODS a data object `relpica`. When you download data the system decides for you, which is the most advantageous replica to give to you in terms of speed and integrity.

<img src="img/DataObject3.png" width="400">

Let us inspect how many replicas of our file we have in iRODS:

In [22]:
from ibridges.data_operations import obj_replicas
obj_replicas(obj)

[(0,
  'lp0054_02',
  'sha2:S75BqiOKr5pS7pQokupxImlIkHOzETWyNqaCtutCHP8=',
  27,
  'good')]

The structure of the answer is a list, where each element of this list is structured like this:

In [24]:
item = obj_replicas(obj)[0]
print('Relplica index:', item[0])
print('Storage system:', item[1])
print('Replica cecksum', item[2])
print('Replica size', item[3])
print('Replica status', item[4])

Relplica index: 0
Storage system: lp0054_02
Replica cecksum sha2:S75BqiOKr5pS7pQokupxImlIkHOzETWyNqaCtutCHP8=
Replica size 27
Replica status good


The replica status tells you of this particular copy of the data is verified and ok. It should say `good`. In all other cases please send the whole replica information including the path to your iRODS help.