# Working with Large Files using iRODs

When dealing with large resource files in HydroShare, the standard method for data transfer (HydroShare REST API) is not sufficient due to file size constrants.  This notebook introduces advanced methods for dealing with these large files using iRODs tools.  More specifically, this notebook demonstrates how to configure and use `iCommands` in Jupyter Notebooks to transfer and syncronize data in HydroShare. The commands that are used in this notebook are:

**`ils`** : display data stored in iRODs  
**`iget`** : retrieve data stored in iRODs  
**`iput`** : move a data into iRODs  
**`irsync`** : syncronize local and iRODs data  

More information on these tools can be found in the iRODs [documentation](https://docs.irods.org/4.2.0/icommands/user/)


Lets start by importing the `hydroshare` and `irods` modules

In [None]:
import os
from utilities import hydroshare
from utilities.irods import commands

In [None]:
hs = hydroshare.hydroshare()

### Instantiate the iCommands library.  
This will install and setup the iRODs icommands if they haven't already been configured.  The purpose of the this module is to provide a lightweight wrapper around the standard `icommands` to make integration within Jupyter notebooks easy.  However, the standard `icommands` can also be used via bash, e.g. `!ils`

In [None]:
i = commands.iCommands(hs)

### List the files in your HydroShare iRODs userspace
List all files in your HydroShare iRODs userspace.  For more information on how to use this iRODs userspace, see [HydroShare iRODs Setup](https://pages.hydroshare.org/creating-and-managing-resources/uploading-large-files-into-hydroshare/)

In [None]:
# call the ils function to retrieve files as a Python list
files = i.ils()

In [None]:
for f in files:
    print(f)

Alternatively you can use `icommands` directly from bash.

In [None]:
!ils

Notice that the output from the Python command (`i.ils`) is slightly different from the bash command (`!ils`). This is because the irods Python module cleans the `ils` response and organizes it into a native Python list.  While either approach is valid, the Python module aims to make iRODs tools easy to use within Python notebooks.

### Get a file from HydroShare iRODs to work with in JupyterHub

Before we can work with files in JupyterHub, they need to be transfered into the JupyterHub [userspace](/user/test/tree/notebooks/data).  We can use the `iget` command to perform a parallel transfer from HydroShare iRODs data into our JupyterHub userspace.  For instance, lets get the first file that was returned from the `ils` command above:

In [None]:
myfile = i.iget(files[0])

In [None]:
print(myfile)

Now that we have our file in JupyterHub, we can begin our processing.  If the file is a hydroshare resource bagit, it can be loaded into `hydroshare` module for easier processing:

In [None]:
# get the resource id (minus the extension)
resourceid = os.path.basename(myfile).split('.')[0]

# load the resource into the hydroshare module
hs.loadResource(resourceid)

In [None]:
# do something with the files

### Syncronize files 
File can syncronized between JupyterHub and Hydroshare iRods using the `irsync` command.  A wrapper has been provided to make this easy to use outside of bash, but the native command can still be used `!irsync`.

To demonstrate the `irsync` command, lets start by creating a simple text file using bash:

In [None]:
# print some text into a new file
!echo "this is some text" > testfile

In [None]:
# preview the testfile
!cat testfile

Syncronize this file with the HydroShare iRODs server.  Several additional parameters have been provided, such as as `dryrun` which allows you to preview the syncronization before it's executed:

In [None]:
i.irsync('testfile', 'testfile', dryrun=True)

Additionally, we can control the direction of syncronization using the `source_irods` and `target_irods` parameters.  To demonstrate this, lets first make a change to our testfile on the JupterHub server.

In [None]:
# append some more text our file
!echo '\nthis is some additional text' >> testfile

In [None]:
# preview the testfile
!cat testfile

Now our testfile in JupyterHub is out of sync with the version on HydroShare iRODs.  Lets revert our JupyterHub version back to that of the HydroShare iRODs version using the `irsync` command in the reverse direction

In [None]:
i.irsync('testfile', 'testfile', 
         source_irods=True, target_irods=False,
         dryrun=True)

In [None]:
# preview the testfile
!cat testfile