You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Aspera_CLI_Downloads.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Aspera_CLI_Downloads.ipynb)

# Summary
Much of non-DICOM content in [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is provided via links to IBM Aspera Faspex packages.  Aspera’s FASP protocol is designed to move data rapidly across networks with minimal disruption to other traffic.  Aspera’s Faspex application bundles data into packages that can be referenced via a web link (i.e., a URI).  When an Aspera Faspex link resolves in a browser, it presents a GUI that guides a user through the process of installing a browser extension and a local Aspera Connect client, if not already there, that moves the data using FASP between TCIA servers and the computer the browser is running on.

We frequently get requests from researchers for an option to download TCIA Faspex packages using a command line interface that bypasses the GUI.  While not a part of the standard Aspera distributions, the IBM Aspera developers have provided an open source tool (Apache 2.0 license) called [ascli (aspera-cli)](https://github.com/IBM/aspera-cli) that allows a client to download an Aspera Faspex package using its URI.  **This notebook is focused on demonstrating how to download TCIA data from Aspera packages via the command line on a linux system.**

**Note:** The performance of ascli appears to be significantly worse when using the free tier of Google Colab.  


# Setup

First we'll install some necessary prerequisite software.  First you need to install Ruby, then the aspera-cli gem, and afterward you can use that to install ascli.  The steps to [install Ruby](https://www.ruby-lang.org/en/downloads/) vary by operating system but the gem and ascli commands should be the same as the last two lines below.

In [None]:
# consult the link above if you're not on an OS that uses apt
!apt install -y ruby ruby-dev rubygems ruby-json

# these should work in all environments after ruby is installed
!gem install aspera-cli
!ascli conf ascp install

If the ascli command above fails you can try this alternate 'local SDK installation' method described on https://github.com/IBM/aspera-cli:

```
curl -Lso sdk.zip https://ibm.biz/aspera_transfer_sdk
ascli config ascp install --sdk-url=file:///sdk.zip
```

The format is: file:///<path>, where <path> can be either a relative path (not starting with /), or an absolute path.

# Download an entire Aspera package

Now that the prerequisite installations are complete, you will be able to use the ascli **receive** command to transfer a TCIA Faspex package using its link by following these steps:
1.	Browse to the dataset landing page that describes the package,
2.	Find the “Download” button that one normally would use to download the package using the Faspex GUI, but do not click it,
3.	Right click that button icon and select “copy link address” or “copy link” or similar (depends on your browser) to extract the package URI into the clipboard or copy/paste buffer.

Replace the **url** parameter in the following cell with the link of the package you want to download.  

**Note:** If you don't change the package URL below it will only take a few seconds to download an example package (~100MB) from the [Cancer Moonshot Biobank - Gastroesophageal Cancer Collection (CMB-GEC)](https://doi.org/10.7937/E7KH-R486).


In [None]:
!ascli faspex5 packages receive --url='https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6Ijc2MiIsInBhc3Njb2RlIjoiZDg0NmZlM2Q5ZjZjNzliYjUxYWU2MWMzNjJkNmE1ODJmMTc0YmVkYSIsInBhY2thZ2VfaWQiOiI3NjIiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0='

### Optional: Verify your download

Aspera has built in checksum capabilities and asserts that any download that completes successfully had a checksum verification performed so this is generally not necessary.  

However, each package TCIA maintains includes a sums file at the root level of the package.  Not only can this help with troubleshooting issues, it can also be used to quickly understand the contents of a package in case you only want to download a subset of it as shown in the next section.

**Note:** Performing this verification on large datasets may take a very long time depending on how much compute you have available.

Here's an example python function that will perform the checksum verification.

In [None]:
def verify_sums_file(sums_file_path, data_directory):
    """
    Function to assist with verifying the checksum
    files that TCIA publishes with Aspera packages. Specify
    the path to your *.sums file and the directory where
    you've saved the Aspera package.
    """
    import hashlib
    import os
    if not os.path.exists(sums_file_path):
        print("Sums file not found at the specified path.")
        return

    with open(sums_file_path, 'r') as file:
        for line in file:
            file_hash, file_name = line.strip().split(' ')
            file_path = os.path.join(data_directory, file_name)
            if os.path.exists(file_path):
                with open(file_path, 'rb') as f:
                    file_contents = f.read()
                    calculated_hash = hashlib.md5(file_contents).hexdigest()
                    if calculated_hash == file_hash:
                        print(f"{file_name}: OK - Hashes match")
                    else:
                        print(f"{file_name}: MISMATCH - Hashes do not match")
            else:
                print(f"{file_name}: NOT FOUND")

You'll need to update the file names/paths for the sums file and data directory if you downloaded a different package.

In [None]:
sums_file_path = '/content/PKG - Biobank_CMB-GEC_v1/Biobank_CMB-GEC_v1.sums'
data_directory = '/content/PKG - Biobank_CMB-GEC_v1'
verify_sums_file(sums_file_path, data_directory)

# Download individual files or directories from an Aspera package

In order to grab specific parts of a package you can use the **browse** parameter to look around inside the package.  Running the command without any path specified will show you the root folder of the package.  

Let's take a look at the [HER2 tumor ROIs](https://doi.org/10.7937/NVA3-N783) collection as an example. Here's the root folder of the package.

In [None]:
!ascli faspex5 packages browse --url="https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6IjczOSIsInBhc3Njb2RlIjoiNzEwNmUzNDFjMDY4MjljNjBkMmM0ZjcxYTBhMTE1ODcxNGIzZWNjNSIsInBhY2thZ2VfaWQiOiI3MzkiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0="

If you want to look at the contents of one of the directories you just add the directory path to the end of this command.  Let's look at the Yale_HER2_cohort folder.

In [None]:
!ascli faspex5 packages browse --url="https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6IjczOSIsInBhc3Njb2RlIjoiNzEwNmUzNDFjMDY4MjljNjBkMmM0ZjcxYTBhMTE1ODcxNGIzZWNjNSIsInBhY2thZ2VfaWQiOiI3MzkiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0=" Yale_trastuzumab_response_cohort

When you want to download a file, you can once again use the **recieve** command.  Let's pretend that we're interested in grabbing the entire annotation folder.  You can do that by simply appending the path of the directory to the end of the **receive** command.

In [None]:
!ascli faspex5 packages receive --url="https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6IjczOSIsInBhc3Njb2RlIjoiNzEwNmUzNDFjMDY4MjljNjBkMmM0ZjcxYTBhMTE1ODcxNGIzZWNjNSIsInBhY2thZ2VfaWQiOiI3MzkiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0=" Yale_trastuzumab_response_cohort/Annotations

You can also download individual files.  We'll demonstrate that by retrieving the included **sums** file, which will provide a full list of the files in the package.  This can help us more easily figure out which specific data we may want to download without having to dig through every folder one at a time using the previously shown **browse** commands.

In [None]:
!ascli faspex5 packages receive --url="https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6IjczOSIsInBhc3Njb2RlIjoiNzEwNmUzNDFjMDY4MjljNjBkMmM0ZjcxYTBhMTE1ODcxNGIzZWNjNSIsInBhY2thZ2VfaWQiOiI3MzkiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0=" HER2_tumor_ROIs_v3.sums

Sometimes the sums files are not sorted in a way that makes them easy to read.  In this case you may want to sort them alphabetically by file path/name.  If you're using a different sums file you just need to update the original and sorted file names accordingly in this next step.

In [None]:
!sort -k2,2 -o HER2_tumor_ROIs_v3_sorted.sums HER2_tumor_ROIs_v3.sums

Now let's say that after reviewing the sums file you're interested in obtaining just the first HER2 positive image and annotation file from the Yale HER2 cohort.  We can pull both files at the same time by appending their paths to the end of the previous **receive** command as shown here.

In [None]:
!ascli faspex5 packages receive --url="https://faspex.cancerimagingarchive.net/aspera/faspex/public/package?context=eyJyZXNvdXJjZSI6InBhY2thZ2VzIiwidHlwZSI6ImV4dGVybmFsX2Rvd25sb2FkX3BhY2thZ2UiLCJpZCI6IjczOSIsInBhc3Njb2RlIjoiNzEwNmUzNDFjMDY4MjljNjBkMmM0ZjcxYTBhMTE1ODcxNGIzZWNjNSIsInBhY2thZ2VfaWQiOiI3MzkiLCJlbWFpbCI6ImhlbHBAY2FuY2VyaW1hZ2luZ2FyY2hpdmUubmV0In0=" Yale_HER2_cohort/Annotations/Her2Pos_Case_01.xml Yale_HER2_cohort/SVS/Her2Pos_Case_01.svs

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/), [Quasar Jarosz](https://www.linkedin.com/in/quasarjarosz/) and [Lawrence Tarbox](https://www.linkedin.com/in/lawrence-tarbox-088335/). Also, big thanks to [@VolodymyrChapman](https://github.com/VolodymyrChapman) for pointing out we can download individual files/directories from a package and to [Laurent Martin](https://www.linkedin.com/in/laurentmartinjp/) for assisting with a variety of our ASCLI questions!

If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7