You can download and run this notebook locally, or you can run it for free in a cloud environment using Google Colab or Amazon Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_PathDB_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_PathDB_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers.

**This notebook provides documentation for using the PathDB module of [tcia-utils](https://pypi.org/project/tcia-utils/), which is a package that contains functions to simplify common tasks one might perform when interacting with The Cancer Imaging Archive (TCIA) via Python.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# Introduction
TCIA uses software called [PathDB](https://pathdb.cancerimagingarchive.net/imagesearch) to host its digital pathology datasets. PathDB works together with caMicroscope to let users browse our available dataset and visualize the slides in a browser before downloading them. This notebook is focused on using the PathDB API to access these images and metadata.

**Note:** Typically, due to the large size of these datasets, we encourage users to download them using the Aspera browser plugin.  We have another notebook which shows how to use [Aspera from the command line](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Aspera_CLI_Downloads.ipynb) if you are interested in bulk downloads while working from a server without a GUI.



# tcia_utils Overview and Installation

The following cells install and import the pathDB module from [**tcia_utils**](https://pypi.org/project/tcia-utils/).


In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import pathdb

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# PathDB Query Examples


## getCollections()
The **pathdb** module contains a function called **getCollections()** to get a list of available pathology datasets.  It includes **query** and **format** parameters.  It returns a list of all collections in JSON format if no parameters are specified.  

In [None]:
pathdb.getCollections()

Using the **format = "df"** parameter allows you to return a dataframe.

In [None]:
pathdb.getCollections(format = "df")

Using the **query** parameter searches the **collectionName** field. Let's say that we were interested to find all TCIA datasets that mention "CPTAC" in their collection name by specifying **query = "CPTAC"**.  We can also save the output to a CSV file by specifying **format = "csv"**.  Check the contents of the CSV before moving on!

In [None]:
pathdb.getCollections(query = "CPTAC", format = "csv")

## getImages()

You can use **getImages()** to access metadata about individual images in PathDB.  It includes **query** and **format** parameters.  The format parameter works the same way as **getCollections()**.  There are a couple of ways to use the **query** parameter.  The first is to pass a specific collectionId as an integer.  Let's use the Prostate_MRI collectionId (10) that we discovered using **getCollections()** for this example.

**Note:** PathDB returns paginated results.  If your results span more than one page **getImages()** will continue to run until you reach the last page of results for your query.

In [None]:
pathdb.getImages(query = 10)

The second way we can use the **query** parameter is to search for matches of **collectionName**.  Let's try pulling the metadata for all of the [Cancer Moonshot Biobank (CMB)](https://www.cancerimagingarchive.net/research/cmb/) collections and use a dataframe as the output.  These collections all have the naming structure of CMB-XXXX where the X's represent an abbreviation for the type of cancer.

In [None]:
pathdb.getImages(query = "cmb", format = "df")

# Downloading images

After querying to obtain the metadata for the images you're interested in you can download them using **downloadImages()**. This function takes the output from getImages() and downloads the files concurrently. It checks for existing files to avoid re-downloading.

    Args:
        images_data (list or pd.DataFrame): A list of dictionaries or a pandas DataFrame
            containing image metadata, including an 'imageUrl' key.
        path (str): The local directory to save the downloaded images.
        max_workers (int): The maximum number of concurrent download threads.
        number (int): The maximum number of new images to download. Defaults to 0 (no limit).

In [None]:
metadata = pathdb.getImages(query = "Prostate-MRI")


In [None]:
pathdb.downloadImages(metadata, path = "ProstateMRI", number = 10)

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7