You can download and run this notebook locally, or you can run it for free in a cloud environment using Google Colab or Amazon Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_PathDB_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_PathDB_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers. 

**This notebook provides documentation for using the PathDB module of [tcia-utils](https://pypi.org/project/tcia-utils/), which is a package that contains functions to simplify common tasks one might perform when interacting with The Cancer Imaging Archive (TCIA) via Python.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1. Introduction
TCIA uses software called [PathDB](https://pathdb.cancerimagingarchive.net/imagesearch) to host its digital pathology datasets. PathDB works together with caMicroscope to let users browse our available dataset and visualize the slides in a browser before downloading them. This notebook is focused on using the PathDB API to access these images and metadata. 

**Note:** Typically, due to the large size of these datasets, we encourage users to download them using the Aspera browser plugin.  We have another notebook which shows how to use [Aspera from the command line](https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Aspera_CLI_Downloads.ipynb) if you are interested in bulk downloads while working from a server without a GUI.



# 2. tcia_utils Overview and Installation

The following cells install and import the DataCite module from [**tcia_utils**](https://pypi.org/project/tcia-utils/).


In [1]:
!pip install pip --upgrade -q
!pip install tcia-utils --upgrade -q
!pip install pandas --upgrade -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.0.1 which is incompatible.[0m[31m
[0m

In [2]:
import requests
import pandas as pd
from tcia_utils import pathdb

### Set logging level to INFO (optional)
This step isn't necessary on local Jupyter Labs, but Google Colab's logging root handler only shows warnings and errors by default.  If you'd like to see INFO statements you can run the following code.  

In [8]:
import logging

# Check current handlers
#print(logging.root.handlers)

# Remove all handlers associated with the root logger object.
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
#print(logging.root.handlers)

# Set handler with level = info
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', 
                    level=logging.INFO)

print("Logging set to INFO")

Logging set to INFO


# 3. PathDB Query Examples


## 3.1 getCollections()
The **pathdb** module contains a function called **getCollections()** to get a list of available pathology datasets.  It includes **query** and **format** parameters.  It returns a list of all collections in JSON format if no parameters are specified.  

In [None]:
pathdb.getCollections()

Using the **format = "df"** parameter allows you to return a dataframe.

In [4]:
pathdb.getCollections(format = "df")

Unnamed: 0,collectionName,collectionId,updated
0,AML_Cytomorphology_LMU,516,2021-11-29
1,Biobank_CRC,535,2022-04-18
2,Biobank_GEC,536,2022-04-18
3,Biobank_LCA,537,2022-04-18
4,Biobank_MEL,538,2022-04-18
5,Biobank_MML,539,2022-04-18
6,Biobank_PCA,540,2022-04-18
7,Bone,519,2022-01-05
8,CATCH,518,2022-01-04
9,CODEX_imaging_of_HCC,544,2022-11-15


Using the **query** parameter searches the **collectionName** field. Let's say that we were interested to find all TCIA datasets that mention "prostate" in their collection name by specifying **query = "prostate"**.  We can also save the output to a CSV file by specifying **format = "csv"**.  Check the contents of the CSV before moving on!

In [5]:
pathdb.getCollections(query = "prostate", format = "csv")

## getImages()

You can use **getImages()** to access metadata about individual images in PathDB.  It includes **query** and **format** parameters.  The format parameter works the same way as **getCollections()**.  There are a couple of ways to use the **query** parameter.  The first is to pass a specific collectionId as an integer.  Let's use the Prostate_MRI collectionId (10) that we discovered using **getCollections()** for this example.

**Note:** PathDB returns paginated results.  If your results span more than one page **getImages()** will continue to run until you reach the last page of results for your query.

In [9]:
pathdb.getImages(query = 10)

2023-05-25 20:58:36,180:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/10?page=0&_format=json
2023-05-25 20:58:36,470:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/10?page=1&_format=json


[{'collectionName': 'Prostate_MRI',
  'collectionId': 10,
  'subjectId': 'MIP-PROSTATE-01-0001',
  'imageId': 'MIP-PROSTATE-01-0001',
  'imageHeight': 6165,
  'imagedWidth': 7324,
  'physicalPixelSizeX': 0,
  'physicalPixelSizeY': 0,
  'imageUrl': 'http://pathdb.cancerimagingarchive.net/system/files/wsi/ross/Prostate-MRI/converted/Prostate-MRI/MIP-PROSTATE-01-0001.tiff'},
 {'collectionName': 'Prostate_MRI',
  'collectionId': 10,
  'subjectId': 'MIP-PROSTATE-01-0002',
  'imageId': 'MIP-PROSTATE-01-0002',
  'imageHeight': 6744,
  'imagedWidth': 7489,
  'physicalPixelSizeX': 0,
  'physicalPixelSizeY': 0,
  'imageUrl': 'http://pathdb.cancerimagingarchive.net/system/files/wsi/ross/Prostate-MRI/converted/Prostate-MRI/MIP-PROSTATE-01-0002.tiff'},
 {'collectionName': 'Prostate_MRI',
  'collectionId': 10,
  'subjectId': 'MIP-PROSTATE-01-0003',
  'imageId': 'MIP-PROSTATE-01-0003',
  'imageHeight': 11006,
  'imagedWidth': 9000,
  'physicalPixelSizeX': 0,
  'physicalPixelSizeY': 0,
  'imageUrl': '

The second way we can use the **query** parameter is to search for matches of **collectionName**.  Let's try pulling the metadata for all of the Biobank collections and use a dataframe as the output.  

In [10]:
pathdb.getImages(query = "biobank", format = "df")

2023-05-25 21:00:54,539:INFO:Calling... https://pathdb.cancerimagingarchive.net/collections?_format=json
2023-05-25 21:00:54,832:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/535?page=0&_format=json
2023-05-25 21:00:55,143:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/535?page=1&_format=json
2023-05-25 21:00:55,357:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/536?page=0&_format=json
2023-05-25 21:00:55,572:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/536?page=1&_format=json
2023-05-25 21:00:55,785:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/537?page=0&_format=json
2023-05-25 21:00:56,045:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/537?page=1&_format=json
2023-05-25 21:00:56,258:INFO:Calling... https://pathdb.cancerimagingarchive.net/listofimages/538?page=0&_format=json
2023-05-25 21:00:56,571:INFO:Calling... https://pathdb.cancerimagingarchive.

Unnamed: 0,collectionName,collectionId,subjectId,imageId,imageHeight,imagedWidth,physicalPixelSizeX,physicalPixelSizeY,imageUrl
0,CMB_CRC,535,MSB-01262,MSB-01262-02-02,30902,65735,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
1,CMB_CRC,535,MSB-01262,MSB-01262-04-02,41613,45816,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
2,CMB_CRC,535,MSB-01262,MSB-01262-05-02,45838,31871,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
3,CMB_CRC,535,MSB-01262,MSB-01262-08-02,51702,45816,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
4,CMB_CRC,535,MSB-01262,MSB-01262-09-02,39892,31871,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
...,...,...,...,...,...,...,...,...,...
80,CMB_MML,539,MSB-04030,MSB-04030-12-02,36903,33863,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
81,CMB_MML,539,MSB-04030,MSB-04030-12-03,37333,43823,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
82,CMB_PCA,540,MSB-02917,MSB-02917-01-02,40626,45816,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...
83,CMB_PCA,540,MSB-03973,MSB-03973-01-02,51886,43823,0.2523,0.2523,http://pathdb.cancerimagingarchive.net/system/...


# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7