You can download and run this notebook locally, or you can run it for free in a cloud environment using Colab or Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Radiology_Inventory.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Radiology_Inventory.ipynb)

# Summary

This notebook can be used to generate reports that summarize TCIA's radiology datasets.  It also provides an option to compare the reports over time and assess changes in the number of patients, studies, series, images and disk space utilized.  

# 1 Setup

Install the latest release of [**tcia_utils**](https://pypi.org/project/tcia-utils/) and import the necessary modules.

In [None]:
import sys

# install tcia utils
!{sys.executable} -m pip install --upgrade -q tcia_utils

In [None]:
import requests
import pandas as pd
from tcia_utils import nbia

# set logging level to INFO in Google Colab (not necessary in Jupyter)
if 'google.colab' in sys.modules:
  import logging

  for handler in logging.root.handlers[:]:
      logging.root.removeHandler(handler)

  # Set handler with level = info
  logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                      level=logging.INFO)

  print("Google Colab Logging set to INFO")

# 2 Create a Token (optional)
This is necessary if you're planning on including restricted NBIA datasets in the reports.


In [None]:
# create token
nbia.getToken()

# set api_url to include restricted collections
api_url = "restricted"

# 3 Generate new report
Next we'll generate  reports for what's currently in the system.  First we'll choose **ONE** of the two cells below based on whether you want to run a report on all collections or only specific collections.  

In [None]:
# For all collections, run this cell
collections_json = nbia.getCollections(api_url = api_url)
collections = [item['Collection'] for item in collections_json]

print(collections)
print(str(len(collections_json)) + " collections were found.")

In [None]:
# For specific collections, update and run this cell
collections = ["LIDC-IDRI", "VICTRE", "BREAST-DIAGNOSIS"]

Next we'll build Patient, Study and Series level inventories of what's currently available in those collections.

In [None]:
# get inventory of studies

studies = pd.DataFrame()

for collection in collections:
    studyDescription = nbia.getStudy(collection, api_url = api_url)
    studies = pd.concat([studies, pd.DataFrame(studyDescription)], ignore_index=True)

studies.to_csv('study_metadata_{}.csv'.format(datetime.date.today()))

In [None]:
# get inventory of series

series = pd.DataFrame()

for collection in collections:
    seriesDescription = nbia.getSeries(collection, api_url = api_url)
    series = pd.concat([series, pd.DataFrame(seriesDescription)], ignore_index=True)

series.to_csv('series_metadata_{}.csv'.format(datetime.date.today()))

## Summarize the inventory reports
Here is the summary of those reports:

In [None]:
# Total count of unique PatientID and StudyInstanceUID values in studies
total_unique_patient_ids = studies['PatientID'].nunique()
total_unique_study_instance_uids = studies['StudyInstanceUID'].nunique()

# Total count of values in series
total_unique_series_patient_ids = series['PatientID'].nunique()
total_unique_series_instance_uids = series['SeriesInstanceUID'].nunique()
image_count_sum = series['ImageCount'].sum()
file_size_sum = series['FileSize'].sum()
disk_space = nbia.format_disk_space(file_size_sum)

# Print the summary statistics
print("Summary Statistics:")
print(f"Total currently available subjects (in series report): {total_unique_series_patient_ids}")
print(f"Total currently available subjects (in study report): {total_unique_patient_ids}")
print(f"Total currently available studies: {total_unique_study_instance_uids}")
print(f"Total currently available series: {total_unique_series_instance_uids}")
print(f"Total currently available images: {image_count_sum}")
print(f"Total current FileSize: {file_size_sum} bytes or {disk_space}")


# 4 Compare your results against previously generated reports (optional)

Follow these steps if you want to compare what's currently available with older reports.  First, we'll import the old reports you want to compare.  Don't forget to update the file names to match your reports.



In [None]:
# import results of getStudy()
oldStudies = pd.read_csv("study_metadata_2023-11-14.csv")

# import results of getSeries()
oldSeries = pd.read_csv("series_metadata_2023-11-14.csv")

This next step produces UID-level reports to allow one to look at the specifics of what changed between the current and previous study/series reports.  This accounts for new data additions as well as data that may have been removed.  

In [None]:
# Merge the studies and oldStudies dataframes based on the PatientID and StudyInstanceUID columns
studyDiff = pd.merge(studies, oldStudies, on=['PatientID', 'StudyInstanceUID'], how='outer', indicator=True)

# Filter the merged dataframe to keep only the rows that are different
studyDiff = studyDiff[studyDiff['_merge'] != 'both']

# Drop the '_merge' column
studyDiff = studyDiff.drop('_merge', axis=1)

# Print the studyDiff dataframe
display(studyDiff)
studyDiff.to_csv("studyDiff.csv")

In [None]:
# Merge the series and oldSeries dataframes based on the PatientID and SeriesInstanceUID columns
seriesDiff = pd.merge(series, oldSeries, on=['PatientID', 'SeriesInstanceUID'], how='outer', indicator=True)

# Filter the merged dataframe to keep only the rows that are different
seriesDiff = seriesDiff[seriesDiff['_merge'] != 'both']

# Drop the '_merge' column
seriesDiff = seriesDiff.drop('_merge', axis=1)

# Print the studyDiff dataframe
display(seriesDiff)
seriesDiff.to_csv("seriesDiff.csv")

## Comparison Summary
Here we will summarize the overall changes between the current and previous reports.

In [None]:
# Count of new PatientID values in studies but not in oldStudies
new_patient_ids = studies[~studies['PatientID'].isin(oldStudies['PatientID'])]['PatientID'].nunique()

# Count of new StudyInstanceUID values in studies but not in oldStudies
new_study_instance_uids = studies[~studies['StudyInstanceUID'].isin(oldStudies['StudyInstanceUID'])]['StudyInstanceUID'].nunique()

# Count of PatientID values in oldStudies but not in studies
missing_patient_ids = oldStudies[~oldStudies['PatientID'].isin(studies['PatientID'])]['PatientID'].nunique()

# Count of StudyInstanceUID values in oldStudies but not in studies
missing_study_instance_uids = oldStudies[~oldStudies['StudyInstanceUID'].isin(studies['StudyInstanceUID'])]['StudyInstanceUID'].nunique()

# Total count of unique PatientID and StudyInstanceUID values in studies
total_unique_patient_ids = studies['PatientID'].nunique()
total_unique_study_instance_uids = studies['StudyInstanceUID'].nunique()

# Count of new PatientID values in series but not in oldSeries
new_series_patient_ids = series[~series['PatientID'].isin(oldSeries['PatientID'])]['PatientID'].nunique()

# Count of new SeriesInstanceUID values in series but not in oldSeries
new_series_instance_uids = series[~series['SeriesInstanceUID'].isin(oldSeries['SeriesInstanceUID'])]['SeriesInstanceUID'].nunique()

# Count of PatientID values in oldSeries but not in series
missing_series_patient_ids = oldSeries[~oldSeries['PatientID'].isin(series['PatientID'])]['PatientID'].nunique()

# Count of SeriesInstanceUID values in oldSeries but not in series
missing_series_instance_uids = oldSeries[~oldSeries['SeriesInstanceUID'].isin(series['SeriesInstanceUID'])]['SeriesInstanceUID'].nunique()

# Total file size/images in oldSeries
old_image_count_sum = oldSeries['ImageCount'].sum()
old_file_size_sum = oldSeries['FileSize'].sum()

# Total count of values in series
total_unique_series_patient_ids = series['PatientID'].nunique()
total_unique_series_instance_uids = series['SeriesInstanceUID'].nunique()
image_count_sum = series['ImageCount'].sum()
file_size_sum = series['FileSize'].sum()
disk_space = nbia.format_disk_space(file_size_sum)

# Change in images/size
image_diff = image_count_sum - old_image_count_sum
size_diff = file_size_sum - old_file_size_sum
disk_space_diff = nbia.format_disk_space(size_diff)

# Print the summary statistics
print("Summary Statistics:")
print(f"New subjects added (in study report): {new_patient_ids}")
print(f"Number of subjects removed (in study report): {missing_patient_ids}")
print(f"New subjects added (in series report): {new_series_patient_ids}")
print(f"Number of subjects removed (in series report): {missing_patient_ids}")
print(f"New studies: {new_study_instance_uids}")
print(f"Studies removed: {missing_study_instance_uids}")
print(f"New series: {new_series_instance_uids}")
print(f"Series removed: {missing_series_instance_uids}")
print(f"Change in image count: {image_diff}")
print(f"Change in disk size: {size_diff} or {disk_space_diff}")


# 5 National Lung Screening Trial (optional)
If you want to include **all** TCIA datasets you must separately account for the National Lung Screening Trial collection.  This collection lives on a separate server due to its size.  Also, this is a completed collection and is very unlikely to change at this point.  However, the following steps can be used to ensure you have the latest inventory of what's available.

In [None]:
# get inventory of studies
collections = ["NLST"]
nlstStudies = pd.DataFrame()

for collection in collections:
    studyDescription = nbia.getStudy(collection, api_url = "nlst")
    nlstStudies = pd.concat([nlstStudies, pd.DataFrame(studyDescription)], ignore_index=True)

nlstStudies.to_csv('nlst_study_metadata_{}.csv'.format(datetime.date.today()))

In [None]:
# get inventory of series
collections = ["NLST"]
nlstSeries = pd.DataFrame()

for collection in collections:
    seriesDescription = nbia.getSeries(collection, api_url = "nlst")
    nlstSeries = pd.concat([nlstSeries, pd.DataFrame(seriesDescription)], ignore_index=True)

nlstSeries.to_csv('nlst_series_metadata_{}.csv'.format(datetime.date.today()))

In [None]:
# Total count of unique NLST PatientID and StudyInstanceUID values
total_unique_nlst_patient_ids = nlstStudies['PatientID'].nunique()
total_unique_nlst_study_instance_uids = nlstStudies['StudyInstanceUID'].nunique()
total_unique_nlst_series_instance_uids = nlstSeries['SeriesInstanceUID'].nunique()
nlst_image_count_sum = nlstSeries['ImageCount'].sum()
nlst_file_size_sum = nlstSeries['FileSize'].sum()
nlst_disk_space = nbia.format_disk_space(nlst_file_size_sum)

# Print the summary statistics
print("NLST Summary Statistics:")
print(f"Total subjects: {total_unique_nlst_patient_ids}")
print(f"Total studies: {total_unique_nlst_study_instance_uids}")
print(f"Total series: {total_unique_nlst_series_instance_uids}")
print(f"Total images: {nlst_image_count_sum}")
print(f"Total FileSize: {nlst_file_size_sum} bytes or {nlst_disk_space}")

Now let's add everything together again with NLST included.

In [None]:
print("TCIA DICOM radiology collections include:",
      total_unique_patient_ids + total_unique_nlst_patient_ids, "subjects,",
      total_unique_study_instance_uids + total_unique_nlst_study_instance_uids, "studies,",
      total_unique_series_instance_uids + total_unique_nlst_series_instance_uids, "series,",
      image_count_sum + nlst_image_count_sum, "images, which requires",
      nbia.format_disk_space(file_size_sum + nlst_file_size_sum), "of storage.")

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7