<a href="https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_Series_UID_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary

This notebook can be used to summarize TCIA data given a set of Series Instance UIDs (e.g. from a TCIA manifest file). The output includes:

1.   A detailed report (CSV) containing the Collection Name, Subject ID,	Study UID,	Study Description,	Study Date,	Series UID,	Series Description,	Number of images,	File Size (Bytes),	Modality and	Manufacturer for each scan
2.   A report summarizing how many Patients/Studies/Series/Images are represented along with a breakdown of Collections, modalities, body parts and manufacturers that are included

You can import Series UIDs into the notebook in following ways:

1.   Upload a TCIA manifest file
2.   Use wget with the URL of manifest file on TCIA
3.   Upload a text file with a list of Series UIDS (one per row)

# 1 Create a credential file and token
In order to ensure you can obtain info about all series UIDs in your list you must provide your TCIA login/password and create a token using the following steps.






In [1]:
# Create the credential file
# NOTE: You must enter your real user name and password before you run this,
# or edit the resulting text file with your real credentials after it's created.

lines = ['userName=YourUserName', 'passWord=YourPassword']
with open('credentials.txt', 'w') as f:
    f.write('\n'.join(lines))

After the file is created you can find it by clicking the folder icon in the left sidebar.  Double click to edit the file, enter your login credentials, and then close it to save the file before proceeding to the next step.

In [11]:
# extract the user/pw from the credential file to variables for use in subsequent API calls and downloads          

credentialFilePath = 'credentials.txt'
mylines = []                                  
with open (credentialFilePath, 'rt') as myfile: 
    for myline in myfile:                     
        mylines.append(myline)   

userName = mylines[0].rstrip('\n').split(r'userName=')[1]
passWord = mylines[1].rstrip('\n').split(r'passWord=')[1] 

Now we'll use the credential file you created to generate an access token to query restricted Collections on TCIA.  

***Note:*** Tokens are valid for 2 hours and must be refreshed after that point. See https://wiki.cancerimagingarchive.net/x/X4ATBg for more details. 

In [7]:
# imports

import requests
import pandas as pd
import io

# set API URL
adv_url = "https://services.cancerimagingarchive.net/nbia-api/services/"

In [13]:
# request token

token_url = "https://services.cancerimagingarchive.net/nbia-api/oauth/token?username="+userName+"&password="+passWord+"&grant_type=password&client_id=nbiaRestAPIClient&client_secret=ItsBetweenUAndMe"
access_token = requests.get(token_url).json()["access_token"]
print ('Token created successfully: ', access_token)

# set API call headers to use the access token we created
api_call_headers = {'Authorization': 'Bearer ' + access_token}

Token created successfully:  c1dbfacc-1110-4f6b-b33d-beece6ccd274


# Import Series UID file 

***The file must contain only Series Instance UIDS with one UID per line (no commas).***  

To import a file directly to Colab use the menu on the left sidebar to upload it.  Once it's uploaded, right click and rename it to "series-uids.txt".

To import a file that's already posted on TCIA you can use the WGET command in the next cell by updating it with the URL of the manifest you want to analyze.  

In [14]:
# optional: use wget to download the manifest
# replace the URL: wget -O /directory_path/series-uids.txt https://URL_on_TCIA/manifest.tcia

!wget -O /content/series-uids.txt https://wiki.cancerimagingarchive.net/download/attachments/52757630/CrowdsCureCancer2018-DICOM.TCIA?api=v2

--2022-10-24 14:00:58--  https://wiki.cancerimagingarchive.net/download/attachments/52757630/CrowdsCureCancer2018-DICOM.TCIA?api=v2
Resolving wiki.cancerimagingarchive.net (wiki.cancerimagingarchive.net)... 144.30.169.13
Connecting to wiki.cancerimagingarchive.net (wiki.cancerimagingarchive.net)|144.30.169.13|:443... connected.
HTTP request sent, awaiting response... 200 
Length: 44706 (44K) [application/x-nbia-manifest-file]
Saving to: ‘/content/series-uids.txt’


2022-10-24 14:00:58 (566 KB/s) - ‘/content/series-uids.txt’ saved [44706/44706]



If using a TCIA manifest file, run the step below to remove the header (6 lines of text that precede the UID list).  You can skip this if you have created a custom UID file in some other application like Excel.  

In [17]:
with open('series-uids.txt') as f:
    first_line = f.readline()

if "downloadServerUrl" in first_line:
    !sed -i -e 1,6d /content/series-uids.txt
    print("Header text removed.")
else:
    print("This is not a TCIA manifest file, or you've already removed the header lines.")

This is not a TCIA manifest file, or you've already removed the header lines.


# Read the file

Read the series-uid.txt file for analysis in subsequent steps.

In [18]:
# initialize variable
uids = []

# open file
with open("series-uids.txt") as file:
    for line in file:
        uids.append(line.rstrip())

# format the result to submit to the API
csvUids = ",".join(uids)
print(csvUids)

1.3.6.1.4.1.14519.5.2.1.3098.5025.115022530859364225793081095046,1.3.6.1.4.1.14519.5.2.1.3098.4963.220432233339628991344378001240,1.3.6.1.4.1.14519.5.2.1.3098.5025.257760710362197951836270981675,1.3.6.1.4.1.14519.5.2.1.3098.4963.320145454615389714134412028542,1.3.6.1.4.1.14519.5.2.1.3098.5025.645295619891266235526712062459,1.3.6.1.4.1.14519.5.2.1.3098.4963.339510393021164759075669972255,1.3.6.1.4.1.14519.5.2.1.3098.4963.309488074097527031061784217755,1.3.6.1.4.1.14519.5.2.1.3098.5025.275039868724513872765196667070,1.3.6.1.4.1.14519.5.2.1.3098.5025.123912208404617819073663909493,1.3.6.1.4.1.14519.5.2.1.6450.4012.286334895151142675174731411094,1.3.6.1.4.1.14519.5.2.1.3098.4963.206913223260019003716206320688,1.3.6.1.4.1.14519.5.2.1.3098.4963.300710668745440676766442000832,1.3.6.1.4.1.14519.5.2.1.3098.4963.280583513164721911159656704491,1.3.6.1.4.1.14519.5.2.1.3098.4963.198781047577165405831497716983,1.3.6.1.4.1.14519.5.2.1.3098.4963.315303641137904064712770025506,1.3.6.1.4.1.14519.5.2.1.3

# Create a Report of Series Metadata

Create the detailed metadata report and save it to "scan_metadata.csv".

In [19]:
# get series metadata from API
param = {'list': csvUids}
data_url = adv_url + "getSeriesMetadata2"
data = requests.post(data_url, headers = api_call_headers, data = param)

# save output
df = pd.read_csv(io.StringIO(data.text), sep=',')
df.to_csv('scan_metadata.csv')
print("Metadata report saved successfully")

# optional: display sample of csv
#display(df)

# Create a summary of Collections, patients, modalities, body parts and manufacturers

In [21]:
# Summarize Collections
print("Number of Series per Collection:")
print(df['Collection Name'].value_counts(dropna=False),'\n')

# Summarize patients
print('Patient/Study/Series Counts\n')
print('Subjects: ', len(df['Subject ID'].value_counts()), 'subjects')
print('Subjects: ', len(df['Study UID'].value_counts()), 'studies')
print('Subjects: ', len(df['Series ID'].value_counts()), 'series')
print('Images: ', df['Number of images'].sum(), 'images\n')

# Summarize modalities
print("Series Counts - Modalities:")
print(df['Modality'].value_counts(dropna=False),'\n')

# Summarize manufacturers
print("Series Counts - Device Manufacturers:")
print(df['Manufacturer'].value_counts(dropna=False))

Number of Series per Collection:
Anti-PD-1_MELANOMA     207
NSCLC Radiogenomics     94
TCGA-LUSC               62
CPTAC-CCRCC             59
Anti-PD-1_Lung          51
TCGA-BLCA               50
TCGA-COAD               37
CPTAC-PDA               32
CPTAC-UCEC              26
TCGA-UCEC               17
CPTAC-GBM               15
TCGA-HNSC               11
CPTAC-HNSCC             10
CPTAC-CM                 3
Name: Collection Name, dtype: int64 

Patient/Study/Series Counts

Subjects:  324 subjects
Subjects:  395 studies
Subjects:  674 series
Images:  138851 images

Series Counts - Modalities:
CT    674
Name: Modality, dtype: int64 

Series Counts - Device Manufacturers:
GE MEDICAL SYSTEMS                382
SIEMENS                           216
Philips                            36
NaN                                22
TOSHIBA                            17
McKesson Medical Imaging Group      1
Name: Manufacturer, dtype: int64


# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/), and is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/).

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/).  If you leverage this notebook or any TCIA datasets in your work please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7