You can download and run this notebook locally, or you can run it for free in a cloud environment using Google Colab or Amazon Sagemaker Studio Lab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kirbyju/TCIA_Notebooks/blob/main/TCIA_DataCite_Queries.ipynb)

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_DataCite_Queries.ipynb)

# Summary

Access to large, high-quality datasets is essential for researchers to understand disease and precision medicine pathways, especially in cancer. However, HIPAA constraints make sharing medical images outside an individual institution complex. [The Cancer Imaging Archive (TCIA)](https://www.cancerimagingarchive.net/) is a public service funded by the National Cancer Institute that addresses this challenge by providing hosting and de-identification services that take major burdens of data sharing off researchers. 

**This notebook provides documentation for using the DataCite module of [tcia-utils](https://pypi.org/project/tcia-utils/), which is a package that contains functions to simplify common tasks one might perform when interacting with The Cancer Imaging Archive (TCIA) via Python.**  If you're interested in additional TCIA notebooks and coding examples, check out the tutorials at https://github.com/kirbyju/TCIA_Notebooks.

# 1. Introduction
TCIA issues a Digital Object Identifier (DOI) for each of its datasets through [DataCite](https://datacite.org/value.html).  The [DataCite API](https://wiki.cancerimagingarchive.net/x/YwD5BQ) can be used to programmatically access Collection metadata such as their DOI URL, title, publication date, licensing information and abstract. 

Please note that **this API was not developed by TCIA.** See https://support.datacite.org/ for any technical questions.  The TCIA Helpdesk may be able to assist if your inquiry is related to the content of the data itself. 



# 2. tcia_utils Overview and Installation

The following cells install and import the DataCite module from [**tcia_utils**](https://pypi.org/project/tcia-utils/).


In [9]:
!pip install pip --upgrade -q
!pip install tcia-utils --upgrade -q
!pip install pandas --upgrade -q

In [10]:
import requests
import pandas as pd
from tcia_utils import datacite

# 3. DataCite Query Examples - Rewritten to get all Datasets mentioning "Tumor and Brain"



The datacite module contains a single function called **getDOI()**.  This returns metadata for one or more DOIs based on your query parameters.  It returns all TCIA DOIs in **JSON** format if no parameters are specified.


In [None]:
#datacite.getDoi() # returns a giant JSON of all datasets metadata

In [11]:
doiDF = datacite.getDoi(format = "df") # same as before but as a DF table


In [12]:
doiDF

Unnamed: 0,DOI,Identifier,CreatorNames,Title,Created,Updated,Related,Version,Rights,RightsURI,Description,FundingReferences,URL,CitationCount,ReferenceCount
0,10.7937/k9/tcia.2018.ow73vlo2/scgvs2,Crowds-Cure-2017,"Jayashree Kalpathy-Cramer, Andrew Beers, Artem...",Crowds-Cure-2017 RSNA2017CCC-doiJNLP-e8nBWDCC....,2019-06-11T15:34:25Z,2023-05-18T15:36:42Z,IsPartOf: 10.7937/k9/tcia.2018.ow73vlo2,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,RSNA2017CCC-doiJNLP-e8nBWDCC.jnlp\nPlease see ...,[],https://wiki.cancerimagingarchive.net/download...,0,0
1,10.7937/k9/tcia.2018.ow73vlo2/otqf2a,Crowds-Cure-2018,Jayashree Kalpathy-Cramer (https://orcid.org/0...,CrowdsCureCancer2018-Results.tab,2019-06-05T18:16:24Z,2023-05-18T15:35:01Z,IsPartOf: 10.7937/k9/tcia.2018.ow73vlo2,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,CrowdsCureCancer2018-Results.csv,[],https://wiki.cancerimagingarchive.net/download...,0,0
2,10.7937/k9/tcia.2018.ow73vlo2/okjdaf,Crowds-Cure-2017,"Jayashree Kalpathy-Cramer, Andrew Beers, Artem...",Crowds-Cure-2017 tabdelimitedtoxml.xsl,2019-06-11T15:34:29Z,2023-05-18T15:33:51Z,IsPartOf: 10.7937/k9/tcia.2018.ow73vlo2,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,DICOM-SR representation of crowd measurements\...,[],https://wiki.cancerimagingarchive.net/x/ZgQGAg,0,0
3,10.7937/k9/tcia.2018.ow73vlo2/k7oh8k,Crowds-Cure-2017,"Jayashree Kalpathy-Cramer, Andrew Beers, Artem...",CrowdsCureCancer-dicomsrfiles_20180830.zip,2019-06-11T15:34:27Z,2023-05-18T15:32:20Z,IsPartOf: 10.7937/k9/tcia.2018.ow73vlo2,1,,,CrowdsCureCancer-dicomsrfiles,[],https://wiki.cancerimagingarchive.net/download...,0,0
4,10.7937/k9/tcia.2018.ow73vlo2/gjilrf,Crowds-Cure-2017,"Jayashree Kalpathy-Cramer, Andrew Beers, Artem...",ccc2017clinical.tab,2019-06-11T15:34:30Z,2023-05-18T15:30:33Z,IsPartOf: 10.7937/k9/tcia.2018.ow73vlo2,1,,,TCGA Clinical Data,[],https://wiki.cancerimagingarchive.net/download...,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,10.7937/k9/tcia.2014.v7cvh1jo,,"Russell C Hardie, Temesguen Messay",Segmentation of Pulmonary Nodules in Computed ...,2014-10-24T17:17:25Z,2020-08-19T16:56:28Z,,,,,,[],https://wiki.cancerimagingarchive.net/x/I4IiAQ,0,0
268,10.7937/k9/tcia.2014.iirmbunx,,Zhang J Mazurowski MA,Radiogenomic Analysis of Breast Cancer: Lumina...,2014-09-10T16:45:06Z,2020-08-19T16:34:01Z,,,,,,[],https://wiki.cancerimagingarchive.net/x/JYAiAQ,0,0
269,10.7937/k9/tcia.2020.85sp-0b40,,"Lucien Beer, Hilal Sahin, Kathleen M. Darcy, G...",Data from Integration of Proteomics with Radio...,2020-05-20T16:04:24Z,2020-08-01T19:47:14Z,,,,,This collection is comprised of imaging segmen...,[],https://wiki.cancerimagingarchive.net/x/vosvB,0,0
270,10.7937/tcia.2020.zx6hmv94,,Ryan Birmingham,Clearly a Test DOI,2020-01-31T17:19:37Z,2020-07-30T21:58:07Z,,,,,A Test DOI which should hopefully not post to ...,[],http://tcia-path-a1:3001/details?doi=10.7937/T...,0,0


In [13]:
doiDF.Title.str.contains("brain", case = False)

0      False
1      False
2      False
3      False
4      False
       ...  
267    False
268    False
269    False
270    False
271    False
Name: Title, Length: 272, dtype: bool

In [14]:
# search for Tumor                                                    in Title     OR       in Description
dataSetContaningTumorIdxs = doiDF.Title.str.contains("tumor|tumour", case = False) | doiDF.Description.str.contains("tumor|tumour", case = False)
sum(dataSetContaningTumorIdxs)

113

In [15]:
# search for Brain                                                in Title     OR       in Description
dataSetContainingBrainIdxs = doiDF.Title.str.contains("brain", case = False) | doiDF.Description.str.contains("brain", case = False)
sum(dataSetContainingBrainIdxs)

17

# Get the final dataframe of "Tumor and Brain" datasets

In [16]:
dataSetTumorAndBrainIdxs = dataSetContaningTumorIdxs & dataSetContainingBrainIdxs
dfTumorAndBrain = doiDF[dataSetTumorAndBrainIdxs]
dfTumorAndBrain

Unnamed: 0,DOI,Identifier,CreatorNames,Title,Created,Updated,Related,Version,Rights,RightsURI,Description,FundingReferences,URL,CitationCount,ReferenceCount
24,10.7937/tcia.2019.zfv154m9,DI-Cubed-Reports,"Hubert Hickman, Wendy Ver Hoef, Smita Hastak, ...",SDTM datasets of clinical data and measurement...,2019-06-06T19:32:08Z,2023-05-17T15:17:09Z,,1,TCIA Restricted License,https://wiki.cancerimagingarchive.net/download...,The Data Integration &amp; Imaging Informatics...,[],https://wiki.cancerimagingarchive.net/x/ihklAw,0,0
39,10.7937/tcia.t905-zq20,GLIS-RT,Nadya Shusharina (https://orcid.org/0000-0003-...,Glioma Image Segmentation for Radiotherapy: RT...,2021-10-22T20:59:12Z,2023-05-11T18:31:09Z,,1,TCIA Limited Access License,https://wiki.cancerimagingarchive.net/download...,The imaging data consists of 230 cases of glio...,[],https://wiki.cancerimagingarchive.net/x/pgKtBQ,0,0
41,10.7937/xb6d-py67,Brain-TR-GammaKnife,Yibin Wang (https://orcid.org/0000-0002-8761-5...,Brain Tumor Recurrence Prediction after Gamma ...,2023-03-10T15:53:37Z,2023-05-11T18:19:41Z,,1,TCIA Limited Access License,https://wiki.cancerimagingarchive.net/download...,Here we release a brain cancer MRI dataset wit...,[],https://wiki.cancerimagingarchive.net/x/BQSwC,0,0
43,10.7937/k9/tcia.2018.15quzvnb/4fvupd,Brain-Tumor-Progression,"Kathleen Schmainda, Melissa Prah",Brain-Tumor-Progression_2018-01-31.tcia,2019-05-08T15:25:23Z,2023-05-11T17:19:23Z,IsPartOf: 10.7937/k9/tcia.2018.15quzvnb,1,TCIA Restricted License,https://wiki.cancerimagingarchive.net/download...,"Images (DICOM, 3.2GB)\nPlease see DOI:10.7937/...",[],https://wiki.cancerimagingarchive.net/download...,0,0
49,10.7937/k9/tcia.2015.588ozuzb,REMBRANDT,"Lisa Scarpace, Adam E. Flanders, Rajan Jain, T...",Data From REMBRANDT,2015-07-16T03:18:12Z,2023-05-11T15:30:56Z,HasPart: 10.7937/k9/tcia.2015.588ozuzb/xqscgc,2,TCIA Restricted License,https://wiki.cancerimagingarchive.net/download...,Finding better therapies for the treatment of ...,[],https://wiki.cancerimagingarchive.net/display/...,31,0
111,10.7937/tcia.bdgf-8v37,UCSF-PDGM,Evan Calabrese (https://orcid.org/0000-0002-14...,The University of California San Francisco Pre...,2022-07-05T17:17:51Z,2023-04-27T20:18:32Z,,4,Creative Commons Attribution 4.0 International,https://creativecommons.org/licenses/by/4.0/le...,The publicly available University of Californi...,[],https://wiki.cancerimagingarchive.net/x/5pAiBw,0,0
112,10.7937/k9/tcia.2017.klxwjj1q,BraTS-TCGA-GBM,"Spyridon Bakas, Hamed Akbari, Aristeidis Sotir...",Segmentation Labels for the Pre-operative Scan...,2017-01-26T13:48:48Z,2023-04-27T20:17:14Z,IsReferencedBy: 10.1007/978-3-030-46640-4_14,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,This data container describes both computer-ai...,[],https://wiki.cancerimagingarchive.net/x/KoZyAQ,33,0
113,10.7937/k9/tcia.2017.gjq7r0ef,BraTS-TCGA-LGG,"Spyridon Bakas, Hamed Akbari, Aristeidis Sotir...",Segmentation Labels for the Pre-operative Scan...,2017-01-26T13:53:42Z,2023-04-27T20:11:10Z,IsReferencedBy: 10.1007/978-3-030-46640-4_14,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,This data container describes both computer-ai...,[],https://wiki.cancerimagingarchive.net/x/LIZyAQ,32,0
119,10.7937/k9/tcia.2015.h1sxnuxl,RIDER Breast MRI,"Charles R Meyer, Thomas L Chenevert, Craig J G...",RIDER Breast MRI,2015-07-16T03:18:33Z,2023-04-26T22:43:55Z,IsReferencedBy: 10.1007/978-3-319-19578-0_10,1,Creative Commons Attribution 3.0 Unported,https://creativecommons.org/licenses/by/3.0/le...,Ideally a patient’s response to neoadjuvant ch...,[],https://wiki.cancerimagingarchive.net/x/dYRXAQ,7,0
125,10.7937/k9/tcia.2018.15quzvnb,Brain-Tumor-Progression,Kathleen Schmainda (https://orcid.org/0000-000...,Data from Brain-Tumor-Progression,2018-05-18T20:07:27Z,2023-04-26T18:55:05Z,HasPart: 10.7937/k9/tcia.2018.15quzvnb/4fvupd,1,TCIA Restricted,https://wiki.cancerimagingarchive.net/download...,This collection includes datasets from 20 subj...,[],https://wiki.cancerimagingarchive.net/x/1wEGAg,5,0


# Further Examples from original code:


Using the **format** parameter allows you to return a dataframe or save a CSV spreadsheet for the DOI records.

In [None]:
datacite.getDoi(format = "df")

Using the **query** parameter searches most metadata fields DataCite uses. Let's say that we were interested to find all TCIA datasets that mention "lung" in their titles or abstracts and save the output to a CSV file.

In [None]:
datacite.getDoi(query = "lung", format = "csv")

The **created** parameter lets you filter by the date the DOI was created.  It expects a 4 digit year.

In [None]:
datacite.getDoi(created = 2022, format = "df")

The **license** parameter requires an exact match. Most of our datasets are Creative Commons Attribution licenses. The "nc" variants that prevent commercial use are no longer an option for new datasets in TCIA, but there are a handful that used this before the policy changed.  Datasets using the NCTN Data Archive and TCIA Limited Access licenses both require signing data use agreements, but the others do not require special agreements or creating an account.

**Note:** Some of the license data is not populated for datasets, and in other cases there are minor differences in the names of the license.  We are working to fix this issue.

In [None]:
datacite.getDoi(license = "NCTN Data Archive License", format = "df")

# Acknowledgements
TCIA is funded by the [Cancer Imaging Program (CIP)](https://imaging.cancer.gov/), a part of the United States [National Cancer Institute (NCI)](https://www.cancer.gov/).  It is managed by the [Frederick National Laboratory for Cancer Research (FNLCR)](https://frederick.cancer.gov/) and hosted by the [University of Arkansas for Medical Sciences (UAMS)](https://www.uams.edu/)

This notebook was created by [Justin Kirby](https://www.linkedin.com/in/justinkirby82/). If you leverage this notebook or any TCIA datasets in your work, please be sure to comply with the [TCIA Data Usage Policy](https://wiki.cancerimagingarchive.net/x/c4hF). In particular, make sure to cite the DOI(s) for the specific TCIA datasets you used in addition to the following paper!

## TCIA Citation

Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7