# **IMPORTING DATA USING THE SWAGGER API CLIENT**

The data was obtained Bravado - a python client library for the Swagger API client. The data was then held in a main directory, named here as 'cbioportal', followed by section headings referring to the cbioportal repository.


In [1]:
from bravado.client import SwaggerClient
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm
cbioportal = SwaggerClient.from_url('https://www.cbioportal.org/api/api-docs',
                                    config={"validate_requests":False,"validate_responses":False})

In [2]:
dir(cbioportal)

['Cancer_Types',
 'Clinical_Attributes',
 'Clinical_Data',
 'Clinical_Events',
 'Copy_Number_Segments',
 'Discrete_Copy_Number_Alterations',
 'Gene_Panels',
 'Generic_Assays',
 'Genes',
 'Molecular_Data',
 'Molecular_Profiles',
 'Mutations',
 'Patients',
 'Reference_Genome_Genes',
 'Resource_Data',
 'Resource_Definitions',
 'Sample_Lists',
 'Samples',
 'Structural_Variants',
 'Studies',
 'Treatments']

## USING THE API CLIENT TO PULL SPECIFIC DATA GROUPS

I was interested in data relating to clinical attributes for each cancer type and so this was the data that I targeted; the variable was named 'clinical_data'.

I also wanted to see how many studies and cancer types there were available and so these data were obtained and named 'cancer_types' and 'studies'.

In [78]:
clinical_data = cbioportal.Clinical_Attributes.getAllClinicalAttributesUsingGET().result()
cancer_types = cbioportal.Cancer_Types.getAllCancerTypesUsingGET().result()
studies = cbioportal.Studies.getAllStudiesUsingGET().result()

In [72]:
print("No. of total cancer types: ", len(cancer_types))
print("No. of total studies: ", len(studies))
print("No. of total clinical attr: ", len(clinical_data))
print("No. of total samples: {}".format(sum([x.allSampleCount for x in studies])))


No. of total cancer types:  869
No. of total studies:  303
No. of total clinical attr:  12266
No. of total samples: 118149


In [14]:
cancer_types[0]

TypeOfCancer(cancerTypeId='aa', clinicalTrialKeywords='aggressive angiomyxoma', dedicatedColor='LightYellow', name='Aggressive Angiomyxoma', parent='soft_tissue', shortName='AA')

In [73]:
studies[0]

CancerStudy(allSampleCount=40, cancerType=None, cancerTypeId='hnsc', citation='Pickering et al. Cancer Discov 2013', cnaSampleCount=None, completeSampleCount=None, description='Comprehensive profiling of 40 oral squamous cell carcinoma tumor/normal sample pairs.', groups='', importDate='2019-02-19 00:00:00', methylationHm27SampleCount=None, miRnaSampleCount=None, mrnaMicroarraySampleCount=None, mrnaRnaSeqSampleCount=None, mrnaRnaSeqV2SampleCount=None, name='Oral Squamous Cell Carcinoma (MD Anderson, Cancer Discov 2013)', pmid='23619168', publicStudy=True, referenceGenome='hg19', rppaSampleCount=None, sequencedSampleCount=None, shortName='Head & neck (MDA)', status=0, studyId='hnsc_mdanderson_2013')

In [38]:
clinical_data[0]

ClinicalAttribute(clinicalAttributeId='ADJUVANT_CHEMO', datatype='STRING', description='Adjuvant Chemotherapy', displayName='Adjuvant Chemotherapy', patientAttribute=True, priority='1', studyId='acbc_mskcc_2015')

### EXTRACTING CLINICAL DATA

In order to extract clinical data without losing context of the study and other associated data, I needed to use the 'getAllClinicalDataInStudyUsingGET(studyId=i).result()' method, which requires input of each study ID. Therefore, I extracted each study ID by using the clinical_data variable, saving the list to 'study_ids_raw' and removing the duplicates, leaving 303 study IDs. Finally, I used a for loop to loop through the study_id list and extract clinical data for each study. The result of this was the 'all_clinical_data' variable containing a list studies within which, each element is a single clinical attribute for that particular study. This gave 441,868 clinical attributes.

In [39]:
study_ids_raw = []

for i in range(len(clinical_data)):    
    study_ids_raw.append(clinical_data[i].studyId)

In [40]:
len(study_ids_raw)

12266

In [77]:
def remove_duplicates(l):
    return list(set(l))

study_ids = remove_duplicates(study_ids_raw)
len(study_ids)

303

In [51]:
all_clinical_data=[]

for i in study_ids:
    all_clinical_data.append(cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId=i).result())
    

In [64]:
all_clinical_data[0]

[ClinicalData(clinicalAttribute=None, clinicalAttributeId='CANCER_TYPE', patientId='IC009', sampleId='IC009', studyId='es_iocurie_2014', uniquePatientKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', uniqueSampleKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', value='Bone Cancer'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='CANCER_TYPE_DETAILED', patientId='IC009', sampleId='IC009', studyId='es_iocurie_2014', uniquePatientKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', uniqueSampleKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', value='Ewing Sarcoma'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='FRACTION_GENOME_ALTERED', patientId='IC009', sampleId='IC009', studyId='es_iocurie_2014', uniquePatientKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', uniqueSampleKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', value='0.0005'),
 ClinicalData(clinicalAttribute=None, clinicalAttributeId='FUSION', patientId='IC009', sampleId='IC009', studyId='es_iocurie_2014', uniquePatientKey='SUMwMDk6ZXNfaW9jdXJpZV8yMDE0', uniqueSampleKey='SUMwMDk6ZXNfaW9j

In [55]:
counter = 0
for x in all_clinical_data:
    for i in all_clinical_data:
        counter += 1

counter result = 441868

Next, to extract useful values from each clinical data element (i.e. clinicalAttributeId, patientId, sampleId, studyId and value), I used a function to extract the number of elements within each list (ie study).

In [56]:
#Function to find out length of nested list of each study:
def get_elements_of_nested_list(element):
    count=0
    if isinstance (element, list):
        for each_element in element:
            count += get_elements_of_nested_list(each_element)
    else:
        count+=1
    return count


In [57]:
#Apply function on all clinical data to get nested list lengths:
num_of_patients_per_study =[]

for i in range(len(all_clinical_data)):
    num_of_patients_per_study.append(get_elements_of_nested_list(all_clinical_data[i]))

In [59]:
#Check the it works using the first study:
len(all_clinical_data[0])

892

In [60]:
#Need to minus 1 from length of each nested list to index is not out of range when extracting data groups:
num_of_patients_per_study2=[]

for x in tqdm(num_of_patients_per_study):
    num_of_patients_per_study2.append(x-1)

100%|██████████| 303/303 [00:00<00:00, 479394.23it/s]


In [None]:
#extracting all clinical data using second index.

all_clinical_id=[]
study_list_nums=[]

for i in range(len(all_clinical_data)):
    study_list_nums.append(i)

In [None]:
patients_study_nums = list(enumerate(num_of_patients_per_study2))
patients_study_nums

**EXTRACTING EACH LINE OF DATA FROM EACH GROUP IN 'ALL CLINICAL DATA'**


In [None]:
final_clinical_id=[]

for i, x in patients_study_nums:
    counter = 0
    while counter<x:
        counter +=1
        try:
            final_clinical_id.append(all_clinical_data[i][counter].clinicalAttributeId)
        except:
            final_clinical_id.append(np.nan)
        

len(final_clinical_id) = 1866596

In [None]:
final_patient_id=[]

for i, x in patients_study_nums:
    counter = 0
    while counter<x:
        counter +=1
        try:
            final_patient_id.append(all_clinical_data[i][counter].patientId)
        except:
            final_patient_id.append(np.nan)

len(final_patient_id) = 1866596

In [None]:
final_sample_id=[]

for i, x in patients_study_nums:
    counter = 0
    while counter<x:
        counter +=1
        try:
            final_sample_id.append(all_clinical_data[i][counter].sampleId)
        except:
            final_sample_id.append(np.nan)

In [None]:
final_study_id=[]

for i, x in patients_study_nums:
    counter = 0
    while counter<x:
        counter +=1
        try:
            final_study_id.append(all_clinical_data[i][counter].studyId)
        except:
            final_study_id.append(np.nan)

In [None]:
final_value=[]

for i, x in patients_study_nums:
    counter = 0
    while counter<x:
        counter +=1
        try:
            final_value.append(all_clinical_data[i][counter].value)
        except:
            final_value.append(np.nan)

**DATAFRAME MADE**

In [None]:
all_dict_for_df = {
    'sample_id': final_sample_id,
    'patient_id':final_patient_id,
    'study_id': final_study_id,
    'clinical_id': final_clinical_id,
    'value': final_value}

In [None]:
all_df = pd.DataFrame(all_dict_for_df)
all_df

In [None]:
# all_df.to_csv('all_df_v2')