# Merge Metadata
The purpose of this script is to merge the clinical and manifest files collected from the original TCGA paper and GDC data portal.
## Input files
1. From **GDC Portal**, step 1 is to add the correct samples (RNA-seq) to cart.
>After finding the TCGA-SKCM project, click "Explore project data", then "view files in repository". After filtering for "sequencing reads" in Data Category and "RNA-seq" in Experiemental Strategy, and "STAR 2-Pass Transcriptome" in Workflow type, add all files to cart.
### GDC.manifest.TCGA-SKCM
>In cart, click "(download) Manifest", converted to tsv.
### GDC.sample.sheet.TCGA-SKCM
>In cart, click "(download) Sample sheet".
### GDC.Clinical.TCGA-SKCM
>In cart, click "(download) Clinical", in the extracted folder, select "clinical"
### GDC.biospecimen.TCGA-SKCM
>In cart, click "(download) Biospecimen", then within the folder there is the sample information 

2. From **Original paper** - [Network CGA. Genomic Classification of Cutaneous Melanoma. Cell. Jun 18 2015;161(7):1681-96.](https://www.sciencedirect.com/science/article/pii/S0092867415006340?via%3Dihub#app2)
### TCGA-SKCM Paper S1
>This is the Supplement Table S1 of the paper. 

## Output
A sample table in the config folder inside workflow containing all metadata we have + the file accession addresses.

In [148]:
import pandas as pd
import numpy as np
import re

In [149]:
manifest = pd.read_csv("/Users/phoebefei/Desktop/WCM/Mets Melanoma/TCGA SKCM/GDC.manifest.TCGA-SKCM.tsv", sep = "\t", header = 0)
print("Manifest:")
manifest.head()

Manifest:


Unnamed: 0,id,filename,md5,size,state
0,00b643fc-b919-4032-9001-476ba3e38c8d,59c50e1e-fc77-441e-93c2-f69c013f362c.rna_seq.t...,dd6eb6fbf55617507ab89a337ed12994,17202229572,validated
1,00faac50-3797-45a8-be93-3dd119dfa7d6,22f63ce0-5364-48b0-b424-5b971a1dcd33.rna_seq.t...,6fd70581fc9693534052b669823f5d11,18856615432,validated
2,02362f12-8a62-4164-ac3d-7557cb6b70e0,50f58d2f-7e67-40d8-a35c-a3e2c0bbe457.rna_seq.t...,1e2e8615204b4d0a20aebb046cbd2c07,21238794353,validated
3,02752c18-6b0b-4839-b391-a1557ca4b3c1,486430aa-f745-4353-9e03-6226ac8d9993.rna_seq.t...,1a0dbbcbdf21a15172af56960069b1b8,14090001048,validated
4,028dca47-e908-4295-abd6-c433e3a42384,ae8e545c-f6ab-40e2-8122-162eab4b2724.rna_seq.t...,0527d3ca26056ecb4975a012126af0b6,13039034650,validated


In [150]:
sample_sheet = pd.read_csv("/Users/phoebefei/Desktop/WCM/Mets Melanoma/TCGA SKCM/GDC.sample.sheet.TCGA-SKCM.tsv",sep = "\t", header = 0)
print("Sample Sheet:")
sample_sheet.head()

Sample Sheet:


Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
0,de2e180c-1ae5-4d79-82c2-59715f1b31d4,4e06764d-9cf1-4cff-ad67-4c5e472fbff5.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-EB-A3Y6,TCGA-EB-A3Y6-01A,Primary Tumor
1,5ada3a87-07f4-4692-bfdd-aa9265ad58ef,84ca3236-683a-40f6-b8a6-600ffd43a0da.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-D3-A1Q1,TCGA-D3-A1Q1-06A,Metastatic
2,dcbfefba-812e-480b-b6bf-6bc4390d2a7e,c3cd581e-ebce-4eff-8bda-63903805d617.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-FS-A4F5,TCGA-FS-A4F5-06A,Metastatic
3,93da78f8-a902-42be-84f9-d72123753653,f0e20c4e-ac0b-45e1-a53b-2de9a19843b1.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-EE-A2GJ,TCGA-EE-A2GJ-06A,Metastatic
4,622bc71b-03f6-4149-ba71-c70b16bb075c,fc68896b-f13f-4478-8573-3e18c64c3a2b.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-GN-A4U3,TCGA-GN-A4U3-06A,Metastatic


In [151]:
clinical = pd.read_csv("/Users/phoebefei/Desktop/WCM/Mets Melanoma/TCGA SKCM/GDC.Clinical.TCGA-SKCM.tsv",sep = "\t", header = 0)
#replace "'--" with NaN, then remove all columns with empty values
clinical = clinical.replace("'--",np.nan).dropna(axis = 1, how = "all")
print("GDC Clinical Data:")
clinical.head()

GDC Clinical Data:


Unnamed: 0,case_id,case_submitter_id,project_id,age_at_index,days_to_birth,days_to_death,ethnicity,gender,race,vital_status,...,prior_malignancy,prior_treatment,progression_or_recurrence,site_of_resection_or_biopsy,synchronous_malignancy,tissue_or_organ_of_origin,tumor_grade,year_of_diagnosis,treatment_or_therapy,treatment_type
0,f6ff5e77-60d8-4987-9a92-654fd4db81f1,TCGA-FS-A1ZZ,TCGA-SKCM,54,-19733,822,hispanic or latino,female,white,Dead,...,no,Yes,not reported,"Lymph node, NOS",No,"Skin, NOS",not reported,2006,yes,"Radiation Therapy, NOS"
1,f6ff5e77-60d8-4987-9a92-654fd4db81f1,TCGA-FS-A1ZZ,TCGA-SKCM,54,-19733,822,hispanic or latino,female,white,Dead,...,no,Yes,not reported,"Lymph node, NOS",No,"Skin, NOS",not reported,2006,no,"Pharmaceutical Therapy, NOS"
2,f7368803-c545-4407-9a14-2220560a8f80,TCGA-ER-A19B,TCGA-SKCM,42,-15437,2993,not hispanic or latino,male,white,Dead,...,no,Yes,not reported,"Connective, subcutaneous and other soft tissue...",No,"Skin, NOS",not reported,2000,yes,"Pharmaceutical Therapy, NOS"
3,f7368803-c545-4407-9a14-2220560a8f80,TCGA-ER-A19B,TCGA-SKCM,42,-15437,2993,not hispanic or latino,male,white,Dead,...,no,Yes,not reported,"Connective, subcutaneous and other soft tissue...",No,"Skin, NOS",not reported,2000,no,"Radiation Therapy, NOS"
4,f788f950-7953-47a6-b833-f9e5daa28fd3,TCGA-ER-A2NG,TCGA-SKCM,43,-15903,1490,not hispanic or latino,female,white,Dead,...,no,No,not reported,Lymph nodes of axilla or arm,No,"Skin, NOS",not reported,2009,no,"Radiation Therapy, NOS"


In [152]:
biospec = pd.read_csv("/Users/phoebefei/Desktop/WCM/Mets Melanoma/TCGA SKCM/GDC.biospecimen.TCGA-SKCM.tsv",sep = "\t", header = 0)
biospec = biospec.replace("'--",np.nan).dropna(axis = 1, how = "all")
print("Biospecimen Information:")
biospec.head()

Biospecimen Information:


Unnamed: 0,project_id,case_id,case_submitter_id,sample_id,sample_submitter_id,composition,days_to_collection,days_to_sample_procurement,initial_weight,is_ffpe,oct_embedded,pathology_report_uuid,preservation_method,sample_type,sample_type_id,state,tissue_type,tumor_descriptor
0,TCGA-SKCM,f6ff5e77-60d8-4987-9a92-654fd4db81f1,TCGA-FS-A1ZZ,16756720-fa90-435c-94cd-6d68458bf096,TCGA-FS-A1ZZ-10A,Not Reported,1780.0,,,False,false,,,Blood Derived Normal,10,released,Not Reported,Not Reported
1,TCGA-SKCM,f6ff5e77-60d8-4987-9a92-654fd4db81f1,TCGA-FS-A1ZZ,6fcb5378-d9e3-460e-9409-4298ea610bb6,TCGA-FS-A1ZZ-06A,Not Reported,1780.0,,270.0,False,false,E65D23A7-1FD6-4672-9B26-5B087AE90370,,Metastatic,6,released,Not Reported,Not Reported
2,TCGA-SKCM,f7368803-c545-4407-9a14-2220560a8f80,TCGA-ER-A19B,4a0e82e0-cea9-49a8-9a03-a7269bdb3f08,TCGA-ER-A19B-01Z,,,0.0,,True,No,,FFPE,Primary Tumor,1,released,Not Reported,
3,TCGA-SKCM,f7368803-c545-4407-9a14-2220560a8f80,TCGA-ER-A19B,7ccba921-8483-4005-856b-504a8fc22e3e,TCGA-ER-A19B-10A,Not Reported,3775.0,,,False,false,,,Blood Derived Normal,10,released,Not Reported,Not Reported
4,TCGA-SKCM,f7368803-c545-4407-9a14-2220560a8f80,TCGA-ER-A19B,d58e1a0c-3aba-45ac-b1ef-013c91fd6ef7,TCGA-ER-A19B-06A,Not Reported,3775.0,,450.0,False,true,E9F013EB-73E0-4674-8F3A-C6BB07CCA742,,Metastatic,6,released,Not Reported,Not Reported


In [153]:
#patient information is in Supplemental Table S1D (sheet 4). The first row is the title and should be skipped.
patient = pd.read_excel("/Users/phoebefei/Desktop/WCM/Mets Melanoma/TCGA SKCM/TCGA-SKCM Paper S1.xlsx", sheet_name = "Supplemental Table S1D", skiprows=0,header = 1)
print("patient information from supplemental:")
patient.head()

patient information from supplemental:


  for idx, row in parser.parse():


Unnamed: 0,Name,ALL_SAMPLES,MUTATIONSUBTYPES,ALL_PRIMARY_VS_METASTATIC,REGIONAL_VS_PRIMARY,UV-signature,RNASEQ-CLUSTER_CONSENHIER,MethTypes.201408,MIRCluster,ProteinCluster,...,CURATED_DISTANT_ANATOMIC_SITE,CURATED_VITAL_STATUS,CURATED_DAYS_TO_DEATH_OR_LAST_FU,CURATED_TCGA_DAYS_TO_DEATH_OR_LAST_FU,"CURATED_MELANOMA_SPECIFIC_VITAL_STATUS [0 = ""ALIVE OR CENSORED""; 1 = ""DEAD OF MELANOMA""]",CURATED_TCGA_SPECIMEN_Distant,CC>TT/nTotal.Mut,DIPYRIM.C>T/nTotal.Mut,DIPYRIM.C>T/n(C>T).mut,SHATTERSEEK_Chromothripsis_calls
0,TCGA-BF-A1PU-01,Yes,BRAF_Hotspot_Mutants,All_Primaries,-,UV signature,keratin,normal-like,MIR.type.3,PROT.type.1,...,-,-,-,-,-,-,0.034483,0.793103,0.938776,chr12
1,TCGA-BF-A1PV-01,Yes,RAS_Hotspot_Mutants,All_Primaries,Primary_Disease,UV signature,keratin,CpG island-methylated,MIR.type.2,-,...,[Not Available],Alive,14,13,0,Primary Tumor,0.039216,0.838235,0.982759,negative
2,TCGA-BF-A1PX-01,Yes,BRAF_Hotspot_Mutants,All_Primaries,-,UV signature,keratin,normal-like,MIR.type.1,PROT.type.1,...,-,-,-,-,-,-,0.064777,0.842105,0.995215,negative
3,TCGA-BF-A1PZ-01,Yes,RAS_Hotspot_Mutants,All_Primaries,-,UV signature,keratin,hypo-methylated,MIR.type.2,PROT.type.2,...,-,-,-,-,-,-,0.085227,0.670455,0.991597,negative
4,TCGA-BF-A1Q0-01,Yes,Triple_WT,All_Primaries,Primary_Disease,not UV,immune,CpG island-methylated,MIR.type.2,-,...,[Not Available],Alive,17,17,0,Primary Tumor,0.012698,0.573016,0.991758,"chr7,chr12"


In [154]:
#Merging
#sample sheet + biospecimen
print(sample_sheet.shape)
print(biospec.shape)
sample_sheet.set_index("Sample ID",inplace = True, drop = False)
biospec.set_index("sample_submitter_id", inplace = True, drop = False)
sample_biospec = pd.concat([biospec.loc[sample_sheet.index],sample_sheet],axis = 1)
print(sample_biospec.shape)
sample_biospec.head()

(473, 8)
(1381, 18)
(473, 26)


Unnamed: 0_level_0,project_id,case_id,case_submitter_id,sample_id,sample_submitter_id,composition,days_to_collection,days_to_sample_procurement,initial_weight,is_ffpe,...,tissue_type,tumor_descriptor,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
Sample ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-EB-A3Y6-01A,TCGA-SKCM,8b93bdb4-b64b-4940-a814-f5b3f5ae56cc,TCGA-EB-A3Y6,be3b4702-4307-4d22-b69d-2786eb7c204e,TCGA-EB-A3Y6-01A,Not Reported,129,,470.0,False,...,Not Reported,Not Reported,de2e180c-1ae5-4d79-82c2-59715f1b31d4,4e06764d-9cf1-4cff-ad67-4c5e472fbff5.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-EB-A3Y6,TCGA-EB-A3Y6-01A,Primary Tumor
TCGA-D3-A1Q1-06A,TCGA-SKCM,8c54fcfd-999f-43c5-b31f-26d006f5fff3,TCGA-D3-A1Q1,80e29a9a-98c5-4058-914b-1cd86640527a,TCGA-D3-A1Q1-06A,Not Reported,2305,,120.0,False,...,Not Reported,Not Reported,5ada3a87-07f4-4692-bfdd-aa9265ad58ef,84ca3236-683a-40f6-b8a6-600ffd43a0da.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-D3-A1Q1,TCGA-D3-A1Q1-06A,Metastatic
TCGA-FS-A4F5-06A,TCGA-SKCM,87dc5c05-3121-46bf-adba-061f45cecf27,TCGA-FS-A4F5,881cfcc8-bfcf-4404-8742-5be9be653412,TCGA-FS-A4F5-06A,Not Reported,2522,,30.0,False,...,Not Reported,Not Reported,dcbfefba-812e-480b-b6bf-6bc4390d2a7e,c3cd581e-ebce-4eff-8bda-63903805d617.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-FS-A4F5,TCGA-FS-A4F5-06A,Metastatic
TCGA-EE-A2GJ-06A,TCGA-SKCM,a5286cd0-403a-421b-ab58-11e250740b00,TCGA-EE-A2GJ,c78af90c-3469-49c0-a83f-0ab697618d6e,TCGA-EE-A2GJ-06A,Not Reported,2589,,30.0,False,...,Not Reported,Not Reported,93da78f8-a902-42be-84f9-d72123753653,f0e20c4e-ac0b-45e1-a53b-2de9a19843b1.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-EE-A2GJ,TCGA-EE-A2GJ-06A,Metastatic
TCGA-GN-A4U3-06A,TCGA-SKCM,909335d0-bb4b-4ea3-869a-93e4d91c2154,TCGA-GN-A4U3,fc4872fe-e3cf-4f87-be76-b18d3236a9b8,TCGA-GN-A4U3-06A,Not Reported,3001,,180.0,False,...,Not Reported,Not Reported,622bc71b-03f6-4149-ba71-c70b16bb075c,fc68896b-f13f-4478-8573-3e18c64c3a2b.rna_seq.t...,Sequencing Reads,Aligned Reads,TCGA-SKCM,TCGA-GN-A4U3,TCGA-GN-A4U3-06A,Metastatic


In [156]:
#+Patient clinical
sample_biospec.set_index("case_submitter_id", inplace = True, drop = True)
sample_biospec["Case Submitter ID"] = sample_biospec.index
print(patient.shape)
patient["Submitter ID"] = [i[:-3] for i in patient["Name"]]
patient.set_index("Submitter ID", inplace = True, drop = True)
#kick out those that do not have metadata
out_rows = patient.index.difference(sample_biospec.index)
patient = patient.drop(index = out_rows)
print(patient.shape)
sample_biospec_patient = pd.concat([patient.reset_index(),sample_biospec.loc[patient.index].reset_index()],axis = 1)
sample_biospec_patient.set_index("Case Submitter ID", inplace = True, drop = False).head()

KeyError: "None of ['case_submitter_id'] are in the columns"

In [None]:
#only selecting the ones with "no" for treatment (the case ID's are undistinguishable)
clinical_pretx = clinical[clinical["treatment_or_therapy"] == "no"]
print(clinical_pretx.shape)
clinical_pretx.set_index("case_submitter_id", inplace = True, drop = False)
clinical_pretx.head()
clinical_pretx[sample_biospec.index]
#sample_biospec_clinical = pd.concat([sample_biospec,clinical[sample_biospec.index]],axis = 1)