截至2022年11月10日，GDC中记录了54种初发病位点，包含TCGA、TARGET、CPTAC等六个主要组织，44个项目，32种疾病类型共计14531个案例，16617个isomiRNA测序文件；从中整理到14530个案例，16616个测序文件。

为了选择满足机器学习任务的数据，我们对GDC数据进行样本类型量统计，44个项目其中28个项目含有Normal的对照组。

| Primary Site | kidney                                                       | bronchus and lung            | bronchus and lung           | thyroid gland                  |
| ------------ | ------------------------------------------------------------ | ---------------------------- | --------------------------- | ------------------------------ |
| Disease Type | adenomas and adenocarcinomas                                 | adenomas and adenocarcinomas | squamous cell neoplasms     | adenomas and adenocarcinomas   |
| Project      | TCGA-KIRC 71     TCGA-KIRP 34     CPTAC-3 148     TCGA-KICH 25 | TCGA-LUAD 46     CPTAC-3 215 | TCGA-LUSC 45     CPTAC-3 96 | TCGA-THCA 59     REBC-THYR 397 |
| normal 总计  | 278                                                          | 261                          | 141                         | 456                            |

In [1]:
# !pip install ujson

In [1]:
import pandas as pd
import seaborn as sns
import ujson

In [2]:
with open('../../data/GDC/metadata.cart.2022-11-10_part1.json','r',encoding='utf8')as fp:
    part1_json = ujson.load(fp)

with open('../../data/GDC/metadata.cart.2022-11-10_part2.json','r',encoding='utf8')as fp:
    part2_json = ujson.load(fp)

数据探索

In [3]:
len(part1_json)

9425

In [4]:
len(part2_json)

7191

In [5]:
part1_json[0]

{'data_format': 'TXT',
 'access': 'open',
 'associated_entities': [{'entity_submitter_id': 'TCGA-2J-AAB8-01A-12R-A41G-13',
   'entity_type': 'aliquot',
   'case_id': '2e8f90f4-aed3-43b0-985c-dfdc2581f24f',
   'entity_id': 'f2ab8ac5-fed7-4e49-94d2-9c0ed0c56636'}],
 'file_name': '3e3337e6-c444-4539-8ba8-82ae85a96d15.mirbase21.isoforms.quantification.txt',
 'submitter_id': 'mirna_swap_dr11_36_MirnaExpression49657465-cd5a-4b03-a349-3f55c26bcb73_isoform_profiling',
 'data_category': 'Transcriptome Profiling',
 'analysis': {'input_files': [{'data_format': 'BAM',
    'access': 'controlled',
    'file_name': 'TCGA-2J-AAB8-01A-12R-A41G-13_mirna_gdc_realn.bam',
    'submitter_id': 'mirna_swap_dr11_36_AlignedReads49657465-cd5a-4b03-a349-3f55c26bcb73',
    'data_category': 'Sequencing Reads',
    'platform': 'Illumina',
    'file_size': 245033338,
    'created_datetime': '2018-03-20T06:38:21.377090-05:00',
    'md5sum': '94d49b68a6f8b5c1a8d87a07aeea5b2b',
    'updated_datetime': '2018-11-15T21:47:

In [6]:
df_part1 = pd.read_csv('../../data/GDC/gdc_sample_sheet.2022-11-10_part1.tsv',sep='\t')
df_part2 = pd.read_csv('../../data/GDC/gdc_sample_sheet.2022-11-10_part2.tsv',sep='\t')

In [7]:
df_tcga_abbrev = pd.read_excel('../../data/GDC/TCGA_Abbrev.xlsx')
df_tcga_abbrev.head()

Unnamed: 0,Study Abbreviation,Study Name
0,LAML,Acute Myeloid Leukemia
1,ACC,Adrenocortical carcinoma
2,BLCA,Bladder Urothelial Carcinoma
3,LGG,Brain Lower Grade Glioma
4,BRCA,Breast invasive carcinoma


In [8]:
df_tcga_abbrev['Study Abbreviation'] = 'TCGA-' + df_tcga_abbrev['Study Abbreviation'] 
df_tcga_abbrev.head()

Unnamed: 0,Study Abbreviation,Study Name
0,TCGA-LAML,Acute Myeloid Leukemia
1,TCGA-ACC,Adrenocortical carcinoma
2,TCGA-BLCA,Bladder Urothelial Carcinoma
3,TCGA-LGG,Brain Lower Grade Glioma
4,TCGA-BRCA,Breast invasive carcinoma


In [9]:
df_sample_sheet = pd.concat([df_part1,df_part2])
print(df_sample_sheet.info())
df_sample_sheet.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16616 entries, 0 to 7190
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   File ID        16616 non-null  object
 1   File Name      16616 non-null  object
 2   Data Category  16616 non-null  object
 3   Data Type      16616 non-null  object
 4   Project ID     16616 non-null  object
 5   Case ID        16616 non-null  object
 6   Sample ID      16616 non-null  object
 7   Sample Type    16616 non-null  object
dtypes: object(8)
memory usage: 1.1+ MB
None


Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
0,806b9787-2366-4dfe-b053-69267fb3e4fc,3e3337e6-c444-4539-8ba8-82ae85a96d15.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2J-AAB8,TCGA-2J-AAB8-01A,Primary Tumor
1,f0d9f59e-06dd-4c16-a31d-37ace1c7edd1,3b219bb6-74eb-45b0-ba74-76c475a69d1b.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7289,TCGA-HZ-7289-01A,Primary Tumor
2,3f2a4877-473e-4595-bb79-3b6c25219201,fa17ecee-b9d7-4305-90f3-8afd3a12f256.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-Z5-AAPL,TCGA-Z5-AAPL-01A,Primary Tumor
3,01e18cae-c606-40c2-a1f7-f14063c2b62d,9f056d34-77f0-42a9-9b08-efaaebb2b03c.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2L-AAQI,TCGA-2L-AAQI-01A,Primary Tumor
4,73dc240c-51ed-43f7-9baf-6c36f12121ef,2c9272ba-2dd6-46cf-9177-51e7db8eefc7.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7926,TCGA-HZ-7926-01A,Primary Tumor


In [10]:
df_sample_sheet  = pd.merge(df_sample_sheet,df_tcga_abbrev,left_on='Project ID',right_on='Study Abbreviation')
df_sample_sheet.head()

Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type,Study Abbreviation,Study Name
0,806b9787-2366-4dfe-b053-69267fb3e4fc,3e3337e6-c444-4539-8ba8-82ae85a96d15.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2J-AAB8,TCGA-2J-AAB8-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma
1,f0d9f59e-06dd-4c16-a31d-37ace1c7edd1,3b219bb6-74eb-45b0-ba74-76c475a69d1b.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7289,TCGA-HZ-7289-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma
2,3f2a4877-473e-4595-bb79-3b6c25219201,fa17ecee-b9d7-4305-90f3-8afd3a12f256.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-Z5-AAPL,TCGA-Z5-AAPL-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma
3,01e18cae-c606-40c2-a1f7-f14063c2b62d,9f056d34-77f0-42a9-9b08-efaaebb2b03c.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2L-AAQI,TCGA-2L-AAQI-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma
4,73dc240c-51ed-43f7-9baf-6c36f12121ef,2c9272ba-2dd6-46cf-9177-51e7db8eefc7.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7926,TCGA-HZ-7926-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma


In [11]:
df_platform = pd.read_csv('../../data/GDC/PanCanAtlas_miRNA_sample_information_list.txt',sep='\t')[['id','Platform','Disease']]
print(df_platform.shape)
df_platform.head()

(10824, 3)


Unnamed: 0,id,Platform,Disease
0,TCGA-OR-A5J1-01A-11R-A29W-13,HiSeq,ACC
1,TCGA-OR-A5J2-01A-11R-A29W-13,HiSeq,ACC
2,TCGA-OR-A5J3-01A-11R-A29W-13,HiSeq,ACC
3,TCGA-OR-A5J4-01A-11R-A29W-13,HiSeq,ACC
4,TCGA-OR-A5J5-01A-11R-A29W-13,HiSeq,ACC


In [12]:
'-'.join('TCGA-OR-A5J5-01A-11R-A29W-13'.split('-')[:4])

'TCGA-OR-A5J5-01A'

In [13]:
df_platform["new_id"] =['-'.join(i.split('-')[:4]) for i in df_platform["id"]]
df_platform.drop_duplicates(['new_id','Platform'],inplace=True)
df_platform

Unnamed: 0,id,Platform,Disease,new_id
0,TCGA-OR-A5J1-01A-11R-A29W-13,HiSeq,ACC,TCGA-OR-A5J1-01A
1,TCGA-OR-A5J2-01A-11R-A29W-13,HiSeq,ACC,TCGA-OR-A5J2-01A
2,TCGA-OR-A5J3-01A-11R-A29W-13,HiSeq,ACC,TCGA-OR-A5J3-01A
3,TCGA-OR-A5J4-01A-11R-A29W-13,HiSeq,ACC,TCGA-OR-A5J4-01A
4,TCGA-OR-A5J5-01A-11R-A29W-13,HiSeq,ACC,TCGA-OR-A5J5-01A
...,...,...,...,...
10819,TCGA-YZ-A980-01A-11R-A40B-13,HiSeq,UVM,TCGA-YZ-A980-01A
10820,TCGA-YZ-A982-01A-11R-A40B-13,HiSeq,UVM,TCGA-YZ-A982-01A
10821,TCGA-YZ-A983-01A-11R-A40B-13,HiSeq,UVM,TCGA-YZ-A983-01A
10822,TCGA-YZ-A984-01A-11R-A40B-13,HiSeq,UVM,TCGA-YZ-A984-01A


In [14]:
df_sample_sheet  = pd.merge(df_sample_sheet,df_platform,left_on='Sample ID',right_on='new_id')
df_sample_sheet.head()

Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type,Study Abbreviation,Study Name,id,Platform,Disease,new_id
0,806b9787-2366-4dfe-b053-69267fb3e4fc,3e3337e6-c444-4539-8ba8-82ae85a96d15.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2J-AAB8,TCGA-2J-AAB8-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma,TCGA-2J-AAB8-01A-12R-A41G-13,HiSeq,PAAD,TCGA-2J-AAB8-01A
1,f0d9f59e-06dd-4c16-a31d-37ace1c7edd1,3b219bb6-74eb-45b0-ba74-76c475a69d1b.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7289,TCGA-HZ-7289-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma,TCGA-HZ-7289-01A-11R-2155-13,HiSeq,PAAD,TCGA-HZ-7289-01A
2,3f2a4877-473e-4595-bb79-3b6c25219201,fa17ecee-b9d7-4305-90f3-8afd3a12f256.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-Z5-AAPL,TCGA-Z5-AAPL-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma,TCGA-Z5-AAPL-01A-12R-A41G-13,HiSeq,PAAD,TCGA-Z5-AAPL-01A
3,01e18cae-c606-40c2-a1f7-f14063c2b62d,9f056d34-77f0-42a9-9b08-efaaebb2b03c.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-2L-AAQI,TCGA-2L-AAQI-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma,TCGA-2L-AAQI-01A-12R-A39J-13,HiSeq,PAAD,TCGA-2L-AAQI-01A
4,73dc240c-51ed-43f7-9baf-6c36f12121ef,2c9272ba-2dd6-46cf-9177-51e7db8eefc7.mirbase21...,Transcriptome Profiling,Isoform Expression Quantification,TCGA-PAAD,TCGA-HZ-7926,TCGA-HZ-7926-01A,Primary Tumor,TCGA-PAAD,Pancreatic adenocarcinoma,TCGA-HZ-7926-01A-11R-2155-13,HiSeq,PAAD,TCGA-HZ-7926-01A


In [15]:
df_sample_sheet.loc[df_sample_sheet['Sample Type'].str.contains('Primary Tumor'),'Sample Type'] = 'Primary Tumor'
df_sample_sheet.loc[df_sample_sheet['Sample Type'].str.contains('Solid Tissue Normal'),'Sample Type'] = 'Solid Tissue Normal'

样本类型分布

In [16]:
df_sample_sheet['Sample Type'].value_counts()

Primary Tumor                                      9611
Solid Tissue Normal                                 658
Metastatic                                          379
Primary Blood Derived Cancer - Peripheral Blood     188
Recurrent Tumor                                      37
Additional - New Primary                             11
Additional Metastatic                                 1
Name: Sample Type, dtype: int64

In [17]:
sum(df_sample_sheet['Sample Type'].value_counts().index.str.contains('Normal'))

1

In [18]:
df_sample_sheet['Sample Type'].value_counts().loc[df_sample_sheet['Sample Type'].value_counts().index.str.contains('Normal')]

Solid Tissue Normal    658
Name: Sample Type, dtype: int64

**生成表单**

In [21]:
df_sample_sheet.loc[(df_sample_sheet['Project ID']=='TCGA-LUAD')&(df_sample_sheet['Sample Type']=='Solid Tissue Normal')].shape

(45, 14)

In [36]:
df_sample_sheet.loc[df_sample_sheet['Project ID']=='TCGA-LUAD','Study Name'].values[0]

'Lung adenocarcinoma'

In [41]:
df_stats = pd.DataFrame(columns=['研究名称','Primary Tumor','Solid Tissue Normal','Hiseq','GA'])

In [42]:
for project in set(df_sample_sheet['Project ID']):
    if df_sample_sheet.loc[df_sample_sheet['Project ID']==project,'Study Name'].values[0]:
        abbrev = df_sample_sheet.loc[df_sample_sheet['Project ID']==project,'Study Name'].values[0]
    else:
        abbrev = ' '
    print(abbrev)
    num_tumor = df_sample_sheet.loc[(df_sample_sheet['Project ID']==project)&(df_sample_sheet['Sample Type']=='Primary Tumor')].shape[0]
    num_normal = df_sample_sheet.loc[(df_sample_sheet['Project ID']==project)&(df_sample_sheet['Sample Type']=='Solid Tissue Normal')].shape[0]
    num_hiseq = df_sample_sheet.loc[(df_sample_sheet['Project ID']==project)&(df_sample_sheet['Platform']=='HiSeq')].shape[0]
    num_ga = df_sample_sheet.loc[(df_sample_sheet['Project ID']==project)&(df_sample_sheet['Platform']=='GA')].shape[0]
    
    df_stats.loc[project] = [abbrev,num_tumor,num_normal,num_hiseq,num_ga]
    
    # df_stats.append({'项目': project, 'Primary Tumor': num_tumor, 'Solid Tissue Normal': num_normal,'Hiseq':num_hiseq,'GA':num_ga}, ignore_index=True)

Head and Neck squamous cell carcinoma
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
Pheochromocytoma and Paraganglioma
Uterine Corpus Endometrial Carcinoma
Kidney renal clear cell carcinoma
Glioblastoma multiforme
Colon adenocarcinoma
Cervical squamous cell carcinoma and endocervical adenocarcinoma
Lung squamous cell carcinoma
Cholangiocarcinoma
Skin Cutaneous Melanoma
Brain Lower Grade Glioma
Lung adenocarcinoma
Liver hepatocellular carcinoma
Pancreatic adenocarcinoma
Esophageal carcinoma
Adrenocortical carcinoma
Bladder Urothelial Carcinoma
Stomach adenocarcinoma
Acute Myeloid Leukemia
Breast invasive carcinoma
Kidney Chromophobe
Ovarian serous cystadenocarcinoma
Mesothelioma
Prostate adenocarcinoma
Rectum adenocarcinoma
Testicular Germ Cell Tumors
Sarcoma
Thyroid carcinoma
Thymoma
Uveal Melanoma
Uterine Carcinosarcoma
Kidney renal papillary cell carcinoma


In [48]:
df_stats.to_csv('../../data/GDC/TCGA_miRNA_sample_information_list.tsv',sep='\t')# ,encoding='gb2312'
df_stats

Unnamed: 0,研究名称,Primary Tumor,Solid Tissue Normal,Hiseq,GA
TCGA-HNSC,Head and Neck squamous cell carcinoma,519,44,529,36
TCGA-DLBC,Lymphoid Neoplasm Diffuse Large B-cell Lymphoma,47,0,47,0
TCGA-PCPG,Pheochromocytoma and Paraganglioma,178,3,186,0
TCGA-UCEC,Uterine Corpus Endometrial Carcinoma,527,32,433,127
TCGA-KIRC,Kidney renal clear cell carcinoma,523,70,314,280
TCGA-GBM,Glioblastoma multiforme,0,5,5,0
TCGA-COAD,Colon adenocarcinoma,425,8,265,170
TCGA-CESC,Cervical squamous cell carcinoma and endocervi...,306,3,311,0
TCGA-LUSC,Lung squamous cell carcinoma,467,44,380,131
TCGA-CHOL,Cholangiocarcinoma,36,9,45,0


In [22]:
for project in set(df_sample_sheet['Project ID']):
    print('#'*20)
    print(f'{project}')
    print('#'*20)
    print(df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts())
    print('\n')

####################
TARGET-ALL-P3
####################
Primary Blood Derived Cancer - Bone Marrow         28
Recurrent Blood Derived Cancer - Bone Marrow        7
Primary Blood Derived Cancer - Peripheral Blood     4
Name: Sample Type, dtype: int64


####################
TCGA-ACC
####################
Primary Tumor    80
Name: Sample Type, dtype: int64


####################
TCGA-BLCA
####################
Primary Tumor          417
Solid Tissue Normal     19
Metastatic               1
Name: Sample Type, dtype: int64


####################
TCGA-CESC
####################
Primary Tumor          307
Solid Tissue Normal      3
Metastatic               2
Name: Sample Type, dtype: int64


####################
TCGA-KIRP
####################
Primary Tumor               291
Solid Tissue Normal          34
Additional - New Primary      1
Name: Sample Type, dtype: int64


####################
TCGA-LGG
####################
Primary Tumor      512
Recurrent Tumor     18
Name: Sample Type, dtype: int6

In [45]:
thershold = 0 # 是否有正常样本
i = 0

for project in set(df_sample_sheet['Project ID']):
    if sum(df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().index.str.contains('Normal'))>0: # 每个Project ID normal 的
        if df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().loc[df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().index.str.contains('Normal')][0]>thershold:
            i += 1
            print('#'*20)
            print(f'{i} - {project}')
            print('#'*20)
            print(df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts())
            print('\n')

####################
1 - TCGA-BLCA
####################
Primary Tumor          417
Solid Tissue Normal     19
Metastatic               1
Name: Sample Type, dtype: int64


####################
2 - TCGA-CESC
####################
Primary Tumor          307
Solid Tissue Normal      3
Metastatic               2
Name: Sample Type, dtype: int64


####################
3 - TCGA-KIRP
####################
Primary Tumor               291
Solid Tissue Normal          34
Additional - New Primary      1
Name: Sample Type, dtype: int64


####################
4 - TCGA-THCA
####################
Primary Tumor          506
Solid Tissue Normal     59
Metastatic               8
Name: Sample Type, dtype: int64


####################
5 - TCGA-KICH
####################
Primary Tumor          66
Solid Tissue Normal    25
Name: Sample Type, dtype: int64


####################
6 - TCGA-SKCM
####################
Metastatic               352
Primary Tumor             97
Solid Tissue Normal        2
Additional Metas

In [46]:
thershold = 20 # 是否有正常样本
i = 0

for project in set(df_sample_sheet['Project ID']):
    if sum(df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().index.str.contains('Normal'))>0: # 每个Project ID normal 的
        if df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().loc[df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts().index.str.contains('Normal')][0]>thershold:
            i += 1
            print('#'*20)
            print(f'{i} - {project}')
            print('#'*20)
            print(df_sample_sheet.loc[df_sample_sheet['Project ID']==project]['Sample Type'].value_counts())
            print('\n')

####################
1 - TCGA-KIRP
####################
Primary Tumor               291
Solid Tissue Normal          34
Additional - New Primary      1
Name: Sample Type, dtype: int64


####################
2 - TCGA-THCA
####################
Primary Tumor          506
Solid Tissue Normal     59
Metastatic               8
Name: Sample Type, dtype: int64


####################
3 - TCGA-KICH
####################
Primary Tumor          66
Solid Tissue Normal    25
Name: Sample Type, dtype: int64


####################
4 - CPTAC-3
####################
Primary Tumor          1101
Solid Tissue Normal     579
Name: Sample Type, dtype: int64


####################
5 - TCGA-HNSC
####################
Primary Tumor          523
Solid Tissue Normal     44
Metastatic               2
Name: Sample Type, dtype: int64


####################
6 - TCGA-STAD
####################
Primary Tumor          446
Solid Tissue Normal     45
Name: Sample Type, dtype: int64


####################
7 - TCGA-PRAD
#######

### 疾病统计


**hematopoietic and reticuloendothelial systems**

1. myeloid leukemias


TARGET-AML
1774

TCGA-LAML
188

TARGET-ALL-P3
2

62 + 0 + 0 =62

**kidney**

1. adenomas and adenocarcinomas

TCGA-KIRC
516

TCGA-KIRP
291

CPTAC-3
220

TCGA-KICH
66

71 + 34 + 148 + 25 = 278

**bronchus and lung**

1. adenomas and adenocarcinomas

TCGA-LUAD
477

CPTAC-3
229

46 + 215 = 261

2. squamous cell neoplasms

TCGA-LUSC
478

CPTAC-3
108

45 + 96 = 141

**breast**

ductal and lobular neoplasms

TCGA-BRCA
1035

CPTAC-2
103


104 + 0 = 104

**thyroid gland**

adenomas and adenocarcinomas

TCGA-THCA
505

REBC-THYR
436

59 + 397 = 456

**brain**

gliomas

TCGA-LGG
512

CPTAC-3
98

0 + 0 = 0 

**ovary**

cystic, mucinous and serous neoplasms

TCGA-OV
489

CPTAC-2
71

0 + 0 = 0

**colon**

adenomas and adenocarcinomas


TCGA-COAD
371

CPTAC-2
104

TCGA-READ
6

8 + 0 + 3 = 11