## Download cancer gene sets

We want to download the set of cancer-associated genes from the [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/cosmic/census), and from [Bailey et al. 2018](https://www.sciencedirect.com/science/article/pii/S009286741830237X), in order to use these genes in our experiments as a comparison/complement to the Vogelstein et al. gene set.

In [1]:
from pathlib import Path

import pandas as pd

import mpmp.config as cfg
import mpmp.utilities.data_utilities as du

%load_ext autoreload
%autoreload 2

### Download COSMIC CGC data

We downloaded the original CGC data directly from the Sanger Institute website linked above - you need to create an account there to download the .tsv file, so we can't do it programmatically.

In [2]:
cosmic_df = pd.read_csv(
    cfg.cosmic_genes_file, sep='\t', index_col=0
)

cosmic_df = cosmic_df[
    # use only tier 1 genes
    ((cosmic_df.Tier == 1) &
    # drop genes without a catalogued somatic mutation
     (cosmic_df.Somatic == 'yes') &
    # drop genes that are only observed in cancer as fusions
    # (we're not calling fusion genes in our mutation data)
     (cosmic_df['Role in Cancer'] != 'fusion'))
].copy()
     
print(cosmic_df.shape)
cosmic_df.head()

(445, 19)


Unnamed: 0_level_0,Name,Entrez GeneId,Genome Location,Tier,Hallmark,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
Gene Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
ABI1,abl-interactor 1,10006,10:26746593-26860935,1,Yes,12.1,yes,,AML,,,L,Dom,"TSG, fusion",T,KMT2A,,,"10006,ABI-1,ABI1,E3B1,ENSG00000136754.17,Q8IZP..."
ABL1,v-abl Abelson murine leukemia viral oncogene h...,25,9:130713946-130885683,1,Yes,34.12,yes,,"CML, ALL, T-ALL",,,L,Dom,"oncogene, fusion","T, Mis","BCR, ETV6, NUP214",,,"25,ABL,ABL1,ENSG00000097007.17,JTK7,P00519,c-A..."
ABL2,"c-abl oncogene 2, non-receptor tyrosine kinase",27,1:179099327-179229601,1,,25.2,yes,,AML,,,L,Dom,"oncogene, fusion",T,ETV6,,,"27,ABL2,ABLL,ARG,ENSG00000143322.19,P42684"
ACKR3,atypical chemokine receptor 3,57007,2:236569641-236582358,1,Yes,37.3,yes,,lipoma,,,M,Dom,"oncogene, fusion",T,HMGA2,,,"57007,ACKR3,CMKOR1,CXCR7,ENSG00000144476.5,GPR..."
ACVR1,"activin A receptor, type I",90,2:157736444-157875111,1,Yes,24.1,yes,,DIPG,,,O,Dom,oncogene,Mis,,yes,Fibrodysplasia ossificans progressiva,"90,ACVR1,ACVR1A,ACVRLK2,ALK2,ENSG00000115170.1..."


In [3]:
print(cosmic_df['Role in Cancer'].unique())

# if a gene is annotated as an oncogene/TSG and a fusion gene, just
# get rid of the fusion component
# we'll resolve the dual annotated oncogene/TSG genes later
cosmic_df['Role in Cancer'] = cosmic_df['Role in Cancer'].str.replace(', fusion', '')
print(cosmic_df['Role in Cancer'].unique())

['TSG, fusion' 'oncogene, fusion' 'oncogene' 'TSG' 'oncogene, TSG, fusion'
 'oncogene, TSG']
['TSG' 'oncogene' 'oncogene, TSG']


### Download Bailey et al. data

This is a supplementary table from [the TCGA Pan-Cancer Atlas driver gene analysis](https://www.sciencedirect.com/science/article/pii/S009286741830237X). The table contains genes identified as cancer drivers by taking the consensus of existing driver identification methods, in addition to manual curation as described in the paper. The table also contains oncogene/TSG predictions for these genes, using the [20/20+ method](https://2020plus.readthedocs.io/en/latest/).

This table (Excel file) was also directly downloaded from the paper's supplementary data, as Cell doesn't seem to provide a straightforward API (that I'm able to find).

In [4]:
class_df = pd.read_excel(
    cfg.data_dir / '1-s2.0-S009286741830237X-mmc1.xlsx', 
    engine='openpyxl', sheet_name='Table S1', index_col='KEY', header=3
)
class_df.rename(columns={'Tumor suppressor or oncogene prediction (by 20/20+)':
                         'classification'},
                inplace=True)

print(class_df.shape)
class_df.head()

(782, 25)


  warn(msg)


Unnamed: 0_level_0,Gene,Cancer,classification,Decision,Tissue Frequency,Pancan Frequency,Consensus Score,Correlation adusted score,Novel,Rescue Notes,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABL1_PANCAN,ABL1,PANCAN,,rescued,,0.011675,0.0,,0.0,Evidence from OncoImpact/DriverNET overlap (SN...,...,,,,,,,,,,
ACVR1_UCEC,ACVR1,UCEC,oncogene,official,0.05303,0.00749,1.5,1.5,0.0,,...,,,,,,,,,,
ACVR1B_PANCAN,ACVR1B,PANCAN,possible tsg,official,,0.010904,1.0,0.0,0.0,,...,,,,,,,,,,
ACVR2A_COADREAD,ACVR2A,COADREAD,tsg,official,0.028481,0.013988,1.5,1.5,0.0,,...,,,,,,,,,,
ACVR2A_LIHC,ACVR2A,LIHC,possible tsg,official,0.031073,0.013988,1.5,1.5,0.0,,...,,,,,,,,,,


In [5]:
bailey_df = (
    class_df[((class_df.Cancer == 'PANCAN') &
             (~class_df.classification.isna()))]
).copy()

# this is the best classification we have to go on for these genes, so if
# a gene is labeled as "possible X", we'll just consider it X
bailey_df['classification'] = (
    bailey_df['classification'].str.replace('possible ', '')
                                   .replace('tsg', 'TSG')
                                   .replace('oncogene', 'Oncogene')
)

print(bailey_df.shape)
bailey_df.head()

(170, 25)


Unnamed: 0_level_0,Gene,Cancer,classification,Decision,Tissue Frequency,Pancan Frequency,Consensus Score,Correlation adusted score,Novel,Rescue Notes,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACVR1B_PANCAN,ACVR1B,PANCAN,TSG,official,,0.010904,1.0,0.0,0.0,,...,,,,,,,,,,
ACVR2A_PANCAN,ACVR2A,PANCAN,TSG,official,,0.013988,2.0,2.0,0.0,,...,,,,,,,,,,
AJUBA_PANCAN,AJUBA,PANCAN,TSG,official,,0.009032,2.0,1.484499,0.0,,...,,,,,,,,,,
AKT1_PANCAN,AKT1,PANCAN,Oncogene,official,,0.010133,2.5,2.5,0.0,,...,,,,,,,,,,
AMER1_PANCAN,AMER1,PANCAN,TSG,official,,0.027426,2.0,2.0,0.0,,...,,,,,,,,,,


### Load Vogelstein et al. data

This data originally came from [Vogelstein et al. 2013](https://www.science.org/doi/10.1126/science.1235122). Oncogene/TSG annotations also come from 20/20+ predictions.

In [6]:
import mpmp.utilities.data_utilities as du

vogelstein_df = du.load_vogelstein()

print(vogelstein_df.shape)
vogelstein_df.head()

(125, 8)


Unnamed: 0,gene,Gene Name,# Mutated Tumor Samples**,Ocogene score*,Tumor Suppressor Gene score*,classification,Core pathway,Process
0,ABL1,"c-abl oncogene 1, receptor tyrosine kinase",851,0.926904,0.003046,Oncogene,Cell Cycle/Apoptosis,Cell Survival
1,ACVR1B,"activin A receptor, type IB",17,0.0,0.423077,TSG,TGF-b,Cell Survival
2,AKT1,v-akt murine thymoma viral oncogene homolog 1,155,0.929487,0.00641,Oncogene,PI3K,Cell Survival
3,ALK,anaplastic lymphoma receptor tyrosine kinase,189,0.72,0.01,Oncogene,PI3K; RAS,Cell Survival
4,APC,adenomatous polyposis coli,2561,0.024553,0.917222,TSG,APC,Cell Fate
