# Data download

1. Make an authentication string
```
echo 'kay.vongruenigen@mls.uzh.ch:PW' | base64

a2F5LnZvbmdydWVuaWdlbkBtbHMudXpoLmNoOkthTmlWb0dyMTQwNjk0IQo=
```

2. Obtain a download URL
```
curl -H "Authorization: Basic a2F5LnZvbmdydWVuaWdlbkBtbHMudXpoLmNoOkthTmlWb0dyMTQwNjk0IQo=" "https://cancer.sanger.ac.uk/api/mono/products/v1/downloads/scripted?path=grch38/cosmic/v99/Cosmic_CancerGeneCensus_Tsv_v99_GRCh38.tar&bucket=downloads"
```

3. Download the data file
```
curl "https://cancer.sanger.ac.uk/api/mono/products/v1/downloads/scripted?path=grch38/cosmic/v99/Cosmic_CancerGeneCensus_Tsv_v99_GRCh38.tar&bucket=downloads" --output data/Cosmic_CancerGeneCensus_Tsv_v99_GRCh38.tar
```

4. Extract the data file
```
tar -xvf data/Cosmic_CancerGeneCensus_Tsv_v99_GRCh38.tar -C data/
```

COSMIC CGC
Catalogue of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census (CGC) data GRCh38 mapped

In [8]:
import pandas as pd
cosmic_cgc = pd.read_csv('../data/Cosmic_CancerGeneCensus_v99_GRCh38.tsv', sep='\t')
for column in cosmic_cgc.columns:
    print(column)
cosmic_cgc.head()

GENE_SYMBOL
NAME
COSMIC_GENE_ID
CHROMOSOME
GENOME_START
GENOME_STOP
CHR_BAND
SOMATIC
GERMLINE
TUMOUR_TYPES_SOMATIC
TUMOUR_TYPES_GERMLINE
CANCER_SYNDROME
TISSUE_TYPE
MOLECULAR_GENETICS
ROLE_IN_CANCER
MUTATION_TYPES
TRANSLOCATION_PARTNER
OTHER_GERMLINE_MUT
OTHER_SYNDROME
TIER
SYNONYMS


Unnamed: 0,GENE_SYMBOL,NAME,COSMIC_GENE_ID,CHROMOSOME,GENOME_START,GENOME_STOP,CHR_BAND,SOMATIC,GERMLINE,TUMOUR_TYPES_SOMATIC,...,CANCER_SYNDROME,TISSUE_TYPE,MOLECULAR_GENETICS,ROLE_IN_CANCER,MUTATION_TYPES,TRANSLOCATION_PARTNER,OTHER_GERMLINE_MUT,OTHER_SYNDROME,TIER,SYNONYMS
0,A1CF,APOBEC1 complementation factor,COSG68236,10,50799409.0,50885675.0,10q11.23,y,n,melanoma,...,,E,,oncogene,Mis,,n,,2,"A1CF,ENSG00000148584.14,29974,ACF,ACF64,ACF65,..."
1,ABI1,abl interactor 1,COSG100962,10,26746593.0,26861087.0,10p12.1,y,n,AML,...,,L,Dom,"TSG, fusion",T,KMT2A,n,,1,"ABI1,ENSG00000136754.17,Q8IZP0,10006,ABI-1,E3B1"
2,ABL1,"ABL proto-oncogene 1, non-receptor tyrosine ki...",COSG106650,9,130713946.0,130887675.0,9q34.12,y,n,"CML, ALL, T-ALL",...,,L,Dom,"oncogene, fusion","T, Mis","BCR, ETV6, NUP214",n,,1,"ABL1,ENSG00000097007.17,P00519,25,JTK7,c-ABL,p150"
3,ABL2,"ABL proto-oncogene 2, non-receptor tyrosine ki...",COSG93778,1,179099327.0,179229684.0,1q25.2,y,n,AML,...,,L,Dom,"oncogene, fusion",T,ETV6,n,,1,"ABL2,ENSG00000143322.19,P42684,27,ARG"
4,ACKR3,atypical chemokine receptor 3,COSG97311,2,236567787.0,236582358.0,2q37.3,y,n,lipoma,...,,M,Dom,"oncogene, fusion",T,HMGA2,n,,1,"ACKR3,ENSG00000144476.5,P25106,57007,GPR159,RDC1"


- __GENE_SYMBOL__ [1]: The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

- __NAME__ [2]: Gene descriptive name.
- __COSMIC_GENE_ID__ [3]: A unique COSMIC gene identifier (COSG) is used to identify a gene within the file. This identifier can be used to retrieve additional Gene information from the Cosmic_Genes file.
- __CHROMOSOME__ [4]: The chromosome location of a given mutation census (1-22, X, Y or MT).
- __GENOME_START__ [5]: The start coordinate of a given mutation census.
- __GENOME_STOP__ [6]: The end coordinate of a given mutation census.
- __CHR_BAND__ [7]: Chromosome (1-22, X, Y or MT), arm (p or q) and cytogenic band.
- __SOMATIC__ [8]: Somatic mutations have been detected (y/n).
- __GERMLINE__ [9]: Germline mutations have been detected (y/n).
- __TUMOUR_TYPES_SOMATIC__ [10]: Somatic mutations in the gene are associated with the following diseases (see abbreviations tab for details: https://cancer.sanger.ac.uk/cosmic/help/census#abbrev).
- __TUMOUR_TYPES_GERMLINE__ [11]: Germline mutations in the gene are associated with the following diseases (see abbreviations tab for details: https://cancer.sanger.ac.uk/cosmic/help/census#abbrev).
- __CANCER_SYNDROME__ [12]: Syndrome associated with germline mutation.
- __TISSUE_TYPE__ [13]: Type of tissue, see abbreviations tab for details: https://cancer.sanger.ac.uk/cosmic/help/census#abbrev.
- __MOLECULAR_GENETICS__ [14]: See abbreviations tab for details: https://cancer.sanger.ac.uk/cosmic/help/census#abbrev.
- __ROLE_IN_CANCER__ [15]: Role in Cancer: oncogene: hyperactivity of the gene drives the transformation; TSG: loss of gene function drives the transformation. Some genes can play either of these roles depending on cancer type. Fusion: the gene is known to be involved in oncogenic fusions.
- __MUTATION_TYPES__ [16]: Types of mutation: See abbreviations tab for details: https://cancer.sanger.ac.uk/cosmic/help/census#abbrev.
- __TRANSLOCATION_PARTNER__ [17]: Gene symbol of fusion partner.
- __OTHER_GERMLINE_MUT__ [18]: Other germline mutations not implicated in cancer.
- __OTHER_SYNDROME__ [19]: Other non-cancerous syndrome.
- __TIER__ [20]: Indicates to which tier of the Cancer Gene Census the gene belongs (1/2)
- __SYNONYMS__ [21]: Gene alternative names.

In [5]:
# List of cancer related genes
# https://cancer.sanger.ac.uk/census
import pandas as pd

cgc = pd.read_csv('../data/Cancer_Gene_Census_COSMIC.csv')
cgc.head()

Unnamed: 0,Gene Symbol,Name,Entrez GeneId,Genome Location,Tier,Hallmark,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
0,A1CF,APOBEC1 complementation factor,29974.0,10:50799421-50885675,2,,11.23,yes,,melanoma,,,E,,oncogene,Mis,,,,"ACF,ACF64,ACF65,APOBEC1CF,ASP,CCDS73133.1,ENSG..."
1,ABI1,abl-interactor 1,10006.0,10:26746593-26860935,1,Yes,12.1,yes,,AML,,,L,Dom,"TSG, fusion",T,KMT2A,,,"ABI-1,CCDS7150.1,E3B1,ENSG00000136754.17,NM_00..."
2,ABL1,v-abl Abelson murine leukemia viral oncogene h...,25.0,9:130713946-130885683,1,Yes,34.12,yes,,"CML, ALL, T-ALL",,,L,Dom,"oncogene, fusion","T, Mis","BCR, ETV6, NUP214",,,"ABL,CCDS35165.1,ENSG00000097007.17,JTK7,NM_007..."
3,ABL2,"c-abl oncogene 2, non-receptor tyrosine kinase",27.0,1:179099327-179229601,1,,25.2,yes,,AML,,,L,Dom,"oncogene, fusion",T,ETV6,,,"ABLL,ARG,CCDS30947.1,ENSG00000143322.19,NM_007..."
4,ACKR3,atypical chemokine receptor 3,57007.0,2:236569641-236582358,1,Yes,37.3,yes,,lipoma,,,M,Dom,"oncogene, fusion",T,HMGA2,,,"CCDS2516.1,CMKOR1,CXCR7,ENSG00000144476.5,GPR1..."


In [2]:
for key in cgc.columns:
    print(key)

Gene Symbol
Name
Entrez GeneId
Genome Location
Tier
Hallmark
Chr Band
Somatic
Germline
Tumour Types(Somatic)
Tumour Types(Germline)
Cancer Syndrome
Tissue Type
Molecular Genetics
Role in Cancer
Mutation Types
Translocation Partner
Other Germline Mut
Other Syndrome
Synonyms


Variables:
- Gene Symbol: Mostly HGNC identifier
- Name: Gene descriptive name
- Entrez GeneId: 
- Genome Location
- Tier
- Hallmark
- Chr Band
- Somatic
- Germline
- Tumour Types(Somatic)
- Tumour Types(Germline)
- Cancer Syndrome
- Tissue Type
- Molecular Genetics
- Role in Cancer
- Mutation Types
- Translocation Partner
- Other Germline Mut
- Other Syndrome
- Synonyms

In [7]:
cgc.loc[(cgc['Hallmark'] == 'Yes') & (cgc['Tier'] == 1)]
cgc['Somatic'].value_counts()

yes    695
Name: Somatic, dtype: int64

In [19]:
# Top genes from hits analysis
top_genes = ['TTN-AS1', 'MUC16', 'TTN', 'CASC8', 'LINC00824', 'TP53', 'USH2A', 'IGH',
             'POU5F1B', 'CCAT1', 'PCDHB@', 'CCAT2', 'RYR2', 'PCDHG@', 'PVT1', 'CASC21',
             'MYHAS', 'CASC19', 'IGK', 'PCDHA@', 'SGCZ', 'CASC11', 'CSMD3']

top_cgc = cgc.loc[cgc['Gene Symbol'].isin(top_genes)]

top_cgc = top_cgc[['Gene Symbol',
                   'Name',
                   'Tier',
                   'Hallmark',
                   'Somatic',
                   'Germline',
                   'Tumour Types(Somatic)',
                   'Tumour Types(Germline)',
                   'Cancer Syndrome',
                   'Molecular Genetics',
                   'Role in Cancer',
                   'Mutation Types',
                   'Translocation Partner']]

table = top_cgc.set_index('Gene Symbol').transpose()
table.to_clipboard()
table

Gene Symbol,CSMD3,IGH,IGK,MUC16,TP53
Name,CUB and Sushi multiple domains 3,immunoglobulin heavy locus,immunoglobulin kappa locus,"mucin 16, cell surface associated",tumor protein p53
Tier,2,1,1,2,1
Somatic,yes,yes,yes,yes,yes
Hallmark,,Yes,,,Yes
Somatic,yes,yes,yes,yes,yes
Germline,,,,,yes
Tumour Types(Somatic),"ovarian cancer, oral SCC, lung cancer","MM, Burkitt lymphoma, NHL, CLL, B-ALL, MALT, M...","Burkitt lymphoma, B-NHL","HNSCC, melanoma","breast, colorectal, lung, sarcoma, adrenocorti..."
Tumour Types(Germline),,,,,"breast, sarcoma, adrenocortical carcinoma, gli..."
Cancer Syndrome,,,,,Li-Fraumeni syndrome
Molecular Genetics,,Dom,Dom,,Rec


### Abbreviations
``````
ABBREVIATION	TERM
A	amplification
aCML	atypical chronic myeloid leukaemia
AEL	acute eosinophilic leukaemia
AITL	angioimmunoblastic T cell lymphoma
AL	acute leukaemia
ALCL	anaplastic large-cell lymphoma
ALL	acute lymphocytic leukaemia
AML	acute myeloid leukaemia
AML*	acute myeloid leukaemia (primarily treatment associated)
APL	acute promyelocytic leukaemia
B-ALL	B-cell acute lymphocytic leukaemia
B-CLL	B-cell lymphocytic leukaemia
B-NHL	B-cell non-Hodgkin lymphoma
CLL	chronic lymphocytic leukaemia
CML	chronic myeloid leukaemia
CMML	chronic myelomonocytic leukaemia
CNL	chronic neutrophilic leukaemia
CNS	central nervous system
D	large deletion
DFSP	dermatofibrosarcoma protuberans
DGC	diffuse-type gastric carcinoma
DIPG	diffuse intrinsic pontine glioma
DLBCL	diffuse large B-cell lymphoma
DLCL	diffuse large-cell lymphoma
Dom	dominant
E	epithelial
ETP-ALL	early T-cell precursor acute lymphoblastic leukaemia
F	frameshift
GBM	glioblastoma multiforme
GIST	gastrointestinal stromal tumour
HES	hypereosinophilic syndrome
HNSCC	head and neck squamous cell carcinoma
JMML	juvenile myelomonocytic leukaemia
L	leukaemia/lymphoma
M	mesenchymal
MALT	mucosa-associated lymphoid tissue lymphoma
MCL	mantle cell lymphoma
MDS	myelodysplastic syndrome
Mis	missense
MLCLS	mediastinal large cell lymphoma with sclerosis
MM	multiple myeloma
MPN	myeloproliferative neoplasm
N	nonsense
NHL	non-Hodgkin lymphoma
NK/T	natural killer T cell
NSCLC	non small cell lung cancer
O	other
PMBL	primary mediastinal B-cell lymphoma
pre-B ALL	pre-B-cell acute lymphoblastic leukaemia
RCC	renal cell carcinoma
Rec	recessive
S	splice site
sAML	secondary acute myeloid leukaemia
SCC	squamous cell carcinoma
SCCOHT	small cell carcinoma of the ovary, hypercalcaemic type
SM-AHD	systemic mastocytosis associated with other haematological disorder
SMZL	splenic marginal zone lymphoma
T	translocation
T-ALL	T-cell acute lymphoblastic leukaemia
T-CLL	T-cell chronic lymphocytic leukaemia
T-PLL	T-cell prolymphocytic leukaemia
TGCT	testicular germ cell tumour
TSG	Tumour Suppressor Gene
WM	Waldenstrom's macroglobulinaemia
``````