# **Description**
This file describes the workflow to identify regions of constraint in tumor genomes. Specifically, I seek to identify intervals of coding sequences in tumor genomes signficantly devoid of protein-altering mutations. This analysis will be done at a pan-cancer and individual cancer type analysis. 

We will use consensus variant files from **TCGA (whole exome sequencing)** and **ICGC (whole genome sequencing)** working groups. See below for more information on the working groups and their manuscripts: 


1.   [Scalable Open Science Approach for Mutation Calling 
of Tumor Exomes Using Multiple Genomic Pipelines](https://pubmed.ncbi.nlm.nih.gov/29596782/)

    * Use publically available MAF file: [mc3.v0.2.8.PUBLIC.maf.gz](https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc)
    * Coverage information:  
    * Clinical information: Download a tsv file containing TCGA clinical data from the [NCI GDC Data Portal](https://portal.gdc.cancer.gov/repository?searchTableTab=cases) by clicking on the **"Clinical"** button. 
    ```bash
    ## Decompress TCGA clinical information file:
    cd ~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/clinical
    tar -zxvf ~/path/to/downloaded/file > mc3_mapping.tsv
    ```
2.   [Pan-cancer analysis of whole genomes](https://www.nature.com/articles/s41586-020-1969-6) 
    * Use publically available MAF file: [final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz](https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_snv_indel/final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz)
    * Use publically available wig files for coverage information: [coverage_wig_files.tar](https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_snv_indel/wig_files/coverage_wig_files.tar)
    * Clinical information: Click the **"Download Donor Data"** button at [ICGC data portal](https://dcc.icgc.org/search?filters=%7B%22donor%22:%7B%22id%22:%7B%22is%22:%5B%22ES:12b6fcab-467d-4649-8177-0aa41f44d77c%22%5D%7D%7D%7D). Use the **submitted_donor_id** and **project_code** columns.  
    ```bash
    ## Decompress ICGC clinical information file:
    cd ~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/clinical
    gunzip ~/Path/to/sample.tsv.gz > pcawg_mapping.tsv
```

# **Basic Outline of Step #1**

## **Step 1: Format, Filter, and Combine TCGA and ICGC Variants**
### ***Step 1a: Obtain variant and coverage files from working groups***
#### Step 1a.1: Get variant information from TCGA and ICGC 
#### Step 1a.2: Map clinical information (i.e. cancer type) to sample ids from both working group datasets

#### **Generate consensus cancer type mapping file** 
* **GOAL:** Generate a consensus file to map ICGC's 'project_code' to MC3's 'project_id' to the same disease 
* Use [TCGA study abbreviations](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations) and [ICGC study abbreviations](https://docs.icgc.org/submission/projects/) to obtain consensus TCGA and ICGC cancer type name. 
* The consensus cancer type name will be indicated in the **'Simplified Name'** column...

| TCGA Study Code | ICGC Project Code | ICGC Cancer Type | TCGA Cancer Type |Simplified Name | TCGA? | ICGC? |
| --------------- | ----------------- | ---------------- | ---------------- | -------------- | ----- | ----- |
| TCGA-BRCA | BRCA-CN | Breast Triple Negative Cancer |  Breast invasive carcinoma |Breast invasive carcinoma | 1 | 1 |
| TCGA-BRCA | BRCA-EU | Breast ER+ and HER2 | Breast invasive carcinoma | Breast invasive carcinoma | 1 | 1 |
| TCGA-BRCA | BRCA-FR | Breast Cancer | Breast invasive carcinoma | Breast invasive carcinoma | 1 | 1 |
| TCGA-BRCA | BRCA-KR | Breast Cancer | Breast invasive carcinoma | Breast invasive carcinoma | 1 | 1 |
| TCGA-BRCA | BRCA-UK | Breast Triple Negative/Lobular Cancer | Breast invasive carcinoma | Breast invasive carcinoma | 1 | 1 |

* **Some cancer type names are only found in ICGC --> see below for a few examples:**

| TCGA Study Code | ICGC Project Code | ICGC Cancer Type | TCGA Cancer Type |Simplified Name | TCGA? | ICGC? |
| --------------- | ----------------- | ---------------- | ---------------- | -------------- | ----- | ----- |
| NA | ALL-US | Acute Lymphoblastic Lymphoma | NA | Acute Lymphoblastic Lymphoma | 0 | 1 |
| NA | LIAD-US | Benign Liver Tumor | NA | Benign Liver Tumor | 0 | 1 |
| NA | CCSK-US | Clear Cell Sarcomas of the Kidney | NA | Clear Cell Sarcomas of the Kidney | 0 | 1 |
| NA | PEME-CA | Pediatric Medulloblastoma | NA | Pediatric Medulloblastoma | 0 | 1 |
| NA | RT-US | Rhabdoid Tumor | NA | Rhabdoid Tumor | 0 | 1 |
| NA | SKCA-BR | Skin Adenocarcinoma | NA | Skin Adenocarcinoma | 0 | 1 |
| NA | LMS-FR | Soft tissue cancer | NA | Soft tissue cancer | 0 | 1 |
| NA | WT-US | Wilms Tumor | NA | Wilms Tumor | 0 | 1 |

* **Some cancer type names are only found in TCGA --> see below for a few examples:**

| TCGA Study Code | ICGC Project Code | ICGC Cancer Type | TCGA Cancer Type |Simplified Name | TCGA? | ICGC? |
| --------------- | ----------------- | ---------------- | ---------------- | -------------- | ----- | ----- |
| TCGA-UVM | NA | NA | Uveal Melanoma | Uveal Melanoma | 1 | 0 |
| TCGA-MESO | NA | NA | Mesothelioma | Mesothelioma | 1 | 0 |
| TCGA-THYM | NA | NA | Thymoma | Thymoma | 1 | 0 |

#### Step 1a.3 Filter variants based on coverage parameters

### ***Step 1b: Remove variants that overlap with regions of segmental duplications and/or self-chains***

*   **NOTE:** This step removes the entire variant, even if part of the variant (in the case of non-MNVs) does not overlap with segmental duplications and/or self-chains
*   **Note:** This is using the UCSC annotation files for genome build hg19 (NOT hg38)

#### Step 1b.1: Concatenate variants from MC3 and PCAWG MAF files

#### Step 1b.2: Select for specific variant classes (exonic and/or non-intronic)

#### Step 1b.3: Use bedtools to remove variants that overlap with regions of segdups and self-chains (SEE BELOW)

#### Step 1b.4 Use bedtools to select for variants with CDS regions of genes

#### Step 1b.5: Use biomaRt to filter for genes with known CDS lengths (using hg19) 

### ***Step 1c: Visualize variant data post-filtering*** 
#### Step 1c.1: Total number of mutations across each variant class
#### Step 1c.2: Total number of samples per cancer type
#### Step 1c.3: Number of mutations per cancer type given a variant_class
#### Step 1c.4: Number of mutations per gene per cancer type

## ***Step 1a.1 Read in TCGA MAF File***
---

In [1]:
import pandas as pd

## Define columns to include
fields = ['Hugo_Symbol', 'Chromosome', 'Start_Position', 'End_Position', 'Strand', 
            'Variant_Classification', 'Variant_Type', 'Tumor_Sample_Barcode',
            'cDNA_position', 'CDS_position', 'HGVSc', 'HGVSp_Short', 'Transcript_ID', 
            'Exon_Number', 't_depth', 't_ref_count', 't_alt_count', 'n_depth', 'n_ref_count', 'n_alt_count', 
            'all_effects', 'DOMAINS', 'IMPACT']

mc3_df = pd.read_csv("https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc", 
                     sep="\t", compression="gzip", header=0)

  interactivity=interactivity, compiler=compiler, result=result)


## ***Step 1a.2 Map Clinical Information to TCGA variants***
---

In [3]:
## Define columns to read in
fields = ['case_submitter_id', 'project_id']

## Read in the MC3 clinical data
clinical_mc3 = pd.read_csv("~/git/somccr/data/clinical/mc3_mapping.tsv", sep="\t", usecols=fields)

## Obtain dataset with the TCGA barcode and cancer type
clinical_mc3 = clinical_mc3.drop_duplicates()
clinical_mc3.columns = ['Tumor_Sample_Barcode_split', 'TCGA_Project_Code']

## Split the Tumor_Sample_Barcode column so that they match the barcodes in the 'clinical' dataset
barcodes = mc3_df['Tumor_Sample_Barcode']
barcodes_split = barcodes.str.rsplit(pat="-", n=4, expand=True)[0]

## Add the split TCGA barcodes to the mc3 variant dataset
mc3_df['Tumor_Sample_Barcode_split'] = barcodes_split

## Map the barcodes to the cancer type 
mc3_df = pd.merge(mc3_df, clinical_mc3, on = "Tumor_Sample_Barcode_split")
mc3_df.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Variant_Classification,Variant_Type,...,ExAC_AF_SAS,GENE_PHENO,FILTER,COSMIC,CENTERS,CONTEXT,DBVS,NCALLERS,Tumor_Sample_Barcode_split,TCGA_Project_Code
0,TACC2,0,.,GRCh37,10,123810032,123810032,+,Missense_Mutation,SNP,...,.,.,PASS,SITE|p.T38M|c.113C>T|3,MUTECT|RADIA|SOMATICSNIPER|MUSE|VARSCANS,GGACACGCCCG,by1000G,5,TCGA-02-0003,TCGA-GBM
1,JAKMIP3,0,.,GRCh37,10,133967449,133967449,+,Silent,SNP,...,.,.,PASS,NONE,MUTECT|RADIA|SOMATICSNIPER|MUSE|VARSCANS,CTGGACGAGGA,byFrequency,5,TCGA-02-0003,TCGA-GBM
2,PANX3,0,.,GRCh37,11,124489539,124489539,+,Missense_Mutation,SNP,...,.,.,PASS,SITE|p.R296Q|c.887G>A|3,MUTECT|RADIA|SOMATICSNIPER|MUSE|VARSCANS,ATGTCGGTGGG,.,5,TCGA-02-0003,TCGA-GBM
3,SPI1,0,.,GRCh37,11,47380512,47380512,+,Missense_Mutation,SNP,...,.,.,PASS,NONE,RADIA|MUSE,GGCTGGGGACA,.,2,TCGA-02-0003,TCGA-GBM
4,NAALAD2,0,.,GRCh37,11,89868837,89868837,+,Missense_Mutation,SNP,...,.,.,PASS,SITE|p.R65C|c.193C>T|4,MUTECT|RADIA|SOMATICSNIPER|MUSE|VARSCANS,TTCTTCGGTAA,.,5,TCGA-02-0003,TCGA-GBM


## ***Step 1a.1 Read in ICGC MAF File***
---

In [None]:
import pandas as pd

## Define columns to include
fields = ['Hugo_Symbol', 'Chromosome', 'Start_position', 'End_position', 'Strand', 
          'Variant_Classification', 'Variant_Type', 'Tumor_Sample_Barcode', 'Donor_ID']

## Read in the ICGC MAF file
pcawg_df = pd.read_csv("https://dcc.icgc.org/api/v1/download?fn=/PCAWG/consensus_snv_indel/final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz",
                      compression='gzip', sep="\t")

## Rename columns Start/End_position to Start/End_Position
pcawg_df.rename(columns={'Start_position':'Start_Position', 'End_position':'End_Position'}, inplace=True)

## ***Step 1a.2 Map Clinical Information to ICGC Variants***
---

In [None]:
## Define columns to read in
fields = ['project_code', 'icgc_donor_id']

## Read in the MC3 clinical data
clinical_pcawg = pd.read_csv("~/git/somccr/data/clinical/pcawg_mapping.tsv", 
                             sep="\t", usecols=fields)

## Obtain dataset with the TCGA barcode and cancer type
clinical_pcawg = clinical_pcawg.drop_duplicates()
clinical_pcawg.columns = ['ICGC_Project_Code', 'Donor_ID']

# ## Remove the country designation in the cancer type code
# cancer_type = clinical_pcawg['cancer_type_pcawg']
# cancer_type_split = cancer_type.str.rsplit(pat="-", n=2, expand=True)[0]

# ## Add the split ICGC barcodes to the pcawg clinical dataset
# clinical_pcawg['cancer_type_pcawg'] = cancer_type_split

## Map the barcodes to the cancer type 
pcawg_df = pd.merge(pcawg_df, clinical_pcawg, on = "Donor_ID")
pcawg_df.head()

## ***Step 1a.2 Map TCGA and ICGC cancer codes to a consensus disease name***
---

In [None]:
## Read in the clinical information mapping file for icgc and tcga
file = "~/git/somccr/data/clinical/map_icgc_tcga_cancer_codes.xlsx"

## Only select for columns of interest
tcga = ['TCGA_Project_Code', 'Cancer_Type_Simplified']
icgc = ['ICGC_Project_Code', 'Cancer_Type_Simplified']
map_disease_tcga = pd.read_excel(file, 'master', usecols=tcga)
map_disease_icgc = pd.read_excel(file, 'master', usecols=icgc)

## Map the project code to the cancer type name
mc3_df = pd.merge(mc3_df, map_disease_tcga, on="TCGA_Project_Code")
pcawg_df = pd.merge(pcawg_df, map_disease_icgc, on="ICGC_Project_Code")

## Change the column name of the column used to map project codes to TCGA and ICGC samples --> used to plot # of samples for each cancer type
mc3_df.rename(columns={'Tumor_Sample_Barcode':'Sample_ID'}, inplace=True)
pcawg_df.rename(columns={'Donor_ID':'Sample_ID'}, inplace=True)

## ***Step 1b.1 Concatenate TCGA and ICGC variants***
---

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

## Get bare minimum columns for mc3 and pcawg variant files 
fields = ['Chromosome', 'Start_Position', 'End_Position', 'Hugo_Symbol', 
          'Variant_Classification', 'Cancer_Type_Simplified', 'Sample_ID', 
          'Reference_Allele', 'Tumor_Seq_Allele2']
mc3 = mc3_df[fields]
pcawg = pcawg_df[fields]

## Specify which study the variants came from 
mc3['study'] = "mc3_wes"
pcawg['study'] = "pcawg_wgs"

## Concatenate mc3 and pcawg variants 
maf = pd.concat([mc3, pcawg])

## Convert start_position to 0-based positions
maf.loc[:, 'Start_Position'] = maf['Start_Position'].apply(lambda x: x - 1)

## ***Step 1b.2 Select for specific variant classes***
---

In [None]:
## Remove non-cds/exonic variants 
variants_to_keep = ['Frame_Shift_Del', 'Frame_Shift_Ins', 'In_Frame_Del', 'In_Frame_Ins', 'Missense_Mutation',
                   'Nonsense_Mutation', 'Nonstop_Mutation', 'RNA', 'Silent', 'Splice_Site', 'Translation_Start_Site',
                   'Targeted_Region', 'Start_Codon_Del', 'Start_Codon_Ins', 'Start_Codon_SNP', 'Stop_Codon_Del', 
                   'Stop_Codon_Ins', 'De_novo_Start_InFrame', 'De_novo_Start_OutOfFrame']
keep_variants = maf.loc[maf["Variant_Classification"].isin(variants_to_keep)]

## Add "chr" to the chromosome name 
keep_variants['Chromosome'] = "chr" + keep_variants['Chromosome'].astype(str)

## Write out the concatenated, filtered MAF file
keep_variants.to_csv("~/git/somccr/data/output/somccr_v3/mc3_pcawg.bed", 
                     sep="\t", header=False, index=False)

## Show dataset
keep_variants.head()

## ***Step 1b.3 Use bedtools to remove variants that overlap with regions of segdups and self-chains***
---
**NOTE:** See [bedtools intersect function with the -v option](http://quinlanlab.org/tutorials/bedtools/bedtools.html) for more information on how this step is performed. Briefly, this function will identify variants from MC3 and PCAWG that DO NOT fall within intervals in the provided segmental-duplication/self-chain file. As a reminder, this latter file is sorted and overlapping intervals are merged. 

In [None]:
%%bash
## cd into working directory 
cd ~/git/somccr/data/output/somccr_v3/

## Pre-sort the variant file 
cat mc3_pcawg.bed | sort -k1,1 -k2,2n > mc3_pcawg.sorted.bed

## Use bedtools intersect with -v option to get variants that DO NOT overlap with the regions of interest
bedtools intersect -a mc3_pcawg.sorted.bed \
-b ~/Google\ Drive/Quinlan\ Lab\ -\ PhD/Projects/somatic_ccr/data/reference/segdup_selfchain/merged_segdups_selfchain.hg19.txt -v \
> mc3_pcawg.sorted.filtered.bed

## ***Step 1b.4 Use bedtools to keep variants within CDS regions of genes***
---
**NOTE:** See [bedtools intersect](http://quinlanlab.org/tutorials/bedtools/bedtools.html) for more information on how the intersect step is performed. This step will select for variants from TCGA and ICGA that fall within CDS coordinates. 

**NOTE:** Download coordiantes of CDS regions of known genes from UCSC Genome Table Browser 
* Sort
* Merge

***NOTE:*** Only focusing on single nucleotide variants for now --> will simply k-mer calculation

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
file1 = '/Users/jasonkunisaki/git/somccr/img/table_browser_query.png'
file2 = '/Users/jasonkunisaki/git/somccr/img/get_output_query.png'
#display(Image(filename=file1), Image(filename=file2))

In [None]:
%%bash
## set working directory
cd ~/Google\ Drive/Quinlan\ Lab\ -\ PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3

cat mc3_pcawg.sorted.filtered.bed | \
awk '{if ($3-$2==1) print $0}' > mc3_pcawg.sorted.filtered.snv.bed

bedtools intersect -a mc3_pcawg.sorted.filtered.snv.bed \
-b ~/Google\ Drive/Quinlan\ Lab\ -\ PhD/Projects/somatic_ccr/data/reference/cds/cds_ucsc_gene_coord_sorted_merged.bed \
> mc3_pcawg.sorted.filtered.snv.CDS.bed

## ***Step 1b.4 Use biomaRt to get genes with known CDS lengths***
---

In [None]:
## Allow for R programming; must use %%R with the `-i [df to import from global environment]` option to use in an R cell
%load_ext rpy2.ipython

In [None]:
%%R
## Define input and output filenames
file <- "~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.snv.CDS.bed"
output_file_name <- "~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/gene_cds_length.txt"

## Source in the code
source("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/script/get-length.R")
get_length(file=file, output_file_name=output_file_name)

In [None]:
## Read in the sorted and filtered variant file
temp = pd.read_csv("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.snv.CDS.bed", 
                   sep="\t", header=None)
temp.columns = ['chromosome', 'start', 'stop', 'gene', 'variant_class', 'cancer_type', 'sample_id', 'ref', 'alt', 'working_group']

## Get the genes with known CDS lengths from the R script above --> read in as pandas dataframe
cds_gene_df = pd.read_csv("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/gene_cds_length.txt", sep="\t")

## Get unique list of cds genes
cds_genes = cds_gene_df['gene'].tolist()
cds_genes = list(set(cds_genes))

## Get rows from the combined MAF dataset with cds genes
cds_maf = temp[temp["gene"].isin(cds_genes)]

## Write the cds MAF dataset
cds_maf.to_csv("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.snv.CDS.knownCDSlength.bed", sep="\t", index=False)

## Show dataset
cds_maf.head()

## ***Step 1c.1 Plot total number of mutations across each variant class***
---

In [2]:
## Allow for R programming; must use %%R with the `-i [df to import from global environment]` option to use in an R cell
%load_ext rpy2.ipython

In [None]:
%%R -w 15 -h 10 -u in
## Set working directory
setwd("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/script/plots/")

## Define input variant file
input_file <- "~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.CDS.bed"

source("p1_variants_per_variant_class.R")
p1 <- plot1(file = input_file)
print(p1[[1]])

## To see the raw data used for the plot --> run: print(p1[[2]])

## ***Step 1c.2 Plot total number of samples per cancer type***
---

In [None]:
%%R -w 15 -h 10 -u in
source("p3_samples_per_cancer_type.R")
p3 <- plot3(file = input_file, log_trans = FALSE)
print(p3[[1]])

## To see the raw data used for the plot --> run: print(p3[[2]])

## ***Step 1c.3 Plot Total number of mutations per cancer type given a variant_class***
---

In [None]:
%%R -w 15 -h 10 -u in
source("p2_variants_per_cancer_type.R")
p2 <- plot2(file = input_file)
print(p2[[1]])

## To see the raw data used for the plot --> run: print(p2[[2]])

## ***Step 1c.4 Plot total number of mutations per gene per cancer type***
---

In [None]:
%%R -w 20 -h 15 -u in
source("p4_mutations_per_gene_per_cancer_type.R")
p4 <- plot4(file = input_file)
print(p4[[1]])

## To see the raw data used for the plot --> run: print(p4[[2]])

# Next Items

* Visualize data in IGV to make sure variants are not in segdups/self-chains

* Make lolliplots for each gene with tracks for S and non-S mutations --> **plot S and non-S mutation density**

* Figure out how to get expected numbers of mutations in an interval

# **Basic Outline of Step #2**
## **Step 2: Calculate k-mer frequencies**
* Reference this [biostars](https://www.biostars.org/p/461455/) post for more information on how this step was completed.

Reference k-mer | Mutated k-mer | Final Label 
:-: | :-: | :-:
ATC | AAC | ATC > AAC
ATC | ACC | ATC > ACC
ATC | AGG | ATC > AGC


In [45]:
%%R
## Read in function
source("~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/script/somCCR - v3 Find Sig CCRs/get_kmer_seq.R")

## Define input variables
file <- "~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.snv.CDS.knownCDSlength.bed"
output_file <- "~/Google Drive/Quinlan Lab - PhD/Projects/somatic_ccr/data/output/mc3_pcawg_v3/mc3_pcawg.sorted.filtered.snv.CDS.knownCDSlength.kmer.bed"
k <- 3

df <- get_kmer_freq(file = file, k = k)
final <- map_kmer_freq(df = df)
head(final)

## Write the final table
#write.table(x = final, file = output_file, quote = FALSE, sep = "\t", row.names = FALSE)

   chromosome  start   stop   gene          variant_class
1:       chr1 861321 861322 SAMD11 Translation_Start_Site
2:       chr1 861335 861336 SAMD11                 Silent
3:       chr1 861335 861336 SAMD11                 Silent
4:       chr1 861341 861342 SAMD11                 Silent
5:       chr1 861341 861342 SAMD11                 Silent
6:       chr1 861348 861349 SAMD11      Missense_Mutation
                             cancer_type                    sample_id ref alt
1:  Uterine Corpus Endometrial Carcinoma TCGA-AX-A2IN-01A-12D-A17W-09   A   G
2:               Skin Cutaneous Melanoma TCGA-DA-A1HY-06A-11D-A19A-08   C   T
3:               Skin Cutaneous Melanoma TCGA-DA-A1HY-06A-11D-A19A-08   C   T
4:               Skin Cutaneous Melanoma TCGA-W3-AA1V-06B-11D-A401-08   G   A
5:               Skin Cutaneous Melanoma TCGA-W3-AA1V-06B-11D-A401-08   G   A
6: Head and Neck squamous cell carcinoma TCGA-CR-6477-01A-11D-1870-08   C   A
   working_group ref_kmer alt_kmer   kmer_id fre