## Process MAF file from MC3 to get sample-specific information

We want to know if samples with mutations in tumor suppressors have mono-allelic or biallelic knockouts. Our mutation data comes from [the MC3 project](https://gdc.cancer.gov/about-data/publications/mc3-2017), which aggregates mutation calls for TCGA samples from several different calling algorithms and scoring approaches.

In [1]:
import os
import gzip
from pathlib import Path

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

In [2]:
mc3_filename = Path('./data', 'mc3.v0.2.8.PUBLIC.maf.gz')

mutated_samples_dir = Path('./data', 'mutated_samples')
mutated_samples_dir.mkdir(exist_ok=True)

# gene to get/save mutation info for
gene = 'ARID1A'

In [3]:
mc3 = gzip.open(mc3_filename, "rb")
maf_header = mc3.readline().decode('UTF-8').strip().split()
maf_ixs = {name: ix for ix, name in enumerate(maf_header)}

print(pd.Series(maf_header).head(10))

0               Hugo_Symbol
1            Entrez_Gene_Id
2                    Center
3                NCBI_Build
4                Chromosome
5            Start_Position
6              End_Position
7                    Strand
8    Variant_Classification
9              Variant_Type
dtype: object


In [4]:
mutated_samples_file = mutated_samples_dir / '{}_mutated_samples.tsv'.format(gene)

if mutated_samples_file.is_file():
    print('file already exists, loading from file')
    mutants_df = pd.read_csv(mutated_samples_file, sep='\t', index_col=0)
else:
    print('generating mutated samples from MC3 maf file')
    mutants = []
    for line in mc3:
        record = line.decode('UTF-8').strip().split("\t")
        hugo_symbol = record[maf_ixs['Hugo_Symbol']] # gene name
        tcga_id_raw = record[maf_ixs['Tumor_Sample_Barcode']] # tumor barcode
        tcga_id_raw_normal = record[maf_ixs['Matched_Norm_Sample_Barcode']] # normal barcode
        is_tumor = tcga_id_raw.split("-")[3].startswith("01")
        tss_code = tcga_id_raw.split("-")[1]

        if hugo_symbol == gene:
            mutants.append(record)
    mutants_df = pd.DataFrame(mutants, columns=maf_header)
    mutants_df.to_csv(mutated_samples_file, sep='\t')

generating mutated samples from MC3 maf file


In [5]:
print(mutants_df.shape)
mutants_df.iloc[:5, :20]

(1220, 114)


Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,dbSNP_RS,dbSNP_Val_Status,Tumor_Sample_Barcode,Matched_Norm_Sample_Barcode,Match_Norm_Seq_Allele1,Match_Norm_Seq_Allele2,Tumor_Validation_Allele1
0,ARID1A,0,.,GRCh37,1,27105667,27105667,+,Nonsense_Mutation,SNP,G,G,T,novel,.,TCGA-05-4382-01A-01D-1931-08,TCGA-05-4382-10A-01D-1265-08,G,G,.
1,ARID1A,0,.,GRCh37,1,27089688,27089688,+,Missense_Mutation,SNP,G,G,T,.,.,TCGA-05-4396-01A-21D-1855-08,TCGA-05-4396-10A-01D-1855-08,G,G,.
2,ARID1A,0,.,GRCh37,1,27056307,27056307,+,Nonsense_Mutation,SNP,C,C,T,novel,.,TCGA-05-4425-01A-01D-1753-08,TCGA-05-4425-10A-01D-1753-08,C,C,.
3,ARID1A,0,.,GRCh37,1,27056169,27056169,+,Missense_Mutation,SNP,A,A,C,.,.,TCGA-05-4427-01A-21D-1855-08,TCGA-05-4427-10A-01D-1855-08,A,A,.
4,ARID1A,0,.,GRCh37,1,27088651,27088651,+,Nonsense_Mutation,SNP,C,C,T,.,.,TCGA-05-5715-01A-01D-1625-08,TCGA-05-5715-10A-01D-1625-08,C,C,.
