# Processing ICGC Data for Mutational Signatures Analysis

**Goal**: clean the downloaded ICGC donor SSM data and convert them into [mutational annotation format (MAF)](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/), and then convert MAF data into a mutational spectra matrix for mutational signature analysis.

This notebook assumes that the steps in the notebook `Consuming ICGC Data.ipynb` has already been run and the downloaded WGS SSM data is stored in the `/data/WGS/` folder of the project directory.

In [76]:
import gzip
from pathlib import Path
import shutil
from typing import Dict, List

import pandas as pd
import pyfaidx
import requests
from tqdm import tqdm


pd.set_option('display.max_columns', None)

In [2]:
dir_data = Path.cwd().parent / "data"
dir_wgs = dir_data / "WGS/"

In [3]:
data = pd.read_csv(dir_wgs / "BRCA-UK_ssm_WGS.tsv.gz", sep="\t")

  data = pd.read_csv(dir_wgs / "BRCA-UK_ssm_WGS.tsv.gz", sep="\t")


In [4]:
data.head()

Unnamed: 0,icgc_mutation_id,icgc_donor_id,project_code,icgc_specimen_id,icgc_sample_id,matched_icgc_sample_id,submitted_sample_id,submitted_matched_sample_id,chromosome,chromosome_start,chromosome_end,chromosome_strand,assembly_version,mutation_type,reference_genome_allele,mutated_from_allele,mutated_to_allele,quality_score,probability,total_read_count,mutant_allele_read_count,verification_status,verification_platform,biological_validation_status,biological_validation_platform,consequence_type,aa_mutation,cds_mutation,gene_affected,transcript_affected,gene_build_version,platform,experimental_protocol,sequencing_strategy,base_calling_algorithm,alignment_algorithm,variation_calling_algorithm,other_analysis_algorithm,seq_coverage,raw_data_repository,raw_data_accession,initial_data_release_date
0,MU2050016,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,9,38437800,38437800,1,GRCh37,single base substitution,C,C,T,,,33.0,7.0,not tested,,,,intergenic_region,,,,,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,
1,MU2050016,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,9,38437800,38437800,1,GRCh37,single base substitution,C,C,T,,,,,not tested,,not tested,,intergenic_region,,,,,75.0,Illumina GA sequencing,,WXS,,BWA http://bio-bwa.sourceforge.net,CaVEMan http://www.nature.com/nature/journal/v...,,,EGA,EGAS00001000161,
2,MU2051169,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,8,93543524,93543524,1,GRCh37,single base substitution,C,C,T,,,,,not tested,,not tested,,intergenic_region,,,,,75.0,Illumina GA sequencing,,WXS,,BWA http://bio-bwa.sourceforge.net,CaVEMan http://www.nature.com/nature/journal/v...,,,EGA,EGAS00001000161,
3,MU2051169,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,8,93543524,93543524,1,GRCh37,single base substitution,C,C,T,,,54.0,10.0,not tested,,,,intergenic_region,,,,,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,
4,MU92010295,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,3,59874611,59874611,1,GRCh37,single base substitution,C,C,T,,,58.0,5.0,not tested,,,,intron_variant,,,ENSG00000189283,ENST00000466788,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,


## Resolving multiple entries for the same mutation in a donor

A simple somatic mutation (SSM) donor dataset in the ICGC portal can contain multiple records for the same variant in a donor. These records differ in fields: `consequence_type`, `aa_mutation`, `cds_mutation`, `gene_affected`, and `transcript_affected`. This is the result of [SnpEff](http://pcingola.github.io/SnpEff/), a genome variant annotation and effect prediction tool.

A single variant can have multiple functional effects (`consequence_type`). One of the reasons is due to the presence of [multiple gene isoforms](https://en.wikipedia.org/wiki/Gene_isoform). These isoforms, while coming from the same locus, can differ in transcription start site, coding DNA sequences, and/or untranslated regions. As a result, [these gene isoforms can have different functions](https://en.wikipedia.org/wiki/Protein_isoform). Sometimes a variant may be transcribed and can introduce synonymous or missense mutation to the transcript. Other times the variant may not be present in the transcript isoform but can influence splice site recognition. Due to these reasons, for the same variant in a donor, we can have multiple `transcript_affected` for the same `gene_affected`.

Additionally, sometimes a variant can exist some distance upstream/downstream of another gene and influence its transcription. As a result, `gene_affected` can also differ for the same variant in a donor.

In [5]:
print("Before cleaning:")

donor_id = "DO1076"

print(f"Donor ID# {donor_id} has {(data.loc[data['icgc_donor_id'] == donor_id].shape[0]):,} records.")
print(f"But they contain only {(data.loc[data['icgc_donor_id'] == donor_id]['icgc_mutation_id'].nunique()):,} unique variants.")

Before cleaning:
Donor ID# DO1076 has 490,103 records.
But they contain only 73,563 unique variants.


In [6]:
select_columns = [
    "icgc_mutation_id", "project_code", "icgc_donor_id",
    "chromosome", "chromosome_start", "chromosome_end",
    "assembly_version", "mutation_type", "reference_genome_allele",
    "mutated_to_allele",
]

In [7]:
data = data[select_columns]

print(f"Data dimensions: {data.shape[0]:,} instances and {data.shape[1]} columns.")

Data dimensions: 1,851,540 instances and 10 columns.


In [8]:
data = data.drop_duplicates(subset=["icgc_donor_id", "icgc_mutation_id"])
data = data.reset_index(drop=True)

print(f"Number of instances after removing multiple records per variant in a donor: {data.shape[0]:,}")

Number of instances after removing multiple records per variant in a donor: 398,988


In [9]:
print("After cleaning:")

donor_id = "DO1076"

print(f"Donor ID# {donor_id} has {(data.loc[data['icgc_donor_id'] == donor_id].shape[0]):,} records.")
print(f"And they contain {(data.loc[data['icgc_donor_id'] == donor_id]['icgc_mutation_id'].nunique()):,} unique variants.")

After cleaning:
Donor ID# DO1076 has 73,563 records.
And they contain 73,563 unique variants.


In [10]:
data.head()

Unnamed: 0,icgc_mutation_id,project_code,icgc_donor_id,chromosome,chromosome_start,chromosome_end,assembly_version,mutation_type,reference_genome_allele,mutated_to_allele
0,MU2050016,BRCA-UK,DO1007,9,38437800,38437800,GRCh37,single base substitution,C,T
1,MU2051169,BRCA-UK,DO1007,8,93543524,93543524,GRCh37,single base substitution,C,T
2,MU92010295,BRCA-UK,DO1007,3,59874611,59874611,GRCh37,single base substitution,C,T
3,MU2047337,BRCA-UK,DO1007,11,10531779,10531779,GRCh37,single base substitution,C,T
4,MU65416208,BRCA-UK,DO1007,17,46826739,46826739,GRCh37,single base substitution,C,G


## Converting each ICGC SSM dataset into a set of MAF files (one per donor)

For each SSM dataset—
* Select only columns relevant for the creation of mutational spectra matrix.
* Remove multiple records for the same variant in an individual as a result of SnpEff variant annotation tool.
* Segregate individuals by donor IDs.
* Sort records for each donor by chromosome name, and then by start position.

In [11]:
def read_ssm_dataset(filepath: Path) -> pd.DataFrame:
    """
    Reads an ICGC SSM file as a pandas dataframe.

    :param filepath: file path to the SSM dataset.
    :return: pandas dataframe selecting only the columns relevant to mutational
        signatures analysis.
    """
    select_columns = [
        "icgc_mutation_id", "project_code", "icgc_donor_id",
        "chromosome", "chromosome_start", "chromosome_end",
        "assembly_version", "mutation_type", "reference_genome_allele",
        "mutated_to_allele",
    ]

    return pd.read_csv(filepath, usecols=select_columns, sep="\t")


def clean_ssm_dataset(data: pd.DataFrame) -> pd.DataFrame:
    """
    Keeps only one variant per donor ID and drops the rest. The repeats are due to
    the SnpEff annotation tool, which is initially irrelevant for signature analysis.

    :param data: a dataframe of SSM.
    :return: a dataframe of SSM without repeats
    """
    return data.drop_duplicates(subset=["icgc_donor_id", "icgc_mutation_id"]).reset_index(drop=True)


def segregate_ids_and_save_as_maf(data: pd.DataFrame,
                                  dir_output: Path) -> None:
    """
    Takes an ICGC SSM dataset, groups them by donor ID, then for each donor ID,
    sorts the records by chromosome number and then by chromosome start position,
    and finally writes this dataset as an MAF file.

    :param data: SSM dataframe
    :param dir_output: output directory for the MAF files.
    """
    for donor_id in data["icgc_donor_id"].unique():
        data_id = data.loc[data["icgc_donor_id"] == donor_id]
        data_id = data_id.loc[pd.to_numeric(data_id["chromosome"], errors="coerce").sort_values().index]
        data_id = data_id.groupby("chromosome", sort=False)\
            .apply(pd.DataFrame.sort_values, "chromosome_start")\
            .reset_index(drop=True)
        data_id.to_csv(dir_output / f"{donor_id}", sep="\t", index=False)


def convert_ssms_to_mafs(dir_datasets: Path, dir_output: Path) -> None:
    """
    Converts each SSM dataset in a directory into MAF files.

    :param dir_datasets: directory containing SSM datasets.
    :param dir_output: directory to store MAF files.
    """
    filepaths = list(dir_datasets.glob("*.tsv.gz"))

    dir_output = dir_output / (dir_datasets.name + "_MAF")
    if not dir_output.exists():
        dir_output.mkdir()

    for filepath in filepaths:
        data = read_ssm_dataset(filepath)
        data = clean_ssm_dataset(data)
        dir_output_file = dir_output / (filepath.stem.split("_")[0])
        if not dir_output_file.exists():
            dir_output_file.mkdir()
        segregate_ids_and_save_as_maf(data, dir_output_file)

In [12]:
convert_ssms_to_mafs(dir_wgs, dir_data)

  return pd.read_csv(filepath, usecols=select_columns, sep="\t")


## Converting MAF files to mutational spectra matrix

In [50]:
def download_grch37(filepath: Path) -> None:
    """
    Downloads a compressed FASTA file of the reference genome GRCh37 from the
    UCSC Genome Browser API.

    :param filepath: output directory.
    """
    url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz"
    headers = {"Accept": "application/x-gzip"}

    response = requests.get(url, headers=headers,
                            verify=False, stream=True)
    if response.status_code != 200:
        raise IOError(f"GET {url} resulted in status code {response.status_code}")

    with open(filepath, "wb") as f:
        for data in tqdm(response.iter_content(10*1024**2)):
            f.write(data)

In [None]:
fpath_compressed_grch37 = dir_data / "hg19.fa.gz"
download_grch37(fpath_compressed_grch37)

In [72]:
def gunzip(gzipped_filepath: Path, gunzipped_filepath: Path) -> None:
    """
    Uncompress a gzipped file.

    :param gzipped_filepath: gzip compressed filepath.
    :param gunzipped_filepath: filepath for unzipped file.
    """
    with gzip.open(gzipped_filepath, "rb") as f_src:
        with open(gunzipped_filepath, "wb") as f_dest:
            shutil.copyfileobj(f_src, f_dest, length=10*1024**2)

In [73]:
fpath_grch37 = dir_data / "hg19.fa"
gunzip(fpath_compressed_grch37, fpath_grch37)

In [None]:
def get_sbs_trinucleotide_contexts() -> List[str]:
    sbs_trinucleotide_contexts = []
    nucleotide_bases = ["A", "C", "G", "T"]
    substitution_types = ["C>A", "C>G", "C>T", "T>A", "T>C", "T>G"]

    for base_5 in nucleotide_bases:
        for base_3 in nucleotide_bases:
            for substitution in substitution_types:
                sbs_trinucleotide_contexts.append(f"{base_5}[{substitution}]{base_3}")

    return sbs_trinucleotide_contexts


def init_sbs_mutational_spectra(n_records: int) -> Dict[str, List[int]]:
    sbs_mutational_spectra = dict()
    sbs_trinucleotide_contexts = get_sbs_trinucleotide_contexts()

    for context in sbs_trinucleotide_contexts:
        sbs_mutational_spectra[context] = [0]*n_records

    return sbs_mutational_spectra


def index_reference_genome(ref_fasta_filepath: Path) -> pyfaidx.Fasta:
    return pyfaidx.Fasta(ref_fasta_filepath)


def read_sbs_maf_file(filepath: Path) -> pd.DataFrame:
    data = pd.read_csv(filepath, sep="\t")
    data = data.loc[data["mutation_type"] == "single base substitution"].reset_index(drop=True)

    return data


def add_instance_to_mutational_spectra(maf_filepath: Path,
                                       mutational_spectra: Dict[str, List[int]], index):
    pass


def convert_mafs_to_sbs_mutational_spectra(dir_mafs: Path, ref_fasta_filepath: Path):
    maf_filepaths = list(dir_mafs.glob("*"))
    n_samples = len(maf_filepaths)
    mutational_spectra = init_sbs_mutational_spectra(n_samples)

    ref_fasta = index_reference_genome(ref_fasta_filepath)

    donor_index = 0
    for maf_filepath in maf_filepaths:
        data_maf = read_sbs_maf_file(maf_filepath)
        add_instance_to_mutational_spectra(data_maf, mutational_spectra, donor_index)
        donor_index += 1

In [74]:
temp = pyfaidx.Fasta(fpath_grch37.as_posix())

In [92]:
temp["chrX"][2232001:2232004]

>chrX:2232002-2232004
gtt

In [90]:
temp.keys()

odict_keys(['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chrX', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr20', 'chrY', 'chr19', 'chr22', 'chr21', 'chr6_ssto_hap7', 'chr6_mcf_hap5', 'chr6_cox_hap2', 'chr6_mann_hap4', 'chr6_apd_hap1', 'chr6_qbl_hap6', 'chr6_dbb_hap3', 'chr17_ctg5_hap1', 'chr4_ctg9_hap1', 'chr1_gl000192_random', 'chrUn_gl000225', 'chr4_gl000194_random', 'chr4_gl000193_random', 'chr9_gl000200_random', 'chrUn_gl000222', 'chrUn_gl000212', 'chr7_gl000195_random', 'chrUn_gl000223', 'chrUn_gl000224', 'chrUn_gl000219', 'chr17_gl000205_random', 'chrUn_gl000215', 'chrUn_gl000216', 'chrUn_gl000217', 'chr9_gl000199_random', 'chrUn_gl000211', 'chrUn_gl000213', 'chrUn_gl000220', 'chrUn_gl000218', 'chr19_gl000209_random', 'chrUn_gl000221', 'chrUn_gl000214', 'chrUn_gl000228', 'chrUn_gl000227', 'chr1_gl000191_random', 'chr19_gl000208_random', 'chr9_gl000198_random', 'chr17_gl000204_random', 'chrUn_gl000233', 'chrUn_gl

In [38]:
data = pd.read_csv(dir_data / "WGS_MAF/BRCA-UK/DO52539", sep="\t")

In [57]:
temp = data.loc[data["mutation_type"] == "single base substitution"].reset_index(drop=True)

In [58]:
count = 0
for _, row in temp.iterrows():
    count+=1
print(count)

4663


In [59]:
temp.head(10)

Unnamed: 0,icgc_mutation_id,icgc_donor_id,project_code,chromosome,chromosome_start,chromosome_end,assembly_version,mutation_type,reference_genome_allele,mutated_to_allele
0,MU92419411,DO52539,BRCA-UK,1,829072,829072,GRCh37,single base substitution,G,C
1,MU63450989,DO52539,BRCA-UK,1,4346087,4346087,GRCh37,single base substitution,T,A
2,MU63450992,DO52539,BRCA-UK,1,4388443,4388443,GRCh37,single base substitution,A,G
3,MU63450995,DO52539,BRCA-UK,1,4864061,4864061,GRCh37,single base substitution,G,A
4,MU63450998,DO52539,BRCA-UK,1,4971737,4971737,GRCh37,single base substitution,C,T
5,MU92419458,DO52539,BRCA-UK,1,7580988,7580988,GRCh37,single base substitution,G,A
6,MU63451004,DO52539,BRCA-UK,1,7996239,7996239,GRCh37,single base substitution,G,A
7,MU63451011,DO52539,BRCA-UK,1,8811763,8811763,GRCh37,single base substitution,G,A
8,MU92419464,DO52539,BRCA-UK,1,9159992,9159992,GRCh37,single base substitution,C,G
9,MU92419479,DO52539,BRCA-UK,1,9459735,9459735,GRCh37,single base substitution,G,C
