# Processing ICGC Data for Mutational Signatures Analysis

**Goal**: clean the downloaded ICGC donor SSM data and convert them into [mutational annotation format (MAF)](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/), and then convert MAF data into a mutational spectra matrix for mutational signature analysis.

This notebook assumes that the steps in the notebook `Consuming ICGC Data.ipynb` has already been run and the downloaded WGS SSM data is stored in the `/data/WGS/` folder of the project directory.

In [1]:
from collections import OrderedDict
import gzip
from pathlib import Path
import shutil

from natsort import natsorted
import numpy as np
import pandas as pd
import pyfaidx
import requests
from tqdm import tqdm


pd.set_option('display.max_columns', None)

In [2]:
dir_data = Path.cwd().parent / "data"
dir_wgs = dir_data / "WGS/"

In [3]:
data = pd.read_csv(dir_wgs / "BRCA-UK_ssm_WGS.tsv.gz", sep="\t")

  data = pd.read_csv(dir_wgs / "BRCA-UK_ssm_WGS.tsv.gz", sep="\t")


In [4]:
data.head()

Unnamed: 0,icgc_mutation_id,icgc_donor_id,project_code,icgc_specimen_id,icgc_sample_id,matched_icgc_sample_id,submitted_sample_id,submitted_matched_sample_id,chromosome,chromosome_start,chromosome_end,chromosome_strand,assembly_version,mutation_type,reference_genome_allele,mutated_from_allele,mutated_to_allele,quality_score,probability,total_read_count,mutant_allele_read_count,verification_status,verification_platform,biological_validation_status,biological_validation_platform,consequence_type,aa_mutation,cds_mutation,gene_affected,transcript_affected,gene_build_version,platform,experimental_protocol,sequencing_strategy,base_calling_algorithm,alignment_algorithm,variation_calling_algorithm,other_analysis_algorithm,seq_coverage,raw_data_repository,raw_data_accession,initial_data_release_date
0,MU2050016,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,9,38437800,38437800,1,GRCh37,single base substitution,C,C,T,,,33.0,7.0,not tested,,,,intergenic_region,,,,,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,
1,MU2050016,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,9,38437800,38437800,1,GRCh37,single base substitution,C,C,T,,,,,not tested,,not tested,,intergenic_region,,,,,75.0,Illumina GA sequencing,,WXS,,BWA http://bio-bwa.sourceforge.net,CaVEMan http://www.nature.com/nature/journal/v...,,,EGA,EGAS00001000161,
2,MU2051169,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,8,93543524,93543524,1,GRCh37,single base substitution,C,C,T,,,,,not tested,,not tested,,intergenic_region,,,,,75.0,Illumina GA sequencing,,WXS,,BWA http://bio-bwa.sourceforge.net,CaVEMan http://www.nature.com/nature/journal/v...,,,EGA,EGAS00001000161,
3,MU2051169,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,8,93543524,93543524,1,GRCh37,single base substitution,C,C,T,,,54.0,10.0,not tested,,,,intergenic_region,,,,,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,
4,MU92010295,DO1007,BRCA-UK,SP2150,SA6146,SA6216,PD4085a,PD4085b,3,59874611,59874611,1,GRCh37,single base substitution,C,C,T,,,58.0,5.0,not tested,,,,intron_variant,,,ENSG00000189283,ENST00000466788,75.0,Illumina HiSeq,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI43481:FI43480,


## Resolving multiple entries for the same mutation in a donor

A simple somatic mutation (SSM) donor dataset in the ICGC portal can contain multiple records for the same variant in a donor. These records differ in fields: `consequence_type`, `aa_mutation`, `cds_mutation`, `gene_affected`, and `transcript_affected`. This is the result of [SnpEff](http://pcingola.github.io/SnpEff/), a genome variant annotation and effect prediction tool.

A single variant can have multiple functional effects (`consequence_type`). One of the reasons is due to the presence of [multiple gene isoforms](https://en.wikipedia.org/wiki/Gene_isoform). These isoforms, while coming from the same locus, can differ in transcription start site, coding DNA sequences, and/or untranslated regions. As a result, [these gene isoforms can have different functions](https://en.wikipedia.org/wiki/Protein_isoform). Sometimes a variant may be transcribed and can introduce synonymous or missense mutation to the transcript. Other times the variant may not be present in the transcript isoform but can influence splice site recognition. Due to these reasons, for the same variant in a donor, we can have multiple `transcript_affected` for the same `gene_affected`.

Additionally, sometimes a variant can exist some distance upstream/downstream of another gene and influence its transcription. As a result, `gene_affected` can also differ for the same variant in a donor.

In [5]:
print("Before cleaning:")

donor_id = "DO1076"

print(f"Donor ID# {donor_id} has {(data.loc[data['icgc_donor_id'] == donor_id].shape[0]):,} records.")
print(f"But they contain only {(data.loc[data['icgc_donor_id'] == donor_id]['icgc_mutation_id'].nunique()):,} unique variants.")

Before cleaning:
Donor ID# DO1076 has 490,103 records.
But they contain only 73,563 unique variants.


In [6]:
select_columns = [
    "icgc_mutation_id", "project_code", "icgc_donor_id",
    "chromosome", "chromosome_start", "chromosome_end",
    "assembly_version", "mutation_type", "reference_genome_allele",
    "mutated_to_allele",
]

In [7]:
data = data[select_columns]

print(f"Data dimensions: {data.shape[0]:,} instances and {data.shape[1]} columns.")

Data dimensions: 1,851,540 instances and 10 columns.


In [8]:
data = data.drop_duplicates(subset=["icgc_donor_id", "icgc_mutation_id"])
data = data.reset_index(drop=True)

print(f"Number of instances after removing multiple records per variant in a donor: {data.shape[0]:,}")

Number of instances after removing multiple records per variant in a donor: 398,988


In [9]:
print("After cleaning:")

donor_id = "DO1076"

print(f"Donor ID# {donor_id} has {(data.loc[data['icgc_donor_id'] == donor_id].shape[0]):,} records.")
print(f"And they contain {(data.loc[data['icgc_donor_id'] == donor_id]['icgc_mutation_id'].nunique()):,} unique variants.")

After cleaning:
Donor ID# DO1076 has 73,563 records.
And they contain 73,563 unique variants.


In [10]:
data.head()

Unnamed: 0,icgc_mutation_id,project_code,icgc_donor_id,chromosome,chromosome_start,chromosome_end,assembly_version,mutation_type,reference_genome_allele,mutated_to_allele
0,MU2050016,BRCA-UK,DO1007,9,38437800,38437800,GRCh37,single base substitution,C,T
1,MU2051169,BRCA-UK,DO1007,8,93543524,93543524,GRCh37,single base substitution,C,T
2,MU92010295,BRCA-UK,DO1007,3,59874611,59874611,GRCh37,single base substitution,C,T
3,MU2047337,BRCA-UK,DO1007,11,10531779,10531779,GRCh37,single base substitution,C,T
4,MU65416208,BRCA-UK,DO1007,17,46826739,46826739,GRCh37,single base substitution,C,G


## Converting each ICGC SSM dataset into a set of MAF files (one per donor)

For each SSM dataset—
* Select only columns relevant for the creation of mutational spectra matrix.
* Remove multiple records for the same variant in an individual as a result of SnpEff variant annotation tool.
* Segregate individuals by donor IDs.
* Sort records for each donor by chromosome name, and then by start position.

In [11]:
def read_ssm_dataset(filepath: Path) -> pd.DataFrame:
    """
    Reads an ICGC SSM file as a pandas dataframe.

    :param filepath: file path to the SSM dataset.
    :return: pandas dataframe selecting only the columns relevant to mutational
        signatures analysis.
    """
    select_columns = [
        "icgc_mutation_id", "project_code", "icgc_donor_id",
        "chromosome", "chromosome_start", "chromosome_end",
        "assembly_version", "mutation_type", "reference_genome_allele",
        "mutated_to_allele",
    ]

    return pd.read_csv(filepath, usecols=select_columns, sep="\t")


def clean_ssm_dataset(data: pd.DataFrame) -> pd.DataFrame:
    """
    Keeps only one variant per donor ID and drops the rest. The repeats are due to
    the SnpEff annotation tool, which is initially irrelevant for signature analysis.

    :param data: a dataframe of SSM.
    :return: a dataframe of SSM without repeats
    """
    return data.drop_duplicates(subset=["icgc_donor_id", "icgc_mutation_id"]).reset_index(drop=True)


def segregate_ids_and_save_as_maf(data: pd.DataFrame,
                                  dir_output: Path) -> None:
    """
    Takes an ICGC SSM dataset, groups them by donor ID, then for each donor ID,
    sorts the records by chromosome number and then by chromosome start position,
    and finally writes this dataset as an MAF file.

    :param data: SSM dataframe
    :param dir_output: output directory for the MAF files.
    """
    for donor_id in data["icgc_donor_id"].unique():
        data_id = data.loc[data["icgc_donor_id"] == donor_id]
        data_id = data_id.loc[pd.to_numeric(data_id["chromosome"], errors="coerce").sort_values().index]
        data_id = data_id.groupby("chromosome", sort=False)\
            .apply(pd.DataFrame.sort_values, "chromosome_start")\
            .reset_index(drop=True)
        data_id.to_csv(dir_output / f"{donor_id}", sep="\t", index=False)


def convert_ssms_to_mafs(dir_datasets: Path, dir_output: Path) -> None:
    """
    Converts each SSM dataset in a directory into MAF files.

    :param dir_datasets: directory containing SSM datasets.
    :param dir_output: directory to store MAF files.
    """
    filepaths = list(dir_datasets.glob("*.tsv.gz"))

    for filepath in filepaths:
        data = read_ssm_dataset(filepath)
        data = clean_ssm_dataset(data)
        dir_output_file = dir_output / (filepath.stem.split("_")[0])
        if not dir_output_file.exists():
            dir_output_file.mkdir()
        segregate_ids_and_save_as_maf(data, dir_output_file)

In [12]:
dir_maf_dirs = dir_data / "WGS_MAFs"
if not dir_maf_dirs.exists():
        dir_maf_dirs.mkdir()

convert_ssms_to_mafs(dir_wgs, dir_maf_dirs)

  return pd.read_csv(filepath, usecols=select_columns, sep="\t")


## Converting MAF files to mutational spectra matrix

For each project dataset directory—
* Read each MAF file representing an individual.
* For each individual, parse each row in their MAF file that represents a single base substitution carried by this individual.
* Look up the reference allele of the mutation from GRCh37 FASTA sequence. Find the trinucleotide context (5' base, reference allele, 3' base) of this mutation.
* Tabulate the counts of each substitution type and save this matrix as the mutational spectra matrix.

In [13]:
def download_grch37(filepath: Path) -> None:
    """
    Downloads a compressed FASTA file of the reference genome GRCh37 from the
    UCSC Genome Browser API.

    :param filepath: output directory.
    """
    url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz"
    headers = {"Accept": "application/x-gzip"}

    response = requests.get(url, headers=headers,
                            verify=False, stream=True)
    if response.status_code != 200:
        raise IOError(f"GET {url} resulted in status code {response.status_code}")

    with open(filepath, "wb") as f:
        for data in tqdm(response.iter_content(10*1024**2)):
            f.write(data)

In [14]:
fpath_compressed_grch37 = dir_data / "hg19.fa.gz"
download_grch37(fpath_compressed_grch37)

91it [01:28,  1.02it/s]


In [15]:
def gunzip(gzipped_filepath: Path, gunzipped_filepath: Path) -> None:
    """
    Uncompress a gzipped file.

    :param gzipped_filepath: gzip compressed filepath.
    :param gunzipped_filepath: filepath for unzipped file.
    """
    with gzip.open(gzipped_filepath, "rb") as f_src:
        with open(gunzipped_filepath, "wb") as f_dest:
            shutil.copyfileobj(f_src, f_dest, length=10*1024**2)

In [16]:
fpath_grch37 = dir_data / "hg19.fa"
gunzip(fpath_compressed_grch37, fpath_grch37)

In [17]:
def get_sbs_trinucleotide_contexts() -> list[str]:
    """
    Returns a list of trinucleotide context for single base substitutions (SBS)
    for constructing a COSMIC mutational spectra matrix.

    :return: a list of SBS trinucleotide contexts.
    """
    sbs_trinucleotide_contexts = []
    nucleotide_bases = ["A", "C", "G", "T"]
    substitution_types = ["C>A", "C>G", "C>T", "T>A", "T>C", "T>G"]

    for base_5 in nucleotide_bases:
        for substitution in substitution_types:
            for base_3 in nucleotide_bases:
                sbs_trinucleotide_contexts.append(f"{base_5}[{substitution}]{base_3}")

    return sbs_trinucleotide_contexts


def init_sbs_mutational_spectra(n_records: int) -> OrderedDict[str, list[int]]:
    """
    Initilizes an ordered dictionary with SBS trinucleotide context as keys and
    a list of counts, one for each sample.

    :param n_records: number of samples to record in the mutational spectra matrix.
    :return: an ordered dictionary of trinucleotide context and a list of counts
        initialized to zeros.
    """
    sbs_mutational_spectra = OrderedDict()
    sbs_trinucleotide_contexts = get_sbs_trinucleotide_contexts()

    for context in sbs_trinucleotide_contexts:
        sbs_mutational_spectra[context] = [0]*n_records

    return sbs_mutational_spectra


def index_reference_genome(ref_fasta_filepath: Path) -> pyfaidx.Fasta:
    """
    Returns an indexed FASTA file to quickly lookup subsequences in a genome.

    :param ref_fasta_filepath: filepath of the FASTA file of the reference genome.
    :return: an indexed FASTA file
    """
    return pyfaidx.Fasta(ref_fasta_filepath.as_posix())


def read_sbs_maf_file(filepath: Path) -> pd.DataFrame:
    """
    Reads only single base substitutions from an MAF file generated from an
    ICGC SSM dataset.

    :param filepath: file path to the MAF file.
    :return: Pandas dataframe with only single base substitutions.
    """
    data = pd.read_csv(filepath, sep="\t")
    data = data.loc[data["mutation_type"] == "single base substitution"].reset_index(drop=True)
    return data


def get_trinucleotide_ref_from_fasta(row: pd.Series,
                                     ref_fasta: pyfaidx.Fasta) -> str:
    """
    Returns the trinucleotides (5' base, reference allele, 3' base) around the
    mutation described by the row.

    :param row: a pandas row of the MAF file.
    :param ref_fasta: an indexed FASTA of the reference genome.
    :return: trinucleotide context for the mutation described by the row.
    """
    pointer = int(row["chromosome_start"])
    """
    '-2' and not '-1' because genomes are indexed starting from 1 but Python data
    structures are indexed starting from 0.
    """
    return ref_fasta[f"chr{row['chromosome']}"][(pointer-2):(pointer+1)].seq.upper()


def standardize_trinucleotide(trinucleotide_ref: str) -> str:
    """
    COSMIC signatures define mutations from a pyrimidine allele (C, T) to any
    other base (C>A, C>G, C>T, T>A, T>C, T>G). If a mutation in the MAF file
    is defined from a purine allele (A, G), then we infer the trinucleotide
    context in the complementary sequence, which would be from a pyrimidine
    allele due to purines and pyrimidines complementing each other in a
    double-stranded DNA.

    :param trinucleotide_ref: trinucleotide sequence seen in the reference genome.
    :return: a pyrimidine-centric trinucleotide sequence.
    """
    complement_seq = {
        'A': 'T',
        'C': 'G',
        'T': 'A',
        'G': 'C'
    }
    purines = ["A", "G"]
    if trinucleotide_ref[1] in purines:
        return f"{complement_seq[trinucleotide_ref[2]]}" \
               f"{complement_seq[trinucleotide_ref[1]]}" \
               f"{complement_seq[trinucleotide_ref[0]]}"
    else:
        return trinucleotide_ref


def standardize_substitution(ref_allele: str,
                             mut_allele: str) -> str:
    """
    COSMIC signatures define mutations from a pyrimidine allele (C, T) to any
    other base (C>A, C>G, C>T, T>A, T>C, T>G). If a mutation in the MAF file
    is defined from a reference purine allele (A, G), then we infer the substituted
    base in the complementary sequence, which would be from a pyrimidine
    allele due to purines and pyrimidines complementing each other in a
    double-stranded DNA.

    :param ref_allele: base in the reference genome.
    :param mut_allele: base in the mutated genome
    :return: substitution string from pyrimidine to any other base.
    """
    complement_seq = {
        'A': 'T',
        'C': 'G',
        'T': 'A',
        'G': 'C'
    }
    purines = ["A", "G"]
    if ref_allele in purines:
        return f"{complement_seq[ref_allele]}>{complement_seq[mut_allele]}"
    else:
        return f"{ref_allele}>{mut_allele}"


def add_instance_to_mutational_spectra(maf_df: pd.DataFrame,
                                       mutational_spectra: OrderedDict[str, list[int]],
                                       ref_fasta: pyfaidx.Fasta,
                                       index: int) -> None:
    """
    Parses each row in a MAF dataframe generated from an ICGC SSM dataset and tabulates a
    mutational spectra count matrix in the form of an ordered dictionary.

    :param maf_df: MAF dataframe generated from an ICGC SSM dataset.
    :param mutational_spectra: an ordered dictionary to tabulat the mutational spectra matrix.
    :param ref_fasta: an indexed reference genome.
    :param index: row index in the mutational spectra matrix to tabulate in the counts.
    """
    nucleotide_bases = ["A", "C", "G", "T"]
    pyrimidine = ["C", "T"]

    for _, row in maf_df.iterrows():
        if((row["chromosome_start"] != row["chromosome_end"]) or
                (row["reference_genome_allele"] not in nucleotide_bases) or
                (row["mutated_to_allele"] not in nucleotide_bases)):
            continue
        trinucleotide_ref = standardize_trinucleotide(
            get_trinucleotide_ref_from_fasta(row, ref_fasta))
        substitution = standardize_substitution(row["reference_genome_allele"],
                                                row["mutated_to_allele"])

        # sanity checks
        try:
            assert (trinucleotide_ref is not None)
            assert (trinucleotide_ref[1] == substitution[0])
            assert (trinucleotide_ref[1] in pyrimidine)
            assert (substitution[0] in pyrimidine)
        except AssertionError:
            print(f"MAF row: {row['chromosome']}, "
                  f"{row['chromosome_start']}, "
                  f"{row['chromosome_end']}, "
                  f"{row['reference_genome_allele']}, "
                  f"{row['mutated_to_allele']}")
            print(f"FASTA context: {get_trinucleotide_ref_from_fasta(row, ref_fasta)}")
            print(f"Pyrimidine-centric context: {trinucleotide_ref}")
            raise

        mutational_spectra[f"{trinucleotide_ref[0]}[{substitution}]{trinucleotide_ref[2]}"][index] += 1


def write_mutational_spectra(mutational_spectra: OrderedDict,
                             sample_names: list[str],
                             filepath: Path) -> None:
    """
    Writes the mutational spectra matrix data, stored in an ordered dictionary, to a CSV file.

    :param mutational_spectra: mutational spectra matrix data stored in an ordered dictionary.
    :param sample_names: a list of names of the samples.
    :param filepath: name of the CSV file to save the data.
    """
    data = np.stack([np.array(mutational_spectra[substitution]) for substitution in mutational_spectra.keys()])
    index = pd.Series(
        data=mutational_spectra.keys(),
        name="Mutation Types"
    )
    mutational_spectra_df = pd.DataFrame(
        data=data,
        index=index,
        columns=sample_names,
        dtype=int,
    )
    mutational_spectra_df.to_csv(filepath, sep=",", index=True)


def convert_mafs_to_sbs_mutational_spectra(dir_mafs: Path,
                                           ref_fasta_filepath: Path,
                                           filepath_output: Path) -> None:
    """
    Converts all MAF files (one file per sample) in a directory into a mutational spectra
    matrix and saves it as a CSV file.

    :param dir_mafs: a directory containing MAF files.
    :param ref_fasta_filepath: filepath to the reference genome FASTA file.
    :param filepath_output: file path to save the mutational spectra CSV file.
    """
    maf_filepaths = natsorted(list(dir_mafs.glob("*")))
    n_samples = len(maf_filepaths)
    mutational_spectra = init_sbs_mutational_spectra(n_samples)
    ref_fasta = index_reference_genome(ref_fasta_filepath)
    donors = list()

    donor_index = 0
    for maf_filepath in maf_filepaths:
        data_maf = read_sbs_maf_file(maf_filepath)
        add_instance_to_mutational_spectra(data_maf, mutational_spectra, ref_fasta, donor_index)
        donors.append(maf_filepath.name)
        donor_index += 1
    write_mutational_spectra(mutational_spectra, donors, filepath_output)


def convert_maf_dirs_to_sbs_mutational_spectra(dir_maf_dirs: Path,
                                               ref_fasta_filepath: Path,
                                               dir_output: Path) -> None:
    """
    For each directory within the specified directory, this method iterates through all
    MAF files and creates a mutational spectra matrix and saves them as a CSV file.

    :param dir_maf_dirs: a directory of directories, each containing a set of MAF files.
    :param ref_fasta_filepath: filepath to the reference genome FASTA file.
    :param dir_output: directory to save the mutational spectra CSV files.
    """
    for dir_mafs in dir_maf_dirs.iterdir():
        if dir_mafs.is_dir():
            filepath_output = dir_output / f"{dir_mafs.name}.csv"
            convert_mafs_to_sbs_mutational_spectra(dir_mafs, ref_fasta_filepath, filepath_output)


In [18]:
dir_spectra = dir_data / "mutational_spectra_wgs"
if not dir_spectra.exists():
    dir_spectra.mkdir()

convert_maf_dirs_to_sbs_mutational_spectra(dir_maf_dirs, fpath_grch37, dir_spectra)

  data = pd.read_csv(filepath, sep="\t")
  data = pd.read_csv(filepath, sep="\t")
  data = pd.read_csv(filepath, sep="\t")
  data = pd.read_csv(filepath, sep="\t")
  data = pd.read_csv(filepath, sep="\t")
  data = pd.read_csv(filepath, sep="\t")


The mutational spectra matrices in the `mutational_spectra_wgs/` directory is used in mutational signature analysis.