# Processing ICGC Data for Mutational Signatures Analysis

In [5]:
from pathlib import Path
import tarfile

import pandas as pd

In [2]:
data_dir = Path.cwd().parent / "data"

In [3]:
brca_projects = ["BRCA-EU", "BRCA-FR", "BRCA-UK", "BRCA-US"]
datatype = "ssm"
analysis_type = "WGS"

In [15]:
temp = pd.read_csv(data_dir / "BRCA-US_ssm_WGS.tsv.gz", sep="\t")

In [16]:
temp.head()

Unnamed: 0,icgc_mutation_id,icgc_donor_id,project_code,icgc_specimen_id,icgc_sample_id,matched_icgc_sample_id,submitted_sample_id,submitted_matched_sample_id,chromosome,chromosome_start,...,experimental_protocol,sequencing_strategy,base_calling_algorithm,alignment_algorithm,variation_calling_algorithm,other_analysis_algorithm,seq_coverage,raw_data_repository,raw_data_accession,initial_data_release_date
0,MU25172,DO1954,BRCA-US,SP4265,SA46776,SA46883,TCGA-B6-A0RU-01A-11D-A099-09,TCGA-B6-A0RU-10A-01D-A099-09,3,48697354,...,,WXS,,,TCGA-MC3 https://gdc.cancer.gov/about-data/pub...,,,GDC,TCGA-B6-A0RU-01A-11D-A099-09,
1,MU25172,DO1954,BRCA-US,SP4265,SA46776,SA46883,TCGA-B6-A0RU-01A-11D-A099-09,TCGA-B6-A0RU-10A-01D-A099-09,3,48697354,...,,WXS,,,TCGA-MC3 https://gdc.cancer.gov/about-data/pub...,,,GDC,TCGA-B6-A0RU-01A-11D-A099-09,
2,MU25172,DO1954,BRCA-US,SP4265,SA46776,SA46883,TCGA-B6-A0RU-01A-11D-A099-09,TCGA-B6-A0RU-10A-01D-A099-09,3,48697354,...,,WXS,,,TCGA-MC3 https://gdc.cancer.gov/about-data/pub...,,,GDC,TCGA-B6-A0RU-01A-11D-A099-09,
3,MU25172,DO1954,BRCA-US,SP4265,SA46776,SA46883,TCGA-B6-A0RU-01A-11D-A099-09,TCGA-B6-A0RU-10A-01D-A099-09,3,48697354,...,,WXS,,,TCGA-MC3 https://gdc.cancer.gov/about-data/pub...,,,GDC,TCGA-B6-A0RU-01A-11D-A099-09,
4,MU25172,DO1954,BRCA-US,SP4265,SA46800,SA46919,TCGA-B6-A0RU-01A-11D-A12L-09,TCGA-B6-A0RU-10B-01D-A12L-09,3,48697354,...,,WGS,,,PCAWG Consensus SNV-MNV caller,,,,FI1261:FI1260,


## Multiple entries for the same mutation in a donor

A simple somatic mutation (SSM) donor dataset in the ICGC portal can contain multiple records for the same variant in a donor. These records differ in fields: `consequence_type`, `aa_mutation`, `cds_mutation`, `gene_affected`, and `transcript_affected`. This is the result of [SnpEff](http://pcingola.github.io/SnpEff/), a genome variant annotation and effect prediction tool.

A single variant can have multiple functional effects (`consequence_type`). One of the reasons is due to the presence of [multiple gene isoforms](https://en.wikipedia.org/wiki/Gene_isoform). These isoforms, while coming from the same locus, can differ in transcription start site, coding DNA sequences, and/or untranslated regions. As a result, [these gene isoforms can have different functions](https://en.wikipedia.org/wiki/Protein_isoform). Sometimes a variant may be transcribed and can introduce synonymous or missense mutation to the transcript. Other times the variant may not be present in the transcript isoform but can influence splice site recognition. Due to these reasons, for the same variant in a donor, we can have multiple `transcript_affected` for the same `gene_affected`.

Additionally, sometimes a variant can exist some distance upstream/downstream of another gene and influence its transcription. As a result, `gene_affected` can also differ for the same variant in a donor.