Note: One of the goals of ExploSig is enabling analysis of mutational signatures and clinical/molecular datasets simultaneously. The ExploSig Browser (http://explosig.lrgr.io) provides tools for initial exploration and simple statistics with stratification and filtering of samples, but there are currently no statistical tests that can be performed within the tool. The NTHL1 case study we describe was performed with the ExploSig Browser. This notebook reproduces the steps of that visual exploration programmatically with processed data obtained using the ExploSig Connect package, enabling further quantitative analysis.

In [1]:
import numpy as np
from scipy import stats

In [2]:
from explosig_connect import connect

projects = ['TCGA-BRCA_BRCA_mc3.v0.2.8.WXS']
signatures = ["COSMIC %d" % x for x in [1, 2, 3, 5, 6, 8, 13, 17, 18, 20, 26, 30]]
genes = ['NTHL1', 'BRCA1', 'BRCA2']

### Get processed data from ExploSig Server

In [None]:
conn = connect(empty=True, how=None)
exps_df = conn.get_exposures(projects, signatures, 'SBS')
gene_exp_df = conn.get_gene_expression_data(genes, projects)
gene_mut_df = conn.get_gene_mutation_data(genes, projects)
gene_cna_df = conn.get_copy_number_data(genes, projects)

### Filter and stratify samples

In [None]:
print(f"{exps_df.shape[0]} samples")

Filter to restrict to samples with wildtype BRCA1 and wildtype BRCA2

In [None]:
BRCA_mut_samples = gene_mut_df.loc[(gene_mut_df["BRCA1"] != "None") | (gene_mut_df["BRCA2"] != "None")].index.values.tolist()
BRCA_wt_samples = gene_mut_df.loc[(gene_mut_df["BRCA1"] == "None") & (gene_mut_df["BRCA2"] == "None")].index.values.tolist()

print(f"{len(BRCA_mut_samples)} samples have mut BRCA1 and/or mut BRCA2")
print(f"{len(BRCA_wt_samples)} samples have wildtype BRCA1 and wildtype BRCA2")

In [None]:
BRCA_mut_exps_df = exps_df.loc[BRCA_mut_samples, :]
BRCA_wt_exps_df = exps_df.loc[BRCA_wt_samples, :]

print(f"{BRCA_mut_exps_df.shape[0]} samples have mut BRCA1 and/or mut BRCA2, and also have exposures data")
print(f"{BRCA_wt_exps_df.shape[0]} samples have wildtype BRCA1 and wildtype BRCA2, and also have exposures data")

Distribution of COSMIC 3 exposure values for BRCA1/2 mutant samples

In [None]:
BRCA_mut_cosmic_3_exposure_stats = BRCA_mut_exps_df.describe()["COSMIC 3"]
BRCA_mut_cosmic_3_exposure_stats.to_frame()

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype samples 

In [None]:
BRCA_wt_cosmic_3_exposure_stats = BRCA_wt_exps_df.describe()["COSMIC 3"]
BRCA_wt_cosmic_3_exposure_stats.to_frame()

### Stratification of BRCA1/2 wildtype samples by NTHL1 gene expression level

In [None]:
BRCA_wt_gene_exp_df = gene_exp_df.loc[BRCA_wt_samples, :]
BRCA_wt_NTHL1_gene_exp_sample_groups = BRCA_wt_gene_exp_df.groupby("NTHL1").groups
BRCA_wt_NTHL1_overexp_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_exp_sample_groups["Over"].values.tolist(), :]
BRCA_wt_NTHL1_nondiffexp_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_exp_sample_groups["Not differentially expressed"].values.tolist(), :]

print(f"{BRCA_wt_NTHL1_overexp_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with overexpression of NTHL1")
print(f"{BRCA_wt_NTHL1_nondiffexp_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with non-differential expression of NTHL1")

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND overexpressed NTHL1 samples

In [None]:
BRCA_wt_NTHL1_overexp_cosmic_3_exposure_stats = BRCA_wt_NTHL1_overexp_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_overexp_cosmic_3_exposure_stats.to_frame()

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND non-differentially-expressed NTHL1 samples

In [None]:
BRCA_wt_NTHL1_nondiffexp_cosmic_3_exposure_stats = BRCA_wt_NTHL1_nondiffexp_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_nondiffexp_cosmic_3_exposure_stats.to_frame()

T-test to compare the means of these two groups

In [None]:
NTHL1_gene_exp_ttest = stats.ttest_ind(
    BRCA_wt_NTHL1_overexp_cosmic_3_df["COSMIC 3"].values, 
    BRCA_wt_NTHL1_nondiffexp_cosmic_3_df["COSMIC 3"].values, 
    equal_var=False
)
print(f"p-value: {NTHL1_gene_exp_ttest.pvalue}")

ANOVA to compare the means of these two groups

In [None]:
NTHL1_gene_exp_anova = stats.f_oneway(
    BRCA_wt_NTHL1_overexp_cosmic_3_df["COSMIC 3"].values, 
    BRCA_wt_NTHL1_nondiffexp_cosmic_3_df["COSMIC 3"].values
)
print(f"p-value: {NTHL1_gene_exp_anova.pvalue}")

### Stratification of BRCA1/2 wildtype samples by NTHL1 copy number status

In [None]:
BRCA_wt_gene_cna_df = gene_cna_df.loc[BRCA_wt_samples, :]
BRCA_wt_NTHL1_gene_cna_sample_groups = BRCA_wt_gene_cna_df.groupby("NTHL1").groups
BRCA_wt_NTHL1_cna_n1_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_cna_sample_groups["-1"].values.tolist(), :]
BRCA_wt_NTHL1_cna_0_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_cna_sample_groups["0"].values.tolist(), :]
BRCA_wt_NTHL1_cna_p1_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_cna_sample_groups["1"].values.tolist(), :]
BRCA_wt_NTHL1_cna_p2_cosmic_3_df = BRCA_wt_exps_df.loc[BRCA_wt_NTHL1_gene_cna_sample_groups["2"].values.tolist(), :]

print(f"{BRCA_wt_NTHL1_cna_n1_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with NTHL1 copy number of -1 (hemizygous deletion)")
print(f"{BRCA_wt_NTHL1_cna_0_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with NTHL1 copy number of 0 (neutral)")
print(f"{BRCA_wt_NTHL1_cna_p1_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with NTHL1 copy number of 1 (gain)")
print(f"{BRCA_wt_NTHL1_cna_p2_cosmic_3_df.shape[0]} BRCA1/2 wildtype samples with NTHL1 copy number of 2 (high level amplification)")

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND NTHL1 copy number of -1 (hemizygous deletion) samples

In [None]:
BRCA_wt_NTHL1_cna_n1_cosmic_3_exposure_stats = BRCA_wt_NTHL1_cna_n1_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_cna_n1_cosmic_3_exposure_stats.to_frame()

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND NTHL1 copy number of 0 (neutral) samples

In [None]:
BRCA_wt_NTHL1_cna_0_cosmic_3_exposure_stats = BRCA_wt_NTHL1_cna_0_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_cna_0_cosmic_3_exposure_stats.to_frame()

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND NTHL1 copy number of 1 (gain) samples

In [None]:
BRCA_wt_NTHL1_cna_p1_cosmic_3_exposure_stats = BRCA_wt_NTHL1_cna_p1_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_cna_p1_cosmic_3_exposure_stats.to_frame()

Distribution of COSMIC 3 exposure values for BRCA1/2 wildtype AND NTHL1 copy number of 2 (high amplification) samples

In [None]:
BRCA_wt_NTHL1_cna_p2_cosmic_3_exposure_stats = BRCA_wt_NTHL1_cna_p2_cosmic_3_df.describe()["COSMIC 3"]
BRCA_wt_NTHL1_cna_p2_cosmic_3_exposure_stats.to_frame()

ANOVA to compare the means of these four groups

In [None]:
NTHL1_gene_cna_anova = stats.f_oneway(
    BRCA_wt_NTHL1_cna_n1_cosmic_3_df["COSMIC 3"].values, 
    BRCA_wt_NTHL1_cna_0_cosmic_3_df["COSMIC 3"].values,
    BRCA_wt_NTHL1_cna_p1_cosmic_3_df["COSMIC 3"].values,
    BRCA_wt_NTHL1_cna_p2_cosmic_3_df["COSMIC 3"].values
)
print(f"p-value: {NTHL1_gene_cna_anova.pvalue}")