# Enrichment analysys
### Over representation analysis
* Simplified ilustrative example
* Followed by real case
### GSEA
* Practical using python package

# Supervised machine learning
### Linear regression
### Logistic regression
### SVM
### Overfitting and Bias vs Variance -> Cross validation
### Dimentionality of the input -> PCA

# Unsupervised machine learning
### K-means clustering
* Using TCGA data, HR+/-

# Lab 2
# Pathway analysis

Let's pick up from where we left in the last lab. By the end of the lab you found interesting genes that were **differentially expressed** between two **clinically relevant** conditions.

You've also learned that one way to make more sense of these results is by performing **pathway analysis**, so let's do that.

We, as always, start by importing relevant libraries and loading the data.
For pathway analysis, we will be mostly working with the [gseapy](https://github.com/ostrokach/gseapy) library, which is mostly a python wrapper for GSEA and Enrichr.



In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gseapy as gp
from gseapy.plot import gseaplot
import qvalue

from ipywidgets import interact, interact_manual
from ipywidgets import IntSlider, FloatSlider, Dropdown, Text

interact_enrich=interact_manual.options(manual_name="Enrichment analysis")
interact_plot=interact_manual.options(manual_name="Plot")
interact_calc=interact_manual.options(manual_name="Calculate tests")


In [None]:
clinical_data = pd.read_csv('data/brca_clin.tsv.gz', sep ='\t', index_col=2)
clinical_data = clinical_data.iloc[4:,1:]
expression_data = pd.read_csv('data/brca.tsv.gz', sep ='\t', index_col=1)
expression_data = expression_data.iloc[:,2:].T

## Over Representation Analysis

We beggin with Enrichr, which performs **Over Representation Analysis (ORA)**.
You have learned that ORA compares two set of genes and calculates how likelly would their overlap occur by random.

We will call the first set of genes the 'query set', and they will be the genes found to be **most significantly differentially espressed**, but you will get a chance to define what that means.
Here, you can separate the query genes from the rest by thresholding on either the *p-value* or *q-value*, and the log2 fold change. 

The second gene on the test is the **pathway**, and you will be able to select one of many pathway databases availiable online. Each database divides the genome into sets according to different criteria.

Enrichment analysis interactive fields:
* **Pathway_DB**: Your choice of pathway database.
* **Statistic**: Which statistic to use for thresholding, p or q-value.
* **Threshold**: The statistical threshold.
* **Lo2FC_threshold**: The log2 fold change threshold.

Use the interactive code bellow to answer the following questions:

### Questions





In [None]:
def differential_test(clinical_df, expression_df, separator, cond1, cond2):
    results = pd.DataFrame(columns = ['p','log2fc'])
    try:
        group1 = clinical_df[separator] == cond1
        index1 = clinical_df[group1].index
        group2 = clinical_df[separator] == cond2
        index2 = clinical_df[group2].index
    except:
        print('Clinical condition wrong')
        
    expression1 = expression_df.loc[index1]
    expression2 = expression_df.loc[index2]
    
    for gene in expression_df.columns:
        p_val = sp.stats.ttest_ind(expression1[gene], expression2[gene]).pvalue
        fc = np.log2(np.mean(expression1[gene])/np.mean(expression2[gene]))
        if p_val == p_val:
            results.loc[gene,'p'] = p_val
            results.loc[gene, 'log2fc'] = fc
            
    return results

def plot_hist(stats, bins):
    stats = np.array(stats)
    plt.hist(stats, bins = bins)
    plt.show()


def interact_multiple_gene_ttest(Criteria, Group_1, Group_2):
    global BRCA_tests
    BRCA_tests = differential_test(clinical_data, expression_data, Criteria, Group_1, Group_2)
    BRCA_tests = qvalue.qvalues(BRCA_tests)
    plot_hist(BRCA_tests['p'].values, 20)
    with pd.option_context('display.max_rows', None):
        display(BRCA_tests)
        
def ORA(tests, threshold, log2fc_threshold, pathway_db=['KEGG_2019_Human'], stat = 'p'):
    background=set(tests.index)
    #gene_list = list(tests.loc[tests[stat]<threshold,stat].index)
    gene_list = list(tests.loc[(tests[stat]<threshold) & (np.abs(tests['log2fc']) > log2fc_threshold), stat].index)
    print('Query set size: ' + str(len(gene_list)))

    output_enrichr=pd.DataFrame()
    enr=gp.enrichr(
                    gene_list=gene_list,
                    gene_sets=pathway_db,
                    background=background,
                    outdir = None
                )
    results = enr.results[["P-value","Overlap","Term"]].rename(columns={"P-value": "p"})
    return qvalue.qvalues(results)

pathway_db_choice = gp.get_library_name()

        
def interact_ORA(Pathway_DB, Statistic, Threshold, Log2FC_threshold):
    threshold = float(Threshold)
    log2fc_threshold = float(Log2FC_threshold)
    results = ORA(BRCA_tests, threshold, log2fc_threshold, Pathway_DB, stat = Statistic)
    with pd.option_context('display.max_rows', None):
        display(results)

In [None]:
interact_calc(interact_multiple_gene_ttest, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))
interact_enrich(interact_ORA, Threshold = '5e-2' , Pathway_DB = pathway_db_choice, Statistic=['p','q'], Log2FC_threshold = Text('0'))

## Enrichment Analysis

We then move one to another form of pathway analysis, dubbed **Enrichment Analysis**.
As opposed to ORA, enrichment analysis takes as an input a ranked list and a gene set, and asks how likely is that those genes from the set randomly distributed along the list.

This then makes our life easier, there is no need to define a query set and we already have our ranked list, since we ranked statistical significance of the differential expression.



### Questions

In [None]:
def gsea(tests, pathway_db = 'KEGG_2019_Human' ):
    pre_res = gp.prerank(rnk=tests['p'], 
                    gene_sets=pathway_db,
                    processes=4,
                    #permutation_num=100, # reduce number to speed up testing
                    outdir=None, format='png')
    return pre_res

def interact_gsea(Pathway_DB):
    global Results_gsea
    Results_gsea = gsea(BRCA_tests, pathway_db=Pathway_DB)
    with pd.option_context('display.max_rows', None):
        display(Results_gsea.res2d[['pval', 'fdr']])

In [None]:
interact_calc(interact_multiple_gene_ttest, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))
interact_enrich(interact_gsea, Pathway_DB = pathway_db_choice)

# Machine learning, SVMs

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

In [None]:
import sklearn as skl
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

interact_gen=interact_manual.options(manual_name="Initialize data")
interact_SVM=interact_manual.options(manual_name="Train SVM")

In [None]:
def split_data(clinical_df, expression_df, separator, cond1, cond2):
    try:
        group1 = clinical_df[separator] == cond1
        index1 = clinical_df[group1].index
        group2 = clinical_df[separator] == cond2
        index2 = clinical_df[group2].index
    except:
        print('Clinical condition wrong')
    expression1 = expression_df.loc[index1].dropna()
    expression2 = expression_df.loc[index2].dropna()
    expression = pd.concat([expression1, expression2])
    X = expression.values
    y = np.append(np.repeat(0, len(expression1)), np.repeat(1, len(expression2)))
    display(pd.DataFrame([len(index1),len(index2)], columns = ['Number of points'], index = ['Group 1', 'Group 2']))
    return X, y

def train_SVM(X, y, C=1, scale = False, max_iter = 1000):
    if scale:
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
    clf = LinearSVC(C=C, max_iter=max_iter)
    clf.fit(X,y)
    return clf

def print_accuracy(X_train, y_train, X_test, y_test, clf):
    y_train_pred = clf.predict(X_train)
    ac_matrix_train = confusion_matrix(y_train, y_train_pred)
    y_test_pred = clf.predict(X_test)
    ac_matrix_test = confusion_matrix(y_test, y_test_pred)
    display(pd.DataFrame(np.concatenate((ac_matrix_train,ac_matrix_test), axis =1), columns = ["predicted G1 (training)","predicted G2 (training)", "predicted G1 (test)","predicted G2 (test)"],index=["actual G1","actual G2"]))
    
def plot_pca_variance(X, scale=False, ncomp = 1):
    if scale:
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
    pca = PCA()
    pca.fit(X)
    plt.rcParams["figure.figsize"] = (20,10)
    sns.set(style='darkgrid', context='talk')
    plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1),np.cumsum(pca.explained_variance_ratio_))
    plt.xlabel('Number of components')
    plt.ylabel('Cumulative explained variance')
    
    plt.vlines(ncomp, 0, plt.gca().get_ylim()[1], color='r', linestyles = 'dashed')
    h = np.cumsum(pca.explained_variance_ratio_)[ncomp -1]
    plt.hlines(h, 0, plt.gca().get_xlim()[1], color='r', linestyles = 'dashed')
    plt.title(str(ncomp) + ' components, ' + str(round(h, 3)) + ' variance explained')
    plt.show()
    
def reduce_data(X, n, scale=True):
    if scale:
        scaler = StandardScaler()
        X = scaler.fit_transform(X)
    pca = PCA(n_components=n)
    Xr = pca.fit_transform(X)
    return Xr

In [None]:
def interact_split_data(Criteria, Group_1, Group_2):
    global BRCA_X, BRCA_y
    BRCA_X, BRCA_y = split_data(clinical_data, expression_data, Criteria, Group_1, Group_2)

interact_gen(interact_split_data, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))

In [None]:
def interact_SVM_1(Rescale, Data_split, Max_iterations):
    max_iter = int(Max_iterations)
    X_train, X_test, y_train, y_test = train_test_split(BRCA_X, BRCA_y, test_size=Data_split)
    clf = train_SVM(X_train, y_train, C=1, scale = Rescale, max_iter=max_iter)
    print_accuracy(X_train, y_train, X_test, y_test, clf)
    
interact_gen(interact_split_data, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))
interact_SVM(interact_SVM_1, Rescale = False, Data_split = FloatSlider(min=0,max=1,value=0.1, step = 0.05), Max_iterations = Text('1000'))

In [None]:
def interact_SVM_2(Rescale, Data_split, Max_iterations, C_parameter):
    max_iter = int(Max_iterations)
    C = float(C_parameter)
    X_train, X_test, y_train, y_test = train_test_split(BRCA_X, BRCA_y, test_size=Data_split)
    clf = train_SVM(X_train, y_train, C=C, scale = Rescale, max_iter=max_iter)
    print_accuracy(X_train, y_train, X_test, y_test, clf)

interact_gen(interact_split_data, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))
interact_SVM(interact_SVM_2, Rescale = False, Data_split = FloatSlider(min=0,max=1,value=0.1, step = 0.05), Max_iterations = Text('1000'), C_parameter = Text('1'))

In [None]:
def interact_PCA_plot(PCA_scaling, N_components):
    n_comp = int(N_components)
    plot_pca_variance(BRCA_X, scale=PCA_scaling, ncomp = n_comp)
    
def interact_SVM_3(Rescale, Data_split, Max_iterations, C_parameter, Reduce, PCA_scaling, N_components):
    max_iter = int(Max_iterations)
    n_comp = int(N_components)
    C = float(C_parameter)
    if Reduce:
        X = reduce_data(BRCA_X, n = n_comp, scale=PCA_scaling)
    else:
        X = BRCA_X
    X_train, X_test, y_train, y_test = train_test_split(X, BRCA_y, test_size=Data_split)
    clf = train_SVM(X_train, y_train, C=C, scale = Rescale, max_iter=max_iter)
    print_accuracy(X_train, y_train, X_test, y_test, clf)
    
interact_gen(interact_split_data, Criteria=Text('Surgical procedure first'), Group_1 = Text('Simple Mastectomy'), Group_2=Text('Lumpectomy'))
interact_plot(interact_PCA_plot, PCA_scaling = False, N_components = Text('1'))
interact_SVM(interact_SVM_3, Rescale = False, Data_split = FloatSlider(min=0,max=1,value=0.1, step = 0.05), Max_iterations = Text('1000'), C_parameter = Text('1'), Reduce = False, PCA_scaling = False, N_components = Text('1'))

# References

Weinstein, John N., et al. 'The cancer genome atlas pan-cancer analysis project.' Nature genetics 45.10 (2013): 1113-1120.

Patrício, Miguel, et al. “Using Resistin, Glucose, Age and BMI to Predict the Presence of Breast Cancer.” BMC Cancer, vol. 18, no. 1, Jan. 2018, p. 29. BioMed Central, doi:10.1186/s12885-017-3877-1.