Co-Essential GSEA Pipeline

Lara Brown

In [1]:
from datetime import datetime
from IPython.display import display, Markdown

todays_date = str(datetime.now().date())

display(Markdown(f'##### Last updated: {todays_date}'))

##### Last updated: 2023-06-01

## Overview

Run MAGeCK outputs through co-essential GSEA (gene set enrichment analysis).

## Requirements

You will need a conda environment with MAGeCK installed and properly functioning, as well as access to Jupyter lab. You will also need the libraries imported below. Further, you'll need access to the gene_summary.txt output of MAGeCK, and/or a csv file corresponding to corrected/adjusted LFCs from the MAGeCK output.

In [2]:
import sys
import pandas as pd
import glob
import pandas as pd
import gseapy as gp
import pathlib
import re

Convert Mageck gene summary to coessentiality GSEA input

In [3]:
import os
os.getcwd()

'/Users/lbb34/cassasherwood/cassa-sherwood-labs/mageck-gsea'

In [4]:
df = pd.read_csv("hepg2_corrected_lfc.csv")

In [5]:
df.columns

Index(['id', 'APOA1_LFC_residuals', 'APOE_LFC_residuals',
       'APOC3_LFC_residuals', 'AHSG_LFC_residuals', 'IGFBP1_LFC_residuals',
       'Mean_LFC_residuals', 'Median_LFC_residuals', 'Mean_APO_LFC_residuals'],
      dtype='object')

Parameters and ouput path of coessentiality GSEA

In [6]:
d = 0.2 # module distance cut-off
PROCESSES = 4 # numbe of CPU threads to use
PERMUTATION_NUM = 100
MIN_SIZE = 4
SEED=0
module_paths = {
    0.1 : "./GSEA_module_files/GO_Annotated_Modules_d_0.1.csv",
    0.2 : "./GSEA_module_files/GO_Annotated_Modules_d_0.2.csv",
    0.3 : "./GSEA_module_files/GO_Annotated_Modules_d_0.3.csv",
    0.4 : "./GSEA_module_files/GO_Annotated_Modules_d_0.4.csv",
    0.5 : "./GSEA_module_files/GO_Annotated_Modules_d_0.5.csv",
    0.6 : "./GSEA_module_files/GO_Annotated_Modules_d_0.6.csv"
}

Create gene set from all moduels

In [7]:
def rnk_df(df, score_col, id_col="id"):
    rnk = df[[id_col, score_col]].rename(columns={id_col:0, score_col:1})
    rnk = rnk.dropna()
    rnk = rnk.sort_values(by = 1, ascending = False)
    return rnk

def gene_set(rnk, score_col, modules):
    
    gene_members = set(rnk[0].values) ## creates set out of values in column index 1 of rnk
    counter = 0
    gene_sets = {}
    ## for every module, add list of their member genes to a set
    for index, row in modules.iterrows():
        members = row["Members"].split(" ")
        members = list(filter(lambda x : x in gene_members, members))
        gene_sets[f"module_{index}"] = members
    
    return gene_sets
    
def run_ce_gsea(rnk, gene_sets, GSEA_out_path):
    pre_res = gp.prerank(
        rnk=rnk,
        gene_sets = gene_sets,
        processes=PROCESSES, ## default 1 process; upping number would be parallel processes
        permutation_num=PERMUTATION_NUM, ## reduce number to speed up testing (1000 or less) ; take score and do 10,000 permutations (like sample size)
        outdir= GSEA_out_path,
        format='png',
        min_size = MIN_SIZE,
        seed=SEED
    )

def process_gsea_results(modules, GSEA_out_path, final_out_path):
    out_df = pd.read_csv(GSEA_out_path + "gseapy.gene_set.prerank.report.csv", index_col = "Term")
    modules["Term"] = list(
            map(lambda x : "module_" + str(x) if "module" not in str(x) else str(x), list(modules.index))
        )
    
    modules.set_index("Term", inplace = True)
    GO_columns = [
        "Top GO Terms",
        "Top GO Term p-values",
        "Top GO FDRs",
        "Top GO Term Fold Enrichments"
    ]
    out_df = out_df.join(modules[GO_columns])
    
    if not pathlib.Path(final_out_path).is_file():
        out_df.to_csv(final_out_path)
        
def modules_func(d):
    if d in module_paths.keys():
        module_path_dir = module_paths[d]
        modules = pd.read_csv(module_paths[d], index_col = "Cluster")

    return modules


Run CE-GSEA for all columns/sets

In [8]:
%%time

for lfc_col in [c for c in df.columns.to_list() if c.endswith("_residuals")]:
    print(re.sub('\_residuals', '', lfc_col))
    lfc_col_stripped = re.sub('\_residuals', '', lfc_col)
    pathlib.Path(f"./CoessentialGSEA_output/{lfc_col_stripped}_output/").mkdir(parents=True, exist_ok=True)
    
    GSEA_out_path = f"./CoessentialGSEA_output/{lfc_col_stripped}_output/" # path to GSEA plots and csv
    final_out_path = f"./CoessentialGSEA_output/{lfc_col_stripped}_results.csv" # path to final output
    
    rnk = rnk_df(df, lfc_col)
    
    modules = modules_func(d)
    
    gene_sets = gene_set(rnk, lfc_col, modules)
    
    run_ce_gsea(rnk, gene_sets, GSEA_out_path)
    
    process_gsea_results(modules, GSEA_out_path, final_out_path)

APOA1_LFC




APOE_LFC
APOC3_LFC
AHSG_LFC


The order of those genes will be arbitrary, which may produce unexpected results.


IGFBP1_LFC
Mean_LFC


The order of those genes will be arbitrary, which may produce unexpected results.


Median_LFC
Mean_APO_LFC
CPU times: user 44.3 s, sys: 1.4 s, total: 45.8 s
Wall time: 38.9 s
