> We have already established that using annotated HeLa m6A sites, we can observe changes in genes with m6A sites (HL-60) cells. In order to confirm this m6A sites, we performed MeRIP-seq in treated and untreated cells, and did observe a general increase in m6A levels upon treatments for a large number of annotated sites. Here, our goal is to indpendently analyze the MeRIP data without relying on HeLa annotations and use it to define a **treatment-induced hyper-methylation sites**. We will then assess the location and behaviour of these targets across the other datasets generated in this study.

## Test enrichment of treatment-induced hyper/hypo-methylation sites

### Goal
Here, I aim to identify the genes that are hyper or hypo methylated as genesets, and see if they have enriched accross all genes; the input table is list of genes with control vs. treated fold change of RNA expression, RNA stability and translational efficiency. 
### Steps 
1. Prepare inputs  
    - Filtering genes with $\Delta$methylation >= 2 as hyper-methylation sites
    - Filtering genes with $\Delta$methylation <= -2 as hypo-methylation sites
    
2 to 1

include pvalue... 

In [13]:
import pandas as pd 
import numpy as np


def two_sided_mtyl(fcthr=1,pvthr=0.01):
    delta_mtyl = pd.read_csv('meRIP-seq/hl60_delta_mtyl_table.txt', sep='\t')
    ### hyper_methylation gene list
    # subset by threshold 
    hyper = delta_mtyl.iloc[np.where([(l and p) for l,p in zip(delta_mtyl.logFC >= fcthr,delta_mtyl.p_value < pvthr)])]
    ### hypo_methylation gene list
    # subset by threshold - logFC <= -2
    hypo = delta_mtyl.iloc[np.where([(l and p) for l,p in zip(delta_mtyl.logFC <= -(fcthr),delta_mtyl.p_value < pvthr)])]
    
    return hyper, hypo

    
def write_gene_file(df,file_name):
    
    df = pd.DataFrame({'ensembl':[ens[:-3] for ens in df.ensembl.tolist()]}).drop_duplicates('ensembl')
    df.to_csv(file_name,sep='\t',index=None,header=None)
    

hyper, hypo = two_sided_mtyl()
write_gene_file(hyper,'combined_analysis/hyper_mtyl.txt')
write_gene_file(hypo,'combined_analysis/hypo_mtyl.txt')

In [17]:
%%bash
mkdir -p combined_analysis/

2. Using a [TEISER](https://github.com/goodarzilab/TEISER) script to do enrichment test 



In [None]:
%%bash

declare -a Genesets=('hyper_mtyl' 'hypo_mtyl')
declare -a Experiments=(
# Ribo-seq
'Ribo-seq/hl60_delta_te.txt'

## HL-60 RNA-seq 
# RNA experssion 
'RNA-seq/hl60-exp/6h_delta_exp.txt' 'RNA-seq/hl60-exp/72h_delta_exp.txt' 'RNA-seq/hl60-exp/120h_delta_exp.txt' 
# RNA stability  
'RNA-seq/hl60-stbl/120h_delta_stbl.txt'  'RNA-seq/hl60-stbl/6h_delta_stbl.txt'

## 5 other AML cell lines RNA-seq
# RNA experssion 
'RNA-seq/other-exp/kg1_delta_exp.txt' 'RNA-seq/other-exp/molm14_delta_exp.txt'
'RNA-seq/other-exp/ociaml2_delta_exp.txt' 'RNA-seq/other-exp/ociaml3_delta_exp.txt'
'RNA-seq/other-exp/thp1_delta_exp.txt'
# RNA stability  
'RNA-seq/other-stbl/kg1_delta_stbl.txt' 'RNA-seq/other-stbl/molm14_delta_stbl.txt' 
'RNA-seq/other-stbl/ociaml2_delta_stbl.txt' 'RNA-seq/other-stbl/ociaml3_delta_stbl.txt'
'RNA-seq/other-stbl/thp1_delta_stbl.txt'
)

for exp in "${Experiments[@]}"; do
    for geneset in "${Genesets[@]}"; do
    
#         echo $exp $geneset
        base=`basename $exp`
        base=${base/.txt/}
        
        # get intersect 
        awk 'NR==FNR{A[$1];next}$1 in A' $exp combined_analysis/${geneset}.txt > combined_analysis/${geneset}_${base}.txt
        
        perl $TEISERDIR/run_mi_gene_list.pl \
            --expfile=$exp \
            --genefile=combined_analysis/${geneset}_${base}.txt \
            --exptype=continuous \
            --ebins=7 \
            --species=human \
            --doremovedups=0 \
            --doremoveextra=0 &> combined_analysis/${geneset}_${base}.log
        # remove results from previous run 
        rm -fr combined_analysis/${geneset}_${base}_GENESET
        
        rm combined_analysis/${geneset}_${base}.txt
        mv ${exp}_GENESET combined_analysis/${geneset}_${base}_GENESET
        
#         echo 'done!'
        
    done 

done

3. Redraw heatmaps using `--min=-3 --max=3` thresholds for those plots which have smaller range of signals:

In [2]:
%%bash 
declare -a Genesets=('hyper_mtyl' 'hypo_mtyl')
declare -a Experiments=(
'6h_delta_stbl' '120h_delta_stbl' 
'kg1_delta_stbl' 'ociaml2_delta_stbl' 'molm14_delta_stbl' 
'ociaml3_delta_stbl' 'thp1_delta_stbl'
'hl60_delta_te'
)
for exp in "${Experiments[@]}"; do
    for geneset in "${Genesets[@]}"; do
#         echo $exp $geneset    
        perl $TEISERDIR/Scripts/teiser_draw_matrix.pl \
        --pvmatrixfile=combined_analysis/${geneset}_${exp}_GENESET/${exp}.txt.matrix \
        --summaryfile=combined_analysis/${geneset}_${exp}_GENESET/${exp}.txt.summary \
        --expfile=combined_analysis/${geneset}_${exp}_GENESET/${exp}.txt \
        --quantized=0 \
        --colmap=$TEISERDIR/Scripts/HEATMAPS/cmap_1.txt \
        --order=0 \
        --min=-3 --max=3 \
        --cluster=5 &>> combined_analysis/${geneset}_${exp}.log
        
#         echo 'done!'
        
    done 
done

4. Make `png` figures:

In [3]:
mkdir -p combined_analysis/plots/

In [2]:
%%bash 
for pdf in combined_analysis/*_GENESET/*.txt.summary.pdf; do 
    png=${pdf/.pdf/.png}
    di=`dirname $pdf`
    out=`basename $di`
    outpng=${out/_GENESET/.png}
    outpdf=${out/_GENESET/.pdf}
    
    bash /rumi/shams/abe/GitHub/Abe/my_scripts/pdf2png.sh $pdf
    cp $pdf combined_analysis/plots/$outpdf
    mv $png combined_analysis/plots/$outpng
done 

5. Write README.md draft
    - Write HTML codes which link all plots into a `README.md` format to prepare GitHub friendly report

## Hyper-geometric test

### Goal
Here, I aim to take iPAGE results ran on CRISPR screening scores to test hyper/hypo methylation enrichment. 

___
- Clean up iPAGE results with no signal

In [129]:
import pandas as pd
from glob import glob 

1. Read iPAGE results into python

In [None]:
# https://bioinformatics.stackexchange.com/questions/5400/how-to-convert-data-in-gmt-format-to-dataframe
# https://gseapy.readthedocs.io/en/latest/gseapy_tutorial.html

In [125]:
def read_gmt(PATH):
    with open(PATH) as gmt:
        lines = gmt.readlines()
        out = {}
        for line in lines:
            data = line.split('\t')
            name = data[0]
            url  = data[1]
            genes= data[2:]
            genes[-1] = genes[-1].split('\n')[0]

            out[name] = {}
            out[name]['url'] = url 
            out[name]['genes'] = genes
            
    return out


def read_page_index(PATH):
    with open(PATH) as raw:
        lines = raw.readlines()
        out = {}
        for line in lines:
            data = line.split('\t')
            gene = data[0]
            pathways  = data[1:]
            pathways[-1] = pathways[-1].split('\n')[0]

            out[gene] = {}
            out[gene]['pathways'] = pathways
    return out


def read_page_names(PATH):
    with open(PATH) as raw:
        lines = raw.readlines()
        out = {}
    for line in lines:
            data = line.split('\t')
            name0= data[0]
            name1= data[1]
            pw_type= data[2].split('\n')[0]
            out[name0] = [name1, pw_type]
    return out


def read_page_annotations(gs_name,ANNDIR='/flash/bin/iPAGEv1.0/PAGE_DATA/ANNOTATIONS/'):
    '''
    Read gene set annotations into python from PAGE_DATA format
    '''
    index = read_page_index(glob(ANNDIR+gs_name+'/*_index.txt')[0] )
    names = read_page_names(glob(ANNDIR+gs_name+'/*_names.txt')[0] )
    gmt = glob(ANNDIR+gs_name+'/*.gmt')
    
    annotations = {}
    annotations['index'] = index
    annotations['names'] = names
    if gmt:
        annotations['gmt'] = read_gmt(gmt[0])
    
    return annotations


def make_page_dict(PATH):
    '''
    PATH = a complete path to a pvmatrix.txt file, part of results from iPAGE run 
    Processes: 
    1) Read p-value matrix data into a data frame
    2) Include annotations for the gene set from the PAGE directory
    Output: Python dictionary contain pvmatrix and related annotations to the gene set
    '''
    ### 1 ### 
    # read pvmatrix.txt file 
    df = pd.read_csv(PATH, sep='\t',index_col=0)
    # remove duplicated named (row) names 
    if all([geneset.split(' ')[0] == geneset.split(' ')[1] for geneset in df.index.tolist()]):
        df.index = [geneset.split(' ')[0] for geneset in df.index.tolist() ]
        
    ### 2 ### 
    gs_name = PATH.split('/')[-2]
    ann = read_page_annotations(gs_name)

    out = {}
    out['gs_name'] = gs_name
    out['annotations'] = ann
    out['data'] = df
    
    return out

In [126]:
l_page = [make_page_dict(path) for path in glob('screen/CRISPRi_HL60_rho/*/pvmatrix.L.txt')]
r_page = [make_page_dict(path) for path in glob('screen/CRISPRi_HL60_rho/*/pvmatrix.R.txt')]

2. Run the hypergeom test





https://github.com/JohnDeJesus22/DataScienceMathFunctions/blob/master/hypergeometricfunctions.py#L38

In [160]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb

def hypergeom_pmf(N, A, n, x):
    
    '''
    Probability Mass Function for Hypergeometric Distribution
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :param x: number of desired items in our draw of n items
    :returns: PMF computed at x
    '''
    Achoosex = comb(A,x)
    NAchoosenx = comb(N-A, n-x)
    Nchoosen = comb(N,n)
    
    return (Achoosex)*NAchoosenx/Nchoosen
    
    
def hypergeom_cdf(N, A, n, t, min_value=None):
    
    '''
    Cumulative Density Funtion for Hypergeometric Distribution
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :param t: number of desired items in our draw of n items up to t
    :returns: CDF computed up to t
    '''
    if min_value:
        return np.sum([hypergeom_pmf(N, A, n, x) for x in range(min_value, t+1)])
    
    return np.sum([hypergeom_pmf(N, A, n, x) for x in range(t+1)])


def hypergeom_plot(N, A, n):
    
    '''
    Visualization of Hypergeometric Distribution for given parameters
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :returns: Plot of Hypergeometric Distribution for given parameters
    '''
    
    x = np.arange(0, n+1)
    y = [hypergeom_pmf(N, A, n, x) for x in range(n+1)]
    plt.plot(x, y, 'bo')
    plt.vlines(x, 0, y, lw=2)
    plt.xlabel('# of desired items in our draw')
    plt.ylabel('Probablities')
    plt.title('Hypergeometric Distribution Plot')
    plt.show()


In [None]:

hyper, hypo = [set(list(mtyl.name)) for mtyl in two_sided_mtyl()]

In [214]:
gmt = l_page[3]['annotations']['gmt']
pw = [gmt[pw]['genes'] for pw in gmt][0]

In [227]:
def intersection(lst1, lst2): 
    lst3 = [value for value in lst1 if value in lst2] 
    return lst3 

In [232]:
N = len(hyper)
A = len(intersection(hyper, pw))

In [247]:
help(hypergeom_cdf)

Help on function hypergeom_cdf in module __main__:

hypergeom_cdf(N, A, n, t, min_value=None)
    Cumulative Density Funtion for Hypergeometric Distribution
    :param N: population size
    :param A: total number of desired items in N
    :param n: number of draws made from N
    :param t: number of desired items in our draw of n items up to t
    :returns: CDF computed up to t



In [249]:
hypergeom_cdf(N,A,5,4, min_value=3)

6.206289814275411e-07

OK! I need to write something that can read index and names in iPAGE format, easy!