# Usage

## Dependencies
This project depends on working installations of:
* Biopython
* joblib
* MOODS
* numpy
* pandas
* pybedtools
* scikit-learn
* statsmodels
* tqdm

All dependencies are available via conda.

This notebook additionally depends on:
* wget
* zcat

### MOODS Motif scanning
Motif scanning functionality depends on the [MOODS Python module](https://github.com/jhkorhonen/MOODS/tree/master/python), which is available via [conda](https://anaconda.org/bioconda/moods). This may have to be installed manually if you don't use conda, since there is currently no PyPi package.


In [None]:
import pandas as pd
import numpy as np

import datetime
from timeit import default_timer as timer


## Generate scored and stratified fasta file from peak data
For each peak, output should appear as:
```
>sequence_name score stratum other_descriptive_text
SEQUENCESEQUENCESEQUENCE
```

In this example, we set `score` to `log2fc * (1-pval)` using values from our peak file.  
The strata are derived from the GC% content of the sequence, rounded to the nearest 5%.  
If strata are unimportant, you may simply enter a single constant.  
If you already have such a file, skip to **Load data for motif enrichment**  

### Download and locate reference sequence

In [None]:
%%bash
if [ ! -f genome.fa ]; then
    wget \
    ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz \
    -O genome.fa.gz
    zcat genome.fa.gz > genome.fa
fi


In [None]:
genome_fa_filename = f'genome.fa'

### Locate peak data

In [None]:
peaks_filename = 'differential_peaks.txt'

In [None]:
peaks_df = pd.read_table(peaks_filename)

### Annotate peaks with additional information

In [None]:
min_tag_limit = 10

peak_annotation_columns = ['chr', 'start', 'end', 'strand', 'log2fc', 'pval', 'min_tags']
peak_id_column = 'PeakID'

peak_annotation_df = (peaks_df[peaks_df['min_tags'] >= min_tag_limit]
                      .rename(columns = {peak_id_column: 'peak_id'})
                      [list(set(['peak_id'] + peak_annotation_columns))]
                      .rename(columns = {col: f'peak_{col}' 
                                         for col 
                                         in peak_annotation_columns})
                      .drop_duplicates())

peak_annotation_df = peak_annotation_df[['peak_id'] + [f'peak_{col}' for col in peak_annotation_columns]]


### Weight log2fc by 1-pval

In [None]:
peak_annotation_df['peak_weighted_log2fc'] = peak_annotation_df['peak_log2fc'] * (1 - peak_annotation_df['peak_pval'])

### Set score for each peak equal to the weighted log2fc

In [None]:
peak_annotation_df['peak_score'] = peak_annotation_df['peak_weighted_log2fc']
peak_annotation_bed_columns = ['peak_chr', 'peak_start', 'peak_end', 'peak_id', 'peak_score', 'peak_strand']

In [None]:
peak_annotation_df[peak_annotation_bed_columns].head()

### Extract peak sequence +/- 300 from peak center

In [None]:
sequence_length = 600

In [None]:
peak_annotation_df.head()

In [None]:
peak_annotation_bed_columns

### Write fasta file

In [None]:
from meirlop import get_centered_peak_sequences, get_gc_pct, get_gc_pct_bin, write_scored_fasta
peak_fasta_filename = 'peak_scores.fa'
peak_fasta_file = open(peak_fasta_filename, 'w')

peak_sequence_dict, peak_sequence_bed_df = get_centered_peak_sequences(peak_annotation_df, 
                                                                       genome_fa_file = open(genome_fa_filename, 'r'), 
                                                                       sequence_length = sequence_length, 
                                                                       peak_bed_columns = peak_annotation_bed_columns)
peak_score_dict = peak_annotation_df.set_index('peak_id')['peak_score'].to_dict()

peak_gc_pct_dict = {peak_id: get_gc_pct(seq) 
                        for peak_id, seq 
                        in peak_sequence_dict.items()}

peak_gc_pct_bin_dict = {peak_id: get_gc_pct_bin(seq) 
                        for peak_id, seq 
                        in peak_sequence_dict.items()}

peak_fasta_string = write_scored_fasta(peak_sequence_dict, 
                                       peak_score_dict, 
                                       peak_fasta_file, 
                                       other_dicts = [peak_gc_pct_bin_dict])
peak_fasta_file.close()

In [None]:
! head {peak_fasta_filename}

## Load data for motif enrichment

### Load the scored fasta

In [None]:
from meirlop import read_scored_fasta, dict_to_df
sequence_dict, score_dict, description_dict = read_scored_fasta(open(peak_fasta_filename, 'r'), description_delim = ' ')
strata_dict = {key: int(val[2]) for key, val in description_dict.items()}

score_df = dict_to_df(score_dict, 'peak_id', 'peak_score')
strata_df = dict_to_df(strata_dict, 'peak_id', 'peak_strata')

In [None]:
print(score_df.shape)
score_df.head()

In [None]:
print(strata_df.shape)
strata_df.head()

### Download and Load motif matrices

In [None]:
%%bash
if [ ! -f JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt ]; then
    wget \
    --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" \
    http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt \
    -O JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt
fi

In [None]:
from meirlop import read_motif_matrices
known_motifs_filename = 'JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt'
known_motifs_file = open(known_motifs_filename, 'r')
motif_matrix_dict, motif_consensus_dict = read_motif_matrices(known_motifs_file)

## Scan for motifs
Create a dictionary, where the key is the motif id, and the value is a list of peaks containing the motif.

In [None]:
start = timer()
print(datetime.datetime.now())

from meirlop import format_scan_results, scan_motifs, get_background
scan_results_df, motif_peak_set_dict = format_scan_results(scan_motifs(motif_matrix_dict, 
                                                                       peak_sequence_dict, 
                                                                       bg = get_background(''.join(peak_sequence_dict.values())), 
                                                                       pval = 0.01, 
                                                                       pseudocount = 0.001, 
                                                                       window_size = 7))

end = timer()
runtime = end - start
print(f'{runtime} seconds')
print(datetime.datetime.now())

## Perform logistic regression analysis
Control for GC% as a covariate

In [None]:
covariates_df = dict_to_df(peak_gc_pct_dict, 'peak_id', 'peak_covariate')

In [None]:
print(covariates_df.shape)
covariates_df.head()

In [None]:
start = timer()
print(datetime.datetime.now())

from meirlop import analyze_peaks_with_lr
from tqdm import tqdm_notebook

lr_results_df = analyze_peaks_with_lr(peak_score_df = score_df,
                                      peak_set_dict = motif_peak_set_dict,
                                      peak_covariates_df = covariates_df,
                                      padj_method = 'fdr_bh',
                                      min_set_size = 1,
                                      max_set_size = np.inf,
                                      n_jobs = 1, 
                                      progress_wrapper = tqdm_notebook)

end = timer()
runtime = end - start
print(f'{runtime} seconds')
print(datetime.datetime.now())

In [None]:
lr_results_df.head(20)

## Perform enrichment analysis with stratified permutations
We use an adaptation of GSEA prerank accounting for GC% (rounded to the nearest 5%) as strata in permutation testing.

The argument `nshuf` refers to how many times the algorithm will shuffle around peaks with equal scores, in order to be robust to multiple valid orderings of peaks by score.  
The argument `nperm` refers to how many permutations are made per shuffling of peaks with equal scores.  
The total number of null permutations is then `nshuf * nperm`
The argument `n_jobs_perm` refers to how many processes will be used to generate permutations.
The argument `n_jobs_ind` refers to how many processes will be used to generate indicator variable matrices for enrichment calculations.

In [None]:
start = timer()
print(datetime.datetime.now())

from meirlop import analyze_peaks_with_prerank
from tqdm import tqdm_notebook

rs = np.random.RandomState(1234)

analysis_results = analyze_peaks_with_prerank(peak_score_df = score_df, 
                                              peak_set_dict = motif_peak_set_dict, 
                                              peak_strata_df = strata_df, 
                                              min_set_size = 1, 
                                              max_set_size = np.inf, 
                                              nperm = 10, 
                                              nshuf = 100, 
                                              rs = rs, 
                                              n_jobs_perm = 20, 
                                              n_jobs_ind = 1, 
                                              progress_wrapper = tqdm_notebook)

enrichment_score_results_df, shuffled_permuted_peak_data, peak_idx_to_peak_id = analysis_results

end = timer()
runtime = end - start
print(f'{runtime} seconds')
print(datetime.datetime.now())

In [None]:
enrichment_score_results_df.head(20)

In [None]:
enrichment_score_results_df[enrichment_score_results_df['fdr_sig'] == 1].shape