# Annotate TSS

Before building the base GRN, we need to annotate the coaccessible peaks and filter our active promoter/enhancer elements. First, we will identify the peaks around transcription starting sites (TSS). We will then merge the Cicero data with the TSS peak information and filter any peaks with weak connections to the TSS peaks. As such, the filtered peak data will only include TSS peaks and peaks with strong TSS connections. These will be our active promoter/enhancer elements for our base GRN.


Most of this code is from the CellOracle tutorial: https://morris-lab.github.io/CellOracle.documentation/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.html

### 1. Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns


import os, sys, shutil, importlib, glob
from tqdm.notebook import tqdm

%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

In [None]:
import warnings
warnings.filterwarnings('ignore')
from celloracle import motif_analysis as ma
import celloracle as co
co.__version__

### 2. Load scATAC peak data and peak connection data made with Cicero

In this notebook, we will annotate and filter output from Cicero. Refer to data output from `make-cicero.R`.

Here, I will use either the pan-cardiac cicero connections, or cluster-by-cluster connections.

Below, I load the peaks and cicero CSV files. You'll do this for each timepoint and both WT/KO conditions.

In [None]:
timepoint = "E9"
wt_or_ko = "WT"
peaks = pd.read_csv(f"./data/base_grn_outputs/{timepoint}/{wt_or_ko}_peaks.csv", index_col=0)
peaks = peaks.x.values
peaks

In [None]:
cicero_connections = pd.read_csv(f"./data/base_grn_outputs/{timepoint}/{wt_or_ko}_cicero_connections.csv", index_col=0)
cicero_connections

### 3. Annotate transcription start sites

Use mm10 reference genome by setting ref_genome="mm10".

In [None]:
ma.SUPPORTED_REF_GENOME

In [None]:
# There seems to be an error in the exporting of peaks from cicero, where 'chr19_24999500_25000000' was saved as 'chr19_24999500_2.5e+07', which causes an error
if timepoint == "E75":
    peaks[132381] = 'chr19_24999500_25000000'

tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm10")

The output file after the integration process has three columns: ["peak_id", "gene_short_name", "coaccess"].

“peak_id” is either the TSS peak or the peaks that have a connection to a TSS peak.

“gene_short_name” is the gene name that associated with the TSS site.

“coaccess” is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself.


In [None]:
tss_annotated.tail()

### 4. Integrate TSS info and cicero connections
The output file after the integration process has three columns: ["peak_id", "gene_short_name", "coaccess"].

“peak_id” is either the TSS peak or the peaks that have a connection to a TSS peak.

“gene_short_name” is the gene name that associated with the TSS site.

“coaccess” is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself.

In [None]:
integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated,
                                               cicero_connections=cicero_connections)
print(integrated.shape)
integrated.head()

### 5. Filter peaks
Remove peaks with weak coaccessibility scores.

In [None]:
peak = integrated[integrated.coaccess >= 0.8]
peak = peak[["peak_id", "gene_short_name"]].reset_index(drop=True)

In [None]:
print(peak.shape)
peak.head()

### 6. Save data
Save the promoter/enhancer peaks.

In [None]:
peak.to_csv(f"./data/base_grn_outputs/{timepoint}/{wt_or_ko}_processed_peak_file.csv")