# Overview


In this notebook, we will make TSS annotation in the Cicero coaccessible peak data to get input data of base-GRN construction. 
- First, we pick up peaks around the transcription starting site (TSS).
- Second, we merge cicero data with the peaks around TSS.
- Then we remove peaks that have a weak connection to TSS peak so that the final product includes TSS peaks and peaks that have a strong connection with the TSS peaks. We use this information as an active promoter/enhancer elements.


#### Although CellOracle supports basic model organisms and reference genomes, you may want to use a different reference genome that is not in the supported reference genome list.

#### Here, we introduce how to use custom TSS database for the annotation process the reference genome not in the default CellOracle.
Please lookt at another notebook for the detailed process to make the custom TSS database for your species of interest.

# !! Caution!!  This is NOT part of CellOracle tutorial. 
- This notebook includes unusual usage of CellOracle. 
- The analysis might require expertise of python and DNA sequence analysis, but this notebook does not aim to explain them all, and please use this notebook by your responsibility.


# 0. Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns


import os, sys, shutil, importlib, glob
from tqdm.notebook import tqdm

In [2]:
from celloracle import motif_analysis as ma
import celloracle as co
co.__version__

'0.8.4'

In [3]:
%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# 1. Load ATAC-seq peak data

In [44]:
# Load scATAC-seq peak list.
peaks = pd.read_csv("Cavpor3.0_tss_info.bed", delimiter="\t", header=None)


In [45]:
peaks = peaks.iloc[100:200,:]

In [46]:
peaks = peaks[0] + "_" + peaks[1].astype("str") + "_" + peaks[2].astype('str')
peaks = pd.DataFrame(peaks)
peaks = peaks.reset_index(drop=True)
peaks.columns = ["x"]

In [47]:
peaks.to_csv("peaks_example.csv")

## 1.1. Load data

In [49]:
# Load scATAC-seq peak list.
peaks = pd.read_csv("peaks_example.csv", index_col=0)
peaks = peaks.x.values
peaks

array(['DS562855.1_17325102_17326202', 'DS562855.1_17604709_17605809',
       'DS562855.1_17717213_17718313', 'DS562855.1_17793186_17794286',
       'DS562855.1_17850610_17851710', 'DS562855.1_17927367_17928467',
       'DS562855.1_17963096_17964196', 'DS562855.1_17981058_17982158',
       'DS562855.1_17984493_17985593', 'DS562855.1_18107565_18108665',
       'DS562855.1_18165123_18166223', 'DS562855.1_18244816_18245916',
       'DS562855.1_19083904_19085004', 'DS562855.1_19139148_19140248',
       'DS562855.1_18763492_18764592', 'DS562855.1_19261363_19262463',
       'DS562855.1_19432676_19433776', 'DS562855.1_19447039_19448139',
       'DS562855.1_19469701_19470801', 'DS562855.1_19545326_19546426',
       'DS562855.1_19600418_19601518', 'DS562855.1_20259647_20260747',
       'DS562855.1_20358849_20359949', 'DS562855.1_20487020_20488120',
       'DS562855.1_20483135_20484235', 'DS562855.1_20607536_20608636',
       'DS562855.1_20687876_20688976', 'DS562855.1_20687876_20688976',
      

# 2. Make TSS annotation

In [55]:
tss_annotated = ma.get_tss_info(peak_str_list=peaks, 
                                ref_genome="Cavpor3.0",
                                custom_tss_file_path="Cavpor3.0_tss_info.bed"##!! Set custom TSS bed file here
                                ) ##!! Set custom TSS bed file here

# Check results
tss_annotated.head()

que bed peaks: 100
tss peaks in que: 130


Unnamed: 0,chr,start,end,gene_short_name,strand
0,DS562855.1,17325102,17326202,7SK,+
1,DS562855.1,17604709,17605809,KCNS2,-
2,DS562855.1,17717213,17718313,NIPAL2,+
3,DS562855.1,17793186,17794286,U6,+
4,DS562855.1,17850610,17851710,POP1,-
