# Overview


In this notebook, we will make TSS annotation in the Cicero coaccessible peak data to get input data of base-GRN construction. 
- First, we pick up peaks around the transcription starting site (TSS).
- Second, we merge cicero data with the peaks around TSS.
- Then we remove peaks that have a weak connection to TSS peak so that the final product includes TSS peaks and peaks that have a strong connection with the TSS peaks. We use this information as an active promoter/enhancer elements.


#### Although CellOracle supports basic model organisms and reference genomes, you may want to use a different reference genome that is not in the supported reference genome list.

#### Here, we introduce how to use custom TSS database for the annotation process the reference genome not in the default CellOracle.
Please lookt at another notebook for the detailed process to make the custom TSS database for your species of interest.

# !! Caution!!  This is NOT part of CellOracle tutorial. 
- This notebook includes unusual usage of CellOracle. 
- The analysis might require expertise of python and DNA sequence analysis, but this notebook does not aim to explain them all, and please use this notebook by your responsibility.


# 0. Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns


import os, sys, shutil, importlib, glob
from tqdm.notebook import tqdm

In [3]:
from celloracle import motif_analysis as ma
import celloracle as co
co.__version__

'0.8.3'

In [4]:
%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

# 1. Load ATAC-seq peak data

## 1.1. Load data

In [5]:
# Load scATAC-seq peak list.
peaks = pd.read_csv("all_peaks.csv", index_col=0)
peaks = peaks.x.values
peaks

FileNotFoundError: [Errno 2] File all_peaks.csv does not exist: 'all_peaks.csv'

In [6]:
# Load cicero coaccess score.
cicero_connections = pd.read_csv("cicero_connections.csv", index_col=0)
cicero_connections.head()

Unnamed: 0,Peak1,Peak2,coaccess
1,chr10_100006139_100006389,chr10_99774288_99774570,-0.003546
2,chr10_100006139_100006389,chr10_99825945_99826237,-0.027536
3,chr10_100006139_100006389,chr10_99830012_99830311,0.009588
4,chr10_100006139_100006389,chr10_99833211_99833540,-0.008067
5,chr10_100006139_100006389,chr10_99941805_99941955,0.0


# 2. Make TSS annotation
## IMPORTANT: Please make sure that you are setting correct reference genoms.
 If your scATAC-seq data was generated with mm10 reference genome, please set `ref_genome="mm10"`.
 
You can check supported reference genome using `ma.SUPPORTED_REF_GENOME`

 If your reference genome is not in the list, please send a request through github issue page.

In [3]:
ma.SUPPORTED_REF_GENOME

Unnamed: 0,species,ref_genome,provider
0,Human,hg38,UCSC
1,Human,hg19,UCSC
2,Mouse,mm10,UCSC
3,Mouse,mm9,UCSC
4,S.cerevisiae,sacCer2,UCSC
5,S.cerevisiae,sacCer3,UCSC
6,Zebrafish,danRer7,UCSC
7,Zebrafish,danRer10,UCSC
8,Zebrafish,danRer11,UCSC
9,Xenopus,xenTro2,UCSC


In [4]:

tss_annotated = ma.get_tss_info(peak_str_list=peaks, c
                                ref_genome="Cavpor3.0",
                                
                                ) ##!! Set reference genome here


# Check results
tss_annotated.tail()

NameError: name 'peaks' is not defined

# 3. Integrate TSS info and cicero connections

he output file after the integration process has three columns: `["peak_id", "gene_short_name", "coaccess"`].

- "peak_id" is either the TSS peak or the peaks that have a connection with the TSS peak.
- "gene_short_name" is the gene name that associated with the TSS site. 
- "coaccess" is the co-access score between a peak and TSS peak. If the score is 1, it means that the peak is TSS itself.

In [9]:
integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated, 
                                               cicero_connections=cicero_connections)
print(integrated.shape)
integrated.head()

(44309, 3)


Unnamed: 0,peak_id,gene_short_name,coaccess
0,chr10_100006139_100006389,Tmtc3,0.017915
1,chr10_100015291_100017830,Kitl,1.0
2,chr10_100018677_100020384,Kitl,0.146517
3,chr10_100050858_100051762,Kitl,0.069751
4,chr10_100052829_100053395,Kitl,0.20267


# 4. Filter peaks
Remove peaks that have weak coaccess score.

In [10]:
peak = integrated[integrated.coaccess >= 0.8]
peak = peak[["peak_id", "gene_short_name"]].reset_index(drop=True)

In [11]:
print(peak.shape)
peak.head()

(15779, 2)


Unnamed: 0,peak_id,gene_short_name
0,chr10_100015291_100017830,Kitl
1,chr10_100486534_100488209,Tmtc3
2,chr10_100588641_100589556,4930430F08Rik
3,chr10_100741247_100742505,Gm35722
4,chr10_101681379_101682124,Mgat4c


# 5. Save data
Save the promoter/enhancer peak.

In [12]:
peak.to_csv("processed_peak_file.csv")

**Please go to next step: Transcriptoin factor motif scan**

https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html