# Overview


In this notebook, we will make TSS annotation data that are used for the scATAC-seq peak annotation.


- First, we download gene annotation gff3 file from Ensembl database.
- Second, we convert gff3 file into bed file. During this process, the TSS information is extracted.


# !! Caution!!  

## 1) This is NOT part of CellOracle tutorial. 
- This notebook includes unusual usage of CellOracle. 
- The analysis might require expertise of python and DNA sequence analysis, but this notebook does not aim to explain them all, and please use this notebook by your responsibility.

## 2) This notebook was tested with Ensembl Guinea Pig data, but we do not guarantee the function works with other species or other database. 

- Please let us know using git hub issue if you have problem with this notebook.
- We can construct TSS annotation data and add them to CellOracle package. Please just let us know if you have a request for new reference genome.

# 0. Import libraries

In [1]:
import pandas as pd
import numpy as np

import os, sys
from tqdm.notebook import  tqdm

from pybedtools import BedTool
import genomepy



In [2]:
import celloracle as co
from celloracle import motif_analysis as ma

co.__version__

'0.10.13'

# 1. Define custom functions to process gene annotation data.

Extract TSS information from gff3 file and get a bed file.

In [3]:
def parse_ens(x):
    dic = {}
    if ";" in str(x):
        for i in x.split(";"):
            key, val = i.split("=")
            dic[key] = val
    return dic

def get_tss_and_promoter_candidate_locus(data, n_downstream=500, n_upstream=500, clip_negative=True):
    data["TSS"] = data["start"]
    
    mRNA_in_reversed_strand = data.index[data["strand"] == "-"]
    data.loc[mRNA_in_reversed_strand, "TSS"] = \
        data.loc[mRNA_in_reversed_strand, "end"]
    
    data["promTSS_left"] = data["TSS"] - n_upstream
    data["promTSS_right"] = data["TSS"] + n_downstream
    
    data.loc[mRNA_in_reversed_strand, "promTSS_left"] = \
        data.loc[mRNA_in_reversed_strand, "TSS"] - n_downstream
    data.loc[mRNA_in_reversed_strand, "promTSS_right"] = \
        data.loc[mRNA_in_reversed_strand, "TSS"] + n_upstream
    
    if clip_negative:
        data.loc[data.index[data.promTSS_left < 0], "promTSS_left"] = 0
    
    return data

def merge_overlapping_peaks(df_):
    
    gene_symbol = df_.gene_symbol.unique()
    assert(len(gene_symbol) == 1)
    
    strand = df_.strand.unique()
    assert(len(strand) == 1)

    df_bt = BedTool.from_dataframe(df_).sort()
    df_ = df_bt.merge(d=0).to_dataframe()
    df_["gene_symbol"] = gene_symbol[0]
    df_["score"] = "."
    df_["strand"] = strand[0]
    df_ = df_.rename(columns={"chrom": "seqname", "start": "promTSS_left", "end":"promTSS_right"})
    
    return df_

def load_and_process_ensembl_gff3_file(file, n_downstream=100, n_upstream=1000, clip_negative=True):
    # Load gff file. Comments rows are skipped.
    lines = []
    with open(file, "r") as f:
        for i, l in enumerate(f.readlines()):
            if l.startswith("#"):
                pass
            else:
                lines.append(l.replace("\n", "").split("\t"))
    df = pd.DataFrame(lines)

    # Data format adjustment 1
    df.columns = ["seqname", "source", "feature", "start", "end", "score",
                  "strand", "frame", "attribute"]

    df["start"] = df["start"].astype("int")
    df["end"] = df["end"].astype("int")


    # Data format adjustment 2
    ## The attribute column includes detailed information. Let's extract information and store them as new columns. 
    annot = pd.DataFrame([parse_ens(i) for i in tqdm(df["attribute"])])
    df = pd.concat([df, annot], axis=1)


    # Split data into gene entry and transcript entry.
    df_gene = df[df.feature=="gene"]
    df_gene["gene_symbol"] = df_gene.gene

    df_transcript = df[df.feature == "mRNA"]
    df_transcript["gene_symbol"] = df_transcript.gene


    # Add PromoterTSS location. 
    df_transcript = get_tss_and_promoter_candidate_locus(df_transcript, 
                                         n_downstream=n_downstream, n_upstream=n_upstream, clip_negative=clip_negative)

    # Wrap up necessary information.
    result = df_transcript[["seqname", "promTSS_left", "promTSS_right",
                        "gene_symbol", "score", "strand"]]
    
    # Merge overlapping peaks
    li = []
    for i in tqdm(result.gene_symbol.unique()):
        df_ = result[result.gene_symbol == i]
        if len(df_) == 1:
            li.append(df_)
        else:
            li.append(merge_overlapping_peaks(df_))
    result_merged = pd.concat(li, axis=0)

    return result_merged

# 2. Install reference genome first.

We use genomepy to get genomic DNA sequence.
The first step is to install reference genome data.

We will use the genomepy function.
`genomepy.install_genome()`

We need (1) referenoce genome name and (2) provider.

Please see genomepy's documentation for more information. https://pypi.org/project/genomepy/


In [4]:
# Search for reference genome name and provider
!genomepy search "Xenopus laevis"

[1mname                    provider    accession          species                                      tax_id    other_info                                     [0m
[0mxenLae2                 UCSC        GCA_001663975.1    Xenopus laevis                               8355      Aug. 2016 (Xenopus_laevis_v2/xenLae2)          [0m
[0mXenopus_laevis_v2       NCBI        GCA_001663975.1    Xenopus laevis                               8355      International Xenopus Sequencing Consortium    [0m
[0mViralProj30173          NCBI        GCA_000875345.1    Xenopus laevis endogenous retrovirus Xen1    204873    NCBI                                           [0m
[0mXenopus_laevis_v10.1    NCBI        GCA_017654675.1    Xenopus laevis                               8355      International Xenopus Sequencing Consortium    [0m
[0m[32m ^[0m
[0m[32m Use name for [36mgenomepy install[0m
[0m[0m

In [5]:
# Install reference genome. You can skip this step if you already installed reference genome.
ref_genome = "Xenopus_laevis_v10.1"
provider = "NCBI"
genomepy.install_genome(ref_genome, provider)

In [6]:
# Check referenoce genome installation status
genome_installation = ma.is_genome_installed(ref_genome=ref_genome)
genome_installation

True

# 3. Download genome annotation file; gff3 file, from Xenbase server. 
https://www.xenbase.org/


In [7]:
!wget https://ftp.xenbase.org/pub/Genomics/JGI/Xenla10.1/XENLA_10.1_GCF_XBmodels.gff3
#!wget https://ftp.xenbase.org/pub/Genomics/JGI/Xenla10.1/XENLA_10.1_GCF.gff3    

--2022-12-27 14:05:01--  https://ftp.xenbase.org/pub/Genomics/JGI/Xenla10.1/XENLA_10.1_GCF_XBmodels.gff3
Resolving ftp.xenbase.org (ftp.xenbase.org)... 136.159.155.151
Connecting to ftp.xenbase.org (ftp.xenbase.org)|136.159.155.151|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 614234189 (586M) [text/gff3]
Saving to: ‘XENLA_10.1_GCF_XBmodels.gff3’


2022-12-27 14:05:23 (28.0 MB/s) - ‘XENLA_10.1_GCF_XBmodels.gff3’ saved [614234189/614234189]



# 4. Process data to get TSS file.

In [7]:
# Load and process gff3 file.

file = "XENLA_10.1_GCF_XBmodels.gff3"
result = load_and_process_ensembl_gff3_file(file, n_downstream=100, 
                                            n_upstream=1000)

HBox(children=(FloatProgress(value=0.0, max=2145166.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=34463.0), HTML(value='')))

In [9]:
result

Unnamed: 0,seqname,promTSS_left,promTSS_right,gene_symbol,score,strand
0,Chr1L,45901,47001,LOC108704873,.,+
41,Chr1L,124971,126071,LOC108704861,.,+
0,Chr1L,165766,166866,dok1.L,.,-
1,Chr1L,171675,172775,dok1.L,.,-
78,Chr1L,182964,184064,mrps26.L,.,-
...,...,...,...,...,...,...
0,Chr9_10S,117182648,117184109,mgrn1.S,.,+
0,Chr9_10S,117202519,117203908,ccdc78.S,.,+
2145038,Chr9_10S,117230153,117231253,fbxl16.S,.,-
2145055,Sca23,10015,11115,LOC108698152,.,-


In [10]:
# Save as bed file
result.to_csv(f"{ref_genome}_tss_info.bed", sep='\t', header=False, index=False)

# Test
Try to load DNA sequence using genomepy

In [11]:
# Load file
tss_file = BedTool(f"{ref_genome}_tss_info.bed").to_dataframe()
tss_file.head()

Unnamed: 0,chrom,start,end,name,score,strand
0,Chr1L,45901,47001,LOC108704873,.,+
1,Chr1L,124971,126071,LOC108704861,.,+
2,Chr1L,165766,166866,dok1.L,.,-
3,Chr1L,171675,172775,dok1.L,.,-
4,Chr1L,182964,184064,mrps26.L,.,-


In [12]:
# Get DNA sequence

peak_ids = tss_file["chrom"] + "_" + tss_file["start"].astype("str") + "_" + tss_file["end"].astype("str")
peak_ids = peak_ids.to_list()

fa = ma.peak2fasta(peak_ids, ref_genome=ref_genome)
fa

41925 sequences

In [13]:
# Show 3 sequences
n = 3

for i, (k, v) in enumerate(fa.items()):
    print(k, "\n", v, "\n")
    
    if i >= n - 1:
        break

Chr1L_45902_47001 
 TTCCGGGGTGAATTATACAAGGGTGCCTTATTCAGGGAAGGGCTATAGTGAGTTATTCAGAGGTGGATTATACATGGGAGAATTTTTTCAGTGAGAATTATTGAGGTTTGAGTTATATAGAAAgggtaaatatatatgtaaaggcAAGTGATTGATGTTCCCCAACGCCACAGAGAAGATTGGCTTATAGGAAAGGAACCATGGGAAACCTTCTAAACCCCCTCCCTTATCAAAATTCTAACATAGATATAAGACAAAGACCACTTGACTGATGGTGGAGAAAggcatttattaattcatttgaTAATATTAGTAATCAACAGTTTTAGACAACTTAATTATAATCAGTCTAATGAAAATCAATAGAATTGTACATATTGTCTCTCTTACCCGCCCACATTTTTTGGTGGGAGTGTTAGCTCTAAACCAGTAAAACAATAGATCTGGGATACAGAACTGGCACTGGGATGATAGCGGCCAGCAGGGAAACCTTGTGTGTGACCCACAGGCCATTGGCTTGTGATACATAAGGAGCAATAGTTCTAGTAGAATATGTAAGTAGAGCACAGTTAGCCAGTAGGCCGCTCCATATCCCTGGGCAGTACCTCTGGGCTCAAGCGCTGATACAGTACATGACACAGGGGTGTGGCCTAAAACTGCTTGTCAGTATGGGATCAATTCCCACCCCCCGTGCCCTGTGATGTTCCTCTATTCATCAATAAAGGCAATATGTATCCCAGTAGAAAGTTTGTAGGGCCCCaatctataaaaaggaaaagCTCCTAGGTGTGACCATCATATCTCACTGGGGGTCATGACAGGGGGTTAAGGAAACTGGGATATATTTCTGGGATATTTTTCTTTAGTCTAGACTCTAGGGAAAGGCAAGGTCTGCCAATTCCACTCCCGTATCTCATTATGGGACACTTCTGGCTAGAGAATGGGGATGGGGCGGGGCtagtattggggggggggagttt

Looks good

In [14]:
# Remove gff3 file.
!rm ./*gff3

In [1]:
ls

1_make_tss_referenece_from_Xenbase_gff3_file.ipynb
Xenopus_laevis_v10.1_tss_info.bed
