## TSS annotation pipeline

## 0. Required initiatial annotation files

In [1]:
# script.
# this file is designed for Genocde v45.

!python TSS_site_list_annotate.Gencode_v45.py -h

usage:  [-h] -i INPUT [-o OUTPUT] -f FASTA -t TSS_DB -e EXONS_DB -a ANNOT
        [-b BLAST_DB] [-r REPEATS] [--bedtools BEDTOOLS] [--blastn BLASTN]

optional arguments:
  -h, --help            show this help message and exit

Required:
  -i INPUT              Input TSS BED files
  -o OUTPUT, --output OUTPUT
                        Output
  -f FASTA, --fasta FASTA
  -t TSS_DB, --tss TSS_DB
  -e EXONS_DB, --exons EXONS_DB
  -a ANNOT, --annot ANNOT
  -b BLAST_DB, --blastdb BLAST_DB
  -r REPEATS, --repeats REPEATS
  --bedtools BEDTOOLS
  --blastn BLASTN




Parameters:

`-i The input BED file`

`-o The prefix of the outputs.`

`-f FASTA file for the genomic sequence.`

`-t A BED file annotating the TSSs (see below).`

`-e A BED file annotating the first exons (see below).`

`-a A table, which is very similar to the output in UCSC table browser (see below).`

`-b A database for BLAST for snRNA and snoRNA. Optional.`

`-r A BED file annotating the repeat elements. Optional. Can download from UCSC table browser. The 5th column will be used.`

`--bedtools The executable file of bedtools. Optional.`

`--blastn The executable file of blastn. Optional.`

In this example, I just use one chromosome.
`

## 1. Makiing metadata

Please make sure that the chromosome names in the FASTA file match those in the gtf. e.g., chr1 vs 1 is not OK.

In [2]:
!python ./ANNOTATIONS/gtf2anno_plus_gencode.py3.py -i ./ANNOTATIONS/chr1.gtf > ./ANNOTATIONS/chr1.anno

In [3]:
!head -2 ./ANNOTATIONS/chr1.anno

lncRNA,lncRNA	ENST00000456328.2	1	+	11868	14409	.	.	3	11868,12612,13220,	12227,12721,14409,	.	ENSG00000290825.1	DDX11L2	.	.
transcribed_unprocessed_pseudogene,transcribed_unprocessed_pseudogene	ENST00000450305.2	1	+	12009	13670	.	.	6	12009,12178,12612,12974,13220,13452,	12057,12227,12697,13052,13374,13670,	.	ENSG00000223972.6	DDX11L1	.	.


In [4]:
!python ./ANNOTATIONS/anno_to_first_exon_end.py  ./ANNOTATIONS/chr1.anno | bedtools sort -i - >  ./ANNOTATIONS/chr1.first_exon.bed 

In [5]:
!head -2 ./ANNOTATIONS/chr1.first_exon.bed 

1	65432	65433	ENST00000641515.2	OR4F5	+
1	450739	450740	ENST00000426406.4	OR4F29	-


In [6]:
!python ./ANNOTATIONS/anno_to_tss.py  ./ANNOTATIONS/chr1.anno | bedtools sort -i - >  ./ANNOTATIONS/chr1.tss.bed 

In [7]:
!head -2 ./ANNOTATIONS/chr1.first_exon.bed 

1	65432	65433	ENST00000641515.2	OR4F5	+
1	450739	450740	ENST00000426406.4	OR4F29	-


## 2. Run annotations

I used the test mapping results (chr1).

Please include all the scripts in the same folder:

* genome_flanking_v3.py
* collaspe_bed_annotations_v3.py
* collaspe_bed_annotations_fix_other_exons_v2.py
* collaspe_bed_annotations_fix_upstream_TSS.py


In [9]:
# sort bed file first
!bedtools sort -i  test.chr1.bed >  test.chr1.sorted.bed

In [None]:
!python TSS_site_list_annotate.Gencode_v45.py -i test.chr1.sorted.bed -o test.chr1.annotated.csv -f ./ANNOTATIONS/chr1.fa -a ./ANNOTATIONS/chr1.anno -t ./ANNOTATIONS/chr1.tss.bed -e ./ANNOTATIONS/chr1.first_exon.bed -b BLASTDB/snRNA_snoRNA_blast -r ./ANNOTATIONS/hg38_repeats.nochr.sorted.bed  


[2025-03-28 14:31:46] Analysis begins.
[2025-03-28 14:31:47] Reference loaded.
[2025-03-28 14:31:47] Searching for snRNA/snoRNA like sequences...
[2025-03-28 14:31:48] Getting fasta...
python genome_flanking_v3.py -U 0 -D 50 -f ./ANNOTATIONS/chr1.fa -i test.chr1.annotated.csv.site > test.chr1.annotated.csv.fa
[2025-03-28 14:31:50] Proceed to gene annotation...
bedtools closest -s -D b -a test.chr1.sorted.bed -b ./ANNOTATIONS/chr1.tss.bed > test.chr1.annotated.csv.closest_tss.bed
bedtools closest -S -D b -a test.chr1.sorted.bed -b ./ANNOTATIONS/chr1.tss.bed > test.chr1.annotated.csv.closest_tss.upstream.bed
bedtools closest -s -D b -a test.chr1.sorted.bed -b ./ANNOTATIONS/chr1.first_exon.bed > test.chr1.annotated.csv.closest_exons.bed
[2025-03-28 14:31:50] Collasping annotations...
python collaspe_bed_annotations_v3.py -i test.chr1.annotated.csv.closest_tss.bed -a ./ANNOTATIONS/chr1.anno -o test.chr1.annotated.csv.collasped.1.bed
python collaspe_bed_annotations_fix_other_exons_v2.py -a 

In [12]:
!head test.chr1.annotated.csv

chr,pos_1,strand,pos_0,snRNA_snoRNA_like,Gene,Gene_biotype,Annotate_to_closest,Relative_distance,Repeat
1,199044,-,199043,False,UNKNOWN,UNKNOWN,UNKNOWN,NA,
1,629572,+,629571,False,MTND2P28,unprocessed_pseudogene,TSS,-68,
1,631259,+,631258,False,ENSG00000293331,lncRNA,UPSTREAM,55,
1,827692,+,827691,False,LINC01128,lncRNA,TSS,0,
1,925743,+,925742,False,SAMD11,protein_coding,TSS,12,
1,959256,-,959255,False,NOC2L,protein_coding,TSS,0,
1,959258,-,959257,False,NOC2L,protein_coding,TSS,-2,
1,959271,-,959270,False,NOC2L,protein_coding,TSS,-15,
1,1000071,-,1000070,False,HES4,protein_coding,TSS,25,


Columns:

chr: chromosome

pos_1: coordinate 1-based

strand: strand

pos_0: coordinate 0-based

snRNA_snoRNA_like: whether similar to snRNA/snoRNA?

Gene: Gene name

Gene_biotype: Gene biotype

Annotate_to_closest: which kind of annotation? TSS - known TSS; EXON - far away (100 nt) from known TSS but locate in the first exon.

Repeat: whether locate in repeat elements?