# Obtain Known Gene/Transcript Annotations
we will use annotations obtained from Ensembl (Homo_sapiens.GRCh38.86.gtf.gz) for chromosome 22 only. For time reasons, these are prepared for you and made available on your AWS instance. 

Course link: https://rnabio.org/module-01-inputs/0001/03/01/Annotations/

In [1]:
pwd # check your current directory
echo $RNA_REFS_DIR
cd $RNA_REFS_DIR # go to this path, note that RNA_REFS_DIR=/home/ubuntu/workspace/rnaseq/refs
ls # the current content inside $RNA_REFS_DIR, we'll add the annotation file to this directory

/home/ubuntu
/home/ubuntu/workspace/rnaseq/refs
chr22_only.fa  chr22_with_ERCC92.fa


In [2]:
# install the annotation file
wget http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf

--2025-04-26 18:13:01--  http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf
Resolving genomedata.org (genomedata.org)... 54.71.55.4
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf [following]
--2025-04-26 18:13:01--  https://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30712117 (29M)
Saving to: ‘chr22_with_ERCC92.gtf’


2025-04-26 18:13:02 (22.6 MB/s) - ‘chr22_with_ERCC92.gtf’ saved [30712117/30712117]



## View GTF file
* GTF files has 9 tab-seperated fileds: chrom  source  feature  start  end  score  strand  frame  attribute
* More info about FASTA, FASTQ, and GTF format: https://github.com/griffithlab/rnabio.org/blob/master/assets/lectures/cshl/2024/mini/RNASeq_MiniLecture_01_01_FASTA_FASTQ_GTF.pdf

In [3]:
ls # the $RNA_REFS_DIR now also contains chr22_with_ERCC92.gtf 

# Activate the path to GTF file
echo $RNA_REF_GTF # Note that RNA_REF_GTF=/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.gtf

chr22_only.fa  chr22_with_ERCC92.fa  chr22_with_ERCC92.gtf
/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.gtf


### Course's method to view GTF file with "less"
Press q to exit the less display when you are done.

In [None]:
# View GTF file
less -p start_codon -S $RNA_REF_GTF 

# two helpful options with less
## -p start_codon jumps directly to the first line that contains start_codon
## -S stands for "chop long lines" (as in --chop-long-lines). It truncates the line visually, so each line stays on one screen line and you can scroll right/left to see more.

### View GTF file with "head"
Because I can't execute "q" to exit "less" command in Jupyter notebook like I can in WSL terminal, I view GTF file alternatively

In [5]:
echo -e "chrom\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tattribute" # print field description
head -n 1 chr22_with_ERCC92.gtf

chrom	source	feature	start	end	score	strand	frame	attribute
22	ensembl	gene	10736171	10736283	.	-	.	gene_id "ENSG00000277248"; gene_version "1"; gene_name "U2"; gene_source "ensembl"; gene_biotype "snRNA";


## Examine GTF file
### How many unique gene in this GTF file?
#### Method 1: use the feature (3rd) field

In [11]:
# view the first 2 lines that contain the exact word "gene"
cat chr22_with_ERCC92.gtf | grep -w gene | head -n 2
# The -w flag tells grep to match the whole word only, meaning it will only match the exact word "gene" and not partial matches (for example, it will not match "gene_id" or "genes").
cat chr22_with_ERCC92.gtf | grep -w gene | wc -l # This command will count only lines with "gene" in the feature field

22	ensembl	gene	10736171	10736283	.	-	.	gene_id "ENSG00000277248"; gene_version "1"; gene_name "U2"; gene_source "ensembl"; gene_biotype "snRNA";
22	havana	gene	10939388	10961338	.	-	.	gene_id "ENSG00000283047"; gene_version "1"; gene_name "FRG1FP"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000191577"; havana_gene_version "1";
1318


#### Method 2: use the attribute (9th) field
Note: 
* "sort | uniq" is functionally equivalent to "sort -u"
* print "$1\n" prints the value of the first matched group (the part of the string matched by the part of the regular expression inside the parentheses ()). The \n adds a newline after printing, making sure each match appears on a new line.

In [18]:
# gene_id\s\"ENSG\w+\" matches any gene_id starting with "ENSG" followed by one or more word characters (\w+), which includes letters, numbers, and underscores.
perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | head -n 2
perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | wc -l

gene_id "ENSG00000008735"
gene_id "ENSG00000015475"
1318


In [13]:
perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' $RNA_REF_GTF | sort -u | wc -l

1318


#### Be aware that chr22_with_ERCC92.gtf has gene_ids containing either ENSG or ERCC- prefix
* ENSG (Ensembl Gene) is used for Ensembl Gene IDs that represent real biological genes in an organism's genome.
* ERCC- (External RNA Controls Consortium) is used for spike-in RNA sequences from the ERCC, which are artificial genes designed for calibration and quality control in RNA-seq experiments.

To answer "How many unique gene IDs are in the .gtf file?", **count only ENSG gene_ids**, which is **1318** gene_ids!

In [23]:
# sed 's/[0-9]*$//' removes the numeric part from the gene ID
perl -ne 'if ($_ =~ /gene_id\s\"([^\"]+)/){print "$1\n"}' $RNA_REF_GTF | sed 's/[0-9]*$//' | sort | uniq

ENSG
ERCC-


In [24]:
# view a line containing ERCC gene_id
grep "ERCC-" $RNA_REF_GTF | head -n 1

ERCC-00002	ERCC	exon	1	1061	0.000000	+	.	gene_name "ERCC-00002"; gene_id "ERCC-00002"; transcript_id "DQ459430"; exon_number "1";


##### To count **all unique gene_ids containing either ENSG or ERCC-**, which is **1410** gene_ids

In [17]:
# gene_id\s\"([^\"]+)\" matches the gene_id and captures everything between the quotes
perl -ne 'if ($_ =~ /(gene_id\s\"[^\"]+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | head -n 2
perl -ne 'if ($_ =~ /(gene_id\s\"[^\"]+\")/){print "$1\n"}' $RNA_REF_GTF | sort | uniq | wc -l

gene_id "ENSG00000008735"
gene_id "ENSG00000015475"
1410


In [15]:
# extract column 9 which contain gene_ID | extract gene_ID, ie, ENSG00000277248 | sort and eliminate duplicate | view the first 2 gene_ids
cut -f9 chr22_with_ERCC92.gtf | cut -d ' ' -f2 | tr -d '";' | sort -u | head -n 2
# count how many gene_IDs/genes in GTF file
cut -f9 chr22_with_ERCC92.gtf | cut -d ' ' -f2 | tr -d '";' | sort -u | wc -l

ENSG00000008735
ENSG00000015475
1410


# Create a HISAT2 index
Create a HISAT2 index for chr22 and the ERCC spike-in sequences. HISAT2 can incorporate exons and splice sites into the index file for alignment. First create a splice site file, then an exon file. Finally make the aligner FM index.

Course link: https://rnabio.org/module-01-inputs/0001/04/01/Indexing/

## hisat2-build 
generates 8 .ht2 files from splicesites.tsv, exons.tsv, and chr22_with_ERCC92.fa. These ht2 files have "chr22_with_ERCC92" basename, for example, chr22_with_ERCC92.1.ht2. These files are stored in the current directory ($RNA_REF_INDEX).

Note: 
* RNA_REF_FASTA=/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.fa
* RNA_REF_INDEX=/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92

* hisat2_extract_splice_sites.py script is part of the HISAT2 package, which is a tool used for aligning RNA-Seq reads to a reference genome. 
* The script is typically installed with HISAT2 when you install it, and it is used to extract splice site information from a GTF file.
* If you have HISAT2 installed, you should be able to run this script from the command line, even when this script is not in the $RNA_REFS_DIR directory.
* splicesites.tsv file contains the positions of exon-exon junctions

In [26]:
cd $RNA_REFS_DIR
hisat2_extract_splice_sites.py $RNA_REF_GTF > $RNA_REFS_DIR/splicesites.tsv

In [27]:
ls

chr22_only.fa  chr22_with_ERCC92.fa  chr22_with_ERCC92.gtf  splicesites.tsv


In [28]:
hisat2_extract_exons.py $RNA_REF_GTF > $RNA_REFS_DIR/exons.tsv
hisat2-build -p 4 --ss $RNA_REFS_DIR/splicesites.tsv --exon $RNA_REFS_DIR/exons.tsv $RNA_REF_FASTA $RNA_REF_INDEX
ls

Settings:
  Output files: "/home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  /home/ubuntu/workspace/rnaseq/refs/chr22_with_ERCC92.fa
Reading reference sizes
  Time reading reference sizes: 00:00:01
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
  Time to read SNPs and splice sites: 00:00:00
Generation 0 (39250236 -> 39250236 nodes, 0 ranks)
COUNTED NEW NODES: 1
COUNTED TEMP NODES: 0
RESIZED NODES: 