# 1-Genomeannotation
Jakke Neiro$^1$
1. Aboobaker lab, Department of Zoology, University of Oxford

## Contents of notebook
* 1. Introduction
* 2. Data quality
* 3. Alignment
* 4. Transcript assembly and merge

## Files
* Input: 183 RNA-seq samples
* Output: stringtie_merged.gtf

# 1. Introduction

This notebook describes the annotation of the sexual planarian genome (*Schmidtea mediterranea*) based on total RNA-seq data from this species. In short, took advantage of the Stringtie annotation pipeline by using 183 planarian RNA-seq reads as the raw RNA-seq material and the PlanMine sexual genome annotation (SMESG-high) as the reference annotation. More information on the stringtie annotation process can be found at http://ccb.jhu.edu/software/stringtie/index.shtml.

In [None]:
%%bash
cd /hydra/sexual_genome_annotation_files/ncrna_Neiro
grep -v -P '\t\.\t\.\t' stringtie_merged.gtf > stringtie.Bioconductor.gtf

# 2. Data quality

In total 183 RNA-seq data sets sequenced in the planarian were used as the raw pool of RNA-seq reads for the expression-driven annotation. The sample IDs are listed below:

In [3]:
%%bash
cd /drives/raid/AboobakerLab/jakke/sexual_genome_annotation_files/sexual_genome_annotation/
wc -l sample_ids.txt
less sample_ids.txt

183 sample_ids.txt
SRR1012785
SRR1012838
SRR1012831
SRR1012832
SRR1012833
SRR1012834
SRR1012835
SRR1012836
ERR032074
ERR032075
ERR032076
ERR032073
SRR867380
SRR867381
SRR867382
SRR2009682
SRR2009683
SRR2009678
SRR2009679
SRR867383
SRR867384
SRR867385
SRR2009680
SRR2009681
SRR867386
SRR867387
SRR867388
SRR2009674
SRR2009675
SRR2009676
SRR2009677
SRR1302027
SRR1302028
SRR1302033
SRR1302034
SRR1208934
SRR1266970
SRR1266971
SRR1266972
SRR1266967
SRR1266968
SRR1266969
SRR1266973
SRR1266974
SRR1266975
SRR1302023
SRR1302024
SRR1302029
SRR1302030
SRR1302025
SRR1302026
SRR1302031
SRR1302032
SRR900846
SRR900847
SRR900848
SRR900849
SRR900850
SRR900851
SRR900852
SRR900853
SRR2183571
SRR2183572
SRR2183573
SRR2183574
SRR2183575
SRR2183576
SRR2183577
SRR2183578
SRR2183579
SRR2183580
SRR2183581
SRR2183582
SRR2183583
SRR2183584
SRR2183585
SRR2658130
SRR2658131
SRR2658132
SRR2658133
SRR2658126
SRR2658127
SRR2658128
SRR2658129
SRR2183586
SRR2183587
SRR2183588
SRR2183589
SRR2183590
SRR2183591
SRR2183592
S

# 3. Alignment

The SMESG-high genome annotation and gene models were downloaded from PlanMine v3.0.

In [2]:
#%%bash
#cd ../sexual_genome_annotation_files/sexual_genome_annotation/
#wget http://planmine.mpi-cbg.de/planmine/model/bulkdata/smes_v2_hconf_SMESG.gff3.zip
#unzip smes_v2_hconf_SMESG.gff3.zip

The gff3 file was converted into a gtf file:

In [None]:
%%bash
cd ../sexual_genome_annotation_files/sexual_genome_annotation/
gffread smes_v2_hconf_SMESG.gff3 -T -o smes_v2_hconf_SMESG.gtf

The splice sites were extracted for hisat2:

In [None]:
%%bash
cd ../sexual_genome_annotation_files/sexual_genome_annotation/
~/software/hisat2-2.1.0/hisat2_extract_splice_sites.py smes_v2_hconf_SMESG.gtf > smes_v2_hconf_hisat2_splice_sites

The exon sites were extracted for hisat2:

In [None]:
%%bash
cd ../sexual_genome_annotation_files/sexual_genome_annotation/
~/software/hisat2-2.1.0/hisat2_extract_exons.py smes_v2_hconf_SMESG.gtf > smes_v2_hconf_hisat2_exons

The hisat2 index was build for alignment:

In [None]:
%%bash
cd ../sexual_genome_annotation_files/sexual_genome_annotation/
nohup ~/software/hisat2-2.1.0/hisat2-build final_dd_Smed_g4.fa final_dd_Smed_g4_hisat2 > nohup_hisat_index &

All RNA-seq libraries were aligned to the sexual planarian genome with HISAT2:

In [3]:
%%bash
cd /hydra/neiro_bam
less hisat2_align.sh

#!/bin/bash

for f in ../anish_trimmed_files/*.gz; do ~/software/hisat2-2.1.0/hisat2 -p 8 --dta --known-splicesite-infile smes_v2_hconf_hisat2_splice_sites -x ~/sexual_genome_annotation_files/sexual_genome_annotation/final_dd_Smed_g4_hisat2 -U $f -S ${f:23}.sam; done


In [None]:
#%%bash
#cd /hydra/neiro_bam
#nohup ./hisat2_align.sh &

Finally, all the sam files were converted into bam files:

In [4]:
%%bash
cd /hydra/neiro_bam
less hisat_samtools.sh

#!/bin/bash

for f in *.sam; do samtools sort -@ 8 -o ${f%%.gz.sam}.bam $f; done


In [None]:
#%%bash
#nohup ./hisat_samtools.sh &

# 4. Transcript assembly and merge
The assembly process used Stringtie. First, the RNA-seq reads were assembled into transcripts for each sample using the exisiting SMESG-high annotation.

In [5]:
%%bash
cd /hydra/neiro_bam
less stringtie.sh

#!/bin/bash

for f in *.bam; do ~/software/stringtie-2.1.1/stringtie -p 8 -G ~/sexual_genome_annotation_files/sexual_genome_annotation/smes_v2_hconf_SMESG.gtf -o ${f%%.fastq.trimmed.fastq.bam}.gtf -l ${f%%.fastq.trimmed.fastq.bam} $f; done


In [None]:
#%%bash
#cd /hydra/neiro_bam
#nohup ./stringtie.sh &

All the bam files generated in the previous step were collected into a single list:

In [None]:
%%bash
cd /hydra/neiro_bam
for f in *.gtf; do echo $f >> mergelist.txt; done

Finally, all the sample-specific annotations were merged into one final annotation.

In [6]:
%%bash
cd /hydra/neiro_bam
less stringtie_merge.sh

#!/bin/bash

~/software/stringtie-2.1.1/stringtie --merge -p 8 -G ~/sexual_genome_annotation_files/sexual_genome_annotation/smes_v2_hconf_SMESG.gtf -o stringtie_merged.gtf mergelist.txt


In [None]:
%%bash
cd /hydra/neiro_bam
nohup ./stringtie_merge.sh &

# FINNISHED