# Generating genomic feature tracks

In [1]:
!reference="/project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/CV_genomic.gff"

In [5]:
pwd

'/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks'

## Genes

In [8]:
# find features with type == gene
!grep "Gnomon	gene" $reference > CV_gene.gff

^C


In [None]:
# convert gff file to BED file
!awk '{print $1"\t"$4"\t"$5}' CV_gene.gff \
> CV_gene.bed

## Exons

In [None]:
# find features with type == exon
!grep "Gnomon	exon" $reference > CV_exon.gff

In [None]:
# convert gff file to BED file
!awk '{print $1"\t"$4"\t"$5}' CV_exon.gff \
> CV_exon.bed

## CDS

In [None]:
# find features with type == exon
!grep "Gnomon	CDS" $reference > CV_CDS.gff

In [None]:
# convert gff file to BED file
!awk '{print $1"\t"$4"\t"$5}' CV_CDS.gff \
> CV_CDS.bed

## mRNA

In [None]:
# find features with type == exon
!grep "Gnomon	mRNA" $reference > CV_mRNA.gff

In [None]:
# convert gff file to BED file
!awk '{print $1"\t"$4"\t"$5}' CV_mRNA.gff \
> CV_mRNA.bed

### introns

introns are the space between exons within a gene - so to pull this out, I have to look within a gene (LOC number), subtract the end of exon 1 from the start of exon 2


can create GFF file of non-coding regions based on the original GFF file - then introns, by definition, are the intersections of non-coding regions and genes


following pipeline from [Venkataraman et al 2020](https://www.frontiersin.org/journals/marine-science/articles/10.3389/fmars.2020.00225/full#h7)

In [None]:
# run in command line
complementBed -i CV_sorted_exons.gff3 -g 2018-06-15-bedtools-Chromosome-Lengths.txt > CV_noncoding.gff3

In [None]:
# run in command line
!intersectBed \
-a CV_noncoding.gff3 \
-b CV_sorted_gene.gff3 -sorted \
> CV_sorted_intron.gff3

## Intergenic regions
regions that aren't genes

`complementBed` to find regions that aren't genes, and `subtractBed` to remove exons and create this

In [None]:
samtools faidx /project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/GCF_002022765.2_C_virginica-3.0_genomic.fna
awk '{print $1, $3-$2}' /project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/GCF_002022765.2_C_virginica-3.0_genomic.fna.fai > /project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/GCF_C_virginica-3.0_genomic.fai

In [None]:
genes="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_gene.bed"
genome="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/2018-06-15-bedtools-Chromosome-Lengths.txt"
output_dir="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/"

bedtools complement -i $genes -g $genome > ${output_dir}CV_intergenic.gff3

### putative promoters
1KB upstream of transcription start site (TSS)

can use `bedtools flank` to find flanking regions 1000bp upstream and downstream of mRNA - then can filter rows to only grab the upstream flank (odd rows) on the + strand which would be our putative promoters



In [None]:
mRNA="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_mRNA.gff3"

grep ".	+	." $mRNA > ${output_dir}CV_+strand_mRNA.bed

In [None]:
pos_mRNA="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_+strand_mRNA.bed"

flankBed -i $pos_mRNA -g $genome -b 1000 > mRNA_1000bp_flanks.bed

In [None]:
flanks="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/mRNA_1000bp_flanks.bed"

awk '{ if (NR%2) print > "mRNA_upstream_flanks.bed"; \
else print > "mRNA_downstream_flanks.bed" }' \
$flanks

### exon UTRs

In [None]:
exons="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_sorted_exon.bed"
CDS="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_CDS.bed"

In [None]:
bedtools sort -i $CDS > ${output_dir}CV_sorted_CDS.gff3

CDS_sorted="/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_sorted_CDS.gff3"

In [None]:
subtractBed -a $exons -b $CDS_sorted -sorted -g $genome > ${output_dir}CV_exonUTR.gff3

### transposable elements