# Generating Genome Feature Tracks

I will create genome feature tracks to use in downstream analyses. While pre-made genome feature tracks exist, it's beneficial to generate these tracks to understand what elements they contain.

1. Download *C. virginica* genome file
2. Separate various tracks
3. Visualize tracks in IGV
4. Characterize track overlap with CG motifs

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/yaamini-virginica/notebooks'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/yaamini-virginica/analyses


In [3]:
!mkdir 2019-05-13-Generating-Genome-Feature-Tracks

In [4]:
cd 2019-05-13-Generating-Genome-Feature-Tracks/

/Users/yaamini/Documents/yaamini-virginica/analyses/2019-05-13-Generating-Genome-Feature-Tracks


## 1. Download C. virginica genome from NCBI

In [5]:
!curl ftp://ftp.ncbi.nlm.nih.gov/genomes/Crassostrea_virginica/GFF/ref_C_virginica-3.0_top_level.gff3.gz > ref_C_virginica-3.0_top_level.gff3.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.2M  100 16.2M    0     0  5833k      0  0:00:02  0:00:02 --:--:-- 5971k


In [8]:
!gunzip ref_C_virginica-3.0_top_level.gff3.gz

In [9]:
!head ref_C_virginica-3.0_top_level.gff3

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-date 14 September 2017
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1	RefSeq	region	1	65668440	.	+	.	ID=id0;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample


In [None]:
!cut -f3 | sort | uniq

The unique categories in this file are `gene`, `exon`, `mRNA`, `CDS`, `lnc_RNA`. I can separate tracks based on these distinctions.

## 2. Separate tracks

In [None]:
!grep "Gnomon	gene" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_gene.gff3

In [None]:
!head C_virginica-3.0_Gnomon_gene.gff3

In [None]:
!wc -l C_virginica-3.0_Gnomon_gene.gff3

In [None]:
!grep "Gnomon	exon" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_exon_yrv.gff3

In [None]:
!head C_virginica-3.0_Gnomon_exon_yrv.gff3

In [None]:
!wc -l C_virginica-3.0_Gnomon_exon_yrv.gff3

## 3. Visualize in IGV

## 4. Set variable paths

In [6]:
fullGenome = "../../data/C_virginica-3.0_genomic.fa"

## 5. Characterize CG motif locations

In [7]:
#Count the number of CGs in the full genome
!fgrep -o -i CG {fullGenome} | wc -l

 14277725


In [None]:
!{bedtoolsDirectory}intersectBed \
-u \
-a ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_CG-motif.bed \
-b C_virginica-3.0_Gnomon_gene.gff3 \
| wc -l
!echo "CG motifs overlap with genes"

In [None]:
!{bedtoolsDirectory}intersectBed \
-u \
-a ../2018-11-01-DML-and-DMR-Analysis/C_virginica-3.0_CG-motif.bed \
-b C_virginica-3.0_Gnomon_exon_yrv.gff3 \
| wc -l
!echo "CG motifs overlap with exons"