# Characterizing genomic regions
I am interested in knowing where the methylation is happening (i.e., exons, introns, promoters, etc.) - I believe I should be able to figure this all out from the annotation file from NCBI [gff file](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002022765.2/)

following pipeline from [Venkataraman et al 2022](https://github.com/epigeneticstoocean/paper-gonad-meth/tree/master)


## 0. library setup and file processing

In [1]:
library(tidyverse)
library(ape) # for read.gff function
library(rtracklayer) 

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘ape’


The following object is masked from ‘package:dplyr’:

    where


Loading required package: GenomicRanges

Loading required package: stats4



In [2]:
# Load the GFF file
gff_data <- import.gff('/project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/CV_genomic.gff')

In [3]:
# this reads in GFF as a data table
gff <- read.gff('/project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Cvirginica_genome/CV_genomic.gff')
head(gff)

Unnamed: 0_level_0,seqid,source,type,start,end,score,strand,phase,attributes
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>,<int>,<dbl>,<fct>,<fct>,<chr>
1,NC_035780.1,RefSeq,region,1,65668440,,+,,ID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
2,NC_035780.1,Gnomon,gene,13578,14594,,+,,ID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
3,NC_035780.1,Gnomon,lnc_RNA,13578,14594,,+,,"ID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
4,NC_035780.1,Gnomon,exon,13578,13603,,+,,"ID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
5,NC_035780.1,Gnomon,exon,14237,14290,,+,,"ID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
6,NC_035780.1,Gnomon,exon,14557,14594,,+,,"ID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"


from the attributes column, creating two new columns, extracting the gene name (LOCxxxx) and the product info

In [4]:
gff2 <- gff %>% mutate(gene = str_extract(attributes, "(?<=gene=)([^;]+)"),
    product = str_extract(attributes, "(?<=product=)([^;]+)")
  )

head(gff2)

Unnamed: 0_level_0,seqid,source,type,start,end,score,strand,phase,attributes,gene,product
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>,<int>,<dbl>,<fct>,<fct>,<chr>,<chr>,<chr>
1,NC_035780.1,RefSeq,region,1,65668440,,+,,ID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample,,
2,NC_035780.1,Gnomon,gene,13578,14594,,+,,ID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA,LOC111116054,
3,NC_035780.1,Gnomon,lnc_RNA,13578,14594,,+,,"ID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1",LOC111116054,uncharacterized LOC111116054
4,NC_035780.1,Gnomon,exon,13578,13603,,+,,"ID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1",LOC111116054,uncharacterized LOC111116054
5,NC_035780.1,Gnomon,exon,14237,14290,,+,,"ID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1",LOC111116054,uncharacterized LOC111116054
6,NC_035780.1,Gnomon,exon,14557,14594,,+,,"ID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1",LOC111116054,uncharacterized LOC111116054


cleaning up the df to contain the info I want in the order I want

In [5]:
gff3 <- select(gff2, seqid, type, start, end, strand, gene, product, attributes)
head(gff3)

Unnamed: 0_level_0,seqid,type,start,end,strand,gene,product,attributes
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<fct>,<chr>,<chr>,<chr>
1,NC_035780.1,region,1,65668440,+,,,ID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
2,NC_035780.1,gene,13578,14594,+,LOC111116054,,ID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
3,NC_035780.1,lnc_RNA,13578,14594,+,LOC111116054,uncharacterized LOC111116054,"ID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
4,NC_035780.1,exon,13578,13603,+,LOC111116054,uncharacterized LOC111116054,"ID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
5,NC_035780.1,exon,14237,14290,+,LOC111116054,uncharacterized LOC111116054,"ID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"
6,NC_035780.1,exon,14557,14594,+,LOC111116054,uncharacterized LOC111116054,"ID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1"


In [6]:
dim(filter(gff, gff$type == "gene"))

checking the types of features in the gff file

In [7]:
unique(gff3$type)

## Generate feature tracks

[Venkataraman et al 2020](https://www.frontiersin.org/journals/marine-science/articles/10.3389/fmars.2020.00225/full#h7) had a nice pipeline and description of how to extract other features from the *C. virginica* GFF file - following and adapting that here

They were able to get:
- putative promoters
- UTRs
- Exons
- Introns
- Transposable elements
- Intergenic
- Other


want to export **gff3** files to keep the same format (includes attributes column at the end of the row)


### genes

In [9]:
head(gff_data)

GRanges object with 6 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1  1-65668440      + |   RefSeq  region         NA      <NA>
  [2] NC_035780.1 13578-14594      + |   Gnomon  gene           NA      <NA>
  [3] NC_035780.1 13578-14594      + |   Gnomon  lnc_RNA        NA      <NA>
  [4] NC_035780.1 13578-13603      + |   Gnomon  exon           NA      <NA>
  [5] NC_035780.1 14237-14290      + |   Gnomon  exon           NA      <NA>
  [6] NC_035780.1 14557-14594      + |   Gnomon  exon           NA      <NA>
                          ID                                  Dbxref
                 <character>                         <CharacterList>
  [1] NC_035780.1:1..65668..                              taxon:6565
  [2]      gene-LOC111116054                        GeneID:111116054
  [3]     rna-XR_002636969.1 GeneID:111116054,Genbank:

In [14]:
# Select rows where type == "gene"
gene <- gff_data[gff_data$type == "gene", ]
head(gene, 3)

# Save the selected rows to a new GFF file
export.gff(gene, "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_gene.gff3")

GRanges object with 3 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1 13578-14594      + |   Gnomon     gene        NA      <NA>
  [2] NC_035780.1 28961-33324      + |   Gnomon     gene        NA      <NA>
  [3] NC_035780.1 43111-66897      - |   Gnomon     gene        NA      <NA>
                     ID           Dbxref         Name  chromosome
            <character>  <CharacterList>  <character> <character>
  [1] gene-LOC111116054 GeneID:111116054 LOC111116054        <NA>
  [2] gene-LOC111126949 GeneID:111126949 LOC111126949        <NA>
  [3] gene-LOC111110729 GeneID:111110729 LOC111110729        <NA>
      collection-date     country       gbkey      genome     isolate
          <character> <character> <character> <character> <character>
  [1]            <NA>        <NA>        Gene        <NA>        <NA>
  [2]            <NA>

In [23]:
setwd('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks')
getwd()

In [29]:
# run in command line
!sortBed -i CV_gene.gff3 > CV_sorted_gene.gff3

ERROR: Error in parse(text = x, srcfile = src): attempt to use zero-length variable name


### exons

In [15]:
# Select rows where type == "exon"
exons <- gff_data[gff_data$type == "exon", ]
head(exons, 3)

# Save the selected rows to a new GFF file
export.gff(exons, "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_exons.gff3")

# save as bed file
exons$score[is.na(exons$score)] <- 0
export.bed(exons, "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_exons.bed")

GRanges object with 3 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1 13578-13603      + |   Gnomon     exon        NA      <NA>
  [2] NC_035780.1 14237-14290      + |   Gnomon     exon        NA      <NA>
  [3] NC_035780.1 14557-14594      + |   Gnomon     exon        NA      <NA>
                         ID                                  Dbxref        Name
                <character>                         <CharacterList> <character>
  [1] exon-XR_002636969.1-1 GeneID:111116054,Genbank:XR_002636969.1        <NA>
  [2] exon-XR_002636969.1-2 GeneID:111116054,Genbank:XR_002636969.1        <NA>
  [3] exon-XR_002636969.1-3 GeneID:111116054,Genbank:XR_002636969.1        <NA>
       chromosome collection-date     country       gbkey      genome
      <character>     <character> <character> <character> <character>
  [1]        <NA>    

In [None]:
# run in command line
!sortBed -i CV_exon.gff3 > CV_sorted_exons.gff3

### coding regions

In [16]:
# Select rows where type == "CDS"
CDS <- gff_data[gff_data$type == "CDS", ]
head(CDS, 3)

# Save the selected rows to a new GFF file
export.gff(CDS, "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_CDS.gff3")

GRanges object with 3 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1 30535-31557      + |   Gnomon      CDS        NA         0
  [2] NC_035780.1 31736-31887      + |   Gnomon      CDS        NA         0
  [3] NC_035780.1 31977-32565      + |   Gnomon      CDS        NA         1
                      ID                                  Dbxref           Name
             <character>                         <CharacterList>    <character>
  [1] cds-XP_022327646.1 GeneID:111126949,Genbank:XP_022327646.1 XP_022327646.1
  [2] cds-XP_022327646.1 GeneID:111126949,Genbank:XP_022327646.1 XP_022327646.1
  [3] cds-XP_022327646.1 GeneID:111126949,Genbank:XP_022327646.1 XP_022327646.1
       chromosome collection-date     country       gbkey      genome
      <character>     <character> <character> <character> <character>
  [1]        <NA>    

In [None]:
# run in command line
!sortBed -i CV_CDS.gff3 > CV_sorted_CDS.gff3

### mRNA

In [17]:
# Select rows where type == "mRNA"
mRNA <- gff_data[gff_data$type == "mRNA", ]
head(mRNA, 3)

# Save the selected rows to a new GFF file
export.gff(mRNA, "/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_mRNA.gff3")

GRanges object with 3 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1 28961-33324      + |   Gnomon     mRNA        NA      <NA>
  [2] NC_035780.1 43111-66897      - |   Gnomon     mRNA        NA      <NA>
  [3] NC_035780.1 43111-46506      - |   Gnomon     mRNA        NA      <NA>
                      ID                                  Dbxref           Name
             <character>                         <CharacterList>    <character>
  [1] rna-XM_022471938.1 GeneID:111126949,Genbank:XM_022471938.1 XM_022471938.1
  [2] rna-XM_022447324.1 GeneID:111110729,Genbank:XM_022447324.1 XM_022447324.1
  [3] rna-XM_022447333.1 GeneID:111110729,Genbank:XM_022447333.1 XM_022447333.1
       chromosome collection-date     country       gbkey      genome
      <character>     <character> <character> <character> <character>
  [1]        <NA>    

In [None]:
# run in command line
!sortBed -i CV_mRNA.gff3 > CV_sorted_mRNA.gff3

### introns

introns are the space between exons within a gene - so to pull this out, I have to look within a gene (LOC number), subtract the end of exon 1 from the start of exon 2


can create GFF file of non-coding regions based on the original GFF file - then introns, by definition, are the intersections of non-coding regions and genes


following pipeline from [Venkataraman et al 2020](https://www.frontiersin.org/journals/marine-science/articles/10.3389/fmars.2020.00225/full#h7)

In [18]:
head(exons)

GRanges object with 6 ranges and 32 metadata columns:
         seqnames      ranges strand |   source     type     score     phase
            <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  [1] NC_035780.1 13578-13603      + |   Gnomon     exon         0      <NA>
  [2] NC_035780.1 14237-14290      + |   Gnomon     exon         0      <NA>
  [3] NC_035780.1 14557-14594      + |   Gnomon     exon         0      <NA>
  [4] NC_035780.1 28961-29073      + |   Gnomon     exon         0      <NA>
  [5] NC_035780.1 30524-31557      + |   Gnomon     exon         0      <NA>
  [6] NC_035780.1 31736-31887      + |   Gnomon     exon         0      <NA>
                         ID                                  Dbxref        Name
                <character>                         <CharacterList> <character>
  [1] exon-XR_002636969.1-1 GeneID:111116054,Genbank:XR_002636969.1        <NA>
  [2] exon-XR_002636969.1-2 GeneID:111116054,Genbank:XR_002636969.1        <NA>
  [3] exon

In [None]:
# run in command line
complementBed -i CV_sorted_exons.gff3 -g 2018-06-15-bedtools-Chromosome-Lengths.txt > CV_noncoding.gff3

In [None]:
# run in command line
!intersectBed \
-a CV_noncoding.gff3 \
-b CV_sorted_gene.gff3 -sorted \
> CV_sorted_intron.gff3

ERROR: Error in scan(file, w, sep = "\t", quote = "", quiet = TRUE, na.strings = na.strings, : scan() expected 'an integer', got 'NC_035780.1'


# bedtools multicov

In [None]:
multiBamCov -bams *.bam -bed /project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_sorted_exon.bed > exon_multicov.csv

In [None]:
multiBamCov -bams *.bam -bed /project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/genomic_feature_tracks/CV_sorted_intron.bed > intron_multicov.csv