# R Notebook for ncbi.datasets Package

The objective of this notebook is to use the **ncbi.datasets R** package to extract gene information and sequence for a list of gene symbols.

In this example. we will show you how to get information for a list of Drosophila
melanogaster genes cited in [A single-cell survey of Drosophila blood](https://pubmed.ncbi.nlm.nih.gov/32396065/).

## Citation
Tattikota SG, Cho B, Liu Y, Hu Y, Barrera V, Steinbaugh MJ, Yoon SH, Comjean A, 
   Li F, Dervis F, Hung RJ, Nam JW, Ho Sui S, Shim J, Perrimon N. A single-cell
   survey of *Drosophila* blood. Elife. 2020 May 12;9:e54818. 
   doi: 10.7554/eLife.54818. PMID: [32396065](https://pubmed.ncbi.nlm.nih.gov/32396065/); PMCID: [PMC7237219](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237219/).
   
## Installation
To start, we install the ncbi.datasets R package from ftp.

In [1]:
local({r <- getOption("repos")
       r["CRAN"] <- "http://cran.r-project.org"
       options(repos=r)
})
if (!require(httr)) {
  install.packages("httr")
  library(httr)
}
if (!require(caTools)) {
  install.packages("caTools")
  library(caTools)
}
if (!require(knitr)) {
  install.packages("knitr")
  library(knitr)
}
if (!require(ncbi.datasets)) {
  install.packages("https://ftp.ncbi.nlm.nih.gov/pub/datasets/r_client_lib/ncbi.datasets_LATEST.tar.gz", repos = NULL)
  library(ncbi.datasets)
}

Loading required package: httr

Loading required package: caTools

Loading required package: knitr

Loading required package: ncbi.datasets



## Create GeneApi instance, that manages the http1 service calls to NCBI

In [1]:
api.gene_instance <- ncbi.datasets::GeneApi$new()

## Retrieve metadata for a list of D. meanogaster gene symbols

First we'll get metadata for a list of Drosophila melanogaster gene symbols by entering the gene symbol list combined with the organism scientific name (common name and NCBI Taxonomy ID are also acceptable).

In [1]:
gene_symbols = c(
    'Ac76E',
    'Ac78C',
    'Acbp1',
    'Acbp2',
    'Acbp3',
    'Acbp4',
    'Acbp5',
    'Acbp6',
    'ACC',
    'AcCoAS',
    'Ace',
    'Acer',
    'Acf',
    'achi',
    'acj6',
    'Ack'
)

result <- api.gene_instance$GeneMetadataByTaxAndSymbol(
  paste(gene_symbols, collapse = ','),
  'Drosophila melanogaster')

## Organize the metadata as a table

Using the metadata we retrieved in the previous step (and stored in `result`), let's organize it into a table. The extensive metadata report includes information on gene nomenclature,  gene type, transcript and protein lengths and coordinates for the genes transcripts and coding regions.

In this example we'll generate a table including  

* gene symbol
* gene name
* gene id
* Ensembl id
* Fly Base id
* gene type
* chromosome location
* gene range 

In [1]:
metadata_tbl <- t(sapply(result$genes ,
    function(g) { c(
        g$gene$symbol,
        g$gene$description,
        g$gene$gene_id,
        ifelse(g$gene$nomenclature_authority$authority == "FLYBASE",
            g$gene$nomenclature_authority$identifier,'-'),
        gsub("\"", "", result$genes[[1]]$gene$type$toJSON()),
        paste(g$gene$chromosome, ":",
              g$gene$genomic_ranges[[1]]$range[[1]]$begin, "..",
              g$gene$genomic_ranges[[1]]$range[[1]]$end,
              sep = '') 
    )}))
colnames(metadata_tbl) <- c('Gene Symbol',
                            'Gene Name',
                            'Gene ID',
                            'Fly Base id',
                            'Gene Type',
                            'Chromosome location')
metadata_tbl

Gene Symbol,Gene Name,Gene ID,Fly Base id,Gene Type,Chromosome location
Acbp1,Acyl-CoA binding protein 1,34111,FBgn0031992,PROTEIN_CODING,2L:8161796..8162452
Acer,Angiotensin-converting enzyme-related,34189,FBgn0016122,PROTEIN_CODING,2L:8521933..8525740
ACC,Acetyl-CoA carboxylase,35761,FBgn0033246,PROTEIN_CODING,2R:7974115..7991862
achi,achintya,36373,FBgn0033749,PROTEIN_CODING,2R:12511534..12515062
Ack,Activated Cdc42 kinase,38489,FBgn0028484,PROTEIN_CODING,3L:4108186..4114157
Acbp4,Acyl-CoA binding protein 4,38781,FBgn0035742,PROTEIN_CODING,3L:7127913..7128264
Acbp6,Acyl-CoA binding protein 6,38782,FBgn0035743,PROTEIN_CODING,3L:7128903..7129673
Acbp3,Acyl-CoA binding protein 3,38783,FBgn0250836,PROTEIN_CODING,3L:7130874..7131505
Acbp2,Acyl-CoA binding protein 2,38784,FBgn0010387,PROTEIN_CODING,3L:7131863..7133463
Acbp5,Acyl-CoA binding protein 5,39005,FBgn0035926,PROTEIN_CODING,3L:8788144..8788545


## Report on a representative transcript and protein for each protein-coding gene in the list

In this example we'll generate a table the following information for the transcript encoding the longest protein for each gene.

*    gene symbol
*    gene id
*    transcript accession
*    transcript length
*    transcript coordinates on the genome
*    protein accession
*    protein length
*    CDS start and stop on the transcript

In [1]:
range_format <- function(range) {
  paste0(range$accession_version, " (", range$range[[1]]$begin, "..", range$range[[1]]$end, ")")
}
transcript_tbl <- t(sapply(result$genes ,
    function(g) {
      max_idx = which.max(lapply(g$gene$transcripts, function(tr) {tr$length}))
        c(
        g$gene$symbol,
        g$gene$gene_id,
        g$gene$transcripts[[max_idx]]$accession_version,
        g$gene$transcripts[[max_idx]]$length,
        range_format(g$gene$transcripts[[max_idx]]$genomic_range),
        g$gene$transcripts[[max_idx]]$protein$accession_version,
        g$gene$transcripts[[max_idx]]$protein$length,
        range_format(g$gene$transcripts[[max_idx]]$cds)
        
    )}))
colnames(transcript_tbl) <- c('gene symbol',
                            'gene id',
                            'transcript accession',
                            'transcript length',
                            'transcript coordinates on the genome',
                            'protein accession',
                            'protein length',
                            'CDS start and stop on the transcript')
transcript_tbl

gene symbol,gene id,transcript accession,transcript length,transcript coordinates on the genome,protein accession,protein length,CDS start and stop on the transcript
Acbp1,34111,NM_001298815.1,525,NT_033779.5 (8161796..8162452),NP_001285744.1,90,NM_001298815.1 (128..400)
Acer,34189,NM_001273329.1,3198,NT_033779.5 (8521933..8525740),NP_001260258.1,630,NM_001273329.1 (346..2238)
ACC,35761,NM_136498.3,8010,NT_033778.4 (7974115..7987237),NP_610342.1,2482,NM_136498.3 (190..7638)
achi,36373,NM_165912.2,2509,NT_033778.4 (12511534..12515062),NP_725183.1,555,NM_165912.2 (305..1972)
Ack,38489,NM_139602.3,4393,NT_037436.4 (4108186..4114157),NP_647859.1,1073,NM_139602.3 (802..4023)
Acbp4,38781,NM_139824.3,352,NT_037436.4 (7127913..7128264),NP_648081.1,84,NM_139824.3 (36..290)
Acbp6,38782,NM_001300034.1,697,NT_037436.4 (7128903..7129673),NP_001286963.1,82,NM_001300034.1 (50..298)
Acbp3,38783,NM_139826.3,429,NT_037436.4 (7130874..7131505),NP_648083.1,84,NM_139826.3 (59..313)
Acbp2,38784,NM_168192.3,420,NT_037436.4 (7131863..7133463),NP_729218.1,86,NM_168192.3 (64..324)
Acbp5,39005,NM_139998.4,337,NT_037436.4 (8788144..8788545),NP_648255.1,82,NM_139998.4 (32..280)


## Calculate data for regions of the transcripts

In this example, we'll retrieve the length and coordinates of the cds and exons.

* gene symbol
* gene id
* transcript accession
* CDS Range
* Transcript Length
* Genomic Sequence Accession
* Exon Positions
* Exon Count

In [1]:
metadata_tbl <- t(do.call(cbind,
    sapply(result$genes, 
        function(g) sapply(g$gene$transcripts,
            function(t) {
                c(g$gene$symbol,
                  g$gene$gene_id,
                  t$accession_version,
                  paste(t$cds$range[[1]]$begin, "..",
                        t$cds$range[[1]]$end,
                        sep = ''),
                  t$length,
                  t$exons$accession_version,
                  paste(sapply(t$exons$range, function(r) 
                      {paste(r$begin, r$end, sep='-')}), collapse=','),
                  length(t$exons$range)
                  )}))))
colnames(metadata_tbl) <- c('Gene Symbol',
                            'Gene ID',
                            'Transcript Accession',
                            'CDS Range',
                            'Transcript Length',
                            'Genomic Sequence Accession',
                            'Exon Positions',
                            'Exon Count'
                            )
metadata_tbl

Gene Symbol,Gene ID,Transcript Accession,CDS Range,Transcript Length,Genomic Sequence Accession,Exon Positions,Exon Count
Acbp1,34111,NM_001298815.1,128..400,525,NT_033779.5,"8162311-8162452,8162123-8162237,8161796-8162063",3
Acbp1,34111,NM_135343.4,94..366,491,NT_033779.5,"8162311-8162418,8162123-8162237,8161796-8162063",3
Acer,34189,NM_001273329.1,346..2238,3198,NT_033779.5,"8521933-8522325,8522402-8522966,8523443-8524684,8524743-8525740",4
Acer,34189,NM_057847.4,346..2238,2342,NT_033779.5,"8521933-8522325,8522402-8522966,8523443-8524684,8524743-8524884",4
ACC,35761,NM_001299249.1,471..7442,7814,NT_033778.4,"7991710-7991862,7985066-7985451,7984881-7985002,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",13
ACC,35761,NM_001299248.1,517..7488,7860,NT_033778.4,"7989516-7989714,7985066-7985451,7984881-7985002,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",13
ACC,35761,NM_165581.3,490..7461,7833,NT_033778.4,"7987066-7987237,7985066-7985451,7984881-7985002,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",13
ACC,35761,NM_136498.3,190..7638,8010,NT_033778.4,"7987066-7987237,7985602-7985778,7985066-7985451,7984881-7985002,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",14
ACC,35761,NM_001103756.3,607..7578,7950,NT_033778.4,"7986949-7987237,7985066-7985451,7984881-7985002,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",13
ACC,35761,NM_001103757.3,323..7369,7741,NT_033778.4,"7983476-7984063,7982641-7983007,7981184-7981472,7977389-7981122,7977022-7977324,7976786-7976956,7976295-7976723,7975304-7976173,7974886-7975233,7974690-7974824,7974115-7974621",11
