## Sequence databases

### Searching for an accession number in the NCBI databases on NCBI website

The TA will demonstrate the following on the NCBI website (https://www.ncbi.nlm.nih.gov).

1. Search for accession number NC_001477, which is the DEN-1 Dengue virus genome sequence, and explain the information on the results page.
2. More complex queries on the NCBI website with "search fields" or "search tags" (https://www.ncbi.nlm.nih.gov/books/NBK49540/)
    - NC_001477[ACCN]
    - Chlamydia trachomatis[ORGN]
    - Fungi[ORGN] AND biomol_mRNA[PROP]
    - Nature[JOUR] AND 460[VOL] AND 352[PAGE]
    - Schistosoma mansoni[ORGN] AND biomol_mrna[PROP]

### Querying the NCBI Database via R


We have seen how to query the NCBI databases on the NCBI website. We can get the same kind of data in R.  Can you think of reasons why we want to do this in R?

First we load the seqinr package that has functions to allow us to query NCBI databases.

In [186]:
library("seqinr")

What kind of databases are available?

In [187]:
choosebank()

 [1] "genbankseqinr"   "genbank"         "embl"            "emblwgs"        
 [5] "swissprot"       "ensembl"         "hogenom7dna"     "hogenom7"       
 [9] "hogenom"         "hogenomdna"      "hovergendna"     "hovergen"       
[13] "hogenom5"        "hogenom5dna"     "hogenom4"        "hogenom4dna"    
[17] "homolens"        "homolensdna"     "hobacnucl"       "hobacprot"      
[21] "phever2"         "phever2dna"      "refseq"          "refseq16s"      
[25] "greviews"        "bacterial"       "archaeal"        "protozoan"      
[29] "ensprotists"     "ensfungi"        "ensmetazoa"      "ensplants"      
[33] "ensemblbacteria" "mito"            "polymorphix"     "emglib"         
[37] "refseqViruses"   "ribodb"          "taxodb"         

We first have to choose a database that we want to search. This time we will use the *refseq* database.

In [188]:
choosebank("refseq")

This function below creates a query named *RefSeqChlam*, and save the results to a list we called *refseq_results*.

In [189]:
refseq_results <- query("RefSeqChlam", "SP=Chlamydia")

In [190]:
refseq_results

44 SQ for SP=Chlamydia

Remember to close the database connection.

In [191]:
closebank()

### Example: finding the sequence for the DEN-1 Dengue virus genome¶

We queried another database *refseqViruses* for the Dengue virus genome sequence, and get the elements in the list returned as we did before.

In [192]:
choosebank("refseqViruses")

In [193]:
den1_results <- query("Dengue1", "AC=NC_001477")

In [194]:
attributes(den1_results)

In [195]:
den1_results[['nelem']]

In [196]:
den1_results[['req']]

[[1]]
       name      length       frame      ncbicg 
"NC_001477"     "10735"         "0"         "1" 


In [197]:
dengueseq <- getSequence(den1_results$req[[1]])

In [198]:
dengueseq[1:50]

In addition to getting the sequences, we can get information resulting from analysis that people have performed for the sequence - these information is called *annotations*. *seqinr* has a function *getAnnot* that allows you to get the annotations.

In [199]:
annots <- getAnnot(den1_results$req[[1]])

In [200]:
annots

In [201]:
closebank()

### mRNA sequences in Schistosoma mansoni

Second example: we search for the mRNA sequences from the parasitic worm *Schistosoma mansoni*.

In [202]:
choosebank("genbank")
sch_results <- query("SchistosomamRNA", "SP=Schistosoma mansoni AND M=mrna")

“closing unused connection 6 (->pbil.univ-lyon1.fr:5558)”

What is the information in the *sch_results* list?  

It is a list, so we can find the names that are associated with the elements in the list.

In [203]:
names(sch_results)

We can get the value associated with the *nelem* name in the list. What does this value contain?  Hint: check the help page for the query function.

In [204]:
sch_results[['nelem']]

From the help page we know that the *req* element contains a list of sequence names. Let's get the first one.  Remember the length of this sequence.

In [205]:
sch_results[['req']][[1]]

      name     length      frame     ncbicg 
"AA080774"      "372"        "0"        "1" 

The *getSequence* function accepts the object returned by the *query* function and retrieves the sequence associated with it 

In [206]:
sch_seq1 <- getSequence(sch_results[['req']][[1]])

Print this sequence out.

In [207]:
sch_seq1

What is the length of the sequence?  Is it what you expected?

In [208]:
length(sch_seq1)

Using a *for loop* we can get the first five sequences, and print out the first 10 bases of each sequence.

In [209]:
for (i in 1:5) {
    seqi <- getSequence(sch_results[['req']][[i]])
    print(seqi[1:10])
}

 [1] "c" "t" "t" "t" "c" "g" "t" "a" "g" "t"
 [1] "g" "a" "a" "g" "c" "g" "g" "c" "g" "t"
 [1] "g" "g" "t" "g" "g" "t" "a" "a" "a" "g"
 [1] "g" "a" "g" "t" "a" "a" "a" "t" "t" "t"
 [1] "c" "a" "c" "g" "a" "g" "c" "t" "a" "a"


Remember to close the database connection!

In [210]:
closebank()

### Saving sequencce data 

We can save multiple sequences to one FASTA file using the *write.fasta* function. What does the help page tell you about how to call this function?  What does it expect as the first argument?  As the second argument?

In [211]:
?write.fasta

In [212]:
choosebank("genbank")

In [213]:
hs_trna <- query("humtRNAs", "SP=homo sapiens AND M=TRNA")

In [214]:
myseqs <- getSequence(hs_trna)

In [215]:
mynames <- getName(hs_trna)

In [216]:
write.fasta(myseqs,mynames,file.out="humantRNAs.fasta")

In [217]:
closebank()