## Sequence databases

### Getting started

1. In the "my_notebooks/week01" folder, open a notebook from this URL https://raw.githubusercontent.com/hlab1/teaching-fb2025/main/week01/w01p3_seqdb.ipynb.
2. Clear all outputs by "Kernel"->"Restart Kernel and Clear All Outputs".

At any time if you want to stop the notebook, remember to "Save Notebook" before doing "Close and Shutdown Notebook". And "Stop My Server" when you are done.

## Searching for an accession number in the NCBI databases on NCBI website

The TA will demonstrate the following on the NCBI website (https://www.ncbi.nlm.nih.gov).

1. Search for accession number NC_001477, which is the DEN-1 Dengue virus genome sequence, and explain the information on the results page.
2. More complex queries on the NCBI website with "search fields" or "search tags" (https://www.ncbi.nlm.nih.gov/books/NBK49540/)
    - NC_001477[ACCN]
    - Chlamydia trachomatis[ORGN]
    - Fungi[ORGN] AND biomol_mRNA[PROP]
    - Nature[JOUR] AND 460[VOL] AND 352[PAGE]
    - Schistosoma mansoni[ORGN] AND biomol_mrna[PROP]

### Querying the NCBI Database via R


We have seen how to query the NCBI databases on the NCBI website. We can get the same kind of data in R.  Can you think of reasons why we want to do this in R?

First we load the seqinr package that has functions to allow us to query NCBI databases.

In [1]:
library("rentrez")
library("seqinr")
library("XML")

What kind of databases are available? Can you match some of them to the list in the dropdown box on the top of the NCBI website? 

In [2]:
entrez_dbs()

You can find a table of some of the databases here: https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly.

### Example: finding the sequence for the DEN-1 Dengue virus genome¶

If we know the database that contains the data we want, and the *accession number* for the data in the databse, we can simply *fetch* the data. In this example, we know the DEN-1 Dengue virus genome is in the Nucleotide database ("nuccore") with accession number "NC_001477".

In [3]:
den1_xml <- entrez_fetch("nuccore","NC_001477",rettype="fasta",retmode="xml",parsed=TRUE)

*den1_results* now contains a string of the results in XML format.

In [4]:
den1_xml

We can use the *xmlToList* function to convert it to a list object in R.

In [5]:
den1_list <- xmlToList(den1_xml)

*den1_list* is a list. The first elment of the list has name 'TSeq' and contains a list.

In [6]:
den1_list

How do we get the sequence inside the list with the name 'TSeq'?

In [7]:
dengueseq <- den1_list[['TSeq']][['TSeq_sequence']]
dengueseq

This is a *string*. To convert it to a vector of single characters, we can use the *s2c* function in *seqinr*.

In [8]:
dengueseq_vec <- s2c(dengueseq)
dengueseq_vec

In addition to getting the sequences, we can get information resulting from analysis that people have performed for the sequence - this information is called *annotations*. *rentrez* has a function *entrez_summary* that allows you to get the annotations.

In [9]:
den1_summary <- entrez_summary("nuccore","NC_001477")
den1_summary

This is a list, so we can get the specific pieces of information using the double brackets. 

In [10]:
den1_summary[['biomol']]

### mRNA sequences in Schistosoma mansoni

Second example: we want to get all the mRNA sequences from the parasitic worm *Schistosoma mansoni*.  Since we don't know the accession numbers, we first have to *search* using *search terms*.

In [11]:
sch_results <- entrez_search(db="nuccore",
                             term="Schistosoma mansoni[ORGN] AND biomol_mrna[PROP]")

What is the information in the *sch_results* list?  

It is a list, so we can find the names that are associated with the elements in the list.

In [12]:
names(sch_results)

We can get the value associated with the *count* name in the list. What does this value contain?  Hint: check the help page for the *entrez_search* function.

In [13]:
sch_results[['count']]

From the help page we know that the *count* element contains the total number of hits satisfying the search terms, and the *ids* contains the accession numbers of those sequences.

In [14]:
sch_results[['ids']]

Why are we only getting 20 accession numbers in 'ids', while the 'count' tells us there are 234948?

Hint: what does the "retmax" result in *entrez_search* function say?

Let's just get the first sequence.

In [15]:
sch_seq1_xml <- entrez_fetch("nuccore","2845369404",rettype="fasta",retmode="xml",parsed=TRUE)

Print this sequence out.

In [16]:
sch_seq1_xml

As we did previously, we convert this XML result into an R list.

In [17]:
sch_seq1_list <- xmlToList(sch_seq1_xml)
sch_seq1_list

Using a *for loop* we can get the first five sequences, and print out their definitions in the *TSeq_defline* variable.

In [18]:
ids_first20 <- sch_results[['ids']]

for (i in 1:5) {
    seqi_xml <- entrez_fetch("nuccore",ids_first20[i],rettype="fasta",retmode="xml",parsed=TRUE)
    seqi_list <- xmlToList(seqi_xml)
    seqi_defline <- seqi_list[['TSeq']][['TSeq_defline']]
    print(seqi_defline)
}

[1] "Schistosoma mansoni strain PR venom allergen-like protein 29 (Smp_120670) mRNA, complete cds"
[1] "Schistosoma mansoni strain PR venom allergen-like protein 35 (Smp_347320) mRNA, complete cds"
[1] "Schistosoma mansoni strain PR venom allergen-like protein 34 (Smp_317520) mRNA, complete cds"
[1] "Schistosoma mansoni strain PR venom allergen-like protein 33 (Smp_313710) mRNA, complete cds"
[1] "Schistosoma mansoni strain PR venom allergen-like protein 31 (Smp_200460) mRNA, complete cds"


### Saving sequencce data 

Now we will get the tRNA sequences in the human genome.

In [19]:
hs_trna <- entrez_search("nuccore", term="homo sapiens[ORGN] AND biomol_trna[PROP]")

In [20]:
hs_trna

Entrez search result with 38 hits (object contains 20 IDs and no web_history object)
 Search term (as translated):  "Homo sapiens"[Organism] AND biomol_trna[PROP] 

In [21]:
names(hs_trna)
hs_trna[['ids']]
hs_trna[['count']]
hs_trna[['retmax']]

We expect 38 sequences but only got 20 ids, so need to set the *retmax* argument.

In [22]:
hs_trna <- entrez_search("nuccore", term="homo sapiens[ORGN] AND biomol_trna[PROP]",retmax=38)
hs_trna[['ids']]
hs_trna[['count']]
hs_trna[['retmax']]

Now we have 38 ids and are ready to fetch the sequences.

In [23]:
hs_trna_text <- entrez_fetch("nuccore",id=hs_trna[['ids']],rettype="fasta",retmode="text")

Note that we asked for the "text" returned mode this time, so the results is a big string already in FASTA format.

In [24]:
hs_trna_text

We can directly write this string to get a FATSA format file.

In [25]:
write(hs_trna_text,"humantRNAs.fasta")

Now you can go back your "my_notebooks/week01" folder to take a look at the *humantRNAs.fasta* file.

Remember to "Save Notebook" before "Close and Shutdown Notebook"!  When you finish using JupyterHub, remember to click "File"->"Hub Control Panel", then "Stop My Server".