# **Chapter 2: Next-Generation Sequencing**
</br>

**Accessing Databases**

firstly, we need to install BioPython package.

In [None]:
conda install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 8.7 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


**how we can search and fetch data from NCBI databases?**

**Note:** Do not forget to put in the correct email address.

In [2]:
from Bio import Entrez, SeqIO
Entrez.email = 'your@email.here'

now try to find the c**hloroquine resistance transporter (CRT) gene** in **Plasmodium falciparum** (the parasite that causes the deadliest form of malaria) on **the nucleotide database**

**First Step:** search on the nucleotide database
</br>
to do that, we should use: `Entrez.esearch()` function
</br>
`Entrez.esearch()` arguments: 
1.   db: database name
2.   term: ncbi search query (for more detail, see: https://www.ncbi.nlm.nih.gov/nuccore/advanced)
3. retmax: number of record refrences (default value: 20)


In [9]:
handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')

**Next Step:** read the result that is returned
</br>
to do that, we should use: `Entrez.read()` function
</br>
outputs of `Entrez.read()` function:
</br>
XML Document Summaries. includes:


1.   count: total number of related records to your search query
2.   RetMax: number of record refrences
3. retstart: Sequential index of the first UID in the retrieved set (default=0)
3. IdList: id of record refrences



In [10]:
rec_list = Entrez.read(handle)
print(rec_list )

{'Count': '2022', 'RetMax': '20', 'RetStart': '0', 'IdList': ['2196471109', '2196471107', '2196471105', '2196471103', '2196471101', '2196471099', '2196471097', '2196471095', '2196471093', '2196471091', '2196471089', '2196471087', '2196471085', '2196471083', '2196471081', '2196471079', '2196471077', '2196471075', '2196471073', '2196471071'], 'TranslationSet': [{'From': '"Plasmodium falciparum"[Organism]', 'To': '"Plasmodium falciparum"[Organism]'}], 'TranslationStack': [{'Term': 'CRT[Gene Name]', 'Field': 'Gene Name', 'Count': '4800', 'Explode': 'N'}, {'Term': '"Plasmodium falciparum"[Organism]', 'Field': 'Organism', 'Count': '259299', 'Explode': 'Y'}, 'AND'], 'QueryTranslation': 'CRT[Gene Name] AND "Plasmodium falciparum"[Organism]'}


**Question:** How we can read all records?
</br>
the standard search will limit the number of record references to 20, so if you have more, you may want to repeat the query with an increased maximum limit.

In [None]:
if rec_list['RetMax'] < rec_list['Count']:
 handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=rec_list['Count'])
 rec_list = Entrez.read(handle)

now, we have the IDs of all of the
records. but, we still need to retrieve the records properly.
</br>
**retrieving all of these records:**
</br>
`Entrez.efetch()` function
</br>
usage of this function:

> downloding all matching nucleotide sequences from data bank (GenBank)

</br>
arguments:


1.   db: Database from which to retrieve records.
2.   id: UID list.
3. rettype: Retrieval type







In [23]:
id_list = rec_list['IdList']
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')

**be careful** with this technique, </br>
because you will retrieve a large amount of complete records, and some of them
will have fairly large sequences inside. You risk downloading a lot of data (which
would be a strain both on your side and on NCBI servers).
</br>
**better strategy:** </br>
One way is to make a more restrictive query
and/or download just a few at a time and stop when you have found the one that
you need. </br> The precise strategy will depend on what you are trying to achieve. 
</br>
**In any case, we will retrieve a list of records in the GenBank format (which includes sequences, plus a lot of interesting metadata).**

**Next Step:** read and parse the result
</br>
to do that, we should use: </br>
`SeqIO.parse()` function
</br>

> usage: Sequence input/output as SeqRecord objects.

outputs:

1. seq
1. id
1.   name
2.   description: human-readable description
3. features: includes: **source, gene, mRNA, CDS**
4. annotations





In [24]:
# converting an iterator (the result of SeqIO.parse) to a list
recs = list(SeqIO.parse(hdl, 'gb'))
print(recs)

[SeqRecord(seq=Seq('GAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCC...TGC'), id='OM418379.1', name='OM418379', description='Plasmodium falciparum isolate 240134_081 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]), SeqRecord(seq=Seq('GAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCC...TGC'), id='OM418378.1', name='OM418378', description='Plasmodium falciparum isolate 240134_039 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]), SeqRecord(seq=Seq('GAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCC...TGC'), id='OM418377.1', name='OM418377', description='Plasmodium falciparum isolate 240125_051 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]), SeqRecord(seq=Seq('GAACGACACCGAAGCTTTAATTTACAATTTTTTGCTATATCCATGTTAGATGCC...TCT'), id='OM418376.1', name='OM418376', description='Plasmodium falciparum isolate 240125_049 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]), SeqRecord(seq=Seq('

We will now just concentrate on a single record. This will only work if you used the exact same preceding query:

In [25]:
for rec in recs:
 if rec.name == 'KM288867':
   break
print(rec.name)
print(rec.description)

OM418360
Plasmodium falciparum isolate 240137_025 chloroquine resistance transporter (crt) gene, partial cds


Now, let's extract some sequence features, which contain information such as gene products and exon positions on the sequence:

Description of below codes:
</br>
If the feature.type is gene, we will print its name, which will be in the
qualifiers dictionary. We will also print all the locations of exons. Exons, as with all features, have locations in this sequence: a start, an end, and the strand from
where they are read. While all the start and end positions for our exons are
ExactPosition, note that Biopython supports many other types of positions.
One type of position is BeforePosition, which specifies that a location point is
before a certain sequence position. Another type of position is
BetweenPosition, which gives the interval for a certain location start/end. There
are quite a few more position types; these are just some examples.

In [29]:
for feature in rec.features:
  if feature.type == 'gene':
      print(feature.qualifiers['gene'])
  elif feature.type == 'exon':
      loc = feature.location
      print(loc.start, loc.end, loc.strand)
  else:
    print('not processed:\n%s' % feature)

not processed:
type: source
location: [0:153](+)
qualifiers:
    Key: db_xref, Value: ['taxon:5833']
    Key: isolate, Value: ['240137_025']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Plasmodium falciparum']

['crt']
not processed:
type: mRNA
location: [<0:>153](+)
qualifiers:
    Key: gene, Value: ['crt']
    Key: product, Value: ['chloroquine resistance transporter']

not processed:
type: CDS
location: [<0:>153](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: gene, Value: ['crt']
    Key: product, Value: ['chloroquine resistance transporter']
    Key: protein_id, Value: ['ULG09259.1']
    Key: translation, Value: ['ERHRSFNLQFFAISMLDACSVILAFIGLTRTTGNIQSFVLQLSIPINMFFC']



We will now look at the annotations on the record, which are mostly metadata that is not related to the sequence position:

In [30]:
for name, value in rec.annotations.items():
  print('%s=%s' % (name, value))

molecule_type=DNA
topology=linear
data_file_division=INV
date=18-FEB-2022
accessions=['OM418360']
sequence_version=1
keywords=['']
source=Plasmodium falciparum (malaria parasite P. falciparum)
organism=Plasmodium falciparum
taxonomy=['Eukaryota', 'Sar', 'Alveolata', 'Apicomplexa', 'Aconoidasida', 'Haemosporida', 'Plasmodiidae', 'Plasmodium', 'Plasmodium (Laverania)']
references=[Reference(title='Targeted amplicon deep sequencing for monitoring antimalarial resistance markers in Western Kenya', ...), Reference(title='Direct Submission', ...)]
structured_comment=OrderedDict([('Assembly-Data', OrderedDict([('Assembly Method', 'SeekDeep v. 3.0.1'), ('Sequencing Technology', 'Illumina')]))])


Last but not least, you can access the fundamental piece of information, the sequence:

In [31]:
print(len(rec.seq))

153


**There's more...** </br>
There are many more databases at NCBI. You will probably want to check the Sequence
Read Archive (SRA) database (previously known as Short Read Archive) if you are
working with NGS data. The SNP database contains information on single-nucleotide
polymorphisms (SNPs), whereas the protein database has protein sequences, and so on. A
full list of databases in Entrez is linked in the See also section of this recipe.
</br>
Another database that you probably already know about with regard to NCBI is PubMed,
which includes a list of scientific and medical citations, abstracts, and even full texts. You
can also access it via Biopython. Furthermore, GenBank records often contain links to
PubMed. For example, we can perform this on our previous record, as shown here:

In [32]:
from Bio import Medline
refs = rec.annotations['references']
for ref in refs:
  if ref.pubmed_id != '':
    print(ref.pubmed_id)
    handle = Entrez.efetch(db='pubmed', id=[ref.pubmed_id],rettype='medline', retmode='text')
    records = Medline.parse(handle)
    for med_rec in records:
      for k, v in med_rec.items():
        print('%s: %s' % (k, v))