# NCBIQuery class

## Description
To download from NCBI I've been using `BioEntrez`. I made a simple class to download `fasta` and `gb` records.

### Useful links:
- [A General Introduction to the E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
- [Table for `EFetch` `retmode` and `rettype`](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly)

`import geneviking as gv`

`gv.NCBIQuery(acc, start, end, thresh)`


In [155]:
import urllib
from Bio import Entrez
import time
import io
from Bio import SeqIO

In [156]:
__email__ = "nquinones@g.harvard.edu"
Entrez.email = __email__
Entrez.api_key = '693377c63a7574b9384fca67ee08a50d1308'

In [332]:
class NCBIQuery:
    
    def __init__(self, acc, start, end, thresh):
        
        # ncbi accession
        self.acc = acc
        
        # range
        self.start = start
        self.end = end
        
        # if start or end are None, treat ranges as none
        # this will download full record
        if (start is not None) and (end is not None):
            self.query_range = set(range(start, end))

            # neighborhood
            self.rangeend = end + thresh
            self.rangestart = start - thresh
            
            # Deal with range start of 0 as if it was 1.
            # NCBI records are 1 indexed
            if self.rangestart <= 0:
                self.rangestart = 1
        
        else:
            self.query_range = None
            self.rangeend = None
            self.rangestart = None
    
    
    def download(self, outdir='.', fmt='gb', full_download=False, DEBUG=True):
        '''
        outdir: directory were sequence will be downloaded
        fmt: record download format. allowed: 'gb' is GenBank(full) and 'fasta'
        '''
        
        # deal with format option
        if fmt == 'gb':
            rettype_fmt = 'gb'
            output_fmt = 'genbank'
        
        elif fmt == 'fasta':
            rettype_fmt = 'fasta'
            output_fmt = 'fasta'
        else:
            raise ValueError('Supported fmt are "gb" or "fasta".')
        
        # assign download start and end from object attributes
        download_start = self.rangestart
        download_end = self.rangeend
        
        # modify download start and end if full record option
        if full_download:
            print('Downloading full record.', end =" ")
            download_start = None
            download_end = None

        # if they were specified as None, download full record
        if (download_start is None) or (download_end is None):
            print('Downloading full record.', end =" ")
            
        
        # download with Bio.Entrez
        while True:
            try:
                # debug print statement
                if DEBUG:
                    print('Downloading {}...'.format(self.acc), end =" ")
                
                # request to NCBI with efetch
                response = Entrez.efetch(db='nuccore',
                                         id=self.acc,
                                         rettype=rettype_fmt,
                                         retmode="text",
                                         seq_start=download_start,
                                         seq_stop=download_end)
                if DEBUG:
                    print('Downloaded.')
                
                break
                
            except urllib.error.HTTPError as e:
                
                # Handle non existant accession
                if e.code == 400:
                    print(f'Unable to download: {self.acc}. Check accession.')
                    raise
                
                # Handle too many requests error
                elif e.code == 429:
                    
                    if DEBUG:
                        print('Too many requests. retrying {}'.format(self.acc))
                    
                    # wait 1 second and try again
                    time.sleep(1)
                    continue
                
                # raise if HTTPError is different
                else:
                    raise
            else:
                raise
        
        # read response into stringio
        response_io = io.StringIO(response.read())
        
        # save into specified directory
        if outdir:
            for record in SeqIO.parse(response_io, output_fmt):
                fasta_records = record
            
            # reformat id for save name
            seq_id = fasta_records.id.replace(':', '_')
            SeqIO.write(fasta_records, f'{outdir}/{seq_id}.{rettype_fmt}', output_fmt)
        
        self.downloaded_record = response_io
    
    def add_something(self, string):
        self.something = string

In [337]:
a = NCBIQuery('CP028994.1', 4443838, 4444755, 0)

In [338]:
a.download()

Downloading CP028994.1... Downloaded.


In [341]:
a.downloaded_record.getvalue()

'LOCUS       CP028994                 918 bp    DNA     linear   BCT 28-APR-2018\nDEFINITION  Klebsiella pneumoniae strain AR_0079 chromosome, complete genome.\nACCESSION   CP028994 REGION: 4443838..4444755\nVERSION     CP028994.1\nDBLINK      BioProject: PRJNA292904\n            BioSample: SAMN04014920\nKEYWORDS    .\nSOURCE      Klebsiella pneumoniae\n  ORGANISM  Klebsiella pneumoniae\n            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;\n            Enterobacteriaceae; Klebsiella.\nREFERENCE   1  (bases 1 to 918)\n  AUTHORS   Conlan,S., Thomas,P.J., Mullikin,J., Frank,K.M. and Segre,J.A.\n  TITLE     Whole genome sequencing of Klebsiella pneumoniae AR_0079\n  JOURNAL   Unpublished\nREFERENCE   2  (bases 1 to 918)\n  AUTHORS   Segre,J.A. and Mullikin,J.\n  TITLE     Direct Submission\n  JOURNAL   Submitted (19-APR-2018) NIH Intramural Sequencing Center, 5625\n            Fishers Lane, Rockville, MD 20852, USA\nCOMMENT     Annotation was added by the NCBI Proka

In [320]:
a.add_something('bueno')

In [323]:
a.something

'bueno'