# SRA Query and XML Parsing

Currently metadata was downloaded by hand using the web interface which generates a sample table. The goal of this notebook is to query the SRA database directly and generate this table without using the web interface. There is an R package `SRAdb` that allows easy querying of the SRA database, but it does not directly interact with the NCBI database. Instead `SRAdb` uses a pre-built SQLite database that is provided by the author. In my tests of this package the counts did not match what I was getting from the SRA web interface.

In [9]:
# Load useful extensions
%reload_ext autoreload
%autoreload 2

%reload_ext ipycache

In [10]:
# Imports
import os
import sys
import re
from xml.etree import ElementTree as ET

import pandas as pd
from Bio import Entrez

from ipycache import CacheMagics
CacheMagics.cachedir = '../../output/cache'

# Import my libraries
sys.path.insert(0, '../../lib/python/')
import Sra

pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)
Entrez.email = 'justin.fear@nih.gov'

## Query SRA

Using the Biopython implementation of E-utilities directly query SRA using the search term `"Drosophila melanogaster"[Orgn]`. Because there is a large number of entires use the web history feature of E-utilities.

In [14]:
# Query SRA
handle = Entrez.esearch(db='sra', term='"Drosophila melanogaster"[Orgn]', retmax=99999, usehistory='y')
records = Entrez.read(handle)
print('There were ',records['Count'], ' records return')

# Save history from eSearch, this will be used in eFetch
webenv = records['WebEnv']
query_key = records['QueryKey']

There were  22434  records return


## Download Full XML Results

Download the XML records using the above query history. This process takes a long time and taxes the SRA system so only re-download if need.

In [245]:
# Check if I have already dumped the sra records. If 
# you want to update, simply delete the file and re-run.
fname = '../../output/sra_dump.xml'
if not os.path.exists(fname):
    Sra.downloadSRA(count=records['Count'], webenv=webenv, 
                query_key=query_key, fname=fname)

tree = ET.parse(fname)
root = tree.getroot()
ep = root.getchildren()
print('You have ', len(ep), ' XML records. This should match the number ofr results returned from your',
      'query. If they do not match then delete the file `../../ouput/sra_dump.xml`')

You have  22344  XML records. This should match the number ofr results returned from your query. If they do not match then delete the file `../../ouput/sra_dump.xml`


## Parse XML Records

E-utilities only provides an XML version of the results. XML needs parsed to generate a results table. A description of the SRA XML schema can be found here:

http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-6a/

There is a large number of fields in the SRA XML, and a lot of the data is repeated in multiple places. I need to decide which pieces of information to use. 

Here is an example tree with current fields being considered marked with a '⟹':

In [None]:
ERR358180

In [321]:
# Print out an example Tree and mark fields used
for experiment in ep:
    if experiment.find('RUN_SET/RUN/IDENTIFIERS/PRIMARY_ID').text == 'SRR3663861':
        break

keep = ['EXPERIMENT/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/EXTERNAL_ID',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_STRATEGY',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SOURCE',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SELECTION',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_LAYOUT/SINGLE',
        'EXPERIMENT/PLATFORM/ILLUMINA/INSTRUMENT_MODEL',
        'EXPERIMENT/PLATFORM/ILLUMINA',
        'SUBMISSION/IDENTIFIERS/PRIMARY_ID',
        'SUBMISSION/IDENTIFIERS/SUBMITTER_ID',
        'Organization/Address/Institution',
        'SAMPLE/TITLE',
        'SAMPLE/SAMPLE_NAME/TAXON_ID',
        'SAMPLE/SAMPLE_NAME/SCIENTIFIC_NAME',
        'SAMPLE/SAMPLE_NAME/COMMON_NAME',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/VALUE',
        'RUN_SET/RUN',
        'RUN_SET/RUN/Statistics/Read'
       ]

def print_tags(s, space='', path=''):
    if (s.tag != 'Quality') & (s.tag != 'Base'):
        if path in keep:
            print( '⟹' + "{space}{tag} {attrib} {text}".format(tag=s.tag, attrib=s.attrib, text=s.text, space=space))
        else:
            print("{space}{tag} {attrib} {text}".format(tag=s.tag, attrib=s.attrib, text=s.text, space=space))

for i in experiment.getchildren():
    path1 = i.tag
    print_tags(i, path=path1)
    for j in i.getchildren():
        path2 = path1 + '/' + j.tag
        print_tags(j, '\t', path=path2)
        for k in j.getchildren():
            path3 = path2 + '/' + k.tag
            print_tags(k, '\t\t', path=path3)
            for l in k.getchildren():
                path4 = path3 + '/' + l.tag
                print_tags(l, '\t\t\t', path=path4)
                for m in l.getchildren():
                    path5 = path4 + '/' + m.tag
                    print_tags(m, '\t\t\t\t', path=path5)

EXPERIMENT {'alias': 'head 2', 'accession': 'SRX1842622'} None
	IDENTIFIERS {} None
⟹		PRIMARY_ID {} SRX1842622
		SUBMITTER_ID {'namespace': 'Institute of Zoology, CAS'} head 2
	TITLE {} Drosophila melanogaster 3 tissue Transcriptome Head 2
	STUDY_REF {'accession': 'SRP057728'} None
		IDENTIFIERS {} None
⟹			PRIMARY_ID {} SRP057728
⟹			EXTERNAL_ID {'namespace': 'BioProject'} PRJNA282433
	DESIGN {} None
		DESIGN_DESCRIPTION {} None
		SAMPLE_DESCRIPTOR {'accession': 'SRS1501599'} None
			IDENTIFIERS {} None
				PRIMARY_ID {} SRS1501599
				EXTERNAL_ID {'namespace': 'BioSample'} SAMN05231878
		LIBRARY_DESCRIPTOR {} None
			LIBRARY_NAME {} None
⟹			LIBRARY_STRATEGY {} RNA-Seq
⟹			LIBRARY_SOURCE {} TRANSCRIPTOMIC
⟹			LIBRARY_SELECTION {} PolyA
			LIBRARY_LAYOUT {} None
				PAIRED {'NOMINAL_SDEV': '0.0E0'} None
		SPOT_DESCRIPTOR {} None
			SPOT_DECODE_SPEC {} None
				SPOT_LENGTH {} 300
				READ_SPEC {} None
				READ_SPEC {} None
	PLATFORM {} None
⟹		ILLUMINA {} None
⟹			INSTRUMENT_MODEL {} 

[autoreload of Sra failed: Traceback (most recent call last):
  File "/Users/fearjm/opt/miniconda3/envs/ncbi_remap/lib/python3.5/site-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ValueError: get_Sex() requires a code object with 0 free vars, not 1
]
[autoreload of Sra failed: Traceback (most recent call last):
  File "/Users/fearjm/opt/miniconda3/envs/ncbi_remap/lib/python3.5/site-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ValueError: get_Sex() requires a code object with 0 free vars, not 1
]


I am assuming the redundant information is populated by SRA and should be identical, but it would probably be safest to compare data points and flag those that do not match. I could do this a number of ways, first I could be a set of classes for each piece of information I am interested and then parse each bit of XML to check, or I could build a class for each bit of XML and check. The first one is probably easier. 

In [306]:
p = ep[10].find('EXPERIMENT/PLATFORM')

In [308]:
p.getchildren()[0].tag

'ILLUMINA'

In [276]:
b = set()
for e in ep:
    b.add(len(e.findall('RUN_SET/RUN/Statistics/Read')))
b

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 18,
 20,
 26,
 32,
 38,
 41}

In [262]:
for r in e.findall('RUN_SET/RUN'):
    pass

In [265]:
r.find('IDENTIFIERS/PRIMARY_ID').text

'SRR2443145'

## Available Sample Attributes: 

Sample attributes are entered as free text. Here is a list of attributes currently in the SRA. Some of these can collapsed into single categories.

In [155]:
counts = {}
for i in root.findall('SAMPLE_ATTRIBUTE/TAG'):
    try:
        text = i.text.lower()
        if text in counts:
            counts[text] += 1
        else:
            counts[text] = 1
    except:
        pass

In [156]:
pd.Series(counts).to_csv('../../output/tmp.text', sep='\t')

In [205]:
s = bob.find('SAMPLE')

In [206]:
for i in s.iter():
    print(i.tag, i.attrib, i.text)

SAMPLE {'accession': 'SRS1532953', 'alias': 'FR198N'} None
IDENTIFIERS {} None
PRIMARY_ID {} SRS1532953
EXTERNAL_ID {'namespace': 'BioSample'} SAMN05330489
SUBMITTER_ID {'namespace': 'pda|justin.lack@nih.gov', 'label': 'Sample name'} FR198N
TITLE {} FR198N
SAMPLE_NAME {} None
TAXON_ID {} 7227
SCIENTIFIC_NAME {} Drosophila melanogaster
COMMON_NAME {} fruit fly
SAMPLE_LINKS {} None
SAMPLE_LINK {} None
XREF_LINK {} None
DB {} bioproject
ID {} 327349
LABEL {} PRJNA327349
SAMPLE_ATTRIBUTES {} None
SAMPLE_ATTRIBUTE {} None
TAG {} strain
VALUE {} FR198N
SAMPLE_ATTRIBUTE {} None
TAG {} dev_stage
VALUE {} adult
SAMPLE_ATTRIBUTE {} None
TAG {} sex
VALUE {} female
SAMPLE_ATTRIBUTE {} None
TAG {} tissue
VALUE {} Whole organism
SAMPLE_ATTRIBUTE {} None
TAG {} BioSampleModel
VALUE {} Model organism or animal


In [211]:
for attrib in s.findall('SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE'):
    print(attrib.find('TAG').text)

strain
dev_stage
sex
tissue
BioSampleModel
