This notebook is looking at metadata from SRA. I am wanting to develop a programmatic approach to download metadata information, but needs to be consistent with Zhenxia and Miegs analysis. I am also wanting to figure exactly how Zhenxia and the Miegs did this.

There are two programmatic approaches:
* SRAdb
* Biopython

In [9]:
try:
    %load_ext autoreload
except:
    pass
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
import os
import sys
import re
from xml.etree import ElementTree as ET

import pandas as pd
from Bio import Entrez

# Import my libraries
sys.path.insert(0, '../../lib/python/')
import Sra

pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)
Entrez.email = 'justin.fear@nih.gov'

# Biopyton

Biopython's counts match what I expect, but the results are in raw XML and need a parser. 

## Get List of Records

In [11]:
handle = Entrez.esearch(db='sra', term='"Drosophila melanogaster"[Orgn]', retmax=99999, usehistory='y')
records = Entrez.read(handle)
print(records['Count'])

# Save history from eSearch, this will be used in eFetch
webenv = records['WebEnv']
query_key = records['QueryKey']

22350


## Get Full XML

In [12]:
# Check if I have already dumped the sra records. If 
# you want to update, simply delete the file and re-run.
fname = '../../output/sra_dump.xml'
if not os.path.exists(fname):
    Sra.downloadSRA(count=records['Count'], webenv=webenv, 
                query_key=query_key, fname=fname)

tree = ET.parse(fname)
root = tree.getroot()
ep = root.getchildren()
len(ep)

22344

In [13]:
bob = ep[0]

In [14]:
list(bob)

[<Element 'EXPERIMENT' at 0x16d5644f8>,
 <Element 'SUBMISSION' at 0x16d5809a8>,
 <Element 'Organization' at 0x16d580b38>,
 <Element 'STUDY' at 0x16d9b92c8>,
 <Element 'SAMPLE' at 0x16d9b9728>,
 <Element 'Pool' at 0x16d9ba1d8>,
 <Element 'RUN_SET' at 0x16d9ba3b8>]

In [15]:
s = bob.find('RUN_SET')

In [16]:
for e in ep:
    try:
        if e.find('.//RUN').get('accession') == 'DRR001444':
            break
    except:
        pass

In [17]:
e.findall('.//EXTERNAL_ID')

[<Element 'EXTERNAL_ID' at 0x1a3847278>,
 <Element 'EXTERNAL_ID' at 0x1a38474a8>,
 <Element 'EXTERNAL_ID' at 0x1a384a2c8>,
 <Element 'EXTERNAL_ID' at 0x1a384a728>,
 <Element 'EXTERNAL_ID' at 0x1a384ac28>,
 <Element 'EXTERNAL_ID' at 0x1a384d0e8>]

In [18]:
for i in e.iter():
    print("""
    {tag}
    \t{attrib}
    \t{text}
    """.format(tag=i.tag, attrib=i.attrib, text=i.text))


    EXPERIMENT_PACKAGE
    	{}
    	None
    

    EXPERIMENT
    	{'accession': 'DRX000998', 'center_name': 'KYOTO_SC', 'alias': 'DRX000998'}
    	None
    

    IDENTIFIERS
    	{}
    	None
    

    PRIMARY_ID
    	{}
    	DRX000998
    

    SUBMITTER_ID
    	{'namespace': 'KYOTO_SC'}
    	DRX000998
    

    TITLE
    	{}
    	Whole genome sequencing of Drosophila melanogaster strain DM, series DM01
    

    STUDY_REF
    	{'accession': 'DRP000460', 'refname': 'DRP000460', 'refcenter': 'KYOTO_SC'}
    	None
    

    IDENTIFIERS
    	{}
    	None
    

    PRIMARY_ID
    	{}
    	DRP000460
    

    EXTERNAL_ID
    	{'namespace': 'BioProject', 'label': 'BioProject ID'}
    	PRJDA72881
    

    SUBMITTER_ID
    	{'namespace': 'KYOTO_SC'}
    	DRP000460
    

    DESIGN
    	{}
    	None
    

    DESIGN_DESCRIPTION
    	{}
    	none provided
    

    SAMPLE_DESCRIPTOR
    	{'accession': 'DRS000998', 'refname': 'DRS000998', 'refcenter': 'KYOTO_SC'}
    	None
    

    IDENTIFIE

## Available Sample Attributes: 

Sample attributes are entered as free text. Here is a list of attributes currently in the SRA. Some of these can collapsed into single categories.

In [19]:
counts = {}
for i in root.findall('.//SAMPLE_ATTRIBUTE/TAG'):
    try:
        text = i.text.lower()
        if text in counts:
            counts[text] += 1
        else:
            counts[text] = 1
    except:
        pass

In [20]:
pd.Series(counts).to_csv('../../output/tmp.text', sep='\t')

In [205]:
s = bob.find('SAMPLE')

In [206]:
for i in s.iter():
    print(i.tag, i.attrib, i.text)

SAMPLE {'accession': 'SRS1532953', 'alias': 'FR198N'} None
IDENTIFIERS {} None
PRIMARY_ID {} SRS1532953
EXTERNAL_ID {'namespace': 'BioSample'} SAMN05330489
SUBMITTER_ID {'namespace': 'pda|justin.lack@nih.gov', 'label': 'Sample name'} FR198N
TITLE {} FR198N
SAMPLE_NAME {} None
TAXON_ID {} 7227
SCIENTIFIC_NAME {} Drosophila melanogaster
COMMON_NAME {} fruit fly
SAMPLE_LINKS {} None
SAMPLE_LINK {} None
XREF_LINK {} None
DB {} bioproject
ID {} 327349
LABEL {} PRJNA327349
SAMPLE_ATTRIBUTES {} None
SAMPLE_ATTRIBUTE {} None
TAG {} strain
VALUE {} FR198N
SAMPLE_ATTRIBUTE {} None
TAG {} dev_stage
VALUE {} adult
SAMPLE_ATTRIBUTE {} None
TAG {} sex
VALUE {} female
SAMPLE_ATTRIBUTE {} None
TAG {} tissue
VALUE {} Whole organism
SAMPLE_ATTRIBUTE {} None
TAG {} BioSampleModel
VALUE {} Model organism or animal


In [211]:
for attrib in s.findall('SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE'):
    print(attrib.find('TAG').text)

strain
dev_stage
sex
tissue
BioSampleModel
