# SRA Query and XML Parsing

Currently metadata was downloaded by hand using the web interface which generates a sample table. The goal of this notebook is to query the SRA database directly and generate this table without using the web interface. There is an R package `SRAdb` that allows easy querying of the SRA database, but it does not directly interact with the NCBI database. Instead `SRAdb` uses a pre-built SQLite database that is provided by the author. In my tests of this package the counts did not match what I was getting from the SRA web interface.

In [1]:
# Load useful extensions
%reload_ext autoreload
%autoreload 2

%reload_ext ipycache

In [2]:
# Imports
import os
import sys
import re
from xml.etree import ElementTree as ET

import pandas as pd
from Bio import Entrez

from ipycache import CacheMagics
CacheMagics.cachedir = '../../output/cache'

# Import my libraries
sys.path.insert(0, '../../lib/python/')
import Sra

pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)
Entrez.email = 'justin.fear@nih.gov'

## Query SRA

Using the Biopython implementation of E-utilities directly query SRA using the search term `"Drosophila melanogaster"[Orgn]`. Because there is a large number of entires use the web history feature of E-utilities.

In [3]:
# Query SRA
handle = Entrez.esearch(db='sra', term='"Drosophila melanogaster"[Orgn]', retmax=99999, usehistory='y')
records = Entrez.read(handle)
print('There were ',records['Count'], ' records return')

# Save history from eSearch, this will be used in eFetch
webenv = records['WebEnv']
query_key = records['QueryKey']

There were  22509  records return


## Download Full XML Results

Download the XML records using the above query history. This process takes a long time and taxes the SRA system so only re-download if need.

In [4]:
# Check if I have already dumped the sra records. If 
# you want to update, simply delete the file and re-run.
fname = '../../output/sra_dump.xml'
if not os.path.exists(fname):
    Sra.downloadSRA(count=records['Count'], webenv=webenv, 
                query_key=query_key, fname=fname)

tree = ET.parse(fname)
root = tree.getroot()
ep = root.getchildren()
print('You have ', len(ep), ' XML records. This should match the number ofr results returned from your',
      'query. If they do not match then delete the file `../../ouput/sra_dump.xml`')

You have  22344  XML records. This should match the number ofr results returned from your query. If they do not match then delete the file `../../ouput/sra_dump.xml`


## Parse XML Records

E-utilities only provides an XML version of the results. XML needs parsed to generate a results table. A description of the SRA XML schema can be found here:

http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-6a/

There is a large number of fields in the SRA XML, and a lot of the data is repeated in multiple places. I need to decide which pieces of information to use.

In [5]:
# Print out an example Tree and mark fields used
for experiment in ep:
    try:
        if experiment.find('RUN_SET/RUN/IDENTIFIERS/PRIMARY_ID').text == 'ERR358180':
            break
    except:
        pass
experiment = ep[802]
keep = ['<span class="burk"><span class="burk"><span class="burk"><span class="burk">EXPERIMENT</span></span></span></span>/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/EXTERNAL_ID',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_STRATEGY',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SOURCE',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SELECTION',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_LAYOUT/SINGLE',
        'EXPERIMENT/PLATFORM/ILLUMINA/INSTRUMENT_MODEL',
        'EXPERIMENT/PLATFORM/ILLUMINA',
        'SUBMISSION/IDENTIFIERS/PRIMARY_ID',
        'SUBMISSION/IDENTIFIERS/SUBMITTER_ID',
        'Organization/Address/Institution',
        'SAMPLE/TITLE',
        'SAMPLE/SAMPLE_NAME/TAXON_ID',
        'SAMPLE/SAMPLE_NAME/SCIENTIFIC_NAME',
        'SAMPLE/SAMPLE_NAME/COMMON_NAME',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/VALUE',
        'RUN_SET/RUN',
        'RUN_SET/RUN/Statistics/Read'
       ]

def print_tags(s, space='', path=''):
    if (s.tag != 'Quality') & (s.tag != 'Base'):
        print("{space}{tag} {attrib} {text}".format(tag=s.tag, attrib=s.attrib, text=s.text, space=space))

for i in experiment.getchildren():
    path1 = i.tag
    print_tags(i, path=path1)
    for j in i.getchildren():
        path2 = path1 + '/' + j.tag
        print_tags(j, '\t', path=path2)
        for k in j.getchildren():
            path3 = path2 + '/' + k.tag
            print_tags(k, '\t\t', path=path3)
            for l in k.getchildren():
                path4 = path3 + '/' + l.tag
                print_tags(l, '\t\t\t', path=path4)
                for m in l.getchildren():
                    path5 = path4 + '/' + m.tag
                    print_tags(m, '\t\t\t\t', path=path5)

EXPERIMENT {'alias': 'lnc25', 'accession': 'SRX1542556'} None
	IDENTIFIERS {} None
		PRIMARY_ID {} SRX1542556
		SUBMITTER_ID {'namespace': 'Tsinghua University'} lnc25
	TITLE {} CR45542 knockout
	STUDY_REF {'accession': 'SRP068880'} None
		IDENTIFIERS {} None
			PRIMARY_ID {} SRP068880
	DESIGN {} None
		DESIGN_DESCRIPTION {} None
		SAMPLE_DESCRIPTOR {'accession': 'SRS1231938'} None
			IDENTIFIERS {} None
				PRIMARY_ID {} SRS1231938
		LIBRARY_DESCRIPTOR {} None
			LIBRARY_NAME {} None
			LIBRARY_STRATEGY {} RNA-Seq
			LIBRARY_SOURCE {} TRANSCRIPTOMIC
			LIBRARY_SELECTION {} PolyA
			LIBRARY_LAYOUT {} None
				SINGLE {} None
		SPOT_DESCRIPTOR {} None
			SPOT_DECODE_SPEC {} None
				SPOT_LENGTH {} 49
				READ_SPEC {} None
	PLATFORM {} None
		ILLUMINA {} None
			INSTRUMENT_MODEL {} Illumina HiSeq 2000
SUBMISSION {'alias': 'lnc1', 'lab_name': '', 'submission_comment': 'five fly mutants and one wild-type', 'center_name': 'Tsinghua University', 'accession': 'SRA325840'} None
	IDENTIFIERS {} 

I am assuming the redundant information is populated by SRA and should be identical, but it would probably be safest to compare data points and flag those that do not match. I could do this a number of ways, first I could be a set of classes for each piece of information I am interested and then parse each bit of XML to check, or I could build a class for each bit of XML and check. The first one is probably easier. 

Decided to just try to replicate the `Sra RunInfo` table as close as possible. I developed a class separately to do some error checking and parsing.

In [6]:
# Parse SRA XML
res = Sra.SraResultsTable(fname)

# Generate DataFrame representation of XML.
exper = res.build_rows()

exper.shape

(29691, 48)

## Validate Parsed Table

Now that I have a parsed table I need to make sure it is correct. There will be some iteration if there is something wrong with the table.

The current table includes other types of samples besides RNA-seq data. I want to parse this out so that it matches the runinfo table. Note this does not account for samples that are not annotated correctly, this is just for verification of my XML parsing.

In [7]:
# List possible values of LibrarySource
exper['LibrarySource'].unique()

array(['GENOMIC', 'TRANSCRIPTOMIC', 'OTHER', 'SYNTHETIC', 'METAGENOMIC',
       'METATRANSCRIPTOMIC'], dtype=object)

In [8]:
# Filter out only Transcriptomic data
ts = exper[(exper['LibrarySource'] == 'TRANSCRIPTOMIC') | (exper['LibrarySource'] == 'METATRANSCRIPTOMIC')]

print(ts.shape)
ts.head()

(14035, 48)


Unnamed: 0,Run,RunSecondary,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
315,SRR3476589,,2016-05-06,,39131961,1956598050,0,50,,,,SRX1743178,UNDEFINED,OTHER,other,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422417,SAMN04942830,,7227,Drosophila melanogaster,GSM2142680,,,,,,,no,,,,,,SRA423615,,public,,
316,SRR3476587,,2016-05-06,,15895465,794773250,0,50,,,,SRX1743176,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422418,SAMN04942828,,7227,Drosophila melanogaster,GSM2142678,,,,,,,no,,,,,,SRA423615,,public,,
317,SRR3476579,,2016-05-06,,20659277,3098891550,0,150,,,,SRX1743168,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422407,SAMN04942820,,7227,Drosophila melanogaster,GSM2142670,,,,,,,no,,,,,,SRA423615,,public,,
318,SRR3476578,,2016-05-06,,14956121,2243418150,0,150,,,,SRX1743167,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422411,SAMN04942819,,7227,Drosophila melanogaster,GSM2142669,,,,,,,no,,,,,,SRA423615,,public,,
319,SRR3476577,,2016-05-06,,8894958,1334243700,0,150,,,,SRX1743166,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422406,SAMN04942818,,7227,Drosophila melanogaster,GSM2142668,,,,,,,no,,,,,,SRA423615,,public,,


In [9]:
# Create list of Run IDs from my parsed table
myRuns = set(ts.Run.tolist())

In [10]:
# Import the web downloaded table and create a list of Run IDs
web = pd.read_csv('../../output/SraRunInfo_example.csv')
webRuns = set(web.Run.tolist())

### Compare Rows

There are <font color="#FF0000">{{ts.shape[0]}}</font> rows in the output, but the SRA RunInfo table has <font color="#FF0000">{{web.shape[0]}}</font> rows. I need to figure out the difference.

In [11]:
# Use Sets to determine the differences between the tow lists
myMissing = webRuns.difference(myRuns)
print(len(myMissing))
myMissing

36


{'SRR2194872',
 'SRR2194895',
 'SRR2194929',
 'SRR2194944',
 'SRR2194955',
 'SRR2194957',
 'SRR2195001',
 'SRR2195002',
 'SRR2195003',
 'SRR2195004',
 'SRR2195005',
 'SRR2195006',
 'SRR2422935',
 'SRR2422936',
 'SRR2422937',
 'SRR2422938',
 'SRR2422939',
 'SRR2422940',
 'SRR2660677',
 'SRR2660678',
 'SRR2660679',
 'SRR2660680',
 'SRR2660681',
 'SRR2660682',
 'SRR2660683',
 'SRR2660684',
 'SRR2660685',
 'SRR2660686',
 'SRR2660687',
 'SRR2660688',
 'SRR2660689',
 'SRR2660690',
 'SRR3575267',
 'SRR3575268',
 'SRR3575291',
 'SRR3575298'}

It looks like these samples no longer exist because they have been updated. Looking at the XML, these values are now secondary IDs. I went back and add the `RunSecondary` IDs to my table output just to double check that this is the case.

In [12]:
# Create a list of secondary IDs
secondaryIDs = set(ts.RunSecondary.tolist())
secondaryIDs

{'',
 'SRR2194872;SRR2194895',
 'SRR2194929;SRR2194944',
 'SRR2194955;SRR2194957',
 'SRR2195001;SRR2195002',
 'SRR2195003;SRR2195004',
 'SRR2195005;SRR2195006',
 'SRR3136809'}

In [13]:
# Expand out concatenated secondary IDs for easy comparison
IDs = []
for ID in secondaryIDs:
    if ID != '':
        IDs.extend(ID.split(';'))

IDs = set(IDs)
IDs

{'SRR2194872',
 'SRR2194895',
 'SRR2194929',
 'SRR2194944',
 'SRR2194955',
 'SRR2194957',
 'SRR2195001',
 'SRR2195002',
 'SRR2195003',
 'SRR2195004',
 'SRR2195005',
 'SRR2195006',
 'SRR3136809'}

Now that I have a list of secondary IDs I can compare this list to the values missing from the webversion of RunInfo.

In [18]:
# Compare secondary IDs with missing IDs and find the difference
IDs.difference(myMissing)

{'SRR3136809'}

In [19]:
ts[ts.RunSecondary == 'SRR3136809']

Unnamed: 0,Run,RunSecondary,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
1162,SRR3114090,SRR3136809,2016-02-10,,14025,2920418,0,0,,,,SRX1542109,MDR_RNAseq_map 91-C,RNA-Seq,RANDOM,TRANSCRIPTOMIC,UNDEFINED,0,0,ILLUMINA,Illumina HiSeq 2500,SRP068789,PRJNA309447,,,SRS1258611,SAMN04433043,,7227,Drosophila melanogaster,Dmel_MDR_RNA_seq,,,,,,,no,,,,,,SRA338296,,public,,


OK, it looks like the all of the differences between my table and the SRA RunInfo table were do to secondary IDs caused by updating submissions. The majority of the updates were to fix the mistake of uploading each read of a paired end experiment as single ends. SRR3136809 is not in the web downloaded version of Sra RunInfo, so it is just an extra secondary ID in the databases. **The parsed table has all of the correct rows.**

### Compare Columns and Values

Now I want to look at the columns and values to make sure they are similar to the Sra RunInfo Table.

In [14]:
# Test out column comparison
my = ts[['Run','ReleaseDate']]
w = web[['Run','ReleaseDate']]

In [15]:
my.head()

Unnamed: 0,Run,ReleaseDate
315,SRR3476589,2016-05-06
316,SRR3476587,2016-05-06
317,SRR3476579,2016-05-06
318,SRR3476578,2016-05-06
319,SRR3476577,2016-05-06


In [16]:
w.head()

Unnamed: 0,Run,ReleaseDate
0,ERR358180,2016-06-16
1,ERR358181,2016-06-16
2,ERR358182,2016-06-16
3,ERR358183,2016-06-16
4,SRR3663861,2016-06-20


In [20]:
# Merge two tables on Run
merged = my.merge(w, how='left', on='Run')
merged.head()

Unnamed: 0,Run,ReleaseDate_x,ReleaseDate_y
0,SRR3476589,2016-05-06,2016-05-06
1,SRR3476587,2016-05-06,2016-05-06
2,SRR3476579,2016-05-06,2016-05-06
3,SRR3476578,2016-05-06,2016-05-06
4,SRR3476577,2016-05-06,2016-05-06


In [21]:
# Check if any of the rows have a mismatch
any(merged.ReleaseDate_x != merged.ReleaseDate_y)

False

In [22]:
def diff(ts, web, header):
    """ Compare columns between two DataFrames
    
    This function takes two data frames and compares the column (header). It returns 
    a Boolean Series where True indicates the columns were different.
    
    
    """
    my = ts[['Run', header]]
    w = web[['Run', header]]
    merged = my.merge(w, how='left', on='Run')
    return merged[header + '_x'] != merged[header + '_y']

In [23]:
# Get a list of columns in the web version of SRA Run
web.columns

Index(['Run', 'ReleaseDate', 'LoadDate', 'spots', 'bases', 'spots_with_mates',
       'avgLength', 'size_MB', 'AssemblyName', 'download_path', 'Experiment',
       'LibraryName', 'LibraryStrategy', 'LibrarySelection', 'LibrarySource',
       'LibraryLayout', 'InsertSize', 'InsertDev', 'Platform', 'Model',
       'SRAStudy', 'BioProject', 'Study_Pubmed_id', 'ProjectID', 'Sample',
       'BioSample', 'SampleType', 'TaxID', 'ScientificName', 'SampleName',
       'g1k_pop_code', 'source', 'g1k_analysis_group', 'Subject_ID', 'Sex',
       'Disease', 'Tumor', 'Affection_Status', 'Analyte_Type',
       'Histological_Type', 'Body_Site', 'CenterName', 'Submission',
       'dbgap_study_accession', 'Consent', 'RunHash', 'ReadHash'],
      dtype='object')

The following columns have differences between the web SRA RunInfo and my parsed table. 

In [26]:
# Iterate over all columns and get differences NOTE: Skip the Run column.
bad = []
for c in web.columns.tolist()[1:]:
    d = any(diff(ts, web, c))
    if d:
        bad.append(c)
bad

['LoadDate',
 'spots',
 'bases',
 'spots_with_mates',
 'avgLength',
 'size_MB',
 'AssemblyName',
 'download_path',
 'LibraryName',
 'LibraryLayout',
 'InsertSize',
 'InsertDev',
 'BioProject',
 'Study_Pubmed_id',
 'ProjectID',
 'Sample',
 'BioSample',
 'SampleType',
 'TaxID',
 'ScientificName',
 'SampleName',
 'g1k_pop_code',
 'source',
 'g1k_analysis_group',
 'Subject_ID',
 'Sex',
 'Disease',
 'Affection_Status',
 'Analyte_Type',
 'Histological_Type',
 'Body_Site',
 'CenterName',
 'Submission',
 'dbgap_study_accession',
 'RunHash',
 'ReadHash']

Some of the information found in the RunInfo table was not found in the XML. For now I am just outputing the columns as `''`. I want to focus on those columns that I actually tried to populate in the output table.

In [25]:
# Columns that I did not populate
ignore = [
    'LoadDate',
    'size_MB',
    'AssemblyName',
    'download_path',
    'Pubmed_id',
    'ProjectID',
    'SampleType',
    'g1k_pop_code',
    'source',
    'g1k_analysis_group',
    'Subject_ID',
    'Disease',
    'Affection_Status',
    'Analyte_Type',
    'Histological_type',
    'dbgap_study_accession',
    'RunHash',
    'ReadHash'
]

In [29]:
# Get list of columns to focus on
set(bad).difference(set(ignore))

{'BioProject',
 'BioSample',
 'Body_Site',
 'CenterName',
 'Histological_Type',
 'InsertDev',
 'InsertSize',
 'LibraryLayout',
 'LibraryName',
 'Sample',
 'SampleName',
 'ScientificName',
 'Sex',
 'Study_Pubmed_id',
 'Submission',
 'TaxID',
 'avgLength',
 'bases',
 'spots',
 'spots_with_mates'}

In [32]:
merged = ts.merge(web, how='left', on='Run')
merged.set_index('Run', inplace=True)
merged.head()

Unnamed: 0_level_0,RunSecondary,ReleaseDate_x,LoadDate_x,spots_x,bases_x,spots_with_mates_x,avgLength_x,size_MB_x,AssemblyName_x,download_path_x,Experiment_x,LibraryName_x,LibraryStrategy_x,LibrarySelection_x,LibrarySource_x,LibraryLayout_x,InsertSize_x,InsertDev_x,Platform_x,Model_x,SRAStudy_x,BioProject_x,Study_Pubmed_id_x,ProjectID_x,Sample_x,BioSample_x,SampleType_x,TaxID_x,ScientificName_x,SampleName_x,g1k_pop_code_x,source_x,g1k_analysis_group_x,Subject_ID_x,Sex_x,Disease_x,Tumor_x,Affection_Status_x,Analyte_Type_x,Histological_Type_x,Body_Site_x,CenterName_x,Submission_x,dbgap_study_accession_x,Consent_x,RunHash_x,ReadHash_x,ReleaseDate_y,LoadDate_y,spots_y,bases_y,spots_with_mates_y,avgLength_y,size_MB_y,AssemblyName_y,download_path_y,Experiment_y,LibraryName_y,LibraryStrategy_y,LibrarySelection_y,LibrarySource_y,LibraryLayout_y,InsertSize_y,InsertDev_y,Platform_y,Model_y,SRAStudy_y,BioProject_y,Study_Pubmed_id_y,ProjectID_y,Sample_y,BioSample_y,SampleType_y,TaxID_y,ScientificName_y,SampleName_y,g1k_pop_code_y,source_y,g1k_analysis_group_y,Subject_ID_y,Sex_y,Disease_y,Tumor_y,Affection_Status_y,Analyte_Type_y,Histological_Type_y,Body_Site_y,CenterName_y,Submission_y,dbgap_study_accession_y,Consent_y,RunHash_y,ReadHash_y
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1
SRR3476589,,2016-05-06,,39131961,1956598050,0,50,,,,SRX1743178,UNDEFINED,OTHER,other,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422417,SAMN04942830,,7227,Drosophila melanogaster,GSM2142680,,,,,,,no,,,,,,SRA423615,,public,,,2016-05-06,2016-06-22,39131961,1956598050,0,50,1659,,http://sra-download.ncbi.nlm.nih.gov/srapub/SR...,SRX1743178,,OTHER,other,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,2.0,320547,SRS1422417,SAMN04942830,simple,7227,Drosophila melanogaster,GSM2142680,,,,,,,no,,,,,GEO,SRA423615,,public,9B89DD52F297F6C85E21B0E8BD4F1797,ACA8DAE5C4D016485896BCC5E1B0AD3B
SRR3476587,,2016-05-06,,15895465,794773250,0,50,,,,SRX1743176,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422418,SAMN04942828,,7227,Drosophila melanogaster,GSM2142678,,,,,,,no,,,,,,SRA423615,,public,,,2016-05-06,2016-06-22,15895465,794773250,0,50,742,,http://sra-download.ncbi.nlm.nih.gov/srapub/SR...,SRX1743176,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,2.0,320547,SRS1422418,SAMN04942828,simple,7227,Drosophila melanogaster,GSM2142678,,,,,,,no,,,,,GEO,SRA423615,,public,9CD9DE707C9D7E686B6164CB55DA0FC9,3B01E7CCBBC1FFD2A1BAC53D2EF29018
SRR3476579,,2016-05-06,,20659277,3098891550,0,150,,,,SRX1743168,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422407,SAMN04942820,,7227,Drosophila melanogaster,GSM2142670,,,,,,,no,,,,,,SRA423615,,public,,,2016-05-06,2016-06-22,20659277,3098891550,0,150,1912,,http://sra-download.ncbi.nlm.nih.gov/srapub/SR...,SRX1743168,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,2.0,320547,SRS1422407,SAMN04942820,simple,7227,Drosophila melanogaster,GSM2142670,,,,,,,no,,,,,GEO,SRA423615,,public,C5654DF66CE2C36A8A3C9C09DAB257F3,7FBB1CFE8550969CA5BA637D4196885B
SRR3476578,,2016-05-06,,14956121,2243418150,0,150,,,,SRX1743167,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422411,SAMN04942819,,7227,Drosophila melanogaster,GSM2142669,,,,,,,no,,,,,,SRA423615,,public,,,2016-05-06,2016-06-22,14956121,2243418150,0,150,1376,,http://sra-download.ncbi.nlm.nih.gov/srapub/SR...,SRX1743167,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,2.0,320547,SRS1422411,SAMN04942819,simple,7227,Drosophila melanogaster,GSM2142669,,,,,,,no,,,,,GEO,SRA423615,,public,7E8F5A091954852B697A4EE0DA14B87E,F409573941C51A038714C5F2D61BA98F
SRR3476577,,2016-05-06,,8894958,1334243700,0,150,,,,SRX1743166,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422406,SAMN04942818,,7227,Drosophila melanogaster,GSM2142668,,,,,,,no,,,,,,SRA423615,,public,,,2016-05-06,2016-06-22,8894958,1334243700,0,150,820,,http://sra-download.ncbi.nlm.nih.gov/srapub/SR...,SRX1743166,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,2.0,320547,SRS1422406,SAMN04942818,simple,7227,Drosophila melanogaster,GSM2142668,,,,,,,no,,,,,GEO,SRA423615,,public,439C3072EF48AC6BDEC8174C10E9DDA7,092CC13ACF609BE7D540DFE21C0F4BED


### Check BioProject

In [38]:
merged.loc[merged['BioProject_x'] != merged['BioProject_y'],['BioProject_x', 'BioProject_y']].head(30)

Unnamed: 0_level_0,BioProject_x,BioProject_y
Run,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR070285,UNDEFINED,PRJNA75285
SRR069030,UNDEFINED,PRJNA75285
SRR068619,UNDEFINED,PRJNA75285
SRR023779,UNDEFINED,PRJNA75285
SRR023780,UNDEFINED,PRJNA75285
SRR013492,UNDEFINED,PRJNA75285
SRR013491,UNDEFINED,PRJNA75285
SRR013489,UNDEFINED,PRJNA75285
SRR013490,UNDEFINED,PRJNA75285
SRR013488,UNDEFINED,PRJNA75285


The UNDEFINED values are what is given if the BioProject ID was not here, these should correspond to NaN. The None values are because the BioProject ID was there, but its value was None. This would indicate that the field I am grabbing BioProject ID is not always populated. Need to look at more detail.

### BioSample

In [41]:
merged.loc[merged['BioSample_x'] != merged['BioSample_y'],['BioSample_x', 'BioSample_y']]

Unnamed: 0_level_0,BioSample_x,BioSample_y
Run,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR3147695,UNDEFINED,SAMN04433043
SRR3147698,UNDEFINED,SAMN04433043
SRR3147700,UNDEFINED,SAMN04433043
ERR562744,SAMEA2639703,
ERR562736,SAMEA2639694,
SRR1024049,UNDEFINED,SAMN02390696
SRR1119148,UNDEFINED,SAMN02390696
SRR1119250,UNDEFINED,SAMN02390696
SRR073287,SAMN00120617,SRS121518
SRR073286,SAMN00120617,SRS121518


There are only a handful of these. Strange that there is an NaN where I have a SAM ID number.

## Available Sample Attributes: 

Sample attributes are entered as free text. Here is a list of attributes currently in the SRA. Some of these can collapsed into single categories.

In [155]:
counts = {}
for i in root.findall('SAMPLE_ATTRIBUTE/TAG'):
    try:
        text = i.text.lower()
        if text in counts:
            counts[text] += 1
        else:
            counts[text] = 1
    except:
        pass

In [156]:
pd.Series(counts).to_csv('../../output/tmp.text', sep='\t')

In [205]:
s = bob.find('SAMPLE')

In [206]:
for i in s.iter():
    print(i.tag, i.attrib, i.text)

SAMPLE {'accession': 'SRS1532953', 'alias': 'FR198N'} None
IDENTIFIERS {} None
PRIMARY_ID {} SRS1532953
EXTERNAL_ID {'namespace': 'BioSample'} SAMN05330489
SUBMITTER_ID {'namespace': 'pda|justin.lack@nih.gov', 'label': 'Sample name'} FR198N
TITLE {} FR198N
SAMPLE_NAME {} None
TAXON_ID {} 7227
SCIENTIFIC_NAME {} Drosophila melanogaster
COMMON_NAME {} fruit fly
SAMPLE_LINKS {} None
SAMPLE_LINK {} None
XREF_LINK {} None
DB {} bioproject
ID {} 327349
LABEL {} PRJNA327349
SAMPLE_ATTRIBUTES {} None
SAMPLE_ATTRIBUTE {} None
TAG {} strain
VALUE {} FR198N
SAMPLE_ATTRIBUTE {} None
TAG {} dev_stage
VALUE {} adult
SAMPLE_ATTRIBUTE {} None
TAG {} sex
VALUE {} female
SAMPLE_ATTRIBUTE {} None
TAG {} tissue
VALUE {} Whole organism
SAMPLE_ATTRIBUTE {} None
TAG {} BioSampleModel
VALUE {} Model organism or animal


In [211]:
for attrib in s.findall('SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE'):
    print(attrib.find('TAG').text)

strain
dev_stage
sex
tissue
BioSampleModel
