# SRA Query and XML Parsing

Currently metadata was downloaded by hand using the web interface which generates a sample table. The goal of this notebook is to query the SRA database directly and generate this table without using the web interface. There is an R package `SRAdb` that allows easy querying of the SRA database, but it does not directly interact with the NCBI database. Instead `SRAdb` uses a pre-built SQLite database that is provided by the author. In my tests of this package the counts did not match what I was getting from the SRA web interface.

In [1]:
# Load useful extensions
%reload_ext autoreload
%autoreload 2

%reload_ext ipycache

In [2]:
# Imports
import os
import sys
import re
from xml.etree import ElementTree as ET

import pandas as pd
from Bio import Entrez

from ipycache import CacheMagics
CacheMagics.cachedir = '../../output/cache'

# Import my libraries
sys.path.insert(0, '../../lib/python/')
import Sra

pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)
Entrez.email = 'justin.fear@nih.gov'

## Query SRA

Using the Biopython implementation of E-utilities directly query SRA using the search term `"Drosophila melanogaster"[Orgn]`. Because there is a large number of entires use the web history feature of E-utilities.

In [14]:
# Query SRA
handle = Entrez.esearch(db='sra', term='"Drosophila melanogaster"[Orgn]', retmax=99999, usehistory='y')
records = Entrez.read(handle)
print('There were ',records['Count'], ' records return')

# Save history from eSearch, this will be used in eFetch
webenv = records['WebEnv']
query_key = records['QueryKey']

There were  22434  records return


## Download Full XML Results

Download the XML records using the above query history. This process takes a long time and taxes the SRA system so only re-download if need.

In [4]:
# Check if I have already dumped the sra records. If 
# you want to update, simply delete the file and re-run.
fname = '../../output/sra_dump.xml'
if not os.path.exists(fname):
    Sra.downloadSRA(count=records['Count'], webenv=webenv, 
                query_key=query_key, fname=fname)

tree = ET.parse(fname)
root = tree.getroot()
ep = root.getchildren()
print('You have ', len(ep), ' XML records. This should match the number ofr results returned from your',
      'query. If they do not match then delete the file `../../ouput/sra_dump.xml`')

You have  22344  XML records. This should match the number ofr results returned from your query. If they do not match then delete the file `../../ouput/sra_dump.xml`


## Parse XML Records

E-utilities only provides an XML version of the results. XML needs parsed to generate a results table. A description of the SRA XML schema can be found here:

http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA_1-6a/

There is a large number of fields in the SRA XML, and a lot of the data is repeated in multiple places. I need to decide which pieces of information to use.

In [5]:
# Print out an example Tree and mark fields used
for experiment in ep:
    try:
        if experiment.find('RUN_SET/RUN/IDENTIFIERS/PRIMARY_ID').text == 'ERR358180':
            break
    except:
        pass
experiment = ep[802]
keep = ['EXPERIMENT/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/PRIMARY_ID',
        'EXPERIMENT/STUDY_REF/IDENTIFIERS/EXTERNAL_ID',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_STRATEGY',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SOURCE',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_SELECTION',
        'EXPERIMENT/DESIGN/LIBRARY_DESCRIPTOR/LIBRARY_LAYOUT/SINGLE',
        'EXPERIMENT/PLATFORM/ILLUMINA/INSTRUMENT_MODEL',
        'EXPERIMENT/PLATFORM/ILLUMINA',
        'SUBMISSION/IDENTIFIERS/PRIMARY_ID',
        'SUBMISSION/IDENTIFIERS/SUBMITTER_ID',
        'Organization/Address/Institution',
        'SAMPLE/TITLE',
        'SAMPLE/SAMPLE_NAME/TAXON_ID',
        'SAMPLE/SAMPLE_NAME/SCIENTIFIC_NAME',
        'SAMPLE/SAMPLE_NAME/COMMON_NAME',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/TAG',
        'SAMPLE/SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE/VALUE',
        'RUN_SET/RUN',
        'RUN_SET/RUN/Statistics/Read'
       ]

def print_tags(s, space='', path=''):
    if (s.tag != 'Quality') & (s.tag != 'Base'):
        print("{space}{tag} {attrib} {text}".format(tag=s.tag, attrib=s.attrib, text=s.text, space=space))

for i in experiment.getchildren():
    path1 = i.tag
    print_tags(i, path=path1)
    for j in i.getchildren():
        path2 = path1 + '/' + j.tag
        print_tags(j, '\t', path=path2)
        for k in j.getchildren():
            path3 = path2 + '/' + k.tag
            print_tags(k, '\t\t', path=path3)
            for l in k.getchildren():
                path4 = path3 + '/' + l.tag
                print_tags(l, '\t\t\t', path=path4)
                for m in l.getchildren():
                    path5 = path4 + '/' + m.tag
                    print_tags(m, '\t\t\t\t', path=path5)

EXPERIMENT {'accession': 'SRX1542556', 'alias': 'lnc25'} None
	IDENTIFIERS {} None
		PRIMARY_ID {} SRX1542556
		SUBMITTER_ID {'namespace': 'Tsinghua University'} lnc25
	TITLE {} CR45542 knockout
	STUDY_REF {'accession': 'SRP068880'} None
		IDENTIFIERS {} None
			PRIMARY_ID {} SRP068880
	DESIGN {} None
		DESIGN_DESCRIPTION {} None
		SAMPLE_DESCRIPTOR {'accession': 'SRS1231938'} None
			IDENTIFIERS {} None
				PRIMARY_ID {} SRS1231938
		LIBRARY_DESCRIPTOR {} None
			LIBRARY_NAME {} None
			LIBRARY_STRATEGY {} RNA-Seq
			LIBRARY_SOURCE {} TRANSCRIPTOMIC
			LIBRARY_SELECTION {} PolyA
			LIBRARY_LAYOUT {} None
				SINGLE {} None
		SPOT_DESCRIPTOR {} None
			SPOT_DECODE_SPEC {} None
				SPOT_LENGTH {} 49
				READ_SPEC {} None
	PLATFORM {} None
		ILLUMINA {} None
			INSTRUMENT_MODEL {} Illumina HiSeq 2000
SUBMISSION {'submission_comment': 'five fly mutants and one wild-type', 'lab_name': '', 'accession': 'SRA325840', 'alias': 'lnc1', 'center_name': 'Tsinghua University'} None
	IDENTIFIERS {} 

I am assuming the redundant information is populated by SRA and should be identical, but it would probably be safest to compare data points and flag those that do not match. I could do this a number of ways, first I could be a set of classes for each piece of information I am interested and then parse each bit of XML to check, or I could build a class for each bit of XML and check. The first one is probably easier. 

Decided to just try to replicate the `Sra RunInfo` table as close as possible. I developed a class separately to do some error checking and parsing.

In [30]:
# Parse SRA XML
res = Sra.SraResultsTable(fname)

# Generate DataFrame representation of XML.
exper = res.build_rows()

exper.shape

(29691, 48)

## Validate Parsed Table

Now that I have a parsed table I need to make sure it is correct. There will be some iteration if there is something wrong with the table.

The current table includes other types of samples besides RNA-seq data. I want to parse this out so that it matches the runinfo table. Note this does not account for samples that are not annotated correctly, this is just for verification of my XML parsing.

In [31]:
# List possible values of LibrarySource
exper['LibrarySource'].unique()

array(['GENOMIC', 'TRANSCRIPTOMIC', 'OTHER', 'SYNTHETIC', 'METAGENOMIC',
       'METATRANSCRIPTOMIC'], dtype=object)

In [103]:
# Filter out only Transcriptomic data
ts = exper[(exper['LibrarySource'] == 'TRANSCRIPTOMIC') | (exper['LibrarySource'] == 'METATRANSCRIPTOMIC')]

print(ts.shape)
ts.head()

(14035, 48)


Unnamed: 0,Run,RunSecondary,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
315,SRR3476589,,2016-05-06,,39131961,1956598050,0,50,,,,SRX1743178,UNDEFINED,OTHER,other,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422417,SAMN04942830,,7227,Drosophila melanogaster,GSM2142680,,,,,,,no,,,,,,SRA423615,,public,,
316,SRR3476587,,2016-05-06,,15895465,794773250,0,50,,,,SRX1743176,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422418,SAMN04942828,,7227,Drosophila melanogaster,GSM2142678,,,,,,,no,,,,,,SRA423615,,public,,
317,SRR3476579,,2016-05-06,,20659277,3098891550,0,150,,,,SRX1743168,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422407,SAMN04942820,,7227,Drosophila melanogaster,GSM2142670,,,,,,,no,,,,,,SRA423615,,public,,
318,SRR3476578,,2016-05-06,,14956121,2243418150,0,150,,,,SRX1743167,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422411,SAMN04942819,,7227,Drosophila melanogaster,GSM2142669,,,,,,,no,,,,,,SRA423615,,public,,
319,SRR3476577,,2016-05-06,,8894958,1334243700,0,150,,,,SRX1743166,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422406,SAMN04942818,,7227,Drosophila melanogaster,GSM2142668,,,,,,,no,,,,,,SRA423615,,public,,


There are <font color="#FF0000">14,035</font> rows in the output, but the SRA RunInfo table has <font color="#FF0000">14,072</font> rows. I need to figure out the difference.

In [10]:
# Create list of Run IDs from my parsed table
myRuns = set(ts.Run.tolist())

In [104]:
# Import the web downloaded table and create a list of Run IDs
web = pd.read_csv('../../output/SraRunInfo_example.csv')
webRuns = set(web.Run.tolist())

In [12]:
# Use Sets to determine the differences between the tow lists
myMissing = webRuns.difference(myRuns)
print(len(myMissing), myMissing)

36 {'SRR2660683', 'SRR2422937', 'SRR2660679', 'SRR2660688', 'SRR2195003', 'SRR2194895', 'SRR2195002', 'SRR2194955', 'SRR2660681', 'SRR3575267', 'SRR2660685', 'SRR2422940', 'SRR2422938', 'SRR2195005', 'SRR2660689', 'SRR2194872', 'SRR3575291', 'SRR2194957', 'SRR2422936', 'SRR2195004', 'SRR2660684', 'SRR2660678', 'SRR2660687', 'SRR2660677', 'SRR2660686', 'SRR2660690', 'SRR2660682', 'SRR2660680', 'SRR2422935', 'SRR2194944', 'SRR2194929', 'SRR2195001', 'SRR3575298', 'SRR2195006', 'SRR3575268', 'SRR2422939'}


It looks like these samples no longer exist because they have been updated. Looking at the XML, these values are now secondary IDs. I am going to go back and add them to my table output just to double check that this is the case.

I went back and added a `RunSecondary` column which contains a list of seconary IDs.

In [43]:
# Create a list of secondary IDs
secondaryIDs = set(ts.RunSecondary.tolist())
secondaryIDs

{'',
 'SRR2194872;SRR2194895',
 'SRR2194929;SRR2194944',
 'SRR2194955;SRR2194957',
 'SRR2195001;SRR2195002',
 'SRR2195003;SRR2195004',
 'SRR2195005;SRR2195006',
 'SRR3136809'}

In [41]:
# Expand out concatenated secondary IDs for easy comparison
IDs = []
for ID in secondaryIDs:
    if ID != '':
        IDs.extend(ID.split(';'))

IDs = set(IDs)
IDs

{'SRR2194872',
 'SRR2194895',
 'SRR2194929',
 'SRR2194944',
 'SRR2194955',
 'SRR2194957',
 'SRR2195001',
 'SRR2195002',
 'SRR2195003',
 'SRR2195004',
 'SRR2195005',
 'SRR2195006',
 'SRR3136809'}

In [42]:
# Compare secondary IDs with missing IDs and find the difference
IDs.difference(myMissing)

{'SRR3136809'}

In [44]:
ts[ts.RunSecondary == 'SRR3136809']

Unnamed: 0,Run,RunSecondary,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
1162,SRR3114090,SRR3136809,2016-02-10,,14025,2920418,0,0,,,,SRX1542109,MDR_RNAseq_map 91-C,RNA-Seq,RANDOM,TRANSCRIPTOMIC,UNDEFINED,0,0,ILLUMINA,Illumina HiSeq 2500,SRP068789,PRJNA309447,,,SRS1258611,SAMN04433043,,7227,Drosophila melanogaster,Dmel_MDR_RNA_seq,,,,,,,no,,,,,,SRA338296,,public,,


OK, it looks like the all of the differences between my table and the SRA RunInfo table were do to secondary IDs caused by updating submissions. The majority of the updates were to fix the mistake of uploading each read of a paired end experiment as single ends. SRR3136809 is not in the web downloaded version of Sra RunInfo, so it is just an extra secondary ID in the databases. The parsed table has all of the correct rows.

Now I want to examen the columns and values to make sure they are similar to the Sra RunInfo Table.

In [105]:
ts.set_index('Run', inplace=True)
web.set_index('Run', inplace=True)

In [110]:
my = ts['ReleaseDate'].to_frame()

In [113]:
w = web['ReleaseDate'].to_frame()

In [114]:
my.head()

Unnamed: 0_level_0,ReleaseDate
Run,Unnamed: 1_level_1
SRR3476589,2016-05-06
SRR3476587,2016-05-06
SRR3476579,2016-05-06
SRR3476578,2016-05-06
SRR3476577,2016-05-06


In [115]:
w.head()

Unnamed: 0_level_0,ReleaseDate
Run,Unnamed: 1_level_1
ERR358180,2016-06-16
ERR358181,2016-06-16
ERR358182,2016-06-16
ERR358183,2016-06-16
SRR3663861,2016-06-20


In [123]:
merged = my.merge(w, how='left', left_index=True, right_index=True)

In [124]:
merged.head()

Unnamed: 0_level_0,ReleaseDate_x,ReleaseDate_y
Run,Unnamed: 1_level_1,Unnamed: 2_level_1
SRR3476589,2016-05-06,2016-05-06
SRR3476587,2016-05-06,2016-05-06
SRR3476579,2016-05-06,2016-05-06
SRR3476578,2016-05-06,2016-05-06
SRR3476577,2016-05-06,2016-05-06


In [56]:
any(merged.ReleaseDate_x != merged.ReleaseDate_y)

False

In [129]:
def diff(ts, web, header):
    my = ts[header].to_frame()
    w = web[header].to_frame()
    merged = my.merge(w, how='left', left_index=True, right_index=True)
    return merged[header + '_x'] != merged[header + '_y']

In [61]:
web.columns

Index(['Run', 'ReleaseDate', 'LoadDate', 'spots', 'bases', 'spots_with_mates',
       'avgLength', 'size_MB', 'AssemblyName', 'download_path', 'Experiment',
       'LibraryName', 'LibraryStrategy', 'LibrarySelection', 'LibrarySource',
       'LibraryLayout', 'InsertSize', 'InsertDev', 'Platform', 'Model',
       'SRAStudy', 'BioProject', 'Study_Pubmed_id', 'ProjectID', 'Sample',
       'BioSample', 'SampleType', 'TaxID', 'ScientificName', 'SampleName',
       'g1k_pop_code', 'source', 'g1k_analysis_group', 'Subject_ID', 'Sex',
       'Disease', 'Tumor', 'Affection_Status', 'Analyte_Type',
       'Histological_Type', 'Body_Site', 'CenterName', 'Submission',
       'dbgap_study_accession', 'Consent', 'RunHash', 'ReadHash'],
      dtype='object')

The following columns have differences between the web SRA RunInfo and my parsed table. 

In [79]:
for c in web.columns.tolist()[1:]:
    d = any(diff(ts, web, c))
    if d:
        print(c)

LoadDate
spots
bases
spots_with_mates
avgLength
size_MB
AssemblyName
download_path
LibraryName
LibraryLayout
InsertSize
InsertDev
BioProject
Study_Pubmed_id
ProjectID
Sample
BioSample
SampleType
TaxID
ScientificName
SampleName
g1k_pop_code
source
g1k_analysis_group
Subject_ID
Sex
Disease
Affection_Status
Analyte_Type
Histological_Type
Body_Site
CenterName
Submission
dbgap_study_accession
RunHash
ReadHash


In [138]:
d = diff(ts, web, 'spots')

In [139]:
ts[d].head()

Unnamed: 0_level_0,RunSecondary,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
SRR3476589,,2016-05-06,,39131961,1956598050,0,50,,,,SRX1743178,UNDEFINED,OTHER,other,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422417,SAMN04942830,,7227,Drosophila melanogaster,GSM2142680,,,,,,,no,,,,,,SRA423615,,public,,
SRR3476587,,2016-05-06,,15895465,794773250,0,50,,,,SRX1743176,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422418,SAMN04942828,,7227,Drosophila melanogaster,GSM2142678,,,,,,,no,,,,,,SRA423615,,public,,
SRR3476579,,2016-05-06,,20659277,3098891550,0,150,,,,SRX1743168,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422407,SAMN04942820,,7227,Drosophila melanogaster,GSM2142670,,,,,,,no,,,,,,SRA423615,,public,,
SRR3476578,,2016-05-06,,14956121,2243418150,0,150,,,,SRX1743167,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422411,SAMN04942819,,7227,Drosophila melanogaster,GSM2142669,,,,,,,no,,,,,,SRA423615,,public,,
SRR3476577,,2016-05-06,,8894958,1334243700,0,150,,,,SRX1743166,UNDEFINED,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP074388,PRJNA320547,,,SRS1422406,SAMN04942818,,7227,Drosophila melanogaster,GSM2142668,,,,,,,no,,,,,,SRA423615,,public,,


In [140]:
web[d].head()

  if __name__ == '__main__':


IndexingError: Unalignable boolean Series key provided

## Available Sample Attributes: 

Sample attributes are entered as free text. Here is a list of attributes currently in the SRA. Some of these can collapsed into single categories.

In [155]:
counts = {}
for i in root.findall('SAMPLE_ATTRIBUTE/TAG'):
    try:
        text = i.text.lower()
        if text in counts:
            counts[text] += 1
        else:
            counts[text] = 1
    except:
        pass

In [156]:
pd.Series(counts).to_csv('../../output/tmp.text', sep='\t')

In [205]:
s = bob.find('SAMPLE')

In [206]:
for i in s.iter():
    print(i.tag, i.attrib, i.text)

SAMPLE {'accession': 'SRS1532953', 'alias': 'FR198N'} None
IDENTIFIERS {} None
PRIMARY_ID {} SRS1532953
EXTERNAL_ID {'namespace': 'BioSample'} SAMN05330489
SUBMITTER_ID {'namespace': 'pda|justin.lack@nih.gov', 'label': 'Sample name'} FR198N
TITLE {} FR198N
SAMPLE_NAME {} None
TAXON_ID {} 7227
SCIENTIFIC_NAME {} Drosophila melanogaster
COMMON_NAME {} fruit fly
SAMPLE_LINKS {} None
SAMPLE_LINK {} None
XREF_LINK {} None
DB {} bioproject
ID {} 327349
LABEL {} PRJNA327349
SAMPLE_ATTRIBUTES {} None
SAMPLE_ATTRIBUTE {} None
TAG {} strain
VALUE {} FR198N
SAMPLE_ATTRIBUTE {} None
TAG {} dev_stage
VALUE {} adult
SAMPLE_ATTRIBUTE {} None
TAG {} sex
VALUE {} female
SAMPLE_ATTRIBUTE {} None
TAG {} tissue
VALUE {} Whole organism
SAMPLE_ATTRIBUTE {} None
TAG {} BioSampleModel
VALUE {} Model organism or animal


In [211]:
for attrib in s.findall('SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE'):
    print(attrib.find('TAG').text)

strain
dev_stage
sex
tissue
BioSampleModel
