# Query SRA metadata using Big Query

Using https://pandas-gbq.readthedocs.io/. 

Before you begin, you must create a Google Cloud Platform project. Use the BigQuery sandbox to try the service for free.

If you do not provide any credentials, this module attempts to load credentials from the environment. If no credentials are found, pandas-gbq prompts you to open a web browser, where you can grant it permissions to access your cloud resources. These credentials are only used locally.

In [183]:
import pandas_gbq
import pandas as pd

In [184]:
# Construct query
# Here we are using the bioproject PRJNA523380 for the CCLE cell lines
# I only want to get the RNA-seq data 
query = """
SELECT * FROM `nih-sra-datastore.sra.metadata` 
WHERE bioproject = 'PRJNA523380' AND assay_type = 'RNA-Seq'
LIMIT 5000
"""

In [185]:
# Query the metadata table in BigQuery
# Might take some time
# Returns a dataframe
df = pandas_gbq.read_gbq(query)

Downloading: 100%|[32m██████████[0m|


In [186]:
# These are the columns in the metadata table and the type of data they contain
df_schema = pd.DataFrame(pandas_gbq.schema.generate_bq_schema(df)['fields'])
df_schema.set_index('name', inplace=True)

# Uncomment if you wish to see the schema
# df_schema 

# Goal here is to curate metadata for the results of the query
I specifically want a dataframe with the desired metadata as well as a dictionary with the same information.

Hoping to be able to use the dictionary to add the metadata to the objects in the the GCS bucket 

In [189]:
#  TODO:: refactor these functions to be more generic
# Also improve efficiency by not iterating over the entire dataframe
# Maybe use a dictionary comprehension instead


def convert_array_to_dict(arr):
    result_dict = {}
    desired_keys = ['bases', 'bytes', 'run_file_create_date', 'disease_sam', 'disease_stage_sam_s_dpl172', 'tissue_sam']
    if isinstance(arr, list):
        for item in arr:
            if isinstance(item, dict):
                key = item.get('k')
                value = item.get('v')
                if key in desired_keys:
                    if key == 'bytes':
                        result_dict['size_in_bytes'] = int(value)
                        result_dict['size_in_GB'] = round(float(value) / 1000000000, 2)
                    else:
                        if key in result_dict:
                            if isinstance(result_dict[key], list):
                                result_dict[key].append(value)
                            else:
                                result_dict[key] = [result_dict[key], value]
                        else:
                            result_dict[key] = value
    return result_dict

def convert_row_to_dict(row):
    result_dict = {}
    for column in row.index:
        if column == 'attributes':
            result_dict.update(convert_array_to_dict(row[column][0]))
        else:
            result_dict[column] = row[column]
    return result_dict


def convert_dataframe_to_dict(df):
    result_dict = {}
    for _, row in df.iterrows():
        result_dict[row['acc']] = convert_row_to_dict(row)
    return result_dict


In [188]:
# choose columns
columns = ['acc', 'sample_name', 'sample_acc', 'experiment',  'library_name', 'sra_study', 'center_name', 
'platform', 'assay_type', 'librarysource', 'organism', 'releasedate']

# # for each column in columns, print out the number of unique values, and then the first 5 unique values
for col in columns:
    print(f"For {col}, there are: {df[col].nunique()} unique values. \nExamples: {df[col].unique()[0:5]}\n")


For acc, there are: 1019 unique values. 
Examples: ['SRR8616020' 'SRR8615545' 'SRR8615484' 'SRR8615991' 'SRR8615876']

For sample_name, there are: 1019 unique values. 
Examples: ['A375_SKIN' 'LN443_CENTRAL_NERVOUS_SYSTEM' 'JHESOAD1_OESOPHAGUS'
 'COLO680N_OESOPHAGUS' 'OV56_OVARY']

For sample_acc, there are: 1019 unique values. 
Examples: ['SRS4395948' 'SRS4395810' 'SRS4395335' 'SRS4395972' 'SRS4396070']

For experiment, there are: 1019 unique values. 
Examples: ['SRX5415030' 'SRX5414855' 'SRX5414269' 'SRX5415059' 'SRX5415174']

For library_name, there are: 1019 unique values. 
Examples: ['RNASeq-A375_SKIN' 'RNASeq-LN443_CENTRAL_NERVOUS_SYSTEM'
 'RNASeq-JHESOAD1_OESOPHAGUS' 'RNASeq-COLO680N_OESOPHAGUS'
 'RNASeq-OV56_OVARY']

For sra_study, there are: 1 unique values. 
Examples: ['SRP186687']

For center_name, there are: 1 unique values. 
Examples: ['BROAD INSTITUTE']

For platform, there are: 1 unique values. 
Examples: ['ILLUMINA']

For assay_type, there are: 1 unique values. 
Examples

In [190]:
# Subset df to df_ using columns but also include the 'attributes' column
df_ = df[columns + ['attributes']]

dict_metadata = convert_dataframe_to_dict(df_)

In [191]:
# convert dict to pandas dataframe
df_metadata = pd.DataFrame.from_dict(dict_metadata, orient='index')

In [192]:
df_metadata

Unnamed: 0,acc,sample_name,sample_acc,experiment,library_name,sra_study,center_name,platform,assay_type,librarysource,organism,releasedate
SRR8616020,SRR8616020,A375_SKIN,SRS4395948,SRX5415030,RNASeq-A375_SKIN,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615545,SRR8615545,LN443_CENTRAL_NERVOUS_SYSTEM,SRS4395810,SRX5414855,RNASeq-LN443_CENTRAL_NERVOUS_SYSTEM,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615484,SRR8615484,JHESOAD1_OESOPHAGUS,SRS4395335,SRX5414269,RNASeq-JHESOAD1_OESOPHAGUS,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615991,SRR8615991,COLO680N_OESOPHAGUS,SRS4395972,SRX5415059,RNASeq-COLO680N_OESOPHAGUS,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615876,SRR8615876,OV56_OVARY,SRS4396070,SRX5415174,RNASeq-OV56_OVARY,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
SRR8616113,SRR8616113,NCIH2286_LUNG,SRS4395876,SRX5414937,RNASeq-NCIH2286_LUNG,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615935,SRR8615935,SNU175_LARGE_INTESTINE,SRS4395226,SRX5415115,RNASeq-SNU175_LARGE_INTESTINE,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615470,SRR8615470,TE4_OESOPHAGUS,SRS4395348,SRX5414283,RNASeq-TE4_OESOPHAGUS,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00
SRR8615458,SRR8615458,HCC38_BREAST,SRS4395357,SRX5414295,RNASeq-HCC38_BREAST,SRP186687,BROAD INSTITUTE,ILLUMINA,RNA-Seq,TRANSCRIPTOMIC,Homo sapiens,2019-03-27 00:00:00+00:00


In [193]:
dict_metadata

{'SRR8616020': {'acc': 'SRR8616020',
  'sample_name': 'A375_SKIN',
  'sample_acc': 'SRS4395948',
  'experiment': 'SRX5415030',
  'library_name': 'RNASeq-A375_SKIN',
  'sra_study': 'SRP186687',
  'center_name': 'BROAD INSTITUTE',
  'platform': 'ILLUMINA',
  'assay_type': 'RNA-Seq',
  'librarysource': 'TRANSCRIPTOMIC',
  'organism': 'Homo sapiens',
  'releasedate': Timestamp('2019-03-27 00:00:00+0000', tz='UTC')},
 'SRR8615545': {'acc': 'SRR8615545',
  'sample_name': 'LN443_CENTRAL_NERVOUS_SYSTEM',
  'sample_acc': 'SRS4395810',
  'experiment': 'SRX5414855',
  'library_name': 'RNASeq-LN443_CENTRAL_NERVOUS_SYSTEM',
  'sra_study': 'SRP186687',
  'center_name': 'BROAD INSTITUTE',
  'platform': 'ILLUMINA',
  'assay_type': 'RNA-Seq',
  'librarysource': 'TRANSCRIPTOMIC',
  'organism': 'Homo sapiens',
  'releasedate': Timestamp('2019-03-27 00:00:00+0000', tz='UTC')},
 'SRR8615484': {'acc': 'SRR8615484',
  'sample_name': 'JHESOAD1_OESOPHAGUS',
  'sample_acc': 'SRS4395335',
  'experiment': 'SRX541