# Getting Started with Arrays in BigQuery
Arrays are an important topic to understand on at least a basic level to work with data in BigQuery. It allows additional information to be stored in lists or key-value pairs within the table. Both can be searched on and extracted to find exactly what you are looking for.

First you need to install and prepare the environment again.

In [None]:
# These commands were run when the notebook started but are shown in case you choose to work with Notebooks later.
!pip install google.cloud.bigquery
!pip install google.cloud.storage
!pip3 install --upgrade google-cloud-bigquery
!pip install google --user
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' --user

In [None]:
# Run this command block to install the BigQuery extension.
%load_ext google.cloud.bigquery

## What Does an Array look like in BigQuery?
Now you can look at a couple of columns from the earlier searches to show examples of arrays. You will select the accession column and the columns that are either simple arrays or are an array of data structures. This is the original query for 5 human runs looking at just those columns with an array.

In [3]:
%%bigquery

SELECT acc, datastore_filetype, datastore_provider, datastore_region, attributes
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1410.56query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  7.33rows/s]


Unnamed: 0,acc,datastore_filetype,datastore_provider,datastore_region,attributes
0,DRR245014,[sra],"[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '1552245200'}, {'k': 'byt..."
1,ERR2535945,"[sra, fastq]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '970498698'}, {'k': 'byte..."
2,SRR15402535,"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'sex_calc', 'v': 'not applicable'}, {'k..."
3,SRR15611200,"[bam, sra]",[ncbi],[ncbi.public],"[{'k': 'sex_calc', 'v': 'female'}, {'k': 'base..."
4,SRR15584106,"[bam, sra]",[ncbi],[ncbi.public],"[{'k': 'sex_calc', 'v': 'female'}, {'k': 'base..."


## Arrays Compared to Structs
All of the columns other thatn **acc** are arrays \[ \] with a comma separated list of items inside. The **attributes** column is an array of structs { } that have a key (k) and a value (v) for each item in the array. Arrays with just a list of strings or integers are usually easy to read. You can use the UNNEST function in BigQuery to extract the contents of an array into one item per row. 

### Naming Unnested Data
When you unnest the data you can give the result of the unnest function a name (data_host in this query) and then use that name in the select statement.

In [7]:
%%bigquery

SELECT data_host
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(datastore_provider) as data_host
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 997.69query/s]                          
Downloading: 100%|██████████| 3/3 [00:00<00:00,  4.79rows/s]


Unnamed: 0,data_host
0,gs
1,ncbi
2,s3


## Listing the Contents of Structs
If you unnest an array of structs, you get the list of the structs in the array. The structure of the column "attributes" for SRA is a list of key-value pairs. The contents of this column are some standard items like "bytes" and "bases" as well as items that were provided by the submitter as additional information. This means that not all runs in the database will have the same attributes. There can also be keys that are repeated so one key might have multiple values in a single run. 

In [2]:
%%bigquery

SELECT unnested_attributes
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as unnested_attributes 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 584.49query/s]                          
Downloading: 100%|██████████| 18/18 [00:00<00:00, 27.48rows/s]


Unnamed: 0,extracted_attributes
0,"{'k': 'sex_calc', 'v': 'female'}"
1,"{'k': 'bases', 'v': '12470223358'}"
2,"{'k': 'bytes', 'v': '6566142480'}"
3,"{'k': 'age_sam', 'v': 'NA'}"
4,"{'k': 'biomaterial_provider_sam', 'v': 'Robert..."
5,"{'k': 'cell_line_sam_ss_dpl110', 'v': 'HCA-7'}"
6,"{'k': 'isolate_sam', 'v': 'cell lines'}"
7,"{'k': 'tissue_sam', 'v': 'colon'}"
8,"{'k': 'primary_search', 'v': '304768'}"
9,"{'k': 'primary_search', 'v': '4309137'}"


## Listing the Keys and Values in the Attributes Structs as a Table
You can further break a struct list like this {'k': 'sex_calc', 'v': 'female'} apart into (k) and (v) columns by using unnested_attributes.k and unnested_attributes.v in our query. At this point you will be seeing all the keys and valunes in a table format for a single accession in the metadata table.

In [3]:
%%bigquery

SELECT unnested_attributes.k, unnested_attributes.v
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as unnested_attributes 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 791.68query/s]                          
Downloading: 100%|██████████| 18/18 [00:00<00:00, 26.09rows/s]


Unnamed: 0,k,v
0,sex_calc,female
1,bases,12470223358
2,bytes,6566142480
3,age_sam,
4,biomaterial_provider_sam,"Robert Coffey, Vanderbilt University"
5,cell_line_sam_ss_dpl110,HCA-7
6,isolate_sam,cell lines
7,tissue_sam,colon
8,primary_search,304768
9,primary_search,4309137


## Finding Runs with a Certain Attribute
Listing out all the attribute keys and values can be quite useful to help build searches for very specific things. Looking at the results for the above listing, one of the attributes is tissue_sam : colon. If you want to find all runs where that exists you can use the following query.

In [47]:
%%bigquery

SELECT acc
FROM `nih-sra-datastore.sra.metadata`   
WHERE ('tissue_sam', 'colon') in UNNEST(attributes)

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 946.58query/s]                          
Downloading: 100%|██████████| 15550/15550 [00:01<00:00, 13305.05rows/s]


Unnamed: 0,acc
0,SRR14684549
1,SRR15652167
2,SRR7762696
3,SRR7762697
4,SRR7762698
...,...
15545,SRR13345069
15546,SRR13872307
15547,SRR15347514
15548,SRR11631045


## Search for Runs to Output Data
Before moving on to using the SRA Toolkit you need to get a list of accessions to work with. One very useful part of Entrez on the NCBI website is natural language searching. You can type "breast cancer cell line" into the search and get a large number of results that probaby contain what you are looking for. You can replicate some of that search by using the LIKE function with the wildcard % in our attributes search. The query below will search for all paired RNA-Seq datasets using cDNA selection that are in the public domain and mention breast cancer cell line in the attributes.

In [4]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1194.28query/s]                         
Downloading: 100%|██████████| 1894/1894 [00:01<00:00, 1688.42rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject
0,SRR9134779,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
1,SRR9134779,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
2,SRR9134766,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
3,SRR9134766,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
4,SRR9134783,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
...,...,...,...,...,...,...
1889,SRR9077653,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA543423
1890,SRR9077653,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA543423
1891,SRR9134489,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037
1892,SRR9134489,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037


### Subselect
A subselect is a select statement embedded within another select statement. In this case you will use a subselect to get the value of attributes using the attribute key name. You will use 'source_name_sam' and 'bases' as an example because they are likely to be on most records.

You can also find the key for the attribute that contained the breast cancer cell line entry by using a subselect. Because there might be more than one attribute key that contains that phrase in a single record, you can use the limit function to just find the first hit. You could do something like build an array from all the key hits, but that is more involved and not necessary in this example.

In [8]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
    (select v from unnest(attributes) where k = 'source_name_sam')as source_name,
    (select v from unnest(attributes) where k = 'bases')as bases,
    (select k from unnest(attributes) where v like '%breast%cancer%cell%line%' limit 1) as breast_cancer_attribute,
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1025.00query/s]                         
Downloading: 100%|██████████| 1894/1894 [00:01<00:00, 1484.61rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject,source_name,bases,breast_cancer_attribute
0,SRR9134779,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037,,262532850,biomaterial_provider_sam
1,SRR9134779,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037,,262532850,biomaterial_provider_sam
2,SRR9134766,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037,,5640579330,biomaterial_provider_sam
3,SRR9134766,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037,,5640579330,biomaterial_provider_sam
4,SRR9134783,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA504037,,6792583094,biomaterial_provider_sam
...,...,...,...,...,...,...,...,...,...
1889,SRR13815301,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA705671,SUM159_HIF-1a and HIF-2a double knockdown_1% O2,17699568284,cell_type_sam
1890,SRR13239745,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA684349,MCF7 breast cancer cell line,8275186258,source_name_sam
1891,SRR10916796,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA602306,breast cancer cell line,3965484990,source_name_sam
1892,SRR925704,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA210428,breast cancer cell line,4835439048,source_name_sam


## Limiting the Search Further
There are still quite a few records in this list so you might be able to find an even more specific attribute. Now you will add a search about the Estrogen Receptor status of cells. And again you will query the keys to see what attribute contained the information about the cells being ER-positive.

In [9]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
    (select v from unnest(attributes) where k = 'source_name_sam')as source_name,
    (select v from unnest(attributes) where k = 'bases')as bases,
    (select k from unnest(attributes) where v like '%ER-positive%' limit 1) as ER_column
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'
    AND extracted.v LIKE '%ER-positive%'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 750.32query/s]                          
Downloading: 100%|██████████| 6/6 [00:00<00:00,  9.90rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject,source_name,bases,ER_column
0,SRR8495053,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,7849955700,cell_type_sam
1,SRR8495049,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7359307500,cell_type_sam
2,SRR8495051,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7402524600,cell_type_sam
3,SRR8495054,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,6637409100,cell_type_sam
4,SRR8495050,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7551937500,cell_type_sam
5,SRR8495052,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,7745716800,cell_type_sam


## Storing Accessions in a Variable
You can store the results of the query in a variable using a function in the bigquery module. You do this by providing the name of the variable "results" to store the output in. This is somewhat specific to the bigquery python module you're using in Jupyter and might not apply to how you will do queries normally.

In [None]:
%%bigquery results

SELECT acc
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'
    AND extracted.v LIKE '%ER-positive%'


## Viewing the Contents of the Variable
You can output the contents of a variable by running a cell with the variable name in it.

In [52]:
results

Unnamed: 0,acc
0,SRR8495049
1,SRR8495052
2,SRR8495053
3,SRR8495050
4,SRR8495051
5,SRR8495054


## Storing the Variable Contents as an Accession List
The results variable contains a datafram with an index (0-5) and a header (acc) that you don't want in the accession list. You will remove those using the to_string function using the options of index=False and header=False. This will save a file with just a list of accessions. Alternatively if you wanted to save the results of a metadata query that included the header information, you can omit the header=False option.

In [51]:
file = open("accessions.txt", "w")
file.write(results.to_string(index=False, header=False))
file.close()