# Getting Started with Arrays in BigQuery
Arrays are an important topic to understand on at least a basic level to work with data in BigQuery. It allows additional information to be stored in lists or key-value pairs within the table. Both can be searched on and extracted to find exactly what you are looking for.

First you need to install and prepare the environment again.

In [1]:
# These commands were run when the notebook started but are shown in case you choose to work with Notebooks later.
!pip install google.cloud.bigquery
!pip install google.cloud.storage
!pip3 install --upgrade google-cloud-bigquery
!pip install google --user
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' --user

Collecting google.cloud.bigquery
  Using cached https://files.pythonhosted.org/packages/31/b5/f9deb0ceb925fe04afdfb4d2258ef64959e22e45468432f5b97caf1b85ba/google_cloud_bigquery-2.26.0-py2.py3-none-any.whl
Installing collected packages: google.cloud.bigquery
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_bigquery-2.26.0-py3.9-nspkg.pth'
Consider using the `--user` option or check the permissions.
[0m
Collecting google.cloud.storage
  Using cached https://files.pythonhosted.org/packages/76/03/55d5255ce6226d8908d5354f3d2ab5749a0ec8b75efc9cdd7d8a45db17b0/google_cloud_storage-1.42.1-py2.py3-none-any.whl
Installing collected packages: google.cloud.storage
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_storage-1.42.1-py3.9-nspkg.pth'
Consider using the `--user` option or check the 

In [2]:
# Run this command block to install the BigQuery extension.
%load_ext google.cloud.bigquery

# Problem : I want to find RNA-Seq data from estrogen receptor positive breast cancer cell lines.
Doing a search like this will require looking into the attributes stored in the database for each record. If you are an experienced user of SRA you might already know how to do a search like this with Entrez and Run Selector. Here we will find the results in a single SQL search primarily using the attributes column. The methods used to do this can be applied to build very targeted searches of submitter supplied information.

## What Does an Array look like in BigQuery?
Here you will look at a couple of columns from the database to show examples of arrays. You will select the accession column and the columns that are either simple arrays or are an array of data structures. 

In [3]:
%%bigquery

SELECT acc, datastore_filetype, datastore_provider, datastore_region, attributes
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1571.49query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  6.98rows/s]


Unnamed: 0,acc,datastore_filetype,datastore_provider,datastore_region,attributes
0,DRR209136,[sra],"[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '830289504'}, {'k': 'byte..."
1,DRR209149,[sra],"[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '954170736'}, {'k': 'byte..."
2,SRR364422,"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'geo_accession_exp', 'v': 'GSM833111'},..."
3,SRR006198,"[sra, srf]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bi_gssr_sample_id_exp', 'v': '24519.0'..."
4,SRR003485,"[sra, srf]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bi_gssr_sample_id_exp', 'v': '24516.0'..."


## Viewing what is in the Attributes Column of an SRA Record
To understand what is in the attributes column in the database it may help to view the contents as a table. All of the columns other than **acc** are arrays \[ \] with a comma separated list of items inside. The **attributes** column is an array of structs { } that have a key (k) and a value (v) for each item in the array. You will use the UNNEST function in BigQuery to extract the contents of the attributes column into one item per row. 

## Listing the Contents of an Array
If you unnest the attributes column, you get the list of the key-value pairs. The contents of this column are some standard items like "bytes" and "bases" as well as items that were provided by the submitter as additional information. This means that runs in the database will have some attributes in common but also some attributes not on other runs. There can also be keys that are repeated so one key might have multiple values in a single run. 

### Naming Unnested Data
When you unnest the data you can give the result of the unnest function a name (unnested_attributes in the below query) and then use that name in the select statement. Like you did before when naming tables, it can be used to make queries a little easier to type and read.

In [2]:
%%bigquery

SELECT unnested_attributes
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as unnested_attributes 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 584.49query/s]                          
Downloading: 100%|██████████| 18/18 [00:00<00:00, 27.48rows/s]


Unnamed: 0,extracted_attributes
0,"{'k': 'sex_calc', 'v': 'female'}"
1,"{'k': 'bases', 'v': '12470223358'}"
2,"{'k': 'bytes', 'v': '6566142480'}"
3,"{'k': 'age_sam', 'v': 'NA'}"
4,"{'k': 'biomaterial_provider_sam', 'v': 'Robert..."
5,"{'k': 'cell_line_sam_ss_dpl110', 'v': 'HCA-7'}"
6,"{'k': 'isolate_sam', 'v': 'cell lines'}"
7,"{'k': 'tissue_sam', 'v': 'colon'}"
8,"{'k': 'primary_search', 'v': '304768'}"
9,"{'k': 'primary_search', 'v': '4309137'}"


## Listing the Keys and Values in the Attributes as a Table
You can further break a structured list like this {'k': 'sex_calc', 'v': 'female'} apart into (k) and (v) columns by using unnested_attributes.k and unnested_attributes.v in the query. At this point you will be seeing all the keys and values in a table format for a single accession in the metadata table.

In [3]:
%%bigquery

SELECT unnested_attributes.k, unnested_attributes.v
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as unnested_attributes 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 791.68query/s]                          
Downloading: 100%|██████████| 18/18 [00:00<00:00, 26.09rows/s]


Unnamed: 0,k,v
0,sex_calc,female
1,bases,12470223358
2,bytes,6566142480
3,age_sam,
4,biomaterial_provider_sam,"Robert Coffey, Vanderbilt University"
5,cell_line_sam_ss_dpl110,HCA-7
6,isolate_sam,cell lines
7,tissue_sam,colon
8,primary_search,304768
9,primary_search,4309137


## Finding Runs with a Certain Attribute
Listing out all the attribute keys and values can be quite useful to help build searches for very specific things. Looking at the results for the above query, one of the attributes is tissue_sam : colon. If you want to find all runs where that exists you can use the following query.

In [47]:
%%bigquery

SELECT acc
FROM `nih-sra-datastore.sra.metadata`   
WHERE ('tissue_sam', 'colon') in UNNEST(attributes)

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 946.58query/s]                          
Downloading: 100%|██████████| 15550/15550 [00:01<00:00, 13305.05rows/s]


Unnamed: 0,acc
0,SRR14684549
1,SRR15652167
2,SRR7762696
3,SRR7762697
4,SRR7762698
...,...
15545,SRR13345069
15546,SRR13872307
15547,SRR15347514
15548,SRR11631045


## Search for Breast Cancer Cell Line Records
Now you can use the UNNEST function to search for the data in the initial problem. The LIKE function with the wildcard % in will let you search for records that contain certain words or phrases with anything around them. The query below will search for all records that mention "breast" and "cancer" and "cell" and "line" in at least one attribute.

In [5]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE extracted.v LIKE '%breast%cancer%cell%line%'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1042.84query/s]                         
Downloading: 100%|██████████| 1894/1894 [00:01<00:00, 1683.96rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject
0,SRR15443831,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA754393
1,SRR8454880,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA515879
2,SRR2353162,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA295409
3,SRR3290939,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA316332
4,SRR3290972,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA316332
...,...,...,...,...,...,...
1889,SRR13036294,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA676130
1890,SRR12060822,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA640820
1891,SRR12060836,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA640820
1892,SRR6822820,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA437670


### Subselect
A subselect is a select statement embedded within another select statement. In this case you will use a subselect to get the value of attributes using the attribute key name. This example uses 'source_name_sam' to see values that are on *some* records and 'bases' which will be on all records.

You can also find the key for the attribute that contained the breast cancer cell line entry by using a subselect. Because there might be more than one attribute key that contains that phrase in a single record, you can use the limit function to just find the first hit. You could do something like build an array from all the key hits but that is more involved and not necessary in this example.

In [13]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
    (select v from unnest(attributes) where k = 'source_name_sam')as source_name,
    (select v from unnest(attributes) where k = 'bases')as bases,
    (select k from unnest(attributes) where v like '%breast%cancer%cell%line%' limit 1) as breast_cancer_attribute,
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE extracted.v LIKE '%breast%cancer%cell%line%'

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1462.70query/s]                        
Downloading: 100%|██████████| 10571/10571 [00:01<00:00, 5628.02rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject,source_name,bases,breast_cancer_attribute
0,SRR015448,OTHER,public,GENOMIC,RT-PCR,,,112586635,primary_search
1,SRR5768015,OTHER,public,GENOMIC,other,PRJNA392337,MDA-MB-231,481000108,cell_type_sam
2,SRR5768014,OTHER,public,GENOMIC,other,PRJNA392337,MDA-MB-231,461470848,cell_type_sam
3,SRR12410836,OTHER,public,GENOMIC,other,PRJNA655843,Pool of 5 breast cancer cell lines,38765680,source_name_sam
4,SRR12409553,OTHER,public,GENOMIC,other,PRJNA655843,Pool of 5 breast cancer cell lines,50943462,source_name_sam
...,...,...,...,...,...,...,...,...,...
10566,SRR6464399,ChIP-Seq,public,GENOMIC,ChIP,PRJNA429642,ERa E2 ChIP-seq,2327280400,cell_type_sam
10567,SRR10852376,ChIP-Seq,public,GENOMIC,ChIP,PRJNA599998,DCIScom-NS,527684948,cell_type_sam
10568,SRR10961905,ChIP-Seq,public,GENOMIC,ChIP,PRJNA603080,BT-474 human breast cancer cell line treated w...,1047737700,source_name_sam
10569,SRR1021790,ChIP-Seq,public,GENOMIC,ChIP,PRJNA147213,Human breast cancer cell line,851894748,source_name_sam


## Limiting the Search Further
Now you will add the search about the Estrogen Receptor status of the cells as well as limit it to paired RNA-Seq datasets using cDNA selection that are in the public domain. And again you will query the keys to see what attribute contained the information about the cells being ER-positive.
# paired RNA-Seq datasets using cDNA selection that are in the public domain and

In [6]:
%%bigquery

SELECT acc, assay_type, consent, librarysource, libraryselection, bioproject,
    (select v from unnest(attributes) where k = 'source_name_sam')as source_name,
    (select v from unnest(attributes) where k = 'bases')as bases,
    (select k from unnest(attributes) where v like '%ER-positive%' limit 1) as ER_column
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'
    AND extracted.v LIKE '%ER-positive%'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 639.47query/s]                          
Downloading: 100%|██████████| 6/6 [00:00<00:00,  7.65rows/s]


Unnamed: 0,acc,assay_type,consent,librarysource,libraryselection,bioproject,source_name,bases,ER_column
0,SRR8495050,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7551937500,cell_type_sam
1,SRR8495052,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,7745716800,cell_type_sam
2,SRR8495049,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7359307500,cell_type_sam
3,SRR8495053,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,7849955700,cell_type_sam
4,SRR8495054,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D-TR,6637409100,cell_type_sam
5,SRR8495051,RNA-Seq,public,TRANSCRIPTOMIC,cDNA,PRJNA517424,T47D,7402524600,cell_type_sam


## Storing Accessions in a Variable
In the last couple of examples we will look at getting a list of accessions as a file within Jupyter. This is a frequent need when working with data that is uniquely identified by accessions or UUIDs. You can store the results of the query in a variable using a function in the bigquery module. You do this by providing the name of the variable "results" to store the output in. This is somewhat specific to the bigquery python module you're using in Jupyter and might not apply to how you will do queries normally.

In [7]:
%%bigquery results

SELECT acc
FROM `nih-sra-datastore.sra.metadata` meta,
    UNNEST(attributes) as extracted
WHERE assay_type = 'RNA-Seq'
    AND consent = 'public'
    AND libraryselection = 'cDNA'
    AND librarylayout = 'PAIRED'
    AND extracted.v LIKE '%breast%cancer%cell%line%'
    AND extracted.v LIKE '%ER-positive%'


Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 801.66query/s]                          
Downloading: 100%|██████████| 6/6 [00:00<00:00,  8.05rows/s]


## Viewing the Contents of the Variable
You can output the contents of a variable by running a cell with the variable name in it.

In [8]:
results

Unnamed: 0,acc
0,SRR8495050
1,SRR8495052
2,SRR8495049
3,SRR8495054
4,SRR8495053
5,SRR8495051


## Storing the Variable Contents as an Accession List
The results variable contains a dataframe with an index (0-5) and a header (acc) that you don't want in the accession list. You will remove those using the to_string function using the options of index=False and header=False. This will save a file with just a list of accessions. Alternatively if you wanted to save the results of a metadata query that included the header information, you can omit the header=False option.

In [10]:
file = open("accessions.txt", "w")
file.write(results.to_string(index=False, header=False))
file.close()