# Getting Started with Arrays in BigQuery
Arrays are an important topic to understand on at least a basic level to work with data in BigQuery. It allows additional information to be stored in a key-value pair within the table. Those key-value pairs can be searched on and extracted to find exactly what we are looking for.

First we need to install and prepare the environment again.

In [2]:
!pip install google.cloud.bigquery
!pip install google.cloud.storage
!pip3 install --upgrade google-cloud-bigquery

!pip install google --user
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' --user

%load_ext google.cloud.bigquery

Collecting google.cloud.bigquery
  Using cached https://files.pythonhosted.org/packages/fd/2a/d53b342d20e4b95ade480fa04977969166bb189bf0e909501e4fff86cb3f/google_cloud_bigquery-2.25.1-py2.py3-none-any.whl
Installing collected packages: google.cloud.bigquery
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_bigquery-2.25.1-py3.9-nspkg.pth'
Consider using the `--user` option or check the permissions.
[0m
Collecting google.cloud.storage
  Using cached https://files.pythonhosted.org/packages/0e/d6/5878d73105fd242dafb42bbea26629372d397f06cb402e90302a4824c2c2/google_cloud_storage-1.42.0-py2.py3-none-any.whl
Installing collected packages: google.cloud.storage
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_storage-1.42.0-py3.9-nspkg.pth'
Consider using the `--user` option or check the 

Now we can look at a couple of columns from our earlier searches to show examples of arrays.

In [3]:
%%bigquery
SELECT acc, datastore_filetype, datastore_provider, datastore_region, attributes
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1410.56query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  7.33rows/s]


Unnamed: 0,acc,datastore_filetype,datastore_provider,datastore_region,attributes
0,DRR245014,[sra],"[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '1552245200'}, {'k': 'byt..."
1,ERR2535945,"[sra, fastq]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '970498698'}, {'k': 'byte..."
2,SRR15402535,"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'sex_calc', 'v': 'not applicable'}, {'k..."
3,SRR15611200,"[bam, sra]",[ncbi],[ncbi.public],"[{'k': 'sex_calc', 'v': 'female'}, {'k': 'base..."
4,SRR15584106,"[bam, sra]",[ncbi],[ncbi.public],"[{'k': 'sex_calc', 'v': 'female'}, {'k': 'base..."


All of the columns other thatn **acc** are arrays \[\] with a comma separated list of items inside. The **attributes** column is an array of structs {} that have a key (k) and a value (v) for each item in the array. We can use the UNNEST function to extract the contents of an array into one row per item inside.

In [7]:
%%bigquery
SELECT data_host
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(datastore_provider) as data_host
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 997.69query/s]                          
Downloading: 100%|██████████| 3/3 [00:00<00:00,  4.79rows/s]


Unnamed: 0,data_host
0,gs
1,ncbi
2,s3


If we unnest an array of structs, we can see the list of the structs.

In [23]:
%%bigquery
SELECT extracted
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as extracted 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1824.40query/s]
Downloading: 100%|██████████| 18/18 [00:01<00:00, 17.89rows/s]


Unnamed: 0,extracted
0,"{'k': 'sex_calc', 'v': 'female'}"
1,"{'k': 'bases', 'v': '12470223358'}"
2,"{'k': 'bytes', 'v': '6566142480'}"
3,"{'k': 'age_sam', 'v': 'NA'}"
4,"{'k': 'biomaterial_provider_sam', 'v': 'Robert..."
5,"{'k': 'cell_line_sam_ss_dpl110', 'v': 'HCA-7'}"
6,"{'k': 'isolate_sam', 'v': 'cell lines'}"
7,"{'k': 'tissue_sam', 'v': 'colon'}"
8,"{'k': 'primary_search', 'v': '304768'}"
9,"{'k': 'primary_search', 'v': '4309137'}"


We can further break a struct list this {'k': 'sex_calc', 'v': 'female'} apart into (k) and (v) columns by using extracted.k and extracted.v in our query. 

In [26]:
%%bigquery
SELECT extracted.k, extracted.v
FROM `nih-sra-datastore.sra.metadata`,
    UNNEST(attributes) as extracted 
WHERE acc = 'SRR2973262'

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1142.24query/s]                         
Downloading: 100%|██████████| 18/18 [00:00<00:00, 18.38rows/s]


Unnamed: 0,k,v
0,sex_calc,female
1,bases,12470223358
2,bytes,6566142480
3,age_sam,
4,biomaterial_provider_sam,"Robert Coffey, Vanderbilt University"
5,cell_line_sam_ss_dpl110,HCA-7
6,isolate_sam,cell lines
7,tissue_sam,colon
8,primary_search,304768
9,primary_search,4309137


Listing out all the attribute keys and values can be quite useful as it can help you build searches for very specific things. Looking at the results for the above listing, we can see that tissue_sam : colon is one of the structs in the attributes. If we want to find all runs where that exists we can use the following query.

In [47]:
%%bigquery
SELECT acc
FROM `nih-sra-datastore.sra.metadata`   
WHERE ('tissue_sam', 'colon') in UNNEST(attributes)

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 946.58query/s]                          
Downloading: 100%|██████████| 15550/15550 [00:01<00:00, 13305.05rows/s]


Unnamed: 0,acc
0,SRR14684549
1,SRR15652167
2,SRR7762696
3,SRR7762697
4,SRR7762698
...,...
15545,SRR13345069
15546,SRR13872307
15547,SRR15347514
15548,SRR11631045


Now lets take some of the ideas we used before for joining tables as well as working with arrays to fine some location data for cornavirus samples.

In [54]:
%%bigquery
SELECT m.bioproject, m.biosample, m.acc, m.collection_date_sam, m.geo_loc_name_sam, 
(select v from unnest(m.attributes) where k = 'collected_by_sam') as collected_by,
(select v from unnest(m.attributes) where k = 'host_sam') as host
FROM `nih-sra-datastore.sra.metadata` m , `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` tax 
WHERE m.acc = tax.acc
and tax.name = 'Coronaviridae'
LIMIT 15

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 2971.00query/s]                        
Downloading: 100%|██████████| 15/15 [00:00<00:00, 19.74rows/s]


Unnamed: 0,bioproject,biosample,acc,collection_date_sam,geo_loc_name_sam,collected_by,host
0,PRJNA669553,SAMN16967948,SRR13173206,,[Germany: Riems],"Friedrich-Loeffler-Institut for Animal Health,...",Homo sapiens
1,PRJNA669553,SAMN16968083,SRR13173228,,[Germany: Riems],"Friedrich-Loeffler-Institut for Animal Health,...",Homo sapiens
2,PRJNA669553,SAMN16967989,SRR13173180,,[Germany: Riems],"Friedrich-Loeffler-Institut for Animal Health,...",Homo sapiens
3,PRJNA707404,SAMN18204574,SRR14018669,,[Australia: Brisbane],,
4,PRJEB40277,SAMEA8654660,ERR5853664,,[],,
5,PRJEB40277,SAMEA8654673,ERR5853676,,[],,
6,PRJEB40277,SAMEA8654663,ERR5853667,,[],,
7,PRJEB40277,SAMEA8953121,ERR6178323,2021-05-27,[],,
8,PRJEB40277,SAMEA8924292,ERR6096606,2021-05-27,[],,
9,PRJEB40277,SAMEA8915890,ERR6055923,2021-05-25,[],,
