# Getting Started with BigQuery SQL Searches
First we need to install some packages and load the BigQuery extension.

In [3]:
!pip install google.cloud.bigquery
!pip install google.cloud.storage
!pip3 install --upgrade google-cloud-bigquery

!pip install google --user
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' --user

%load_ext google.cloud.bigquery

Collecting google.cloud.bigquery
  Using cached https://files.pythonhosted.org/packages/fd/2a/d53b342d20e4b95ade480fa04977969166bb189bf0e909501e4fff86cb3f/google_cloud_bigquery-2.25.1-py2.py3-none-any.whl
Installing collected packages: google.cloud.bigquery
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_bigquery-2.25.1-py3.9-nspkg.pth'
Consider using the `--user` option or check the permissions.
[0m
Collecting google.cloud.storage
  Using cached https://files.pythonhosted.org/packages/0e/d6/5878d73105fd242dafb42bbea26629372d397f06cb402e90302a4824c2c2/google_cloud_storage-1.42.0-py2.py3-none-any.whl
Installing collected packages: google.cloud.storage
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_storage-1.42.0-py3.9-nspkg.pth'
Consider using the `--user` option or check the 

## Basic Query
Here is a simple query to find 5 runs submitted for the organsim 'Homo sapiens' from the nih-sra-datastore project. This query searches for all columns in the metadata table from the sra database in the project. We will filter for only the rows where the organism name is 'Homo sapiens' and limit the output to only the first 5 rows by using the LIMIT function. One limitation of our view in Jupyter is only 20 columns are displayed. At the bottom of the results you will see 5 rows X 36 columns. But if you look closely there are 16 columns that are not visible in the middle of the output chart.

In [4]:
%%bigquery
SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1737.49query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  7.33rows/s]


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,geo_loc_name_country_continent_calc,geo_loc_name_sam,ena_first_public_run,ena_last_update_run,sample_name_sam,datastore_filetype,datastore_provider,datastore_region,attributes,jattr
0,SRR15278393,RNA-Seq,EMORY UNIVERSITY,public,SRX11583084,2815176,Illumina HiSeq 1000,PAIRED,PolyA,TRANSCRIPTOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '9518138700'}, {'k': 'byt...","{""bases"": 9518138700, ""bytes"": 3239993193, ""ag..."
1,SRR15278347,RNA-Seq,EMORY UNIVERSITY,public,SRX11583130,919115,Illumina HiSeq 1000,PAIRED,PolyA,TRANSCRIPTOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '9340662300'}, {'k': 'byt...","{""bases"": 9340662300, ""bytes"": 2922928789, ""ag..."
2,SRR15065199,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375204,GSM5412973,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '188792856'}, {'k': 'byte...","{""bases"": 188792856, ""bytes"": 85176148, ""age_s..."
3,SRR15065057,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375346,GSM5412928,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '409589460'}, {'k': 'byte...","{""bases"": 409589460, ""bytes"": 186427640, ""age_..."
4,SRR15065033,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375370,GSM5412909,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '245428848'}, {'k': 'byte...","{""bases"": 245428848, ""bytes"": 103593882, ""age_..."


## Searching Other Tables
The second database in the nih-sra-datastore project contains data from the SRA Taxonomy Analysis Tool (STAT) as well as some from the Taxonomy database. The analysis will use the Taxonomy Database Identifier (tax ID) in the results. You can find the tax_id for any entry in the taxonomy table using a search like the one below. 

In [22]:
%%bigquery
SELECT * 
FROM `nih-sra-datastore.sra_tax_analysis_tool.taxonomy`
WHERE sci_name = 'Homo sapiens'


Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 651.90query/s]                          
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.51rows/s]


Unnamed: 0,tax_id,parent_id,rank,sci_name,names,ilevel,ileft,iright
0,9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231


## Finding How the Database is Organized and Listing Columns
We can use the query below to get a listing of the columns in the database schema. This can be useful to see all the names of columns in a table as well as seeing what type of data is in them. The SQL format requires single quotes around string values in searches but no quotes for integers.

In [23]:
%%bigquery
SELECT * FROM `nih-sra-datastore.sra`.INFORMATION_SCHEMA.COLUMNS


Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 971.35query/s]                          
Downloading: 100%|██████████| 36/36 [00:00<00:00, 44.26rows/s]


Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position
0,nih-sra-datastore,sra,metadata,acc,1,YES,STRING,NEVER,,,NO,,NO,NO,
1,nih-sra-datastore,sra,metadata,assay_type,2,YES,STRING,NEVER,,,NO,,NO,NO,
2,nih-sra-datastore,sra,metadata,center_name,3,YES,STRING,NEVER,,,NO,,NO,NO,
3,nih-sra-datastore,sra,metadata,consent,4,YES,STRING,NEVER,,,NO,,NO,NO,
4,nih-sra-datastore,sra,metadata,experiment,5,YES,STRING,NEVER,,,NO,,NO,NO,
5,nih-sra-datastore,sra,metadata,sample_name,6,YES,STRING,NEVER,,,NO,,NO,NO,
6,nih-sra-datastore,sra,metadata,instrument,7,YES,STRING,NEVER,,,NO,,NO,NO,
7,nih-sra-datastore,sra,metadata,librarylayout,8,YES,STRING,NEVER,,,NO,,NO,NO,
8,nih-sra-datastore,sra,metadata,libraryselection,9,YES,STRING,NEVER,,,NO,,NO,NO,
9,nih-sra-datastore,sra,metadata,librarysource,10,YES,STRING,NEVER,,,NO,,NO,NO,


## Joining Two (or more) Tables
Using a JOIN command will allow you to combine two different tables to find the information you are looking for. This allows data to be stored most efficiently in the database but also allows a user to view the information in a way they find easiest to understand. Below we'll combine the two queries from before. When specifying multiple tables to search we can also give an abbreviated name to the table ('meta' and 'tax' in the query below) to make the rest of the query easier to type and read.
When joining two or more tables we will need to tell the database which columns in the two tables are expected to align or have the same content. We need a field that can be used to link each table. This value is what we are joining "on" in the statement. In this case we are using 'organism' from the metadata table and 'sci_name' from the taxonomy table. Often in a database the columns used to join two tables will have the same name. But that is not always the case, this is one example.

In [25]:
%%bigquery
SELECT *
FROM `nih-sra-datastore.sra.metadata` meta 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.taxonomy` tax
    ON meta.organism = tax.sci_name
WHERE tax.tax_id = 9606
    AND meta.organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 3118.44query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  6.81rows/s]


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,attributes,jattr,tax_id,parent_id,rank,sci_name,names,ilevel,ileft,iright
0,ERR2511560,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530975,SAMEA1116178,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '171494517'}, {'k': 'byte...","{""bases"": 171494517, ""bytes"": 140015172, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231
1,ERR2511826,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2531241,SAMEA1116444,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '163581768'}, {'k': 'byte...","{""bases"": 163581768, ""bytes"": 132336304, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231
2,ERR2511209,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530624,SAMEA1115827,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '191747089'}, {'k': 'byte...","{""bases"": 191747089, ""bytes"": 160929514, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231
3,ERR2511604,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2531019,SAMEA1116222,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '206745187'}, {'k': 'byte...","{""bases"": 206745187, ""bytes"": 165491895, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231
4,ERR2511369,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530784,SAMEA1115987,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '167396357'}, {'k': 'byte...","{""bases"": 167396357, ""bytes"": 134853975, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212230,212231


## Selecting Fewer Columns
Using * in the select statement will show all the columns in the table. Often we don't need or want all of the columns so we can show only the columns we are interested in. We do this by listing the columns we want in the select statement.

In [26]:
%%bigquery
SELECT meta.acc, meta.assay_type, meta.instrument, meta.libraryselection, meta.librarysource, tax.sci_name
FROM `nih-sra-datastore.sra.metadata` meta 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.taxonomy` tax
    ON meta.organism = tax.sci_name
WHERE tax.tax_id = 9606
    AND meta.organism = 'Homo sapiens'
    AND meta.assay_type = 'RNA-Seq'
    AND meta.consent = 'public'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 3520.93query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  6.10rows/s]


Unnamed: 0,acc,assay_type,instrument,libraryselection,librarysource,sci_name
0,SRR12245574,RNA-Seq,HiSeq X Ten,PolyA,TRANSCRIPTOMIC,Homo sapiens
1,SRR15667241,RNA-Seq,Illumina HiSeq 2500,cDNA,TRANSCRIPTOMIC,Homo sapiens
2,SRR15667236,RNA-Seq,Illumina HiSeq 2500,cDNA,TRANSCRIPTOMIC,Homo sapiens
3,SRR14683905,RNA-Seq,Illumina NovaSeq 6000,cDNA,TRANSCRIPTOMIC,Homo sapiens
4,SRR14683915,RNA-Seq,Illumina NovaSeq 6000,cDNA,TRANSCRIPTOMIC,Homo sapiens


## Using Where Clauses to Find Data of Interest
Now let's look at an example for someone who wants to do an alignment. We will combine some ideas from before as well as ad some math into the search. If you wanted to use the PGAP package from NCBI, you would want data that has a high percentage (we'll use 90% as our minimum) of only one species to generate an assembly and annotation for. We can use the output from STAT to do this. 

We want to make an assembly and annotation for Listeria monocytogenes. 
1. We'll use the taxid (1639) to search the tax_id column in the tax_analysis table. 
2. To know what proportion of the spots have been identified to the taxid we are searching for, we also need to use the total analyzed_spot_count from the tax_analysis_info table for each run. 
3. We will need to join these two tables on the run accession (acc) to run this query.

Finally we can sort by the proportion of the spots that were identified as being from the taxid we want to search.

In [29]:
%%bigquery
SELECT a.acc, a.name, info.analyzed_spot_count, a.self_count, a.self_count/info.analyzed_spot_count as proportion
FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` a 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis_info` info 
    ON a.acc = info.acc
WHERE a.tax_id = 1639
    AND a.self_count/info.analyzed_spot_count > .9
    AND info.analyzed_spot_count > 100000
ORDER BY proportion DESC

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 3391.39query/s]                        
Downloading: 100%|██████████| 16153/16153 [00:01<00:00, 15628.64rows/s]


Unnamed: 0,acc,name,analyzed_spot_count,self_count,proportion
0,ERR1817002,Listeria monocytogenes,608593,594332,0.976567
1,ERR1816986,Listeria monocytogenes,692180,673720,0.973331
2,ERR1816981,Listeria monocytogenes,630954,613122,0.971738
3,ERR1816996,Listeria monocytogenes,593342,575904,0.970611
4,ERR1431120,Listeria monocytogenes,3249346,3153217,0.970416
...,...,...,...,...,...
16148,SRR10665812,Listeria monocytogenes,510003,459009,0.900012
16149,SRR2071980,Listeria monocytogenes,1336913,1203238,0.900012
16150,SRR6038661,Listeria monocytogenes,749339,674410,0.900007
16151,ERR1230393,Listeria monocytogenes,803265,722940,0.900002


In [28]:
%%bigquery
SELECT count(meta.acc) as count, meta.geo_loc_name_country_calc
FROM `nih-sra-datastore.sra.metadata` meta , `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` tax
WHERE meta.acc = tax.acc
    and tax.name = 'Coronaviridae'
    and meta.geo_loc_name_country_calc is NOT NULL
GROUP BY meta.geo_loc_name_country_calc
ORDER by 1 DESC

Query complete after 0.00s: 100%|██████████| 5/5 [00:00<00:00, 3950.18query/s]                        
Downloading: 100%|██████████| 119/119 [00:00<00:00, 191.00rows/s]


Unnamed: 0,count,geo_loc_name_country_calc
0,366105,USA
1,196962,United Kingdom
2,17804,Ireland
3,15778,Spain
4,15653,Australia
...,...,...
114,1,American Samoa
115,1,Rwanda
116,1,Chad
117,1,Kyrgyzstan
