# Getting Started with BigQuery SQL Searches
First we need to install some packages and load the BigQuery extension.

In [1]:
!pip install google.cloud.bigquery
!pip install google.cloud.storage
!pip3 install --upgrade google-cloud-bigquery

!pip install google --user
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' --user

%load_ext google.cloud.bigquery

Collecting google.cloud.bigquery
  Using cached https://files.pythonhosted.org/packages/fd/2a/d53b342d20e4b95ade480fa04977969166bb189bf0e909501e4fff86cb3f/google_cloud_bigquery-2.25.1-py2.py3-none-any.whl
Installing collected packages: google.cloud.bigquery
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_bigquery-2.25.1-py3.9-nspkg.pth'
Consider using the `--user` option or check the permissions.
[0m
Collecting google.cloud.storage
  Using cached https://files.pythonhosted.org/packages/0e/d6/5878d73105fd242dafb42bbea26629372d397f06cb402e90302a4824c2c2/google_cloud_storage-1.42.0-py2.py3-none-any.whl
Installing collected packages: google.cloud.storage
[31mERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/opt/tljh/user/lib/python3.7/site-packages/google_cloud_storage-1.42.0-py3.9-nspkg.pth'
Consider using the `--user` option or check the 

## Basic Query
Here is a simple query to find 15 runs submitted for the organsim 'Homo sapiens' from the nih-sra-datastore project. This query searches for all columns in the metadata table from the sra database of the project. We will filter for only the rows where the organism name is 'Homo sapiens' and limit the output to only the first 5 rows.

In [4]:
%%bigquery
SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1737.49query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  7.33rows/s]


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,geo_loc_name_country_continent_calc,geo_loc_name_sam,ena_first_public_run,ena_last_update_run,sample_name_sam,datastore_filetype,datastore_provider,datastore_region,attributes,jattr
0,SRR15278393,RNA-Seq,EMORY UNIVERSITY,public,SRX11583084,2815176,Illumina HiSeq 1000,PAIRED,PolyA,TRANSCRIPTOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '9518138700'}, {'k': 'byt...","{""bases"": 9518138700, ""bytes"": 3239993193, ""ag..."
1,SRR15278347,RNA-Seq,EMORY UNIVERSITY,public,SRX11583130,919115,Illumina HiSeq 1000,PAIRED,PolyA,TRANSCRIPTOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '9340662300'}, {'k': 'byt...","{""bases"": 9340662300, ""bytes"": 2922928789, ""ag..."
2,SRR15065199,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375204,GSM5412973,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '188792856'}, {'k': 'byte...","{""bases"": 188792856, ""bytes"": 85176148, ""age_s..."
3,SRR15065057,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375346,GSM5412928,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '409589460'}, {'k': 'byte...","{""bases"": 409589460, ""bytes"": 186427640, ""age_..."
4,SRR15065033,AMPLICON,THE OPEN UNIVERSITY OF ISRAEL,public,SRX11375370,GSM5412909,Illumina HiSeq 2500,PAIRED,PCR,METAGENOMIC,...,,[],[],[],[],"[fastq, sra]","[gs, ncbi, s3]","[gs.US, ncbi.public, s3.us-east-1]","[{'k': 'bases', 'v': '245428848'}, {'k': 'byte...","{""bases"": 245428848, ""bytes"": 103593882, ""age_..."


## Searching Other Tables
The second database in the nih-sra-datastore project contains data from the SRA Taxonomy Analysis Tool (STAT) as well as the Taxonomy database. The analysis will use the Taxonomy Database Identifier (tax ID) in the results. You can find the tax_id for any entry in the taxonomy table using a search like the one below. 

In [5]:
%%bigquery
SELECT * 
FROM `nih-sra-datastore.sra_tax_analysis_tool.taxonomy`
WHERE sci_name = 'Homo sapiens'


Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 945.94query/s]                          
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.38rows/s]


Unnamed: 0,tax_id,parent_id,rank,sci_name,names,ilevel,ileft,iright
0,9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191


## Joining Two (or more) Tables
Using a JOIN command will allow you to combine two different tables to find the information you are looking for. This allows data to be stored most efficiently in the database but also allows a user to view the information in a way they find easiest to understand. Below we'll simply combine the two queries from before. When specifying multiple tables to search we can also give an abbreviated name to the table ('m' and 't' in the query below) to make the rest of the query easier to type and read.
When joining two or more tables we will need to tell the database which columns in the two tables are expected to align. We need a field that can be used to link each table. This value is what we are joining "on" in the statement.

In [8]:
%%bigquery
SELECT *
FROM `nih-sra-datastore.sra.metadata` m 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.taxonomy` t 
    ON m.organism = t.sci_name
WHERE t.tax_id = 9606
    AND m.organism = 'Homo sapiens'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 2786.91query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  5.96rows/s]


Unnamed: 0,acc,assay_type,center_name,consent,experiment,sample_name,instrument,librarylayout,libraryselection,librarysource,...,attributes,jattr,tax_id,parent_id,rank,sci_name,names,ilevel,ileft,iright
0,ERR2511434,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530849,SAMEA1116052,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '222222604'}, {'k': 'byte...","{""bases"": 222222604, ""bytes"": 188868157, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191
1,ERR2511621,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2531036,SAMEA1116239,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '171452718'}, {'k': 'byte...","{""bases"": 171452718, ""bytes"": 135479064, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191
2,ERR2511146,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530561,SAMEA1115764,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '245005622'}, {'k': 'byte...","{""bases"": 245005622, ""bytes"": 197800594, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191
3,ERR2511321,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530736,SAMEA1115939,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '150219600'}, {'k': 'byte...","{""bases"": 150219600, ""bytes"": 121420950, ""alia...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191
4,ERR2510985,AMPLICON,"BIOINFORMATICS LAB, INNOVATION CENTER",public,ERX2530400,SAMEA1103343,Ion S5,SINGLE,PCR,METAGENOMIC,...,"[{'k': 'bases', 'v': '90789342'}, {'k': 'bytes...","{""bases"": 90789342, ""bytes"": 73927119, ""alias_...",9606,9605,species,Homo sapiens,"[{'name': 'Home sapiens', 'name_class': 'missp...",30,212190,212191


## Selecting Fewer Columns
Using * in the select statement will show all the columns in the table. Often we don't need or want all of the columns so we can show only the columns we are interested in. We do this by naming the columns in the select statement.

In [9]:
%%bigquery
SELECT m.acc, m.assay_type, instrument, libraryselection, librarysource, sci_name
FROM `nih-sra-datastore.sra.metadata` m 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.taxonomy` t 
    ON m.organism = t.sci_name
WHERE t.tax_id = 9606
    AND m.organism = 'Homo sapiens'
    AND m.assay_type = 'RNA-Seq'
    AND m.consent = 'public'
LIMIT 5

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 3543.23query/s]                        
Downloading: 100%|██████████| 5/5 [00:00<00:00,  6.69rows/s]


Unnamed: 0,acc,assay_type,instrument,libraryselection,librarysource,sci_name
0,SRR6178770,RNA-Seq,Ion S5,RT-PCR,TRANSCRIPTOMIC,Homo sapiens
1,SRR6788172,RNA-Seq,Ion S5,PolyA,TRANSCRIPTOMIC,Homo sapiens
2,SRR5943843,RNA-Seq,Ion S5,cDNA,TRANSCRIPTOMIC,Homo sapiens
3,DRR238304,RNA-Seq,MinION,cDNA,TRANSCRIPTOMIC,Homo sapiens
4,SRR12961085,RNA-Seq,MinION,Inverse rRNA,TRANSCRIPTOMIC,Homo sapiens


## Using Where Clauses to Find Data of Interest
Now let's look at an example for someone who wants to do an alignment. If you wanted to use the PGAP package from NCBI, you would want data that has a high percentage of only one species to generate an assembly and annotation for. We can use the output from STAT to do this. We can compare the self_count columns to the analysed spot count to find a hi

In [6]:
%%bigquery
SELECT a.acc, a.name, i.analyzed_spot_count, a.self_count, a.self_count/i.analyzed_spot_count as proportion
FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` a 
    JOIN `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis_info` i 
    ON a.acc = i.acc
WHERE a.tax_id = 1639
    AND a.self_count/i.analyzed_spot_count > .9
    AND i.analyzed_spot_count > 100000
ORDER BY proportion DESC

Query complete after 0.00s: 100%|██████████| 4/4 [00:00<00:00, 2939.25query/s]                        
Downloading: 100%|██████████| 16153/16153 [00:00<00:00, 16483.84rows/s]


Unnamed: 0,acc,name,analyzed_spot_count,self_count,proportion
0,ERR1817002,Listeria monocytogenes,608593,594332,0.976567
1,ERR1816986,Listeria monocytogenes,692180,673720,0.973331
2,ERR1816981,Listeria monocytogenes,630954,613122,0.971738
3,ERR1816996,Listeria monocytogenes,593342,575904,0.970611
4,ERR1431120,Listeria monocytogenes,3249346,3153217,0.970416
...,...,...,...,...,...
16148,SRR10665812,Listeria monocytogenes,510003,459009,0.900012
16149,SRR2071980,Listeria monocytogenes,1336913,1203238,0.900012
16150,SRR6038661,Listeria monocytogenes,749339,674410,0.900007
16151,ERR1230393,Listeria monocytogenes,803265,722940,0.900002


In [7]:
%%bigquery
SELECT count(m.acc) as count, m.geo_loc_name_country_calc
FROM `nih-sra-datastore.sra.metadata` m , `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` tax
WHERE m.acc = tax.acc
    and tax.name = 'Coronaviridae'
    and m.geo_loc_name_country_calc is NOT NULL
GROUP BY m.geo_loc_name_country_calc
ORDER by 1 DESC

Query complete after 0.00s: 100%|██████████| 5/5 [00:00<00:00, 4783.65query/s]                        
Downloading: 100%|██████████| 119/119 [00:00<00:00, 149.82rows/s]


Unnamed: 0,count,geo_loc_name_country_calc
0,366077,USA
1,196962,United Kingdom
2,17804,Ireland
3,15778,Spain
4,15653,Australia
...,...,...
114,1,Laos
115,1,Burundi
116,1,Liberia
117,1,Bulgaria
