# Tutorial #2: How to query data from BigQuery in a Notebook

### Welcome to part II of our tutorials designed for participants in the IronHacks. In this second notebook (part II), we will show you how you can access our training data stored in BigQuery using a key stored in your user profile.

### Before you get started: This tutorial will not more 10 min and you should be able to work with our training data right after. 

### Our goal: Help you getting started with the libraries BigRQuery or DBS or dply in order to access a first temporal dataset stored in BigQuery! So if you have never used these libraries before this tutorial is key for you.

# BigQuery - What's that?

### BigQuery is Google's flagship data warehousing system - https://googleapis.dev/python/bigquery/latest/index.html
### It uses SQL language for its DBMS (database management system)

# Section I: Loading the libraries

In [26]:
!python3 -m pip install google.cloud
from google.cloud import bigquery
from google.oauth2 import service_account
from google.cloud.bigquery import magics
import os

Collecting google.cloud
  Using cached google_cloud-0.34.0-py2.py3-none-any.whl (1.8 kB)
Installing collected packages: google.cloud
Successfully installed google.cloud


# Section II: Authorizing your BigQuery Access

• Finding the key in user profile (screenshot to be added - see notebook part I) 

• Adding the key using the BigRquery functionality

• Verificing that token is valid

• Getting the project(s) names that the token is authorized for establishing the database connection

• Listing the tables in the project

### As shown above, change the URL "hub.ironhacks.com/user/YOUR_USERNAME/tree?" to "hub.ironhacks.com/user/YOUR_USERNAME/lab"

### Replace the "tree?" at the back of the URL to "lab"

In [27]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/home/jovyan/keys.json'
bigquery_client = bigquery.Client(project='ironhacks-covid19-data')
bigquery_client
bigquery_client = bigquery.Client()
QUERY = """

SELECT COVID_TEST, COVID_DEATHS, COVID_COUNT
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN




"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,COVID_TEST,COVID_DEATHS,COVID_COUNT
0,1,0,0
1,1,0,0
2,1,0,0
3,2,0,0
4,4,0,0


# Section III: Exploring the table and loading the table

• What is the schema

• How many rows

• How big is the table

• Loding the table as dataframe and describing the table


In [25]:
QUERY = """

SELECT COUNT(*)
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN




"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,f0_
0,152


# Section IV: Querying the table


### Querying a subset of the data of COVID_TEST from 2020-04-01 onwards
### Calculating the mean cases for the period

In [None]:
QUERY = """

SELECT DATE, COVID_TEST
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN
WHERE DATE >= '2020-04-01'
ORDER BY DATE





"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,DATE,COVID_TEST
0,2020-04-01,2543
1,2020-04-02,2701
2,2020-04-03,2802
3,2020-04-04,1636
4,2020-04-05,1095


In [37]:
import pandas as pd
df = pd.DataFrame(data,columns = ['COVID_TEST'])
!python3 -m pip install numpy
print("mean: ", df.mean()); 


mean:  COVID_TEST    5874.285714
dtype: float64
