# Tutorial #2: How to query data from BigQuery in a Notebook

**Welcome to part II of our tutorials designed for participants in the IronHacks. In this second notebook (part II), we will show you how you can access our training data stored in BigQuery using a key stored in your user profile.**

**Before you get started**: This tutorial will not take more than 10 minutes and you should have the basic skill sets to work with the datasets thereafter. 

**Our goal**: Help you get started with the Python BigQuery Library, Python Pandas Library and Python Numpy Library in order to access a first temporal dataset stored in BigQuery! So if you have never used these libraries before this tutorial is critical for you.

# BigQuery - What's that?

BigQuery is Google's flagship data warehousing system: "Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility". It allows you to analyze large amount of data using ANSI SQL at blazing-fast speeds, with zero operational overhead. You can find out more it at https://cloud.google.com/bigquery

**Why do we use BigQuery?** In the COVID-19 Data Science Challenges you will use BIG DATA from our data providers SafeGraph, the Management Performance Hub (MPH), and other partners (Department of Workforce Development). The first hack Summer 2020, will use preprocessed data so you will not need to use all the functionalities of BIG QUERY as we have sampled down more than 50 datasets with more than 1 TB and millions of raws into a small sets of cleaned tables without missing entries and clear identifiers. However, using BigQuery will still be very helpful as you can see for exploring data without having to use them in memory etc. It will also set you up for the future of data science since BigQuery is replacing other BIG DATA services (e.g. Spark).

**How do we give you access to BigQuery?** In Big Query data are stored in projects. Inside a project there are multiple datasets. Each dataset can contain multiple tables. In this hack we give you access to a project called: ironhacks-covid19-data. In this project there are two datasets:ironhacks-covid19-data:ironhacks_covid19_training and ironhacks-covid19-data:ironhacks_covid19_competition. During the training period you will only find data in the first dataset. In this first tutorial we only use one first relatively simply structured table stored in this dataset. It is called covid19_tests_cases_deaths_IN

**Keep in mind**: In this tutorial you will learn how to get access to the ironhacks-covid19-data and the datset ironhacks-covid19-data:ironhacks_covid19_training stored inside this project.

**What's BigQuery**: Google Cloud Big Query package allows you to query data stored in BigQuery You can find the official documentation [here](https://googleapis.dev/python/bigquery/latest/index.html)

**What's Pandas** In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series


**What's Numpy** NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays





# Section I: Loading the libraries

In [26]:
!python3 -m pip install google.cloud
from google.cloud import bigquery
from google.oauth2 import service_account
from google.cloud.bigquery import magics
import os

Collecting google.cloud
  Using cached google_cloud-0.34.0-py2.py3-none-any.whl (1.8 kB)
Installing collected packages: google.cloud
Successfully installed google.cloud


# Section II: Authorizing your BigQuery Access

• Finding the key in user profile (screenshot to be added - see notebook part I) 

• Adding the key using the BigRquery functionality

• Verificing that token is valid

• Getting the project(s) names that the token is authorized for establishing the database connection

• Listing the tables in the project

**So the next step now is to find the keys**: 

1) Go to your user profile

2) click on Download your hack dataset training key 

3) Upload it to your Juptyer lab environment


After this, you will set the GOOGLE_APPLICATION_CREDENTIALS to point to the path of your key as shown below.

In [49]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/home/jovyan/keys.json'
bigquery_client = bigquery.Client(project='ironhacks-covid19-data')
bigquery_client
bigquery_client = bigquery.Client()
QUERY = """

SELECT *
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN




"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,DATE,COVID_TEST,DAILY_DELTA_TESTS,DAILY_BASE_TESTS,COVID_DEATHS,DAILY_DELTA_DEATHS,DAILY_BASE_DEATHS,COVID_COUNT,DAILY_DELTA_CASES,DAILY_BASE_CASES,COVID_COUNT_CUMSUM,COVID_DEATHS_CUMSUM,COVID_TEST_CUMSUM
0,2020-02-26,1,0,1,0,0,0,0,0,0,0,0,1
1,2020-02-27,1,0,1,0,0,0,0,0,0,0,0,2
2,2020-02-29,1,0,1,0,0,0,0,0,0,0,0,3
3,2020-03-02,2,0,2,0,0,0,0,0,0,0,0,5
4,2020-03-03,4,0,4,0,0,0,0,0,0,0,0,9


In [51]:
print(data.columns)

Index(['DATE', 'COVID_TEST', 'DAILY_DELTA_TESTS', 'DAILY_BASE_TESTS',
       'COVID_DEATHS', 'DAILY_DELTA_DEATHS', 'DAILY_BASE_DEATHS',
       'COVID_COUNT', 'DAILY_DELTA_CASES', 'DAILY_BASE_CASES',
       'COVID_COUNT_CUMSUM', 'COVID_DEATHS_CUMSUM', 'COVID_TEST_CUMSUM'],
      dtype='object')


# Section III: Exploring the table and loading the table

• What is the schema

• How many rows

• How big is the table

• Loding the table as dataframe and describing the table


Here, we wil explore how many rows the table is having. 

In [25]:
QUERY = """

SELECT COUNT(*)
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN




"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,f0_
0,152


# Section IV: Querying the table


So what we will do:

• Subsetting the table for DATE,COVID_TEST, COVID_COUNT and filtering for period 2020-04-01 to 2020-05-01

• Calculating the mean cases for the period with numpy and pandas

In [52]:
QUERY = """

SELECT DATE, COVID_TEST
FROM ironhacks-covid19-data.ironhacks_covid19_training.covid19_tests_cases_deaths_IN
WHERE DATE BETWEEN '2020-04-01'AND '2020-05-01'
ORDER BY DATE

"""

query_job = bigquery_client.query(QUERY)
!python3 -m pip install pandas
import pandas
data = query_job.to_dataframe()
data.head()



Unnamed: 0,DATE,COVID_TEST
0,2020-04-01,2543
1,2020-04-02,2701
2,2020-04-03,2802
3,2020-04-04,1636
4,2020-04-05,1095


In [37]:
import pandas as pd
df = pd.DataFrame(data,columns = ['COVID_TEST'])
!python3 -m pip install numpy
print("mean: ", df.mean()); 


mean:  COVID_TEST    5874.285714
dtype: float64
