# Tutorial #2: How to query data from BigQuery in a Notebook

**Welcome to part II of our tutorials designed for participants in the IronHacks. In this second notebook (part II), we will show you how you can access our training data stored in BigQuery using a key stored in your user profile.**

**Before you get started**: This tutorial will not take more than 10 minutes and you should have the basic skill sets to work with the datasets thereafter. 

**Our goal**: Help you get started with the Python BigQuery Library, Python Pandas Library and Python Numpy Library in order to access a first temporal dataset stored in BigQuery! So if you have never used these libraries before this tutorial is critical for you.

# BigQuery - What's that?

BigQuery is Google's flagship data warehousing system: "Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility". It allows you to analyze large amount of data using ANSI SQL at blazing-fast speeds, with zero operational overhead. You can find out more it at https://cloud.google.com/bigquery

**Why do we use BigQuery?** In the Unemployment Data Science Challenges you will use BIG DATA from our data provider, the Department of Workforce Development (DWD). Using Big Query will be very helpful as you can explore data without having to use them in memory etc. It will also set you up for the future of data science since BigQuery is replacing other BIG DATA services (e.g. Spark).

**How do we give you access to BigQuery?** In Big Query data are stored in projects. Inside a project there are multiple datasets. Each dataset can contain multiple tables. In this hack we give you access to a project called: `ironhacks-data`. In this project there are two datasets:`ironhacks-data:ironhacks_training` and `ironhacks-data:ironhacks_competition`. During the training period you will only find data in the first dataset. In this first tutorial we only use one first relatively simply structured table stored in this dataset. It is called `covid19_cases`

**Keep in mind**: In this tutorial you will learn how to get access to the ironhacks-covid19-data and the datset ironhacks-covid19-data:ironhacks_covid19_training stored inside this project.

**What's BigQuery**: Google Cloud Big Query package allows you to query data stored in BigQuery You can find the official documentation [here](https://googleapis.dev/python/bigquery/latest/index.html)

**What's Pandas** In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series

**What's Numpy** NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

---

# Section I: Authorizing your BigQuery Access

This section can be tricky so do make sure you follow each step!



Open a terminal tab in your Notebook Environment
> This can be opened at the top tab where you have the Notebook name and a "+" sign. Click the "+" and a new Launcher windown will open. You will see the Terminal box in the bottom left of that screen. This will open a new Terminal tab

<img src="https://i.imgur.com/7E9ubfW.png" alt=" icon" style="margin-right: 10px;" />

Copy each command into your terminal and follow the instructions provided by Google
>  `gcloud auth login`

This command will create a Google link for you as marked in the second red box below.

<img src="https://i.imgur.com/YF20i54.png" alt=" icon" style="margin-right: 10px;" />

 Copy and paste that link into a new tab in your internet browser and hit "Enter".  This will pop up a Google Login page. 
 
<img src=" https://i.imgur.com/4bc6m65.png" alt=" icon" style="margin-right: 10px;" />
 
 Log in with your google account and click "Allow" when authorizing the Google CLI. 

<img src="https://i.imgur.com/x7yY932.png" alt=" icon" style="margin-right: 10px;" />
 
 After this, you will get a authorization code in the second box. 
 
<img src="https://i.imgur.com/cZEkLhM.png" alt=" icon" style="margin-right: 10px;" />

 Copy this code and go back to your terminal in your Workspace. Paste this code in the yellow box area like below and hit "Enter". The first command is now complete!

<img src="https://i.imgur.com/YF20i54.png" alt=" icon" style="margin-right: 10px;" />

> `gcloud auth application-default login`

This command will create a Google link for you. 

<img src="https://i.imgur.com/CXoHV6t.png" alt=" icon" style="margin-right: 10px;" />

Identical to the last command, copy and paste that link into a new tab in your internet browser and hit "Enter".  This will pop up a Google Login page. Log in with your google account and click "Allow" when authorizing the Google Cloud SDK. This will ensure you access BigQuery. After this, you will get a authorization code in the second box. Copy this code and go back to your terminal in your Workspace. Paste this code in and hit "Enter". The second command is now complete!


> `gcloud auth application-default set-quota-project ironhacks-data`

This command will set the project. Run this command in the terminal and thats it! You are now set! You can come back to your Jupyter Notebook and begin using BigQuery

<img src="https://i.imgur.com/zYzmwcX.png" alt=" icon" style="margin-right: 10px;" />


> Note: You will only need to do this once when you first open your notebook and had never run these commands. After this you won't have to open the terminal again!

# Section II: Loading the libraries

Now we are going to run our code as normal! You should be able to hit "Run all" and everything should be working now!

In [1]:
!pip install db-dtypes

Collecting db-dtypes
  Using cached db_dtypes-1.0.4-py2.py3-none-any.whl (14 kB)
Installing collected packages: db-dtypes
Successfully installed db-dtypes-1.0.4


In [2]:
import os
import pandas
from google.cloud import bigquery
from google.oauth2 import service_account
from google.cloud.bigquery import magics

In [3]:
# CONFIGURE THE BIGQUERY SETTINGS

BIGQUERY_PROJECT = 'ironhacks-data'
bigquery_client = bigquery.Client(project=BIGQUERY_PROJECT)

In [6]:
query = """
SELECT *
FROM `ironhacks-data.ironhacks_training.covid19_cases`
"""

# QUERY THE DATA ONCE
query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()
covid19_cases_data.head()

Unnamed: 0,week_number,start_date,county,fips,cases,deaths
0,9,2021-03-01,Marion,18097,664,23
1,12,2021-03-22,Marion,18097,623,11
2,19,2021-05-10,Marion,18097,1156,4
3,11,2021-03-15,Marion,18097,560,13
4,6,2021-02-08,Marion,18097,1542,219


In [5]:
# THEN WORK BELOW TO DO SOMETHING THE RESULTS
print("Columns:")
print('\n'.join(covid19_cases_data.columns))
print("\nResults:")
print(covid19_cases_data.head())

Columns:
week_number
start_date
county
fips
cases
deaths

Results:
   week_number  start_date  county   fips  cases  deaths
0            9  2021-03-01  Marion  18097    664      23
1           12  2021-03-22  Marion  18097    623      11
2           19  2021-05-10  Marion  18097   1156       4
3           11  2021-03-15  Marion  18097    560      13
4            6  2021-02-08  Marion  18097   1542     219


---

# Section III: Exploring the table and loading the table

- What is the schema
- How many rows
- How big is the table
- Loading the table as dataframe and describing the table

Here, we wil explore how many rows the table is having. 

In [7]:
query = """
SELECT COUNT(*)
FROM `ironhacks-data.ironhacks_training.covid19_cases`
"""

query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()

In [9]:
print(covid19_cases_data)

   f0_
0   46


---

# Section IV: Querying the table

So what we will do:

- Subsetting the table for 'week_number' and 'cases' and filtering for week 1 to week 3
- Calculating the mean cases for the period with numpy and pandas

In [10]:
query = """
SELECT 
week_number,
cases 
FROM `ironhacks-data.ironhacks_training.covid19_cases`
Where week_number between 1 and 3
order by week_number
"""

query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()

In [11]:
print(covid19_cases_data.head())

   week_number  cases
0            1   4714
1            1    964
2            2   5289
3            2   1232
4            3   3460


In [12]:
df = pandas.DataFrame(covid19_cases_data, columns = ['cases'])
print("mean: ", df.mean()); 

mean:  cases    2732.666667
dtype: float64
