# Tutorial #2: How to query data from BigQuery in a Notebook

**Welcome to part II of our tutorials designed for participants in the IronHacks. In this second notebook (part II), we will show you how you can access our training data stored in BigQuery using a key stored in your user profile.**

**Before you get started**: This tutorial will not take more than 10 minutes and you should have the basic skill sets to work with the datasets thereafter. 

**Our goal**: Help you get started with the Python BigQuery Library, Python Pandas Library and Python Numpy Library in order to access a first temporal dataset stored in BigQuery! So if you have never used these libraries before this tutorial is critical for you.

# BigQuery - What's that?

BigQuery is Google's flagship data warehousing system: "Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility". It allows you to analyze large amount of data using ANSI SQL at blazing-fast speeds, with zero operational overhead. You can find out more it at https://cloud.google.com/bigquery

**Why do we use BigQuery?** In the Unemployment Data Science Challenges you will use BIG DATA from our data provider, the Department of Workforce Development (DWD). Using Big Query will be very helpful as you can explore data without having to use them in memory etc. It will also set you up for the future of data science since BigQuery is replacing other BIG DATA services (e.g. Spark).

**How do we give you access to BigQuery?** In Big Query data are stored in projects. Inside a project there are multiple datasets. Each dataset can contain multiple tables. In this hack we give you access to a project called: `ironhacks-data`. In this project there are two datasets:`ironhacks-data:ironhacks_training` and `ironhacks-data:ironhacks_competition`. During the training period you will only find data in the first dataset. In this first tutorial we only use one first relatively simply structured table stored in this dataset. It is called `covid19_cases`

**Keep in mind**: In this tutorial you will learn how to get access to the ironhacks-covid19-data and the datset ironhacks-covid19-data:ironhacks_covid19_training stored inside this project.

**What's BigQuery**: Google Cloud Big Query package allows you to query data stored in BigQuery You can find the official documentation [here](https://googleapis.dev/python/bigquery/latest/index.html)

**What's Pandas** In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series

**What's Numpy** NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

---

# Section I: Loading the libraries

In [4]:
import os
import pandas
from google.cloud import bigquery
from google.oauth2 import service_account
from google.cloud.bigquery import magics

# Section II: Authorizing your BigQuery Access

- Open a terminal tab in your Notebook Environment
- Copy each command and follow the instructions provided by Google
- Setting the project(s) names that the you are authorized for to establish the database connection
- Listing the tables in the project

> Note: You will need to re-run the terminal commands each time you re-open the notebook. But this will only need to be done once per session

In [None]:
# Run these terminal commands when your Notebook Session begins
!gcloud auth login
!gcloud auth application-default set-quota-project ironhacks-data

In [5]:
# CONFIGURE THE BIGQUERY SETTINGS

BIGQUERY_PROJECT = 'ironhacks-data'
bigquery_client = bigquery.Client(project=BIGQUERY_PROJECT)

In [8]:
query = """
SELECT *
FROM `ironhacks-data.ironhacks_training.covid19_cases`
"""

# QUERY THE DATA ONCE
query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()

In [30]:
# THEN WORK BELOW TO DO SOMETHING THE RESULTS
print("Columns:")
print('\n'.join(covid19_cases_data.columns))
print("\nResults:")
print(covid19_cases_data.head())

Columns:
week_number
cases

Results:
   week_number  cases
0            1   5289
1            2   3460
2            3   2794


---

# Section III: Exploring the table and loading the table

- What is the schema
- How many rows
- How big is the table
- Loading the table as dataframe and describing the table

Here, we wil explore how many rows the table is having. 

In [11]:
query = """
SELECT COUNT(*)
FROM `ironhacks-data.ironhacks_training.covid19_cases`
"""

query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()

In [13]:
print(covid19_cases_data)

   f0_
0   14


---

# Section IV: Querying the table

So what we will do:

- Subsetting the table for 'week_number' and 'cases' and filtering for week 1 to week 3
- Calculating the mean cases for the period with numpy and pandas

In [26]:
query = """
SELECT 
week_number,
cases 
FROM `ironhacks-data.ironhacks_training.covid19_cases`
Where week_number between 1 and 3
order by week_number
"""

query_job = bigquery_client.query(query)
covid19_cases_data = query_job.to_dataframe()

In [28]:
print(covid19_cases_data.head())

   week_number  cases
0            1   5289
1            2   3460
2            3   2794


In [29]:
df = pandas.DataFrame(covid19_cases_data, columns = ['cases'])
print("mean: ", df.mean()); 

mean:  cases    3847.666667
dtype: float64
