<a href="https://colab.research.google.com/github/lazlozerv/Kaggle_SQL_Summer_Camp/blob/master/Lesson2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Intro**

In this lesson we 'll learn about SQL queries in __BigQuery__
and more specifically about **SELECT ... FROM WHERE ...** 

The syntax of this query is very simple and I will not explain it.

In [0]:
# Example of a query
# query = """
#         SELECT Name
#         FROM 'bigquery-publid-data.pet_records.pets'
#         WHERE Animal = 'Cat'
#         """

In [0]:
from google.cloud import bigquery


# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "openaq" dataset
dataset_ref = client.dataset("openaq",project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

"""
# List all the tables in the "openaq" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset
for table in tables:
  print(table.table_id)
  
"""

In [0]:
# Construct a reference to the "global air quality" table
table_ref = dataset_ref.table("global_air_quality")


# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "global_air_quality" table
client.list_rows(table, max_results=5).to_dataframe()

In [0]:
# Printing the table's schema
table.schema

# Query to select all the items from the "city" column where the "country" column is 'US'
query = """
        SELECT city
        FROM 'bigquery-public-data.openaq.global_air_quality'
        WHERE country = 'US'
        """

#### Submitting the query to the dataset 

We begin with setting up the query with the `query` method.

In [0]:
# Set up the query
query_job = client.query(query)

# API request - run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

Now we've got a pandas DataFrame, which we can use like any other DataFrame

In [0]:
# What 5 cities have the most measurements
us_cities.city.value_counts().head()

#### Working with big datasets

How to estimate the size of any query before running it ? To see how much data a query will scan, we create a ` QueryJobConfig ` object and set the `dry_run` parameter to `True`.

In [0]:
# The example query
query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job" 
        """

# Create a QueryJobConfig object to estimate size of query without running it, 
# specify parameter maximum_bytes_billed for upper bound of query.
dry_run_config = bigquery.QueryJobConfig(dry_run=True)

# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed))