# Kaggle Intro to SQL (and BigQuery)
- https://www.kaggle.com/learn/intro-to-sql

## 2. Exercise: Select, From & Where
- The foundational components for all SQL queries.

### Intro
- Write some SELECT statements of your own to explore a large dataset of air pollution measurements.

We know that the `global_air_quality`table is part of the `openaq` dataset. And `openaq` dataset is contained in the `bigquery-public-data`.     
We are going to fetch the `global_air_quality`table to mk some querys on its data.

In [6]:
### To fetch the dataset (in dataset var)
from google.cloud import bigquery

# Create a 'Client' object: the first step in the workflow to retrieve information
# from google-BigQuery datasets.
client = bigquery.Client('jmproject86385')

# Construct a reference to the 'openaq' dataset contained in
# bigquery-public-data project
dataset_ref = client.dataset('openaq', project='bigquery-public-data')

# API request - fetch the dataset (first fetch the dataset, all tables)
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the 'global_air_quality' table
table_ref = dataset_ref.table('global_air_quality')

# API request - fetch the table
table = client.get_table(table_ref)

# Preview first 5 lines of the 'global_air_quality' table
#client.list_rows(table).to_dataframe()
client.list_rows(table, max_results=5).to_dataframe()

  client.list_rows(table, max_results=5).to_dataframe()


Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours,location_geom
0,"Borówiec, ul. Drapałka",Borówiec,PL,bc,0.85217,2022-04-28 07:00:00+00:00,µg/m³,GIOS,1.0,52.276794,17.074114,POINT(52.276794 1)
1,"Kraków, ul. Bulwarowa",Kraków,PL,bc,0.91284,2022-04-27 23:00:00+00:00,µg/m³,GIOS,1.0,50.069308,20.053492,POINT(50.069308 1)
2,"Płock, ul. Reja",Płock,PL,bc,1.41,2022-03-30 04:00:00+00:00,µg/m³,GIOS,1.0,52.550938,19.709791,POINT(52.550938 1)
3,"Elbląg, ul. Bażyńskiego",Elbląg,PL,bc,0.33607,2022-05-03 13:00:00+00:00,µg/m³,GIOS,1.0,54.167847,19.410942,POINT(54.167847 1)
4,"Piastów, ul. Pułaskiego",Piastów,PL,bc,0.51,2022-05-11 05:00:00+00:00,µg/m³,GIOS,1.0,52.191728,20.837489,POINT(52.191728 1)


In [2]:
# # JM_df
# jmdf = client.list_rows(table, max_results=1_000_000).to_dataframe()
# print(f'{jmdf.shape[0]:,}, {jmdf.shape[1]}')
# jmdf.iloc[[0, 9, -9, -1]]

### Ex. 1) Units of measurement
- Wich countries have reported pollution levels in units of 'ppm'? 

In [8]:
# Query to select countries with units of "ppm"
first_query = '''
    SELECT country
    FROM `bigquery-public-data.openaq.global_air_quality`
    WHERE unit='ppm' ''' 

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
first_query_job = client.query(first_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
first_results = first_query_job.to_dataframe()

# View top few rows of results
print(first_results.head())

  country
0      IL
1      IL
2      AR
3      IL
4      AR


You got the right countries. Nice job! Some countries showed up many times in the results. To get each country only once you can run `SELECT DISTINCT country ...`. The DISTINCT keyword ensures each column shows up once, which you'll want in some cases.
##### Or to get each country just once, you could use
first_query = """
              SELECT DISTINCT count    ry
              FROM `bigquery-public-data.openaq.global_air_quali    ty`
              WHERE unit = "      """

### Ex. 2) High air quality
- Which pollution levels were reported to be exactly 0?

In [10]:
zero_pollution_query = '''
    SELECT *
    FROM `bigquery-public-data.openaq.global_air_quality`
    WHERE value=0 '''
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
zpq_job = client.query(zero_pollution_query, job_config=safe_config)
zero_pollution_results = zpq_job.to_dataframe()    # this is my 'df'
zero_pollution_results.iloc[[0, 5, 9. -9, -5, -1]]

Unnamed: 0,location,city,country,pollutant,value,timestamp,unit,source_name,latitude,longitude,averaged_over_in_hours,location_geom
0,"Zielonka, Bory Tucholskie",Zielonka,PL,bc,0.0,2022-04-29 14:00:00+00:00,µg/m³,GIOS,1.0,53.662136,17.933986,POINT(53.662136 1)
5,"Kraków, ul. Bulwarowa",Kraków,PL,bc,0.0,2022-05-12 10:00:00+00:00,µg/m³,GIOS,1.0,50.069308,20.053492,POINT(50.069308 1)
0,"Zielonka, Bory Tucholskie",Zielonka,PL,bc,0.0,2022-04-29 14:00:00+00:00,µg/m³,GIOS,1.0,53.662136,17.933986,POINT(53.662136 1)
192702,City Hall - Durban-NAQI,eThekwini Metro,ZA,pm25,0.0,2022-05-14 19:00:00+00:00,µg/m³,South Africa,1.0,-29.858283,31.027286,POINT(-29.858283 1)
192706,Stellenboch,Cape Winelands,ZA,pm25,0.0,2022-05-10 14:00:00+00:00,µg/m³,South Africa,1.0,-33.927762,18.857242,POINT(-33.927762 1)


That query wasn't too complicated, and it got the data you want. But these SELECT queries don't organizing data in a way that answers the most interesting questions. For that, we'll need the GROUP BY command.

If you know how to use groupby() in pandas, this is similar. But BigQuery works quickly with far larger datasets.