# GCP Professional Data Engineer
### Serverless Data Analysis with Google BigQuery and Cloud Dataflow
#### Modules:
- Serverless Data Analysis with BigQuery
- Autoscaling Data Processing Pipelines with Dataflow

#### Learning Objectives:

- Build up a complex BigQuery using clauses, inner selects, built-in functions and joins
- Load and export data to/from BigQuery
- Identify need for nested, repeated fields and user-defined functions

## Module 1: Serverless Data Analysis with Big Query
#### Topics:
- Queries
- Functions
- Load & export data
- Nested, repeated fields
- Window functions
- User defined functions

### BigQuery Overview
#### BigQuery Benefits:
- Interactive analysis of petabyte scale databases
- Familiar SQL 2011 query language
- Nested and repeat fields, user defined functions in Javascript
- Data storage is inexpensive

#### BigQuery Sample Architecture
##### Project (billing, top-level container)
- Limit access to datasets and jobs
- Manage billing

##### Dataset (organization, access control)
- Access Control Lists for Reader/Writer/Owner
- Applied to all tables/views in dataset

##### Table (data w/ schema)
- Columnar storage
- Views are in virtual tables defined by SQL query
- Tables can be external (Cloud Storage, etc.)
- Each column is storage in a separated, encrypted file

##### Jobs (query, import, export, copy) 
- Repeated or long running action
- Can be cancelled

### Lab: Building a BigQuery Query
#### Objectives:
- Create and run a query
- Modify the query to add clauses, subqueries, built-in functions and joins.

#### Task 1: Create and Run a Query
- In the Console, on the Products & services menu () click BigQuery. Click on the Compose Query button on top left, and then click on Show Options, and ensure you are using Standard SQL. You are using Standard SQL if the Use Legacy SQL checkbox is unchecked.
- Click Hide Options.
- In the New Query window, type (or copy-and-paste) the following query:

In [None]:
# sql
SELECT
  airline,
  date,
  departure_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0
  AND departure_airport = 'LGA'
LIMIT
  100

#### Task 2: Aggregate and Boolean Fxns
- In the New Query window, type the following query(s):

In [None]:
# total number of flights departed from LGA
SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline

# total number of late flights from LGA
SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0 AND
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline

# total number of flights AND total delayed flights
SELECT
  f.airline,
  COUNT(f.departure_delay) AS total_flights,
  SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
FROM
   `bigquery-samples.airline_ontime_data.flights` AS f
WHERE
  f.departure_airport = 'LGA' AND f.date = '2008-05-13'
GROUP BY
  f.airline

#### Task 3: String Operations, Joins, & Subqueries

In [None]:
SELECT
  CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
  `bigquery-samples.weather_geo.gsod`
WHERE
  station_number = 725030
  AND total_precipitation > 0

# join weather data and flight information
SELECT
  f.airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline

# fraction of flights delayed per airline
SELECT
  airline,
  num_delayed,
  total_flights,
  num_delayed / total_flights AS frac_delayed
FROM (
SELECT
  f.airline AS airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
  )
ORDER BY
  frac_delayed ASC

### End Lab
### Lab: Loading & Exporting Data
### Lab: Advanced SQL Queries
### Module 1 Review

## Module 2: Autoscaling Data Processing Pipelines with Dataflow
#### Topics:
- Pipeline concepts
- MapReduce
- Side inputs
- Streaming