<a href="https://colab.research.google.com/github/mwvgroup/Pitt-Google-Broker/blob/u%2Ftjr%2Ftutorials/PGB_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1) Data Products

Architecture figure highlighting the data products available to them

Links to webpages:
- [BQ table descriptions and schemas]
- [Cloud Storage buckets]
- [Pub/Sub streams]

---

# 2) Setup

- Create GCP Project
- Enable billing
- Enable APIs

In [None]:
#--- Colab Setup
from google.colab import auth
auth.authenticate_user()  # follow the instructions to authorize Google Cloud SDK 
# connect your file system (local or Drive)
# example of down/uploading a file (also see file browser on left)


#--- Cloud Shell Setup


#--- Local machine setup


In [None]:
from google.cloud import bigquery
from google.cloud import storage


In [None]:
project_id = 'ardent-cycling-243415'
project_id_pgb = 'ardent-cycling-243415'

---

# 3) Query a BigQuery Database


## 3a) Python

Links with more info:
- [Documentation](https://googleapis.dev/python/bigquery/latest/index.html)
- [Colab Snippets](https://colab.research.google.com/notebooks/snippets/bigquery.ipynb#scrollTo=jl97S3vfNHdz)

In [None]:
# create a connection to BigQuery
bq_client = bigquery.Client(project=project_id)

# write the SQL query as a string
sql = """
    SELECT name, SUM(number) as count
    FROM `bigquery-public-data.usa_names.usa_1910_current`
    GROUP BY name
    ORDER BY count DESC
    LIMIT 10
"""

# make an API request
query_job = bq_client.query(sql)

In [None]:
# Option1: dump results to a pandas.DataFrame
df = query_job.to_dataframe()
df

In [None]:
# Option 2: parse results row by row
print("The query data:")
for row in query_job:
    # row values can be accessed by field name or index
    print(f"name={row[0]}, count={row['count']}")

## 3b) Command-line tool `bq`

Links with more info:
- [Running queries from the bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool#running_queries_from_the) (also see other sections on this page)



```bash
#--- retrieve query results from command line
# (this cell is not executable)

bq query --use_legacy_sql=false \
'SELECT
  word,
  SUM(word_count) AS count
FROM
  `bigquery-public-data`.samples.shakespeare
WHERE
  word LIKE "%raisin%"
GROUP BY
  word'
```



---

# 4) Download a file from Cloud Storage

- grab cutouts from Cloud Storage


In [None]:
#--- Download an alert from GCS bucket

import logging
import os

bucket_name = f'{project_id_pgb}_ztf_alert_avro_bucket'
fname_prefix = '1605062619785'  # recently ingested avro file name, get from logging
local_dir = './testdownload'
delimiter = '/'

storage_client = storage.Client.from_service_account_json('GCPauth_pitt-google-broker-prototype-0679b75dded0.json')
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=fname_prefix, delimiter=delimiter) #List all objects that satisfy the filter.
# Download the file to a destination
# Iterating through for loop one by one using API call
for blob in blobs:
    destination_uri = '{}/{}'.format(local_dir, blob.name)
    blob.download_to_filename(destination_uri)

---

# 5) Listen to Pub/Sub Streams

- Python
  - unpack into dict

---

# 6) Filter or Process the Data (using Apache Beam)

We use the [Apache Beam](https://beam.apache.org/) (Python) SDK to write the data pipeline (see also [Apache Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/)). This has the following advantages:

1. Convenience: Native I/O functions for Pub/Sub, BigQuery, and Cloud Storage.
2. Flexiblity: The same pipeline can accept streaming and batch inputs. -> Use the same pipeline to process the live stream and reprocess the database.
3. Portability: The same pipeline can be run/executed in multiple environments (Google Cloud, AWS, local machine) via an execution "runner" ([Apache Flink](https://beam.apache.org/documentation/runners/flink/), [Apache Spark](https://beam.apache.org/documentation/runners/spark/), [Google Dataflow](https://beam.apache.org/documentation/runners/dataflow/), or [direct/local](https://beam.apache.org/documentation/runners/direct/)).

Links with more info:
- [Colab Snippets: Apache Beam](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-py.ipynb)

In [None]:
#--- Install Apache Beam (and other setup)

# Run and print a shell command.
def run(cmd):
  print('>> {}'.format(cmd))
  !{cmd}
  print('')

# Install apache-beam.
run('pip install --quiet apache-beam')

# Copy the input file into the local file system.
run('mkdir -p data')
run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/')

In [None]:
#--- Minimal word count (example)

import apache_beam as beam
import re

inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part'

# Running locally in the DirectRunner.
with beam.Pipeline() as pipeline:
  (
      pipeline
      | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
      | 'Find words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
      | 'Pair words with 1' >> beam.Map(lambda word: (word, 1))
      | 'Group and sum' >> beam.CombinePerKey(sum)
      | 'Format results' >> beam.Map(lambda word_count: str(word_count))
      | 'Write results' >> beam.io.WriteToText(outputs_prefix)
  )

# Sample the first 20 results, remember there are no ordering guarantees.
run('head -n 20 {}-00000-of-*'.format(outputs_prefix))