# Traffic Data Analytics Tutorial
In this walkthrough, you will use pubsub, dataflow, bigquery and app engine to take data from an API provided by the city of Chicago. And create a service that will allow users to have a global view of Chicago's traffic situation.

# Setup Pub/Sub
Message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers

We will grab data from Chicago Traffic API and send it to pub/sub
* https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Se/n4j6-wkkf
* https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Re/t2qc-9pjd

First make sure to be logged into a gcloud project
```
gcloud auth application-default login
```

Create a topic
```
gcloud beta pubsub topics create chicagoregions
gcloud beta pubsub topics create chicagosegments
```

Make sure the topics were successfully created
![topics](topics.png)
Make sure your pubsub service is working
```
gcloud beta pubsub topics publish chicagoregions "hello"
gcloud beta pubsub subscriptions create --topic chicagoregions mySub1
gcloud beta pubsub subscriptions pull --auto-ack mySub1
```

If you get the message "hello" your subscription is working. 
You can delete it now
```
gcloud beta pubsub subscriptions delete mySub1
```

Create app engine app that GET's the traffic data and sends it to google cloud's pubsub service.
```
git clone https://github.com/nrkfeller/traffic_chicago
```

Deploy the service
```
cd traffic_regions
gcloud app deploy app.yaml cron.yaml
cd ../traffic_segments
gcloud app deploy app.yaml
```

This should take a few seconds to boot up. Then make sure that messages are actually getting through to the pubsub service.
```
gcloud beta pubsub subscriptions pull --auto-ack chicagoregions
gcloud beta pubsub subscriptions pull --auto-ack chicagosegments
```

If you see a message with DATA, MESSAGE_ID and ATTRIBUTES, everything is working.

# Setup Bigquery
Cheap, scalable data warehouse for ad hoc analysis.

What is bigQuery:
* Fully managed data warehouse
* Fast, petabyte scale SQL like queries
* Provides streaming ingest to unbounded data sets
* Encrypted durable highly available
* Virtually unlimited resources on pay for what you use basis

Create a bigquery datasets and tables

Go to the url:https://bigquery.cloud.google.com

Make sure you are logged in on the right project. Project id should appear in URL

Create a Dataset. As in image below. Call it demos
![create table](https://cloud.google.com/solutions/images/etlscreenshot-001.png)

Create 2 dataset tables. Using the little blue + sign next to your new dataset.
* make sure one of them is ```<project id>:demos.regions``` and the other is ```<project id>:demos.segments```

You must diligently enter the fields and data types for each column you want to add to your dataset. Make sure the data types are in line with the traffic APIs:
* https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Se/n4j6-wkkf
* https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Se/n4j6-wkkf

All done!

# Setup Dataflow
Datalow is a distributed processing backend. Dataflow uses Beam  as a programming model to define and execute data pipelines. Beam pipelines can run on Apex, Fink, Spark or Dataflow. Data can either be streamed of batch (bulk DB migration process) processed. Beam pipelines can be defined in Java and Python.

What is dataflow:
* Batch or streaming (b-eam)
* Unified batch and streaming processing
* Fully managed, no ops data processing
* Open source programming model (beam)
* Intelligently scales to millions of QPS

### Connect pubsub to bigquery using dataflow

Navigate to the pubsub page in the google cloud console
https://console.cloud.google.com/cloudpubsub/

Click on the regions topic ```chicagoregions```

Find the 'export to bigquery' button
![export](export.png)

Fill in the export to bigquery form, there are only 2 mandatory fields.
* BigQuery table location ```(<project>:<dataset>.<table_name>)``` to write the output to. The table’s schema must match the input JSON objects.
* Temporary Location. Path and filename prefix for writing temporary files. ex: ```gs://<bucket name>/tmp```

Navigate to the dataflow tab: https://console.cloud.google.com/dataflow

Find your job and click on it.
![jobs](jobs.png)

And make sure its running. This will take a minute, let it boot up and make sure there are no errors.
![running](running.png)

Make sure its running; and do the same thing for segments


# Setup Webapp
Setup webapp!

Navigate to the webapp directory
```
cd traffic_chicago/webapp/
```

Deploy and that's it!
```
gcloud app deploy
```

Navigate to the app engine / services tab on your console
https://console.cloud.google.com/appengine/services

After the deployment is complete click on the webapp service
![webapp](webapp.png)

# Build Cool Analytics Visualizations
* Google Data Studio to visualize Bigquery data: https://codelabs.developers.google.com/codelabs/cpb104-bigquery-datastudio/#0
* Google Datalab to visualize Bigquery data: https://codelabs.developers.google.com/codelabs/cpb100-datalab/index.html?index=..%2F..%2Findex

# Don't Forget to Delete All Services
* Pub/sub: https://console.cloud.google.com/cloudpubsub/topicList 
* All app engine services: https://console.cloud.google.com/appengine/services
* Dataflow jobs: https://console.cloud.google.com/dataflow
* Bigquery Tables: https://bigquery.cloud.google.com/project/

# References
* Full pipeline (pubsub -> dataflow -> bigquery) : https://www.youtube.com/watch?v=kdmAiQeYGgE
* Pub Sub: https://cloud.google.com/pubsub/docs/overview
* Dataflow: https://cloud.google.com/dataflow/
* BigQuery: https://cloud.google.com/bigquery/
* Datastudio: https://www.youtube.com/watch?v=FwpjBp-MgHk
* Big data with GCP: https://cloud.google.com/solutions/big-data/stream-analytics/
* Apache Beam: https://beam.apache.org/documentation/