### Challenges associated with streaming data

#### Ingesting variable volumes
- massive amounts of streaming events, handle spiky/bursting data, high availability and durability
- Cloud Pub/Sub (Ingest)

#### Late data, unordered data
- how to deal with latency, late arriving records, or speculative results
- Data Dataflow (Processing & Imperative Analysis)

#### Real-time insights
- continuous query processing, visualization, analytics, etc.
- Google BigQuery (Durable storage and interactive analysis)

## Module 1 Review

1.) Dataflow offers the following that makes it easy to create resilient streaming pipelines when working with unbounded data
- Ability to flexibly reason about time
- Control messages to ensure correctness

2.) Match the GCP product with its role when designing streaming systems
- Pub / Sub: Global messaging queue
- Dataflow: Controls to handle late-arriving and out-of-order data
- Bigtable: latency in the order of milliseconds when querying against overwhelming volume
- BiqQuery: Query data as it arrives from streaming pipelines

## Lab: Publish Streaming Data into Pub/Sub
#### Objectives:
- Create a Pub/Sub topic and subscription
- Simulate your traffic sensor data into Pub/Sub

#### Task 1: Preparation
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a terminal window.
- In this lab you will enter CLI commands on the training_vm.
- The training_vm is installing software in the background. Verify that setup is complete by checking that the following directory exists. If it does not exist, wait a few minutes and try again
- A repository has been downloaded to the VM. Copy the repository to your home directory.

```
ls /training
# copy to home directory
cp -r /training/training-data-analyst/ .
```
- On the training_vm SSH terminal, set the DEVSHELL_PROJECT_ID environment variable and export it so it will be available to other shells.
```
export DEVSHELL_PROJECT_ID=<project-id>
```

#### Task 2: Create Pub/Sub topic and subscription
- On the training_vm SSH terminal, navigate to the directory for this lab.
```
cd ~/training-data-analyst/courses/streaming/publish
gcloud pubsub topics create sandiego
gcloud pubsub topics publish sandiego --message "hello"
gcloud pubsub subscriptions create --topic sandiego mySub1
gcloud pubsub subscriptions pull --auto-ack mySub1
# try again
gcloud pubsub topics publish sandiego --message "hello again"
gcloud pubsub subscriptions pull --auto-ack mySub1
```

- Return to the Console tab. On the Navigation menu () click Pub/Sub > Topics.
- You should see a line with the Topic Name ending in sandiego and the number of Subscriptions set to 1.
- In the training_vm SSH terminal,, cancel your subscription.

```
gcloud pubsub subscriptions delete mySub1
```

#### Task 3: Simulate traffic sensor data into Pub/Sub
- Explore the python script to simulate San Diego traffic sensor data. Do not make any changes to the code.
```
cd ~/training-data-analyst/courses/streaming/publish
nano send_sensor_data.py
```
- Download the traffic simulation dataset.
```
./download_data.sh
sudo apt-get install -y python-pip
sudo pip install -U google-cloud-pubsub
./send_sensor_data.py --speedFactor=60 --project (dollar sign)DEVSHELL_PROJECT_ID
```

- This command simulates sensor data by sending recorded sensor data via Pub/Sub messages. The script extracts the original time of the sensor data and pauses between sending each message to simulate realistic timing of the sensor data. The value speedFactor changes the time between messages proportionally. So a speedFactor of 60 means '60 times faster' than the recorded timing. It will send about an hour of data every 60 seconds.

#### Task 4: Verify that messages are received
- In the Console, on the Navigation menu () click Compute Engine > VM instances.
- Locate the line with the instance called training_vm.
- On the far right, under 'connect', Click on SSH to open a second terminal window.
- Change into the directory you were working in:
```
cd ~/training-data-analyst/courses/streaming/publish
gcloud pubsub subscriptions create --topic sandiego mySub2
gcloud pubsub subscriptions pull --auto-ack mySub2
# cancel subscription
gcloud pubsub subscriptions delete mySub2
exit
```

## End Lab

## Module 2 Review

1.) Which of the following about Cloud Pub/Sub is NOT true?
- Pub/Sub stores your messages indefinitely until you need it

Pub/Sub does:
- Simplify systems by removing the need for every component to speak to every component
- Connect applications and devices through a messaging infrastructure

2.) Cloud Pub/Sub guarantees that messages delivered are in the order they were received
- False
(Pub/Sub takes advantage of timestamping to deliver in the correct order)

3.) Which of the following about Cloud Pub/Sub topics and subscriptions are true?
- 1 or more publishers can write to the same topic
- 1 or more subscribers can request from the same subscription

4.) Which of the following delivery methods is ideal for subscribers needing close to real time performance?
- Push delivery 

