# Simple Dataflow Pipeline

## Overview

Duration is 1 min

In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.

### What you learn

In this lab, you learn how to:

* Write a simple pipeline in Python

* Execute the query on the local machine

* Execute the query on the cloud

## Introduction
Duration is 1 min

The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.

## Setup

## Open Dataflow project

Duration is 3 min

### Step 1
Start CloudShell and clone the source repo which has starter scripts for this lab:
```
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
```
Then navigate to the code for this lab:
```
cd training-data-analyst/courses/data_analysis/lab2/python
```

### Step 2
Install the necessary dependencies for Python dataflow:
```
sudo ./install_packages.sh
```
Verify that you have the right version of pip (should be > 8.0):
```
pip -V
```
If not, open a new CloudShell tab and it should pick up the updated pip.

## Pipeline filtering

Duration is 5 min

### Step 1
View the source code for the pipeline using the Cloud Shell file browser:

f1f5da1fd2c75d3a.png

In the file directory, navigate to /training-data-analyst/courses/data_analysis/lab2/python.

499badba3c564a51.png

Find grep.py.

8c6f80d2b0a9f0d3.png

Or you can navigate to the directly and view the file using nano if you prefer:
```
nano grep.py
```

### Step 2
What files are being read? _____________*.java________________________________________

What is the search term? _______________import_______________________________________

Where does the output go? _________/tmp/output_____________________________________

There are three transforms in the pipeline:

What does the transform do? _________________________________

What does the second transform do? ______________________________

Where does its input come from? _____training-data-analyst/courses/data_analysis/lab2/javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/
___________________

What does it do with this input? ___________find all line starts with "import"_______________

What does it write to its output? __________________________

Where does the output go to? ____________________________

What does the third transform do? _____________________



## Execute the pipeline locally
Duration is 2 min

### Step 1
Execute locally:
```
python grep.py
```
Note: if you see an error that says "No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

### Step 2
Examine the output file:
```
cat /tmp/output-*
```
Does the output seem logical? ______________________

```
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.windowing.SlidingWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.joda.time.Duration;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import java.util.ArrayList;
import java.util.List;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.Top;
import org.apache.beam.sdk.values.KV;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.Top;
import org.apache.beam.sdk.transforms.View;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionView;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo
```

## Execute the pipeline on the cloud
Duration is 10 min

### Step 1
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

### Step 2
Copy some Java files to the cloud (make sure to replace <YOUR-BUCKET-NAME> with the bucket name you created in the previous step):
```
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp
```

### Step 3
Edit the Dataflow pipeline in grepc.py by opening up in the Cloud Shell in-browser editor again or by using the command line with nano:

2267f36fb97f67cc.png
```
nano grepc.py
```
and changing the PROJECT and BUCKET variables appropriately.

### Step 4
Submit the Dataflow to the cloud:
```
python grepc.py
```
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).

### Step 5
On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:

f55e71303e86b156.png

### Step 6
Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, examine the output:
```
gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*
```

## What you learned

Duration is 1 min

In this lab, you:

* Executed a Dataflow pipeline locally
* Executed a Dataflow pipeline on the cloud.

## End your lab