##### Copyright 2020 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License").
<!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
-->


# Example 2: Streaming Word Count

This example demonstrates how to set up a streaming processing pipeline that reads from a
[Google Pub/Sub](https://cloud.google.com/pubsub) topic. Each message in the Pub/Sub topic is a word from Shakespeare's work *King Lear*, 
The pipeline performs a frequency count on each of those words by window. 

You'll be able to use this notebook to explore the data in each PCollection.

Before you start, please ensure the PubSub API is enabled [here](https://console.cloud.google.com/apis/library/pubsub.googleapis.com).

We start with the necessary imports:

In [None]:
import apache_beam as beam
from apache_beam.runners.interactive import interactive_runner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from datetime import timedelta
import google.auth

# The Google Cloud PubSub topic that we are reading from for this example.
topic = "projects/pubsub-public-data/topics/shakespeare-kinglear"

Now we are setting up the options to create the streaming pipeline:

In [None]:
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions()

# Sets the pipeline mode to streaming, so we can stream the data from PubSub.
options.view_as(pipeline_options.StandardOptions).streaming = True

# Sets the project to the default project in your current Google Cloud environment.
# The project will be used for creating a subscription to the Pub/Sub topic.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()

We are reading from Google Cloud Pub/Sub, which is an unbounded source. By default, *Apache Beam Notebooks* will capture
data from the unbounded sources for replayability. 

The following sets the data capture duration to 60 seconds.

In [None]:
ib.options.capture_duration = timedelta(seconds=60)

The following creates a pipeline with the *Interactive Runner* as the runner with the options we just created.

In [None]:
p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options)

This creates a `PTransform` that will create a subscription to the given Pub/Sub topic and reads from the subscription.

In [None]:
words = p | "read" >> beam.io.ReadFromPubSub(topic=topic)

Because we are reading from an unbounded source, we need to create a windowing scheme so that we can
count the words by window. The following creates fixed windowing with each window being 10 seconds in duration.
For more information about windowing in Apache Beam, please visit the [Apache Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/#windowing-basics).


In [None]:
windowed_words = (words 
                  | "window" >> beam.WindowInto(beam.window.FixedWindows(10)))

The following `PTransform` will count the words by window.

In [None]:
windowed_word_counts = (windowed_words
                        | "count" >> beam.combiners.Count.PerElement())

The `ib.show()` method takes a `PCollection` as a parameter, runs the pipeline that contributes to it, and
shows its content as data comes in. This method will return when all the data has been read.

The optional parameter `include_window_info=True` will include the window information for each element in the output.
You will see 3 additional columns: `event_time`, `windows`, and `pane_info`.
`event_time` is the timestamp associated with the value.
`windows` in this example tells you the start timestamp of the window and its duration.
`pane_info` describes the [triggering](https://beam.apache.org/documentation/programming-guide/#triggers) information for the pane that contained the value.

This example does not use custom triggering so by default there will be only one pane per window labeled `Pane 0`.

Note that this also automatically captures a bounded segment of the unbounded source.

In [None]:
ib.show(windowed_word_counts, include_window_info=True)

Because we have captured a bounded segment of the unbounded source, the following will show the same data
as the previous `ib.show()` call. This is to ensure replayability so that you can iteratively augment
your pipeline and verify the output with the same input, which you will see in future cells in this notebook.
Note the parameter `visualize_data=True`. This optional parameter gives you a visualization of the data. 

In [None]:
ib.show(windowed_word_counts, include_window_info=True, visualize_data=True)

As mentioned, to ensure replayability for iterative prototyping of your pipeline,
`ib.show()` calls will reuse the captured data by default. You can change this behavior and
have it always fetch new data, by doing:


In [None]:
# Uncomment and run this only if you would like to change the replay behavior:
# ib.options.enable_capture_replay = False

The following `PTransform` will count the words in lowercase by window.

In [None]:
windowed_lower_word_counts = (windowed_words
                              | beam.Map(lambda word: word.lower())
                              | "count" >> beam.combiners.Count.PerElement())

Assuming you have not changed `ib.options.enable_capture_replay`, the following will return the count using the same words 
as before but with lowercase.
Because all words are converted to lowercase before being counted, some words will have a higher count than before.

In [None]:
ib.show(windowed_lower_word_counts, include_window_info=True)

The following gives you a [Pandas Dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) that represents the `PCollection`.

In [None]:
ib.collect(windowed_lower_word_counts, include_window_info=True)

Just like the first example, this example is designed to run easily on a single machine. If the input stream had a very high volume, you'd want to add an output sink to your PCollection result by doing something like:
```
windowed_lower_word_counts | beam.io.<some output transform>
```
and let [Google Cloud Dataflow](https://cloud.google.com/dataflow) run your pipeline.

You can find the list of built-in input and output transforms [here](https://beam.apache.org/documentation/io/built-in/).

Please refer to the user guide <TODO: URL> on how to run a Dataflow job using a pipeline assembled from your notebook. You can also refer to [this walkthrough](Dataflow_Word_Count.ipynb) which is based on the [first word count example notebook](01-Word_Count.ipynb).
