<a href="https://colab.research.google.com/github/mdivk/colab/blob/master/WordCountExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First of all check the python version to ensure it meet the minimum version requirement for beam: [Installing the Apache Beam SDK](https://cloud.google.com/dataflow/docs/guides/installing-beam-sdk)

In [0]:
!python --version

Python 3.6.9


Note: below is the right way to get beam installed in colab

In [7]:
!pip install apache-beam[gcp]

Collecting apache-beam[gcp]
[?25l  Downloading https://files.pythonhosted.org/packages/29/60/9e71c59aa44366471f84149a707de88cd6c658a04bc03a99ba2b99bb4461/apache_beam-2.19.0-cp36-cp36m-manylinux1_x86_64.whl (3.4MB)
[K     |████████████████████████████████| 3.4MB 2.8MB/s 
Collecting oauth2client<4,>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/c0/7b/bc893e35d6ca46a72faa4b9eaac25c687ce60e1fbe978993fe2de1b0ff0d/oauth2client-3.0.0.tar.gz (77kB)
[K     |████████████████████████████████| 81kB 7.9MB/s 
Collecting avro-python3<2.0.0,>=1.8.1; python_version >= "3.0"
  Downloading https://files.pythonhosted.org/packages/76/b2/98a736a31213d3e281a62bcae5572cf297d2546bc429accf36f9ee1604bf/avro-python3-1.9.1.tar.gz
Collecting hdfs<3.0.0,>=2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/82/39/2c0879b1bcfd1f6ad078eb210d09dbce21072386a3997074ee91e60ddc5a/hdfs-2.5.8.tar.gz (41kB)
[K     |████████████████████████████████| 51kB 7.1MB/s 
[?25hCollecting python-dat

In [0]:
# with beam installed, import it as beam for the short name
import apache_beam as beam

In [0]:
# get some testing file uploaded first
# Note currently Colab is a bit tricky, seems it works only when there are enough screen space for the files.upload() widget
from google.colab import files
uploaded = files.upload()

In [12]:
# Double check to confirm the file is uploaded successfuly
!ls -la kinglear.txt

-rw-r--r-- 1 root root 176339 Feb  7 12:02 kinglear.txt


In [0]:
# Below p1 is a simple beam pipeline for the prototype
p1 = beam.Pipeline()

In [0]:
# list/array = []
# set = ()
# dictionary = {}

lines = (
    
    p1
    | beam.Create(['Using create transform ',
                   'to generate in memory data ',
                   'This is 3rd line ',
                   'Thanks '])
    | beam.io.WriteToText('data/outCreate1')
)
p1.run()

# visualize output
!{('head -n 20 data/outCreate1-00000-of-00001')}


In [0]:
# p2 is the beam pipeline that will be used to parse the uploaded text file and get the word count pair
p2 = beam.Pipeline()

In [0]:
# re is used in the pipeline, so we need to import it first
import re

In [17]:
# the p2 pipeline is reading the text first, get each word, count each word, and then write to output in the name of "counts"
# Note: the generated output will be prefixed with "counts"
lines2 = (
    p2
    | beam.io.ReadFromText('kinglear.txt')
    | beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
    | beam.combiners.Count.PerElement()
    | beam.MapTuple(lambda word, count: '%s: %s' % (word, count))
    | beam.io.WriteToText('counts')
)
p2.run()

# visualize output
!{('ls -la')}


total 252
drwxr-xr-x 1 root root   4096 Feb  7 12:16 .
drwxr-xr-x 1 root root   4096 Feb  7 11:57 ..
drwxr-xr-x 2 root root   4096 Feb  7 12:15 beam-temp-counts-8b08e01c49a311eaae0d0242ac1c0002
drwxr-xr-x 1 root root   4096 Feb  5 18:37 .config
-rw-r--r-- 1 root root  50176 Feb  7 12:16 counts-00000-of-00001
drwxr-xr-x 2 root root   4096 Feb  7 12:11 data
-rw-r--r-- 1 root root 176339 Feb  7 12:02 kinglear.txt
drwxr-xr-x 1 root root   4096 Feb  5 18:37 sample_data


In [21]:
!head -n 20 counts-00000-of-00001

THE: 7
TRAGEDY: 1
OF: 16
KING: 1
LEAR: 1
by: 69
William: 1
Shakespeare: 1
Dramatis: 1
Personae: 1
Lear: 228
King: 54
of: 456
Britain: 2
France: 32
Duke: 30
Burgundy: 15
Cornwall: 22
Albany: 14
Earl: 11
