<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align = "center"> Spark Fundamentals I - Introduction to Spark </h1>
<h2 align = "center"> Python - Working with RDD operations </h2>
<br align = "left">

**Related free online courses:**  

Related courses can be found in the following learning paths:

- [Spark Fundamentals path](http://cocl.us/Spark_Fundamentals_Path)
- [Big Data Fundamentals path](http://cocl.us/Big_Data_Fundamentals_Path)

<img src = "http://spark.apache.org/images/spark-logo.png", height = 100, align = 'left'>

## Analyzing a log file

First, let's download the data that we will working with in this lab.

In [1]:
# download the data from the IBM server
# this may take ~30 seconds depending on your interent speed
!wget --quiet https://ibm.box.com/shared/static/j8skrriqeqw66f51iyz911zyqai64j2g.zip
print("Data Downloaded!")

Data Downloaded!


In [2]:
# unzip the folder's content into "resources" directory
# this may take ~30 seconds depending on your internet speed
!unzip -q -o -d /resources/jupyter/labs/BD0211EN/ j8skrriqeqw66f51iyz911zyqai64j2g.zip
print("Data Extracted!")

Data Extracted!


In [3]:
# list the extracted files
!ls -1 /resources/jupyter/labs/BD0211EN/LabData/

followers.txt
notebook.log
nyctaxi100.csv
nyctaxi.csv
nyctaxisub.csv
nycweather.csv
pom.xml
README.md
taxistreams.py
users.txt


Now, let's create an RDD by loading the log file that we analyze in the Scala version of this lab.

In [4]:
logFile = sc.textFile("/resources/jupyter/labs/BD0211EN/LabData/notebook.log")

### <span style="color: red">YOUR TURN:</span> 

#### In the cell below, filter out the lines that contains INFO

In [11]:
# WRITE YOUR CODE BELOW
info = logFile.filter(lambda line: "INFO" in line)

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info = logFile.filter(lambda line: "INFO" in line)
</textarea>

#### Count the lines:

In [13]:
# WRITE YOUR CODE BELOW
info.map(lambda x: 1).reduce(lambda a, b: a+b) # CL method
info.count()

13438

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.count()
</textarea>

#### Count the lines with "spark" in it by combining transformation and action.

In [14]:
# WRITE YOUR CODE BELOW
info.filter(lambda line: "spark" in line).count()

156

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.filter(lambda line: "spark" in line).count()
</textarea>

#### Fetch those lines as an array of Strings

In [15]:
# WRITE YOUR CODE BELOW
info.filter(lambda line: "spark" in line).collect()

['15/10/14 14:29:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@172.17.0.22:53333]',
 "15/10/14 14:29:23 INFO Utils: Successfully started service 'sparkDriver' on port 53333.",
 '15/10/14 14:29:23 INFO DiskBlockManager: Created local directory at /tmp/spark-fe150378-7bad-42b6-876b-d14e2c193eb6/blockmgr-c142f2f1-ebb6-4612-945b-0a67d156230a',
 '15/10/14 14:29:23 INFO HttpFileServer: HTTP File server directory is /tmp/spark-fe150378-7bad-42b6-876b-d14e2c193eb6/httpd-ed3f4ab0-7218-48bc-9d8a-3981b1cfe574',
 "15/10/14 14:29:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35726.",
 '15/10/15 15:33:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@172.17.0.22:47412]',
 "15/10/15 15:33:42 INFO Utils: Successfully started service 'sparkDriver' on port 47412.",
 '15/10/15 15:33:42 INFO DiskBlockManager: Created local directory at /tmp/spark-fc035223-3b43-43d1-8d7d-

Highlight text for answer:

<textarea rows="3" cols="80" style="color: white">
info.filter(lambda line: "spark" in line).collect()
</textarea>

View the graph of an RDD using this command:

In [16]:
print(info.toDebugString())

b'(2) PythonRDD[10] at RDD at PythonRDD.scala:43 []\n |  MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []\n |  /resources/jupyter/labs/BD0211EN/LabData/notebook.log HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []'


## Joining RDDs

Next, you are going to create RDDs for the same README and the POM files that we used in the Scala version. 

In [17]:
readmeFile = sc.textFile("/resources/jupyter/labs/BD0211EN/LabData/README.md")
pomFile = sc.textFile("/resources/jupyter/labs/BD0211EN/LabData/pom.xml")

How many Spark keywords are in each file?

In [18]:
print(readmeFile.filter(lambda line: "Spark" in line).count())
print(pomFile.filter(lambda line: "Spark" in line).count())

18
2


Now do a WordCount on each RDD so that the results are (K,V) pairs of (word,count)

In [19]:
readmeCount = readmeFile.                    \
    flatMap(lambda line: line.split("   ")).   \
    map(lambda word: (word, 1)).             \
    reduceByKey(lambda a, b: a + b)
    
pomCount = pomFile.                          \
    flatMap(lambda line: line.split("   ")).   \
    map(lambda word: (word, 1)).            \
    reduceByKey(lambda a, b: a + b)

To see the array for either of them, just call the collect function on it.

In [20]:
print("Readme Count\n")
print(readmeCount.collect())

Readme Count

[('', 43), ('Testing first requires [building Spark](#building-spark). Once Spark is built, tests', 1), (' ./bin/run-example SparkPi', 1), ('To run one of them, use `./bin/run-example <class> [params]`. For example:', 1), ('You can set the MASTER environment variable when running examples to submit', 1), ('And run the following command, which should also return 1000:', 1), ('storage systems. Because the protocols have changed in different versions of', 1), ('[run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).', 1), ('examples to a cluster. This can be a mesos:// or spark:// URL,', 1), ('Spark is a fast and general cluster computing system for Big Data. It provides', 1), ('and Spark Streaming for stream processing.', 1), ('Spark also comes with several sample programs in the `examples` directory.', 1), ('Hadoop, you must build Spark against the same version that your cluster runs.', 1), ('## Building Spar

In [21]:
print("Pom Count\n")
print(pomCount.collect())

Pom Count

[('', 841), ('<exclusions>', 6), ('<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">', 1), (' <artifactId>spark-parent_2.10</artifactId>', 1), (' <dependency>', 24), ('  <resource>log4j.properties</resource>', 1), ('  <dependencies>', 1), ('  <artifact>*:*</artifact>', 1), ('<version>${project.version}</version>', 11), (' </filters>', 1), ('  <hadoop.deps.scope>provided</hadoop.deps.scope>', 1), ('  <packaging>jar</packaging>', 1), ('  <properties>', 1), ('<artifactId>scopt_${scala.binary.version}</artifactId>', 1), ('<artifactId>commons-math3</artifactId>', 1), ('  </configuration>', 3), (' <!-- SPARK-4455 -->', 4), (' <artifactId>commons-math</artifactId>', 1), (' <groupId>commons-logging</groupId>', 1), ('  <parent>', 1), (' <artifactId>hadoop-core</artifactId>', 1), (' <artifactId>hbase-annotations</artifactId>', 4), (

The join function combines the two datasets (K,V) and (K,W) together and get (K, (V,W)). Let's join these two counts together.

In [22]:
joined = readmeCount.join(pomCount)

Print the value to the console

In [23]:
joined.collect()

[('', (43, 841))]

Let's combine the values together to get the total count

In [24]:
joinedSum = joined.map(lambda k: (k[0], (k[1][0]+k[1][1])))

To check if it is correct, print the first five elements from the joined and the joinedSum RDD

In [25]:
print("Joined Individial\n")
print(joined.take(5))

print("\n\nJoined Sum\n")
print(joinedSum.take(5))

Joined Individial

[('', (43, 841))]


Joined Sum

[('', 884)]


## Shared variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

### Broadcast variables

Broadcast variables are useful for when you have a large dataset that you want to use across all the worker nodes. A read-only variable is cached on each machine rather than shipping a copy of it with tasks. Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage.


Read more here: [http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables](http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables)

Create a broadcast variable. Type in:

In [26]:
broadcastVar = sc.broadcast([1,2,3])

To get the value, type in:

In [27]:
broadcastVar.value

[1, 2, 3]

### Accumulators

Accumulators are variables that can only be added through an associative operation. It is used to implement counters and sum efficiently in parallel. Spark natively supports numeric type accumulators and standard mutable collections. Programmers can extend these for new types. Only the driver can read the values of the accumulators. The workers can only invoke it to increment the value.

Create the accumulator variable. Type in:

In [28]:
accum = sc.accumulator(0)

Next parallelize an array of four integers and run it through a loop to add each integer value to the accumulator variable. Type in:

In [29]:
rdd = sc.parallelize([1,2,3,4])
def f(x):
    global accum
    accum += x

Next, iterate through each element of the rdd and apply the function f on it:

In [30]:
rdd.foreach(f)

To get the current value of the accumulator variable, type in:

In [31]:
accum.value

10

You should get a value of 10.

This command can only be invoked on the driver side. The worker nodes can only increment the accumulator.


## Key-value pairs

You have already seen a bit about key-value pairs in the Joining RDD section.

Create a key-value pair of two characters. Type in:

In [32]:
pair = ('a', 'b')

To access the value of the first index use [0] and [1] method for the 2nd.

In [33]:
print(pair[0])
print(pair[1])

a
b


<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: Enjoyed using Jupyter notebooks with Spark? Get yourself a free 
    <a href="http://cocl.us/DSX_on_Cloud">IBM Cloud</a> account where you can use Data Science Experience notebooks
    and have *two* Spark executors for free!
</div>

### Summary
Having completed this exercise, you should now be able to describe Spark’s primary data abstraction, work with Resilient Distributed Dataset (RDD) operations, and utilize shared variables and key-value pairs.

This notebook is part of the free course on **Cognitive Class** called *Spark Fundamentals I*. If you accessed this notebook outside the course, you can take this free self-paced course, online by going to: http://cocl.us/Spark_Fundamentals_I

### About the Authors:  
Hi! It's [Alex Aklson](https://www.linkedin.com/in/aklson/), one of the authors of this notebook. I hope you found this lab educational! There is much more to learn about Spark but you are well on your way. Feel free to connect with me if you have any questions.
<hr>