# Spark Fundamentals I - Introduction to Spark

This notebook:

- Perform basic RDD actions and transformations
- Use caching to speed up repeated operations

In [None]:
# download the data from the IBM server
# this may take ~30 seconds depending on your internet speed
!wget --quiet https://cocl.us/BD0211EN_Data
print("Data Downloaded!")

In [None]:
# unzip the data that we just downloaded into a directory, (ex:/resources/jupyter/labs/BD0211EN/.)
# this may take ~30 seconds depending on your internet speed
!unzip -q -o -d //resources/jupyter/labs/BD0211EN/BD0211EN_Data

# or manual download!

In [6]:
# list the extracted files
!ls -1 /home/jane/Desktop/Github/IBM-Learning-Path-Spark/02-Spark-Fundamentals/LabData

followers.txt
notebook.log
nyctaxi100.csv
nyctaxi.csv
nyctaxisub.csv
nycweather.csv
pom.xml
README.md
taxistreams.py
users.txt


### Starting with Spark

In [7]:
# Check the version os Spark running:
sc.version

'3.0.0-preview2'

In [8]:
# Add in the path to the README.md file in LabData
readme = sc.textFile("/home/jane/Desktop/Github/IBM-Learning-Path-Spark/02-Spark-Fundamentals/LabData/README.md")

### - Some RDD ACTIONS on this text file:

In [11]:
type(readme)

pyspark.rdd.RDD

In [9]:
# Counting the object elements
readme.count()

98

In [12]:
readme.first()

'# Apache Spark'

In [13]:
# Print the collection
readme.collect()

['# Apache Spark',
 '',
 'Spark is a fast and general cluster computing system for Big Data. It provides',
 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that',
 'supports general computation graphs for data analysis. It also supports a',
 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,',
 'MLlib for machine learning, GraphX for graph processing,',
 'and Spark Streaming for stream processing.',
 '',
 '<http://spark.apache.org/>',
 '',
 '',
 '## Online Documentation',
 '',
 'You can find the latest Spark documentation, including a programming',
 'guide, on the [project web page](http://spark.apache.org/documentation.html)',
 'and [project wiki](https://cwiki.apache.org/confluence/display/SPARK).',
 'This README file only contains basic setup instructions.',
 '',
 '## Building Spark',
 '',
 'Spark is built using [Apache Maven](http://maven.apache.org/).',
 'To build Spark and its example programs, run:',
 '',
 '    build/mvn -DskipTes

### - Some RDD TRANSFORMATIONS:

In [15]:
# The filter transformation to return a new RDD with a subset of the items in the file
readme2 = readme.filter(lambda line: "Spark" in line)

In [16]:
# Action
readme2.count()

18

In [21]:
# Or we can even chain together transformations and actions
readme3 = readme.filter(lambda line: "Big Data" in line)

In [22]:
readme3.count()

1

In [23]:
# Show the row where the word appear
readme3.collect()

['Spark is a fast and general cluster computing system for Big Data. It provides']

# More on RDD Operations
RDDs can be used for more complex computations.

### .map and .reduce

- The first maps a line to an integer value, the number of words in that line. 
- In the second part reduce is called to find the line with the most words in it. 
- The arguments to map and reduce are Python anonymous functions (lambdas), but you can use any top level Python functions.


In [25]:
readme.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

14

In [26]:
# Define the max function. we will need to type this in:
def max(a, b):
    if a > b:
        return a
    else:
        return b

In [28]:
# Run the following with the max() function
readme.map(lambda line: len(line.split())).reduce(max)

14

- Spark has a MapReduce data flow pattern. We can use this to do a word count on the readme file.

-  Below we combined the flatMap, map, and the reduceByKey functions to do a word count of each word in the readme file.

In [29]:
# .flatMap() .map() .reduceByKey()  
wordCounts = readme.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

NOTE: It should be noted that the collect function brings all of the data into the driver node. For a small dataset, this is acceptable but, for a large dataset this can cause an Out Of Memory error. It is recommended to use collect() for testing only. The safer approach is to use the take() function e.g. print take(n)

In [30]:
# Action
wordCounts.collect()

[('#', 1),
 ('Apache', 1),
 ('Spark', 14),
 ('is', 6),
 ('It', 2),
 ('provides', 1),
 ('high-level', 1),
 ('APIs', 1),
 ('in', 5),
 ('Scala,', 1),
 ('Java,', 1),
 ('an', 3),
 ('optimized', 1),
 ('engine', 1),
 ('supports', 2),
 ('computation', 1),
 ('analysis.', 1),
 ('set', 2),
 ('of', 5),
 ('tools', 1),
 ('SQL', 2),
 ('MLlib', 1),
 ('machine', 1),
 ('learning,', 1),
 ('GraphX', 1),
 ('graph', 1),
 ('processing,', 1),
 ('Documentation', 1),
 ('latest', 1),
 ('programming', 1),
 ('guide,', 1),
 ('[project', 2),
 ('README', 1),
 ('only', 1),
 ('basic', 1),
 ('instructions.', 1),
 ('Building', 1),
 ('using', 2),
 ('[Apache', 1),
 ('run:', 1),
 ('do', 2),
 ('this', 1),
 ('downloaded', 1),
 ('documentation', 3),
 ('project', 1),
 ('site,', 1),
 ('at', 2),
 ('Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 1),
 ('Interactive', 2),
 ('Shell', 2),
 ('The', 1),
 ('way', 1),
 ('start', 1),
 ('Try', 1),
 ('following', 2),
 ('1000:', 2),
 ('scala>', 1),
 ('1000).count()', 1),


In [32]:
wordCounts.take(5)

[('#', 1), ('Apache', 1), ('Spark', 14), ('is', 6), ('It', 2)]

# **** IMPORTANT 

### What is the most frequent word in the README, and how many times was it used?

In [34]:
wordCounts.reduce(lambda a, b: a if (a[1] > b[1]) else b)

('the', 21)

# *************** !

# Using Spark caching

- In this short section, we’ll see how Spark caching can be used to pull data sets into a cluster-wide in-memory cache. 
- This is very useful for accessing repeated data, such as querying a small “hot” dataset or when running an iterative algorithm. 
- Both Python and Scala use the same commands.
- As a simple example, let’s mark our readme2 dataset to be cached and then invoke the first count operation to tell Spark to cache it. 
- Remember that transformation operations such as cache does not get processed until some action like count() is called. 
- Once you run the second count() operation, you should notice a small increase in speed.

In [35]:
print(readme2.count())

18


In [36]:
from timeit import Timer

def count():
    return readme2.count()
t = Timer(lambda: count())

In [37]:
print(t.timeit(number=50))

4.618994120000025


In [39]:
readme2.cache()
print(t.timeit(number=50))

3.8210617369995816


Spark caching can be used to cache large datasets and subsequent operations on it will utilize the data in the cache rather than re-fetching it from HDFS.

# End