# CS 297.2 Big Data Processing: Introduction to Spark with Simple Examples

Prepared by: Miguel Saavedra <msaavedra@ateneo.edu> and William Yu <wyu@ateneo.edu>


---

### Installing Spark on the machine

Once you have installed java, the next steps should be similar. You will likely want to put the Spark application folder wherever you put your user-installed applications.

In [1]:
!rm -r spark*
!wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!ls
!tar xvf ./spark-3.5.1-bin-hadoop3.tgz > /dev/null 2>/dev/null
!ls
!pip install -q findspark

rm: cannot remove 'spark*': No such file or directory
--2024-04-11 15:25:08--  https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Resolving archive.apache.org (archive.apache.org)... 65.108.204.189, 2a01:4f9:1a:a084::2
Connecting to archive.apache.org (archive.apache.org)|65.108.204.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400446614 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.1-bin-hadoop3.tgz’


2024-04-11 15:26:12 (5.97 MB/s) - ‘spark-3.5.1-bin-hadoop3.tgz’ saved [400446614/400446614]

sample_data  spark-3.5.1-bin-hadoop3.tgz
sample_data  spark-3.5.1-bin-hadoop3  spark-3.5.1-bin-hadoop3.tgz


In [2]:
# Set environment variables
import os
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3/"
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/jre/bin/java
!java -version

update-alternatives: error: alternative /usr/lib/jvm/java-11-openjdk-amd64/jre/bin/java for java not registered; not setting
openjdk version "11.0.22" 2024-01-16
OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)


____
## Sample 1: Hello World in Spark

Simple hello world example in Spark

In [3]:
# If Spark is installed and SPARK_HOME is set, this will find the spark installation so spark libraries can be imported.
# findspark is necessary if you want to use Spark in the IDE of your choice.
import findspark
findspark.init()

# Imports the basic spark functions needed
from pyspark import SparkConf, SparkContext
from operator import add

# Sets the Spark configuration. The AppName is arbitrary, but setting the master to local
# specifies that the application is not running on a distributed system
conf = SparkConf().setMaster("local").setAppName("HelloWorld")
sc = SparkContext.getOrCreate(conf = conf)

# what does this do?
data = sc.parallelize(list("Hello World"))
counts = data.map(lambda x:
	(x, 1)).reduceByKey(add).sortBy(lambda x: x[1],
	 ascending=False).collect()

for (word, count) in counts:
    print("{}: {}".format(word, count))

l: 3
o: 2
H: 1
e: 1
 : 1
W: 1
r: 1
d: 1


In [4]:
list("Hello World")

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

In [5]:
data = sc.parallelize(list("Hello World"))
data

ParallelCollectionRDD[6] at readRDDFromFile at PythonRDD.scala:289

In [6]:
data.map(lambda x: (x, 1)).collect()

[('H', 1),
 ('e', 1),
 ('l', 1),
 ('l', 1),
 ('o', 1),
 (' ', 1),
 ('W', 1),
 ('o', 1),
 ('r', 1),
 ('l', 1),
 ('d', 1)]

In [7]:
data.map(lambda x: (x, 1)).reduceByKey(add).collect()

[('H', 1),
 ('e', 1),
 ('l', 3),
 ('o', 2),
 (' ', 1),
 ('W', 1),
 ('r', 1),
 ('d', 1)]

In [8]:
data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).collect()

[('l', 3),
 ('o', 2),
 ('H', 1),
 ('e', 1),
 (' ', 1),
 ('W', 1),
 ('r', 1),
 ('d', 1)]

____
## Sample 2: RDD example in Spark

RDD Example

In [9]:
# If Spark is installed and SPARK_HOME is set, this will find the spark installation so spark libraries can be imported.
# findspark is necessary if you want to use Spark in the IDE of your choice.
import findspark
findspark.init()

# Imports the basic spark functions needed
from pyspark import SparkConf, SparkContext
from operator import add

# Sets the Spark configuration. The AppName is arbitrary, but setting the master to local
# specifies that the application is not running on a distributed system
conf = SparkConf().setMaster("local").setAppName("RDDExample")
sc = SparkContext.getOrCreate(conf = conf)

# generate data and put into a RDD
data = range(10000)
distData = sc.parallelize(data)

# show only values less than 10
distData.filter(lambda x: x < 10).collect()


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
data

range(0, 10000)

In [11]:
distData.filter(lambda x: x > 9990).collect()

[9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999]

In [12]:
# multiply all numbers less than 5 by 2
distData.filter(lambda x:x <5).map(lambda x: x*2).collect()

[0, 2, 4, 6, 8]

In [13]:
distData.filter(lambda x:x>9995).collect()

[9996, 9997, 9998, 9999]

In [14]:
distData.filter(lambda x:x<5).union(distData.filter(lambda x:x>9995)).collect()

[0, 1, 2, 3, 4, 9996, 9997, 9998, 9999]

____
## Sample 3: Key Value RDD examples in Spark

KV RDD Example

In [15]:
# If Spark is installed and SPARK_HOME is set, this will find the spark installation so spark libraries can be imported.
# findspark is necessary if you want to use Spark in the IDE of your choice.
import findspark
findspark.init()

# Imports the basic spark functions needed
from pyspark import SparkConf, SparkContext
from operator import add

# Sets the Spark configuration. The AppName is arbitrary, but setting the master to local
# specifies that the application is not running on a distributed system
conf = SparkConf().setMaster("local").setAppName("KVRDDExample")
sc = SparkContext.getOrCreate(conf = conf)

# create KV data set
data = sc.parallelize([("Will", 12), ("Will", 7), ("Will", 5), ("Walt", 9), ("Walt", 7), ("Walt", 5), ("Wain", 7), ("Wain", 5)])

In [16]:
data.collect()

[('Will', 12),
 ('Will', 7),
 ('Will', 5),
 ('Walt', 9),
 ('Walt', 7),
 ('Walt', 5),
 ('Wain', 7),
 ('Wain', 5)]

In [17]:
# convert substract 1 from ages
data.map(lambda x: (x[0], x[1]-1)).collect()

[('Will', 11),
 ('Will', 6),
 ('Will', 4),
 ('Walt', 8),
 ('Walt', 6),
 ('Walt', 4),
 ('Wain', 6),
 ('Wain', 4)]

In [18]:
# return highest ages per person
data.reduceByKey(max).collect()

[('Will', 12), ('Walt', 9), ('Wain', 7)]

In [19]:
# return sum of all ages
data.reduceByKey(lambda x, y: x+y).collect()

[('Will', 24), ('Walt', 21), ('Wain', 12)]

In [20]:
data.reduceByKey(add).collect()

[('Will', 24), ('Walt', 21), ('Wain', 12)]

In [21]:
# count number of elements per key
data.groupByKey().mapValues(len).collect()

[('Will', 3), ('Walt', 3), ('Wain', 2)]

In [22]:
# return only ages greater than 10
data.filter(lambda x: x[1]>10).collect()

[('Will', 12)]

In [23]:
# sample join
ref = sc.parallelize([("Will", "Eldest"),("Walt","Middle"),("Wain","Youngest")])
data.join(ref).collect()

[('Will', (12, 'Eldest')),
 ('Will', (7, 'Eldest')),
 ('Will', (5, 'Eldest')),
 ('Walt', (9, 'Middle')),
 ('Walt', (7, 'Middle')),
 ('Walt', (5, 'Middle')),
 ('Wain', (7, 'Youngest')),
 ('Wain', (5, 'Youngest'))]

In [24]:
data.join(ref).map(lambda x: x[1][1] + " has a " + str(x[1][0]) + " year old").collect()

['Eldest has a 12 year old',
 'Eldest has a 7 year old',
 'Eldest has a 5 year old',
 'Middle has a 9 year old',
 'Middle has a 7 year old',
 'Middle has a 5 year old',
 'Youngest has a 7 year old',
 'Youngest has a 5 year old']

____
## Exercise: Create a query

Spark Exercise

In [25]:
# If Spark is installed and SPARK_HOME is set, this will find the spark installation so spark libraries can be imported.
# findspark is necessary if you want to use Spark in the IDE of your choice.
import findspark
findspark.init()

# Imports the basic spark functions needed
from pyspark import SparkConf, SparkContext
from operator import add

# Sets the Spark configuration. The AppName is arbitrary, but setting the master to local
# specifies that the application is not running on a distributed system
conf = SparkConf().setMaster("local").setAppName("KVRDDExercise")
sc = SparkContext.getOrCreate(conf = conf)

In [26]:
# Metro Manila population and area data sets
pop = sc.parallelize([
("Caloocan",1583978),
("Las Pinas",588894),
("Makati",582602),
("Malabon",365525),
("Mandaluyong",386276),
("Manila",1780148),
("Marikina",450741),
("Muntinlupa",504509),
("Navotas",249463),
("Parañaque",664822),
("Pasay",416522),
("Pasig",755300),
("Pateros",63840),
("Quezon City",2936116),
("San Juan",122180),
("Taguig",804915),
("Valenzuela",620422)
])

area = sc.parallelize([
("Caloocan",53.33),
("Las Pinas",32.02),
("Makati",27.36),
("Malabon",15.96),
("Mandaluyong",11.06),
("Manila",42.88),
("Marikina",22.64),
("Muntinlupa",41.67),
("Navotas",11.51),
("Parañaque",47.28),
("Pasay",18.64),
("Pasig",31.46),
("Pateros",1.76),
("Quezon City",165.33),
("San Juan",5.87),
("Taguig",45.18),
("Valenzuela",45.75)
])

In [38]:
# Return the total population of Metro Manila
total_population = pop.map(lambda kv: kv[1]).reduce(add)

## SOLUTION 2 :
# total_population = pop.map(lambda kv: kv[1]).reduce(lambda x, y: x + y)

## SOLUTION 3 :
# total_population = pop.reduceByKey(lambda x, y: x + y).map(lambda kv: kv[1]).reduce(lambda x, y: x + y)

print(f"Total Population :",total_population)

Total Population : 12876253


In [35]:
# Return the largest 5 cities/town in terms of population
largest_cities = pop.sortBy(lambda x: x[1], ascending=False).take(5)
print(f"Top 5 Largest Cities Town in terms of Population :")

for key,val in largest_cities:
  print(f" {key.ljust(11)} = {val}")

Top 5 Largest Cities Town in terms of Population :
 Quezon City = 2936116
 Manila      = 1780148
 Caloocan    = 1583978
 Taguig      = 804915
 Pasig       = 755300


In [37]:
# Return the total land area of Metro Manila
total_land_area = area.map(lambda kv: kv[1]).reduce(lambda x, y: x + y)

## SOLUTION 2 :
# total_land_area  = area.reduceByKey(lambda x, y: x + y).map(lambda kv: kv[1]).reduce(lambda x, y: x + y)

print(f"Total Land Area of Metro Manila : {total_land_area}")

Total Land Area of Metro Manila : 619.6999999999999


In [39]:
# Return the smallest 5 cities/town in terms of area
smallest_cities_by_area = area.sortBy(lambda x: x[1]).take(5)
print(f"Top 5 Smallest City/Town :")

for key,val in smallest_cities_by_area:
  print(f" {key.ljust(11)} \t= {val}")

Top 5 Smallest City/Town :
 Pateros     	= 1.76
 San Juan    	= 5.87
 Mandaluyong 	= 11.06
 Navotas     	= 11.51
 Malabon     	= 15.96


In [31]:
# Return the total population density per city/town in Metro Manila
population_density = pop.union(area).reduceByKey(lambda x, y: x / y).collect()

print(f"Population Density :")
for key,val in population_density:
  print(f" {key.ljust(11)} \t= {val:.5f}")


Population Density :
 Caloocan    	= 29701.44384
 Mandaluyong 	= 34925.49729
 Manila      	= 41514.64552
 Marikina    	= 19909.05477
 Muntinlupa  	= 12107.24742
 Navotas     	= 21673.58818
 Parañaque   	= 14061.37902
 Pasay       	= 22345.60086
 Quezon City 	= 17759.12418
 Taguig      	= 17815.73705
 Valenzuela  	= 13561.13661
 Las Pinas   	= 18391.44285
 Makati      	= 21293.93275
 Malabon     	= 22902.56892
 Pasig       	= 24008.26446
 Pateros     	= 36272.72727
 San Juan    	= 20814.31005


In [32]:
# Return the top 5 cities/town in terms of population density
top_population_density = pop.union(area).reduceByKey(lambda population, area: population / area).sortBy(lambda x: x[1], ascending=False).take(5)
print(f"Highest Ranking of City/Town in terms of population density:")
for key,val in top_population_density:
  print(f" {key.ljust(11)} \t= {val:.5f}")


Highest Ranking of City/Town in terms of population density:
 Manila      	= 41514.64552
 Pateros     	= 36272.72727
 Mandaluyong 	= 34925.49729
 Caloocan    	= 29701.44384
 Pasig       	= 24008.26446


In [33]:
# Return the bottom 5 cities/town in terms of population density
bottom_population_density = pop.union(area).reduceByKey(lambda population, area: population / area).sortBy(lambda x: x[1]).take(5)
print(f"Lowest Ranking of City/Town in terms of population density:")
for key,val in bottom_population_density:
  print(f" {key.ljust(11)} \t= {val:.5f}")

Lowest Ranking of City/Town in terms of population density:
 Muntinlupa  	= 12107.24742
 Valenzuela  	= 13561.13661
 Parañaque   	= 14061.37902
 Quezon City 	= 17759.12418
 Taguig      	= 17815.73705


In [34]:
# Return the population density of the cities/town which start with the letter 'P' sorted from largest to smallest population density
population_desity_with_p_sorted = pop.union(area).filter(lambda x: x[0][0].lower() == 'p').reduceByKey(lambda population, area: population / area).sortBy(lambda x: x[1], ascending=False).collect()
print(f"Ranking of Population density of the cities/town which start with the letter 'P' :")
for key,val in population_desity_with_p_sorted:
  print(f" {key.ljust(11)} \t= {val:.5f}")

Ranking of Population density of the cities/town which start with the letter 'P' :
 Pateros     	= 36272.72727
 Pasig       	= 24008.26446
 Pasay       	= 22345.60086
 Parañaque   	= 14061.37902
