# Spark Preparation
We check if we are in Google Colab.  If this is the case, install all necessary packages.

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 3.2.1 with hadoop 3.2, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Learn more from [A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!](https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/)

credit: Natawut Nupairoj

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

In [2]:
if IN_COLAB:
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
    !tar xf spark-3.2.1-bin-hadoop3.2.tgz
    !mv spark-3.2.1-bin-hadoop3.2 spark
    !pip install -q findspark

In [3]:
if IN_COLAB:
  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark"

# Start a Local Cluster
Use findspark.init() to start a local cluster.  If you plan to use remote cluster, skip the findspark.init() and change the cluster_url according.

In [4]:
import findspark
findspark.init()

In [5]:
cluster_url = 'local'

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master(cluster_url)\
        .appName("Colab")\
        .config('spark.ui.port', '4040')\
        .getOrCreate()
sc = spark.sparkContext

# Basic Spark Commands

In [7]:
sc

## Simple RDD Operations
- *sc.parallelize(data)* 
create an RDD from data
- *rdd.count()* 
count number of elements in an rdd
- *rdd.filter(func)* 
create a new rdd from existing rdd and keep only those elements that func is true

In [8]:
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
n = rdd.count()
print('count = {0}'.format(n))
l = rdd.collect()
print(l)

count = 5
[1, 2, 3, 4, 5]


In [9]:
l = rdd.take(3)
print(l)

[1, 2, 3]


In [10]:
f_rdd = rdd.filter(lambda d: d > 2)
for d in f_rdd.collect():
    print(d)
print('filter count = {0}'.format(f_rdd.count()))

3
4
5
filter count = 3


## RDD Operations - map and reduce
- *rdd.map(func)*
create a new rdd by performing function func on each element in an rdd
- *rdd.reduce(func)*
aggregate all elements in an rdd using function func

In [11]:
data = ['line 1', '2', 'more lines', 'last line']

In [12]:
lines = sc.parallelize(data)

In [13]:
print(lines.collect())

['line 1', '2', 'more lines', 'last line']


In [14]:
lineLengths = lines.map(lambda line: len(line))
print(lineLengths.collect())

[6, 1, 10, 9]


In [15]:
totalLength = lineLengths.reduce(lambda a, b: a+b)
print(totalLength)

26


In [16]:
data = (1,2,3,4)
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x: x*2)
print(rdd2.collect())
sum_val = rdd2.reduce(lambda a, b: a+b)
print('sum = {0}'.format(sum_val))
mul_val = rdd2.reduce(lambda a, b: a*b)
print('mul = {0}'.format(mul_val))

[2, 4, 6, 8]
sum = 20
mul = 384


## RDD Operations - aggregate

In [17]:
rdd.aggregate((0, 0),
              lambda acc, value: (acc[0]+value, acc[1]+1), 
              lambda acc1, acc2: (acc1[0]+acc2[0], acc1[1]+acc2[1]))

(10, 4)

In [18]:
lines.aggregate(("", 0),
                lambda a, s: (a[0]+s, a[1]+len(s)),
                lambda a, b: (a[0]+b[0], a[1]+b[1]))

('line 12more lineslast line', 26)

# Working with Text

Before running this example, make sure that a data file 'star-wars.txt' has been uploaded to content folder of this colab

In [19]:
!wget https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/code/week9_spark/star-wars.txt -O star-wars.txt

--2022-03-21 03:30:50--  https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/code/week9_spark/star-wars.txt
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kaopanboonyuen/2110446_DataScience_2021s2/main/code/week9_spark/star-wars.txt [following]
--2022-03-21 03:30:50--  https://raw.githubusercontent.com/kaopanboonyuen/2110446_DataScience_2021s2/main/code/week9_spark/star-wars.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 238956 (233K) [text/plain]
Saving to: ‘star-wars.txt’


2022-03-21 03:30:50 (8.56 MB/s) - ‘star-wars.txt’ saved [238956/238956]



In [20]:
sw = sc.textFile('star-wars.txt')
print('Total = {0} lines'.format(sw.count()))
for line in sw.take(10):
    print('{0}: [{1}]'.format(len(line), line))

Total = 7518 lines
0: []
35: [                          STAR WARS]
41: [                    !! PUBLIC  VERSION !!]
2: [  ]
49: [          �A long time ago, in a galaxy far, far ]
18: [          away...�]
0: []
55: [A vast sea of stars serves as the backdrop for the main]
55: [title.  War drums echo through the heavens as a rollup ]
28: [slowly crawls into infinity.]


In [21]:
nb_lines = sw.filter(lambda line: len(line) > 0)
print('Non blank line = {0} lines'.format(nb_lines.count()))
all_lowers = nb_lines.map(lambda line: line.lower())
for line in all_lowers.take(10):
    print('{0}: [{1}]'.format(len(line), line))

Non blank line = 4754 lines
35: [                          star wars]
41: [                    !! public  version !!]
2: [  ]
49: [          �a long time ago, in a galaxy far, far ]
18: [          away...�]
55: [a vast sea of stars serves as the backdrop for the main]
55: [title.  war drums echo through the heavens as a rollup ]
28: [slowly crawls into infinity.]
47: [          �it is a period of civil war.  rebel ]
45: [          spaceships, striking from a hidden ]


In [22]:
words = all_lowers.flatMap(lambda line: line.split())
for w in words.take(5):
    print(w)

star
wars
!!
public
version


In [23]:
mappers = words.map(lambda word: (word, 1))
counts = mappers.reduceByKey(lambda x, y: x+y)
for wc in counts.take(10):
    print(wc)

('star', 211)
('wars', 1)
('!!', 2)
('public', 1)
('version', 1)
('�a', 1)
('long', 31)
('time', 16)
('ago,', 1)
('in', 396)
