# Introduction



## Prerrequisites

Installing Spark and Java in VM


---



In [1]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.1
!wget -q https://apache.osuosl.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop2.tgz

In [2]:
ls -l # check the .tgz is there

total 267680
drwxr-xr-x 1 root root      4096 Dec  7 14:41 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 274099817 Oct 15 10:53 spark-3.3.1-bin-hadoop2.tgz


In [3]:
# unzip it
!tar xf spark-3.3.1-bin-hadoop2.tgz

In [4]:
!pip install -q findspark

Define the environment

In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop2"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---


In [6]:
import findspark
findspark.init("spark-3.3.1-bin-hadoop2")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("Introduction") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.3.1'

In [7]:
spark

In [8]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Download datasets

In [9]:
!mkdir -p /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/frankenstein.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/el_quijote.txt -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/characters.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/planets.csv -P /dataset
!ls /dataset

characters.csv	el_quijote.txt	frankenstein.txt  planets.csv


## RDDs

---



Example 1

In [10]:
textFile = spark.sparkContext.textFile("../dataset/frankenstein.txt")
textFile.first()

'FRANKENSTEIN'

Creating parallelized collections
A very quick way to create RDD from the shell, when we are learning, is to create a parallelized collection. To do this:

Example 2

In [11]:
distData = spark.sparkContext.parallelize([25, 20, 15, 10, 5])
distData.reduce(lambda x ,y: x + y)

75

What type of variabe is distData?


In [12]:
type(distData)

pyspark.rdd.RDD

Exercise 1. Count the number of line on "el_quijote.txt" file



In [14]:
distDataQuijote = spark.sparkContext.textFile("../dataset/el_quijote.txt")
distDataQuijote.count()

2186

Exercise 2. Print the first line of "el_quijote.txt" file



In [33]:
distDataQuijote.first()

print(distDataQuijote.collect()[0])
#for x in distDataQuijote.collect():
#  print(x)

DON QUIJOTE DE LA MANCHA


## Tranformations and Actions over RDDs 

Example 3

In [35]:
textFile.count() # Number of elements in the RDD 

7237

Example 4

In [36]:
# ReduceByKey
# Lines (distinc) count
lines = spark.sparkContext.textFile("../dataset/frankenstein.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b).cache()
counts.count()

6423

In [37]:
counts.collect()[0:5] #[0:5] to limit the output

[('FRANKENSTEIN', 1),
 ('', 811),
 ('Letter 1', 1),
 ('commencement of an enterprise which you have regarded with such evil', 1),
 ('forebodings.  I arrived here yesterday, and my first task is to assure', 1)]

In [38]:
# SortBy (frequency in this case)
sortedByFreq = counts.sortBy(lambda x: x[1], False)
sortedByFreq.collect()[0:10]

[('', 811),
 ('To Mrs. Saville, England', 3),
 ('Your affectionate brother,', 2),
 ('me.', 2),
 ('FRANKENSTEIN', 1),
 ('Letter 1', 1),
 ('commencement of an enterprise which you have regarded with such evil', 1),
 ('forebodings.  I arrived here yesterday, and my first task is to assure', 1),
 ('my dear sister of my welfare and increasing confidence in the success', 1),
 ('of my undertaking.', 1)]

Example 5

In [39]:
# Filter

linesWithThe = textFile.filter(lambda line: "the" in line)
linesWithThe.count()

3712

In [40]:
linesWithThe.collect()[0:10]

['Or, The Modern Prometheus',
 'You will rejoice to hear that no disaster has accompanied the',
 'my dear sister of my welfare and increasing confidence in the success',
 'I am already far north of London, and as I walk in the streets of',
 'Petersburgh, I feel a cold northern breeze play upon my cheeks, which',
 'feeling?  This breeze, which has travelled from the regions towards',
 'and vivid.  I try in vain to be persuaded that the pole is the seat of',
 'frost and desolation; it ever presents itself to my imagination as the',
 'region of beauty and delight.  There, Margaret, the sun is forever',
 'visible, its broad disk just skirting the horizon and diffusing a']

In [41]:
linesWithSuccess = textFile.filter(lambda line: "success" in line)
linesWithSuccess.count()

17

In [42]:
linesWithSuccess.collect()[0:10]

['my dear sister of my welfare and increasing confidence in the success',
 'success, there will be none to participate my joy; if I am assailed by',
 'expect such success, yet I cannot bear to look on the reverse of the',
 'But success SHALL crown my endeavours.  Wherefore not?  Thus far I have',
 'success and into every minute detail of the measures I had taken to',
 'unexplored ocean of truth.  Those of his successors in each branch of',
 'unsuccessful, I attributed the failure rather to my own inexperience',
 'application equals your ability, I have no doubt of your success.',
 'imagination was too much exalted by my first success to permit me to',
 'attempts would at least lay the foundations of future success.  Nor']

Exercise 3. Get the frequency with which each word appears in the file "frankenstein.txt"

In [57]:
#words = lines.collect()[0].split()

lines.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b).cache() \
             .map(lambda x: (x[1], x[0]))\
             .sortByKey(False)\
             .collect()[0:10]



#print(counts.count())

[(3897, 'the'),
 (3488, ''),
 (2904, 'and'),
 (2720, 'I'),
 (2634, 'of'),
 (2072, 'to'),
 (1629, 'my'),
 (1338, 'a'),
 (1072, 'in'),
 (992, 'was')]

Exercise 4. get the top 10 words with more than 4 characters



In [None]:
# It is hard, it isn't? So, let's use DF instead :)

In [67]:
lines.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1) if len(word) > 4 else (word, 0)) \
             .reduceByKey(lambda a, b: a + b).cache() \
             .map(lambda x: (x[1],x[0]))\
             .sortByKey(False)\
             .collect()[0:10]

[(540, 'which'),
 (187, 'could'),
 (177, 'would'),
 (174, 'their'),
 (152, 'should'),
 (130, 'these'),
 (122, 'before'),
 (107, 'might'),
 (105, 'myself'),
 (103, 'every')]

In [65]:
lines.flatMap(lambda line: line.split(" ")) \
             .filter(lambda s: len(s) > 4) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b).cache() \
             .map(lambda x: (x[1], x[0]))\
             .sortByKey(False)\
             .collect()[0:10]

[(540, 'which'),
 (187, 'could'),
 (177, 'would'),
 (174, 'their'),
 (152, 'should'),
 (130, 'these'),
 (122, 'before'),
 (107, 'might'),
 (105, 'myself'),
 (103, 'every')]