# Setting up PySpark in Colab
Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.


In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will install Apache Spark 3.0.1 with Hadoop 2.7 from here.

In [3]:
!wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

--2024-10-13 13:36:55--  https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400864419 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.3-bin-hadoop3.tgz’


2024-10-13 13:36:57 (183 MB/s) - ‘spark-3.5.3-bin-hadoop3.tgz’ saved [400864419/400864419]



Now, we just need to unzip that folder.

In [4]:
!tar xf spark-3.5.3-bin-hadoop3.tgz

There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.

In [5]:
!pip install -q findspark

Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.

In [6]:
!ls /content/spark-3.5.3-bin-hadoop3

bin   data	jars	    LICENSE   NOTICE  R		 RELEASE  yarn
conf  examples	kubernetes  licenses  python  README.md  sbin


In [7]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"

Time for the real test!

We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.



In [8]:
import findspark
findspark.init()

Bonus – If you want to know the location where Spark is installed, use findspark.find()


In [9]:
findspark.find()

'/content/spark-3.5.3-bin-hadoop3'

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.



In [10]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Finally, print the SparkSession variable.


In [11]:
spark

If you want to view the Spark UI, you would have to include a few more lines of code to create a public URL for the UI page.



In [12]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels

--2024-10-13 13:37:23--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 54.237.133.81, 54.161.241.46, 18.205.222.128, ...
Connecting to bin.equinox.io (bin.equinox.io)|54.237.133.81|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13921656 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2024-10-13 13:37:23 (71.1 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13921656/13921656]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


In [13]:
!curl -s http://localhost:4040/api/tunnels

{"tunnels":[],"uri":"/api/tunnels"}


# I. Apache Spark examples

## I.1. Spark DF example

In [15]:
# Create a Spark DF
df = spark.createDataFrame (
    [
        ("sue", 32),
        ("li", 3),
        ("bob", 75),
        ("heo", 13),
    ],
    ["first_name", "age"],
)

df.show()

+----------+---+
|first_name|age|
+----------+---+
|       sue| 32|
|        li|  3|
|       bob| 75|
|       heo| 13|
+----------+---+



In [16]:
# Add a column to a Spark DF
from pyspark.sql.functions import col, when

df1 = df.withColumn(
    "life_stage",
    when(col("age") < 13, "child")
    .when(col("age").between(13, 19), "teenager")
    .otherwise("adult"),
)

df1.show()

+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
|       sue| 32|     adult|
|        li|  3|     child|
|       bob| 75|     adult|
|       heo| 13|  teenager|
+----------+---+----------+



In [18]:
# df is unchanged
df.show()

+----------+---+
|first_name|age|
+----------+---+
|       sue| 32|
|        li|  3|
|       bob| 75|
|       heo| 13|
+----------+---+



In [19]:
# Filter a Spark DF
df1.where(col ("life_stage").isin (["teenager", "adult"])).show()

+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
|       sue| 32|     adult|
|       bob| 75|     adult|
|       heo| 13|  teenager|
+----------+---+----------+



In [20]:
# Group by aggregation on Spark DF
from pyspark.sql.functions import avg

df1.select(avg("age")).show()

+--------+
|avg(age)|
+--------+
|   30.75|
+--------+



In [21]:
# Group by aggregation on Spark DF (continue)

df1.groupBy("life_stage").agg(avg("age")).show()

+----------+--------+
|life_stage|avg(age)|
+----------+--------+
|     adult|    53.5|
|  teenager|    13.0|
|     child|     3.0|
+----------+--------+



In [22]:
# Query the DataFrame with SQL
spark.sql("select avg (age) from {df1}", df1=df1).show()


+--------+
|avg(age)|
+--------+
|   30.75|
+--------+



In [23]:
spark.sql("select life_stage, avg (age) from {df1} group by life_stage", df1=df1).show()

+----------+--------+
|life_stage|avg(age)|
+----------+--------+
|     adult|    53.5|
|  teenager|    13.0|
|     child|     3.0|
+----------+--------+



## I.2. Spark SQL example

In [24]:
df1.write.saveAsTable("some_people")

In [25]:
spark.sql('select * from some_people').show()

+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
|       sue| 32|     adult|
|        li|  3|     child|
|       bob| 75|     adult|
|       heo| 13|  teenager|
+----------+---+----------+



In [26]:
spark.sql("INSERT INTO some_people VALUES ('frank', 4, 'child')")
spark.sql('select * from some_people').show()

+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
|       sue| 32|     adult|
|        li|  3|     child|
|       bob| 75|     adult|
|       heo| 13|  teenager|
|     frank|  4|     child|
+----------+---+----------+



In [27]:
spark.sql('select * from some_people where life_stage="teenager"').show()

+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
|       heo| 13|  teenager|
+----------+---+----------+



## I.3. Spark structured streaming example

In [33]:
# TODO

## I.4. Spark RDD example

In [30]:
text_file = spark.sparkContext.textFile('data/some_text.txt')

In [31]:
counts = (text_file.flatMap(lambda line: line.split(" "))
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b))

In [32]:
counts.collect()

[('these', 2),
 ('are', 2),
 ('words', 3),
 ('more', 1),
 ('in', 1),
 ('english', 1)]

# II. Practice with Spark RDD

## II.1. Create SparkContext object

In [35]:
# Spark Context object
from pyspark import SparkConf, SparkContext

sc = SparkContext.getOrCreate()

## II.2. Create a RDD

In [36]:
iterator = [1, 2, 3, 4, 5]
lines = sc.parallelize(iterator)
lines

ParallelCollectionRDD[64] at readRDDFromFile at PythonRDD.scala:289

In [37]:
lines = sc.textFile("data/some_text.txt")
lines

some_text.txt MapPartitionsRDD[66] at textFile at NativeMethodAccessorImpl.java:0

## II.3. RDD Transformations

In [39]:
# map(f)
rdd = sc.parallelize([2, 3, 4])
rdd.map(lambda x: list(range(1, x))).collect()

[[1], [1, 2], [1, 2, 3]]

In [40]:
# flatMap(f)
rdd = sc.parallelize([2, 3, 4, 5])
rdd.flatMap(lambda x: list(range(1, x))).collect()

[1, 1, 2, 1, 2, 3, 1, 2, 3, 4]

In [41]:
# filter(f)
rdd = sc.parallelize(range(10))
rdd.filter(lambda x: x % 2 == 0).collect()

[0, 2, 4, 6, 8]

In [42]:
# distinct()
rdd = sc.parallelize([1, 1, 4, 2, 1, 3, 3])
rdd.distinct().collect()

[1, 4, 2, 3]

In [47]:
# sample()
rdd = sc.parallelize(range(100), 4)
rdd.sample(False, 0.1, 81).collect()

[4, 26, 39, 41, 42, 52, 63, 76, 80, 86, 97]

In [48]:
# union(otherRDD)
rdd1 = sc.parallelize(range(5))
rdd2 = sc.parallelize(range(3, 9))
rdd3 = rdd1.union(rdd2)
rdd3.collect()


[0, 1, 2, 3, 4, 3, 4, 5, 6, 7, 8]

In [49]:
# intersection(otherRDD)
rdd1 = sc.parallelize(range(5))
rdd2 = sc.parallelize(range(3, 9))
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()

[4, 3]

In [50]:
# subtract(otherRDD)
rdd1 = sc.parallelize(range(5))
rdd2 = sc.parallelize(range(3, 9))
rdd3 = rdd1.subtract(rdd2)
rdd3.collect()

[0, 2, 1]

In [51]:
# cartesian(otherRDD)
rdd1 = sc.parallelize([1, 2])
rdd2 = sc.parallelize(["a", "b"])
rdd1.cartesian(rdd2).collect()


[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

## II.4. RDD Actions

In [52]:
# Get all value
rdd = sc.parallelize([1, 2, 3, 3])
rdd.collect()

[1, 2, 3, 3]

In [53]:
# Number of elements in the RDD
rdd = sc.parallelize([1, 3, 1, 2, 2, 2])
rdd.count()

6

In [54]:
# The count of each unique value in the RDD as a dictionary of {value: count} pairs.
rdd.countByValue()


defaultdict(int, {1: 2, 3: 1, 2: 3})

In [55]:
# Get some values in an RDD
rdd = sc.parallelize([(3, 'a'), (1, 'b'), (2, 'd')])
rdd.takeOrdered(2)

[(1, 'b'), (2, 'd')]

In [56]:
# The reduce action
rdd = sc.parallelize([1, 2, 3])
rdd.reduce(lambda a, b: a + b)

6

In [57]:
rdd.fold(0, lambda a, b: a + b)

6

In [61]:
rdd = sc.parallelize([1, 2, 4], 2) # RDD with 2 partitions
"""
RDD has 2 partition: say [1, 2] and [4]
Sum in the partitions: 2.5 + (1 + 2) = 5.5 and 2.5 + (4) = 6.5
Sum over partitions: 2.5 + (5.5 + 6.5) = 14.5
"""
rdd.fold(2.5, lambda a, b: a + b)

14.5

In [64]:
rdd = sc.parallelize([1, 2, 3], 5) # RDD with 5 partitions
"""
If number of P is more than number of elements
=> Some P is empty
-> [1][2][3][][]
-> 2 + (2 + 1) + (2 + 2) + (2 + 3) + (2 + 0) + (2 + 0) = 18
"""

rdd.fold(2, lambda a, b: a + b)


18

In [65]:
seqOp = lambda acc, x: (acc[0] + x, acc[1] + 1)
combOp = lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])
sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)

(10, 4)

acc = (0, 0) => (init sum, init count)
--------------
acc1 = (0, 0) | part1 = [1, 2]

acc1 <> [1] --> (0 + 1, 0 + 1) = (1, 1)

acc1 <> [2] --> (1 + 2, 1 + 1) = (3, 2)

acc2 = (0, 0) | part2 = [3, 4]

acc2 <> [3] --> (0 + 3, 0 + 1) = (3, 1)

acc2 <> [4] --> (3 + 4, 1 + 1) = (7, 2)

==> res = acc1 + acc2 = (3 + 7, 2 + 2) = (10, 4)

In [66]:
# TODO

# III. Practice with Spark DF and SQL

## III.1. Spark DF

In [67]:
from pyspark.sql import Row
row1 = Row(name="John", age=21)
row2 = Row(name="James", age=32)
row3 = Row(name="Jane", age=18)
row1['name']


'John'

In [68]:
df = spark.createDataFrame([row1, row2, row3])
df

DataFrame[name: string, age: bigint]

In [69]:
 df.show()

+-----+---+
| name|age|
+-----+---+
| John| 21|
|James| 32|
| Jane| 18|
+-----+---+



In [70]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [71]:
 df.rdd.getNumPartitions()

1

In [72]:
rows = [
 Row(name="John", age=21, gender="male"),
 Row(name="James", age=25, gender="female"),
 Row(name="Albert", age=46, gender="male")
 ]
df = spark.createDataFrame(rows)
df.show()


+------+---+------+
|  name|age|gender|
+------+---+------+
|  John| 21|  male|
| James| 25|female|
|Albert| 46|  male|
+------+---+------+



In [73]:
column_names = ["name", "age", "gender"]
rows = [
 ["John", 21, "male"],
 ["James", 25, "female"],
 ["Albert", 46, "male"]
 ]
df = spark.createDataFrame(rows, column_names)
df.show()


+------+---+------+
|  name|age|gender|
+------+---+------+
|  John| 21|  male|
| James| 25|female|
|Albert| 46|  male|
+------+---+------+



In [74]:
column_names = ["name", "age", "gender"]
rdd = sc.parallelize([
 ("John", 21, "male"),
 ("James", 25, "female"),
 ("Albert", 46, "male")
 ])
df = spark.createDataFrame(rdd, column_names)
df.show()


+------+---+------+
|  name|age|gender|
+------+---+------+
|  John| 21|  male|
| James| 25|female|
|Albert| 46|  male|
+------+---+------+



In [75]:
 df.schema

StructType([StructField('name', StringType(), True), StructField('age', LongType(), True), StructField('gender', StringType(), True)])

In [76]:
from pyspark.sql.types import *
schema = StructType([
 StructField("name", StringType(), True),
 StructField("age", IntegerType(), True),
 StructField("gender", StringType(), True)
 ])
rows = [("John", 21, "male")]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()


root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)

+----+---+------+
|name|age|gender|
+----+---+------+
|John| 21|  male|
+----+---+------+

