**Install Apache Spark**

In [2]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.1
!wget -q http://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
# unzip it
!tar xf spark-3.0.1-bin-hadoop3.2.tgz
# install findspark 
!pip install -q findspark

**Set Environment Variables**

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

**Testing spark installation and version**
- Spark should be version 3.0.1

In [4]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)



+-----+
|hello|
+-----+
|world|
|world|
|world|
+-----+
only showing top 3 rows



We can also check the pyspark version

In [5]:
import pyspark
print(pyspark.__version__)

3.0.1


We have our environment setup. Now let's get into the basics. We first need to import pyspark but we already did this in the previous step so we will proceed.

## Spark Basics

### Creating RDDs
After importing pyspark we can create a SparkContext

In [6]:
sc = pyspark.SparkContext.getOrCreate()

**Example 1: creating RDDs from an array of numbers**

In [8]:
data = [num for num in range(1,10)]
print(data)

[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [9]:
myRDD = sc.parallelize(data)

In [10]:
print(myRDD.collect())

[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [11]:
print(myRDD.count())

9


**Example 2: creating RDDs from key value pairs (tuples)**

In [12]:
kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]
print(kv)

[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]


In [13]:
rdd2 = sc.parallelize(kv)
print(rdd2.collect())

[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]


In [14]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)
print(rdd3.collect())

[('b', 6), ('c', 10), ('a', 9)]


In [15]:
rdd4 = rdd2.groupByKey()
print(rdd4.collect())

[('b', <pyspark.resultiterable.ResultIterable object at 0x7f37f79d2828>), ('c', <pyspark.resultiterable.ResultIterable object at 0x7f37f79d2588>), ('a', <pyspark.resultiterable.ResultIterable object at 0x7f37f79d24e0>)]


In [16]:
rdd4.map(lambda x: (x[0], list(x[1]))).collect()

[('b', [2, 4]), ('c', [1, 2, 3, 4]), ('a', [7, 2])]