Since setting up PySpark can be challenging due to the required dependencies, this hands-on uses a Docker container with a pre-built PySpark single-node setup. To start the Docker container, run the command

$ docker run -p 8888:8888 jupyter/pyspark-notebook

In [1]:
import pyspark

In [2]:
# Connect to a Spark cluster and create RDDs
# local[*] means we're using the local cluster (a single-machine mode)
# * tells Spark to create as many worker threads (nodes) as logical cores on the machine
sc = pyspark.SparkContext(master='local[*]')

In [4]:
# Create a RDD
txt = sc.textFile('file:////usr/share/doc/python3/copyright')
print(txt.count()) # count() pulls the entire dataset into memory

319


In [5]:
python_lines = txt.filter(lambda line: 'python' in line.lower())
print(python_lines.count())

52


In [7]:
# Create RDDs
big_list = range(10000)

# transform a Python data structure into RDDs
rdd = sc.parallelize(big_list, 2) # distribute into 2 partitions

odds = rdd.filter(lambda x: x % 2 != 0)

# Takes a small subset of the RDD
# important for debugging
print(odds.take(5)) 

[1, 3, 5, 7, 9]
