First, let's verify that our `sc`, an instance of `SparkContext`, is available for use. If there isn't one, then we need to create it.

In [1]:
from pyspark import SparkConf, SparkContext

APP_NAME = "pyspark"
SPARK_URL = "local[*]"
MAX_VALUE = 100

if 'sc' in dir():
    print("`sc` variable already exists with application ID:", sc.applicationId)
    print("Stopping...")
    sc.stop()
    print("Stopped.")
    sc = None

conf = SparkConf() \
    .setMaster(SPARK_URL) \
    .setAppName(APP_NAME)
print("Creating `sc` connection to %s" % (SPARK_URL))
sc = SparkContext(conf = conf)
print("`sc` variable created with application ID:", sc.applicationId)

Creating `sc` connection to local[*]
`sc` variable created with application ID: local-1493753058015


Let's create a function to find prime numbers.

In [2]:
def is_prime(n):
    # Ensure that `n` is a positive integer
    n = abs(int(n))
    # 0 and 1 are not prime
    if n < 2:
        return False
    # 2 is the only even prime
    if n == 2:
        return True
    # All even numbers are not prime
    if not n & 1:
        return False
    # Test all odd numbers between 3 and the square root of n
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

In [3]:
# Create a parallelized array of numbers
numbers = sc.parallelize(range(MAX_VALUE))
print("There are %d numbers." % (numbers.count()))

There are 100 numbers.


In [4]:
prime_numbers = numbers.filter(is_prime)
print("There are %d prime numbers." % (prime_numbers.count()))

There are 25 prime numbers.


If you left `MAX_VALUE` set to 100, then you should have 25 prime numbers less than 100.

In [5]:
print(prime_numbers.collect())

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]
