In [1]:
#!pip install pyspark
#To install pyspark if it has not been installed



To start using Pyspark, we need to type some prerequisite code. Don't worry about what they mean for the moment. It will be the same for every Spark program we write, except for the argument in setAppName, which is a name you can give to your program. You should also only run them once in any Jupyter notebook.

In [2]:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("abc")
sc = SparkContext(conf=conf)

The central object of Pyspark is a Resilient Distributed Dataset, or RDD. You can basically consider an RDD like a Python list. Each element in an RDD is conventionally called a "line". For example, we can simply create a four-line RDD by typing

In [3]:
rdd = sc.parallelize([1,2,3,5])

All functions that an RDD can perform can be divided into two main types: transformations and actions. Pyspark does not perform any calculation until an action is performed, to save computing power. For example, the <code>collect</code> function is an action that shows the RDD:

In [4]:
rdd.collect()

[1, 2, 3, 5]

In this course, I will be using <code>collect</code> a lot to explain what a step does. However, do keep in mind that in a real big data project, actions like <code>collect</code> should be used sparingly, because in big data you will want to avoid spending computing power as much as you can.

Obviously in a big data project, the data will come in a separate file, usually a csv file. So we will not be using the parallelize function much. In the following code we load data from a separate file into an RDD, then use the standard Python slicing function to show the first 5 lines:

In [5]:
lines = sc.textFile("fakefriends.csv")
lines.collect()[0:5] 

['0,Will,33,385',
 '1,Jean-Luc,26,2',
 '2,Hugh,55,221',
 '3,Deanna,40,465',
 '4,Quark,68,21']

<p> Now we want to extract some information from this RDD. This csv file is a dataset about the number of friends somebody's friends have. The first column is the index, the second is the friend's name, and the third the friend's age, and the fourth the number of friends that friend has. For example, the first friend Will is 33 years old and has 385 friends himself. </p>
<p> Let's say we want to know the average number of friends a friend has, if that friend is a given age. Obviously, the names become irrelevant, so we want to first filter out any irrelevant information. To transform an RDD, we need to first write a Python function that transforms one line in an RDD. We see that each line is a string, and each string contains information separated by commas. That suggests using the split function to turn the string into a list of separate items: </p>

In [6]:
def parseLine(line):
    fields = line.split(",")
    age = int(fields[2])
    num = int(fields[3])
    return (age,num)

parseLine(lines.collect()[0]) #Trying out our defined function on the first line

(33, 385)

The code to apply that function to all lines in the RDD, and create a new RDD, is:

In [7]:
rdd = lines.map(parseLine)

Now each line in the RDD is in the format (age,number of friends). This is a very common format of any big dataset called a <b>key-value pair</b>: the first element of the pair being the key and the second being the value. Now we want to add up all the values with the same key, then divide that sum by the number of times the key appears. One neat trick to doing that is to first transform the rdd into a new RDD of the format (age,(num,1)). Now since this is a much simpler operation, we shall utilize the lambda function in Python. The code is as follows:

In [8]:
rdd = rdd.mapValues(lambda x: (x,1))
rdd.collect()[0:5]

(33, (385, 1))

Remember that all transformation functions in Pyspark do not change an RDD in place, but rather create a new one. And in cases when we do not change the key in a key-value pair such as this one, we should use the <code>mapValues</code> function because it is faster. Now we can add up all the values with the same key in the RDD. Usually when we want to transform an RDD by grouping up all key-value pairs with the same key in some fashion, we use the <code>reduceByKey</code> function. The following code might be a little harder to understand, but in the end it produces a new RDD of key-value pairs, with the the first sum in the value being the total number of friends and the second the total number of occurrence:

In [9]:
totalsByAge = rdd.reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1]))
totalsByAge.collect()[0]

(33, (3904, 12))

Apparently, for example, there are 12 friends aged 33 and the total number of friends all of them have is 3904. Therefore to find the average number of friends each age has is a matter of simply dividing the first number in the value to the second:

In [10]:
averageByAge = totalsByAge.mapValues(lambda x: x[0]/x[1])
averageByAge.collect()

[(33, 325.3333333333333),
 (26, 242.05882352941177),
 (55, 295.53846153846155),
 (40, 250.8235294117647),
 (68, 269.6),
 (59, 220.0),
 (37, 249.33333333333334),
 (54, 278.0769230769231),
 (38, 193.53333333333333),
 (27, 228.125),
 (53, 222.85714285714286),
 (57, 258.8333333333333),
 (56, 306.6666666666667),
 (43, 230.57142857142858),
 (36, 246.6),
 (22, 206.42857142857142),
 (35, 211.625),
 (45, 309.53846153846155),
 (60, 202.71428571428572),
 (67, 214.625),
 (19, 213.27272727272728),
 (30, 235.8181818181818),
 (51, 302.14285714285717),
 (25, 197.45454545454547),
 (21, 350.875),
 (42, 303.5),
 (49, 184.66666666666666),
 (48, 281.4),
 (50, 254.6),
 (39, 169.28571428571428),
 (32, 207.9090909090909),
 (58, 116.54545454545455),
 (64, 281.3333333333333),
 (31, 267.25),
 (52, 340.6363636363636),
 (24, 233.8),
 (20, 165.0),
 (62, 220.76923076923077),
 (41, 268.55555555555554),
 (44, 282.1666666666667),
 (69, 235.2),
 (65, 298.2),
 (61, 256.22222222222223),
 (28, 209.1),
 (66, 276.44444444444

So this is the final result we are looking for. It does not mean much as it is only a fake dataset randomly generated, but I hope it gives a good introduction to the general ideas of RDDs and a few commonly used functions. Putting all the different steps together in a complete program, minus the collect calls in the intermediate steps, may looks like the following:

In [11]:
def parseLine(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

lines = sc.textFile("fakefriends.csv")
rdd = lines.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
results = averagesByAge.collect()

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=abc, master=local) created by __init__ at <ipython-input-2-f696d91b152a>:3 