<img style="float: left"  src="images/python.png">
<img style="float: right" src="images/surfsara.png">

<hr style="clear: both" />

# Python introduction

We present a very brief introduction to Python here before we will work with Spark.

Python is relatively easy to learn. Of course, you can do complicated things but we focus on the basics here and assume that you have seen similar constructs in other programming languages.

We focus on the aspects of the language that are important for working with Spark during the rest of the day. As such, we don't discuss control structures and more complex data types.

_You can edit the cells below and execute the code by selecting the cell and press Shift-Enter. Code completion is supported by use of the Tab key._

## Variables and dynamic typing
In the cell below we assign an integer value to the variable `x`, and perform some calculations:

In [None]:
# Assign some value to x and print x

x = 123456
print(x)

# Do some calculations, assign the result to y, and print
y = 5 * x / 3.5
print(y)

Each variable in Python is an object, and each object has a type. We can get the types of `x` and `y` and print them:

In [None]:
print(type(x))
print(type(y))

As you can see, you do not explicitly have to specify the type of the variable. Python will automatically infer that the value assigned to `x` is an integer, and that the result of the expression assigned to `y` will be a floating point value.

Because Python is _dynamically typed_ we can also assign to the same variable a value of a different type:

In [None]:
y = 78
print(type(y))

## Strings

Next we assign a string to the variable `s`. Strings can be thought of as sequences of characters. We can access invidual characters of the string with their index, a technique called _slicing_:

In [None]:
# Assign s to a string, using single quotes (or double quotes)
s = 'abcdABCD'

# Print the first letter of s - remember, Python starts counting at 0, not 1!
print(s[0])

# Print the third letter of s
print(s[2])

So far, our slices have been limited to single characters. We can extract substrings of `s` by specifying a range:

In [None]:
# Slice of s from offsets 1 through 2 (3 not included)
print(s[1:3])

# Slice of the first four characters. Note that we don't need to specify start index in the range
print(s[:4])

# Slice from characters 5 till 8. Note that we don't need to specify the end index in the range
print(s[4:])

We can use negative indices to slice from the end of the string (or any sequence, as we'll see later):

In [None]:
# Last letter of s
print(s[-1])

# Last three letters of s
print(s[-3:])

## Exercise 1
In this exercise, please print all characters of `s`, except the last letter. Part of the code has already been filled in, and you will need to replace `<FILL IN>` with the appropriate statement:

In [None]:
# Print s, except the last letter
print(<FILL IN>)

We can find the length of a string with Python's `len` function:

In [None]:
length = len(s)
print(length)

Strings (and lists, as we'll see) can be concatenated with the `+` operator:

In [None]:
a = 'abcd'
b = 'ABCD'
c = a + b
print(c)

If we want to append something to a string that isn't a string itself (for example, an integer), we'll need to convert it to a string first with Python's builting `str` function. If we don't do this, Python will complain:

In [None]:
# This will fail. You'll see an exception appear in the notebook.
print('abcd' + 8)

In [None]:
# Using str, instead:
print('abcd' + str(8))

## Exercise 2
Concatenate the variables `a`, `b` and `c` (in that order) and calculate the resulting length. Print the concatenated result and its length in a single print statement (the output of the `print` function should be a single line).

In [None]:
a = 'abcd'
b = 8
c = 1.4

<FILL IN>

Everything in Python is an object and every object has a type. We have used the builtin `type` function to find out an object's type in the beginning of the notebook. Each type can have zero or more _methods_, which are functions that are members of this type. To _call_ a method on an object, we use _dot notation_:

In [None]:
s = 'this is a sentence'
print(s.split())

In [None]:
print(s.capitalize())

We can use Jupyter's code completion by pressing Tab after the dot. In the cell below, put the cursor behind the dot and press Tab. Then select the `upper` method. What do you think this method does? Make sure you add `()` behind it and execute the cell. All methods end with parentheses - possibly with arguments between the parentheses.

In [None]:
# Put the cursor behind . then press TAB, select the upper method and add ()

s.upper()

We used the `split` method on a string as `s.split()`. Suppose we want to split a string on comma, instead of white space. We can provide the character on which to split as an argument to split. The following cell illustrates this.

In [None]:
s = "this,is,a,text"
print(s.split(","))

Methods are very similar to functions. The most important difference is that they are associated to a class or object (the thing to the left of the dot).

One confusing function that is not a method is the len() function. You might expect to get the number of elements in a string by using s.len(), but this will not work in Python:

In [None]:
s.len()

To get an overview of all methods available on an object, use Python's builtin `help` method with a type:

In [None]:
help(str)

## Lists
Lists are ordered collections that can contain elements of any type. In the example below, we store an integer, a string and a floating point number in a single list. We can use `len` again to get the length of the list:

In [None]:
l = [123, 'some string', 1.666]

print('the length of l is: ' +  str(len(l)))

We create two lists and concatenate them using the `+` operator:

In [None]:
l1 = ['a', 'b', 'c']
l2 = ['D', 'E', 'F']
l3 = l1 + l2

print(l3)

Lists come with a number of methods, e.g. `append` and `reverse`. Again, since these are methods we will be using `.` (dot) notation to invoke them:

In [None]:
print('l3 before reverse: ' + str(l3))

l3.reverse()

print('l3 after reverse: ' + str(l3))

Similarly we can append an element to the end of a list by calling the `append` method:

In [None]:
print('l3 before append: ' + str(l3))

l3.append('G')

print('l3 after append: ' + str(l3))

Slicing lists is identical to slicing strings:

In [None]:
mylist = ['a', 2.5, 6, 'a word', [6,7,8]]

first = mylist[0]
print('the first element is: ' + first)

# The second letter of s
print('the second element is: ' + str(mylist[1]))

# Slice of s from offsets 1 through 2 (3 not included)
print('the second and third element are: ' + str(mylist[1:3]))

# Print last letter of s
print('the last element is: ' + str(mylist[-1]))

The last element of `mylist` above is itself a list. To select the first element of this list we can do the following:

In [None]:
mylist[4][0]

## Tuples
Tuples are very much like _lists_, except that they are _immutable_, they cannot be changed. Tuples are often denoted by the parentheses `(` and `)`. This can sometimes be confusing to people new to Python. Depending on the context the parentheses can be omitted.

Tuples behave very much like lists, as is shown below.

In [None]:
my_tuple = (1, 2, 3, 4)

print('the length of my_tuple: ' + str(len(my_tuple)))
print('the first element of my_tuple: ' + str(my_tuple[0]))

Tuples are immutable, so we can not replace existing elements:

In [None]:
my_tuple[2] = 9

Tuples also lack the `append` method to add elements:

In [None]:
my_tuple.append(5)

We can build lists where the list elements are tuples and select them on basis of their index. Note that lists and tuples both can contain elements of various types.

In [None]:
# A list of tuples
tuple_list = [('a', 'b'), (3, 4), ('Z', 42)]

# Select the first tuple in the list
print('the first tuple is: ' + str(tuple_list[0]))

In [None]:
# Select the first element of the first tuple of the list
print('the first element of the first tuple is: ' + tuple_list[0][0] )

## Exercise 3
Given `tuple_list` as defined in the previous cell, print the second element of the third tuple in the list.
The answer should be 'right'.

In [None]:
# Print the second element of the third tuple
print(<FILL IN>)

## Functions
You can define functions in Python, similar to other languages. We assume you are familiar with the concept of a function in programming languages. You define a function by using the `def` keyword and a name for the function, followed by arguments in parentheses and `:`. The keyword `return` is used to return the value.

In the cell below we define a function that we call `times`. It has two arguments and returns the product of these arguments.

In [None]:
# We define a function called times, which takes two integers and returns their product
def times(x, y):
    return x * y

p = times(3, 2)

print('the product of 3 and 2 is: ' + str(p))

Below we show another example of a function. We call it `plural` and assume that the input argument is a string.
In the body (the code after `:`) we add an `s` after the input argument `x`. The result is stored in the variable `y` and returned by the function.

In [None]:
def plural(x):
    y = x + 's'
    return y

plural('cat')

### Built-in functions
You have seen how you can define and call your own functions. Python comes with a number of predefined functions, called builtin functions. We have already seen three of them, namely the `print()`, `len()` and `type()` functions.

## Exercise 4
Write a function that takes a tuple and returns the first and last element of the tuple:

In [None]:
def first_and_last(t):
    <FILL IN>
    
first_and_last((1, 2, 3))  # Should return (1, 3)

## Exercise 5
Write a function that takes a list and returns the number of `1`s in the list. **Hint**: use Python's `help` function to see the list of methods the list type has. You'll need to use one of them.

In [None]:
def count_ones(l):
    <FILL IN>
    
count_ones([1, 1, 2, 3, 4, 1])  # Should return 3

## Lambda functions
There is another way of defining functions, called _lambda_ or _anonymous_ functions. The term lambda comes from the field of lambda calculus, which is a branch of mathematics specialised in the logic of computation. Lambda functions play a large role in functional programming languages, and we will use them frequently when working with Spark.

Both MapReduce and Spark have taken their inspiration from functional programming and hence understanding lambda functions will help you to understand MapReduce and Spark.

Lambda functions are anonymous functions, that is, functions without a name. The keyword `lambda` simply denotes that a function is defined. What follows is a function statement, a single statement only, and no return statement.

Finally, both lambda functions and functions specified by `def` can be assigned to a variable, which then is used as the name of the function.

Let's look at an example:

In [None]:
# This lambda function has two arguments x and y which are multiplied
#  Note that there is no return statement and that the function is assigned to the variable l_times
#  The : separates the arguments from the body of the function
l_times = lambda x, y: x * y

# Next we call the function by using the variable as a function
result = l_times(2, 3)

print('the product of 2 and 3 is: ' + str(result))

## Exercise 6
Write a lambda function that adds two numbers. Then execute the function on the integers 7 and 9 and print the result.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# A lambda function to add two numbers
my_add = lambda <FILL_IN>

# Add two numbers using the function just defined
result = <FILL IN>

print(result)

## Exercise 7
Write a lambda function that is equivalent to the `plural` function above.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# A lambda function to add `s` to a string
plural = lambda <FILL IN>
result = plural('cat')

# Should print 'cats'

print(result)

Suppose we want to write a lambda function with a single argument that for a tuple (x,y) returns (y,x).
This can be done by writing the output of the function as a tuple, where the first element is the second element of the input and the other, the second element of the input.

Note that the parentheses `(` and `)` indicate a tuple.

In [None]:
# the parentheses ( and ) indicate that the output is a tuple 
# x[1] is the second element of the input, x[0] the first

f = lambda x: (x[1], x[0])

# let's create a simple tuple
ttup = ('a','b')

# and call the function
f(ttup)

## Exercise 8
Write a lambda function taking the following tuple as its argument:

```
('Jan', 'Jansen', 1234, 'Rozengracht', 'Amsterdam')
```

and reformats this tuple as follows:

```
(1234, ('Jan', 'Jansen', 'Rozengracht', 'Amsterdam'))
```

In [None]:
t = ('Jan', 'Jansen', 1234, 'Rozengracht', 'Amsterdam')

f = lambda x: <FILL IN>

f(t)               

# Output should be
# (1234, ('Jan', 'Jansen', 'Rozengracht', 'Amsterdam'))

## Map and Reduce

Functions are very important to both MapReduce and Spark. To see why let us look at a function called `map`, which is part of Python.

`map` is a function which takes two arguments, a function and a list. `map` applies the function (its first argument) to every element of the list (its second argument).

In the next cell we show how this works.
We define a list called `celsius` which contains a number of temperature measurements in degrees Celsius. We are going to convert this list to degrees Fahrenheit.

For this we write a function that will convert a single degree Celsius into Fahrenheit. We then call `map` to apply this function to all elements of the list `celsius`.

*Note*: instead of returning a list, `map` returns an _iterator_ object. To convert an iterator into a list, we use the builtin `list` function. 

In [None]:
# A list of temperature measures in degrees Celsius
celsius = [39.2, 36.5, 37.3, 37.8]

# A lambda function defining the conversion from a degree in Celsius to one in Fahrenheit
convert = lambda x: (9 / 5) * x + 32

# By using map we apply the function to every element of the list celsius
fahrenheit = list(map(convert, celsius))

# We can do exactly the same by using the lambda expression inside the map statement directly
# fahrenheit = map(lambda x: (float(9)/5)*x + 32, celsius)
print(fahrenheit)

We do not have to use lambda functions here. We can define the convert function using `def` and use the name of the function as the first argument for map. Let's see how this works:

In [None]:
# We define the same function as in the previous cell, now using def
def convert_def(x):
    fahr = (9 / 5) * x + 32
    return fahr

# Let's use it in map on the celsius list
fahrenheit = list(map(convert_def, celsius))
print(fahrenheit)

This is conceptually very similar to Hadoop's Map as in MapReduce. But the Python version is not executed in parallel, as in Hadoop.

Next, let us look at Reduce. Python also has a function called `reduce` which takes as its first argument a function, and as a second argument a list. The function has to have two arguments.

`reduce` will then apply the function to the first two elements of the list and use this result together with the next item in the list to compute the next step. This procedure is repeated until the entire list is traversed.

For example, suppose we want to add up all elements of the list `[47, 11, 42, 13]` then we write a function which will add up two integers, and we will call `reduce`. `reduce` will then proceed as depicted in the picture below:

![python reduce](images/reduce.png)

Note that this function `reduce` resembles Hadoop's Reduce. In Hadoop you have to supply the Mapper and Reducer classes with a program, in Python we can do this with functions.

Spark works similar to Python in this regard. You have to write functions that you have feed into other functions, which will process data for you.

## Exercise 9
As a final exercise, before we move to Spark, compute the mean of the Fahrenheit list using `reduce`.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from functools import reduce

# First write a lambda function which adds up two elements
sum_up = <FILL IN>

# Use reduce to sum up the elements in fahrenheit
total = reduce(<FILL IN>)

# Divide total by the length of the list fahrenheit
# Use the division operator /
mean = total / len(fahrenheit)
print('the mean temperature is: ' + str(mean))

## Spark teaser

To show you how similar the operations in Spark are to the functional Python version we will give you a teaser on how to do the Celsius to Fahrenheit conversion in Spark. For such a small example of course the overhead outweights the benefit of parallel processing.

Instead of working on Python lists we will do our processing on Spark's data structure for collections called an _RDD_. RDDs will be explained later today. Comparing the following code to the earlier version we note:

- The workflow is exactly the same.
- We can even re-use the lambda functions.
- But `map` and `reduce` are _methods_ of the RDD instead of Python _functions_.
- Instead of the Python `len` function we use the `count` method.

In [None]:
# Initialize Spark
from pyspark import SparkContext, SparkConf

if not 'sc' in globals(): # This 'trick' makes sure the SparkContext sc is initialized exactly once
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

In [None]:
# Distribute the celsius list
celcius_rdd = sc.parallelize(celsius)

# This part runs in parallel
fahrenheit_rdd = celcius_rdd.map(convert)

print('degrees in Fahrenheit: ' + str(fahrenheit_rdd.collect()))

count = fahrenheit_rdd.count()
mean = fahrenheit_rdd.reduce(sum_up) / count

print('the mean temperature is: ' + str(mean))