# Functional Programming: Mapping

General purpose programming languages are **procedural**: they use for-loops and the like to process data. However, Spark is written in [**Scala**](https://en.wikipedia.org/wiki/Scala_(programming_language)), which is both OOP and **functional**; when using the Python API PySpark, we need to employ the **functional methods** if we want to be fast. Under the hood, the Python code uses [py4j](https://www.py4j.org/) to make calls to the Java Virtual Machine (JVM) where the Scala library is running.

Functional programming uses methods like `map()`, `apply()`, `filter()`, etc. In those, we pass a function to the method, which is the applied to the entire dataset, without the need to using for-loops.

This **functional programming** style is very well suited for distributed systems and it is related to how MapReduce and Hadoop work.

## Pure Functions and Direct Acyclic Graphs (DAGs)

So that a function passed to a method such as `map()`, `apply()` or `filter()` works properly:

- it should have no side effects on variables outside its scope,
- they should not alter the data which is being processed.

These functions are called **pure functions**.

In Spark, every node makes a copy of the data being processed, so the data is *immutable*. Additionally, the pure functions we apply are usually very simple; we chain them one after the other to define a more complex processing. So a function seems to be composed of multiple subfunctions. All sub-functions need to be pure.

The data is not copied for each of the sub-functions; instead, we perform **lazy evaluation**: all sub-functions are chained in **Direct Acyclic Graphs (DAGs)** and they are not run on the data until it is really necessary. The combinations of sub-functions or chained steps before touching any data are called **stages**.

This is similar to baking bread: we collect all necessary stuff (ingredients, tools, etc.) and prepare them properly before even starting to make the dough.

## Maps

In Spark, maps take data as input and then transform that data with whatever function you put in the map. They are like directions for the data telling how each input should get to the output.

The first code cell creates a SparkContext object. With the SparkContext, you can input a dataset and parallelize the data across a cluster (since you are currently using Spark in local mode on a single machine, technically the dataset isn't distributed yet).

Run the code cell below to instantiate a SparkContext object and then read in the log_of_songs list into Spark. 

In [1]:
### 
# You might have noticed this code in the screencast.
#
# import findspark
# findspark.init('spark-2.3.2-bin-hadoop2.7')
#
# The findspark Python module makes it easier to install
# Spark in local mode on your computer. This is convenient
# for practicing Spark syntax locally. 
# However, the workspaces already have Spark installed and you do not
# need to use the findspark module
#
###

# Find Spark
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="maps_and_lazy_evaluation_example")

# Dataset: list of song names
log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "despacito",
        "All the stars"
]

# Parallelize the log_of_songs to use with Spark
# sc.parallelize() takes a list and creates an
# RDD = Resilient Distributed Dataset, i.e., 
# a dataset distirbuted across the Spark nodes.
# This RDD is represented by distributed_song_log
distributed_song_log = sc.parallelize(log_of_songs)

25/04/15 18:00:52 WARN Utils: Your hostname, kasiopeia.local resolves to a loopback address: 127.0.0.1; using 192.168.68.116 instead (on interface en0)
25/04/15 18:00:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/15 18:00:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


This next code cell defines a function that converts a song title to lowercase. Then there is an example converting the word "Havana" to "havana".

In [4]:
def convert_song_to_lowercase(song):
    return song.lower()

convert_song_to_lowercase("Havana")

'havana'

The following code cells demonstrate how to apply this function using a map step. The map step will go through each song in the list and apply the convert_song_to_lowercase() function. 

In [5]:
# We map() our function to the RDD
# BUT it is not executed, due to the lazy evaluation principle.
# We need to run an action, e.g., collect().
# With collect() the results from all of the clusters
# are taken and gathered into a single list on the master node
distributed_song_log.map(convert_song_to_lowercase)

PythonRDD[1] at RDD at PythonRDD.scala:53

You'll notice that this code cell ran quite quickly. This is because of lazy evaluation. Spark does not actually execute the map step unless it needs to.

"RDD" in the output refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data. 

To get Spark to actually run the map step, you need to use an "action". One available action is the collect method. The collect() method takes the results from all of the clusters and "collects" them into a single list on the master node.

In [7]:
# With collect() the results from all of the clusters
# are taken and gathered into a single list on the master node
distributed_song_log.map(convert_song_to_lowercase).collect()

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']

Note as well that Spark is not changing the original data set: Spark is merely making a copy. You can see this by running collect() on the original dataset.

In [8]:
# If we run collect() without map(), we get
# the original immuted data
distributed_song_log.collect()

['Despacito',
 'Nice for what',
 'No tears left to cry',
 'Despacito',
 'Havana',
 'In my feelings',
 'Nice for what',
 'despacito',
 'All the stars']

You do not always have to write a custom function for the map step. You can also use anonymous (lambda) functions as well as built-in Python functions like string.lower(). 

Anonymous functions are actually a Python feature for writing functional style programs.

In [9]:
# Usually, the map() functions are defined as lambdas
# or anonymoud functions.
# Note that we are using the Pythons built-in lower() function
# inside Spark!
distributed_song_log.map(lambda song: song.lower()).collect()

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']

In [10]:
distributed_song_log.map(lambda x: x.lower()).collect()

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']