<a href="https://colab.research.google.com/github/jmbanda/BigDataProgramming_2019/blob/master/Programming_with_RDDs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda Functions

In [1]:
x = ['Python', 'programming', 'is', 'awesome!']
print(sorted(x))

['Python', 'awesome!', 'is', 'programming']


The key parameter to sorted is called for each item in the iterable. This makes the sorting case-insensitive by changing all the strings to lowercase before the sorting takes place.

In [2]:
print(sorted(x, key=lambda arg: arg.lower()))

['awesome!', 'is', 'programming', 'Python']


This is a common use-case for lambda functions, small anonymous functions that maintain no external state.

Other common functional programming functions exist in Python as well, such as filter(), map(), and reduce(). All these functions can make use of lambda functions or standard functions defined with def in a similar manner.

# filter(), map(), and reduce()

The built-in filter(), map(), and reduce() functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

It’s important to understand these functions in a core Python context. Then, you’ll be able to translate that knowledge into PySpark programs and the Spark API.

filter() filters items out of an iterable based on a condition, typically expressed as a lambda function:

In [3]:
print(list(filter(lambda arg: len(arg) < 8, x)))

['Python', 'is']


filter() takes an iterable, calls the lambda function on each item, and returns the items where the lambda returned True.

Another less obvious benefit of filter() is that it returns an iterable. This means filter() doesn’t require that your computer have enough memory to hold all the items in the iterable at once. This is increasingly important with Big Data sets that can quickly grow to several gigabytes in size.

**map()** is similar to filter() in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The new iterable that map() returns will always have the same number of elements as the original iterable, which was not the case with filter():

In [4]:
print(list(map(lambda arg: arg.upper(), x)))

['PYTHON', 'PROGRAMMING', 'IS', 'AWESOME!']


map() automatically calls the lambda function on all the items.

Finally, the last of the functional trio in the Python standard library is **reduce()**. As with filter() and map(), reduce()applies a function to elements in an iterable.

Again, the function being applied can be a standard Python function created with the def keyword or a lambda function.

However, **reduce()** doesn’t return a new iterable. Instead, **reduce()** uses the function called to reduce the iterable to a single value:

In [5]:
from functools import reduce
print(reduce(lambda val1, val2: val1 + val2, x))

Pythonprogrammingisawesome!


This code combines all the items in the iterable, from left to right, into a single item. There is no call to list() here because **reduce()** already returns a single item.

Note: Python 3.x moved the built-in reduce() function into the functools package.

**lambda, map(), filter(), and reduce()** are concepts that exist in many languages and can be used in regular Python programs. 

# Stop and go back to Slide 10

# Spark Section:

NOTE: These following instructions are just to get Spark running on Colab, they should not appear in your code.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
import findspark
findspark.init()

You should start form HERE:

**Creating an RDD with textFile() in Python**

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Learning_Spark") \
    .getOrCreate()

sc = spark.sparkContext

book = sc.textFile("spark-2.4.4-bin-hadoop2.7/README.md")

Calling the filtering() **transformation**

In [0]:
javaLines = book.filter(lambda line: "Java" in line)

Calling the first() **action**

In [40]:
javaLines.first()

'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

# Persisting and RDD in Memory

In [46]:
book = sc.textFile("spark-2.4.4-bin-hadoop2.7/README.md")
scalaLines = book.filter(lambda line: "Scala" in line)
scalaLines.persist
scalaLines.first()

'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

Counting how many lines we found

In [47]:
scalaLines.count()

3

# Using the World Cup 2014 Players Dataset

Download the TSV file found in iCollege - Datasets section


Note: The following lines are needed for Colab only

In [48]:
from google.colab import files
files.upload()

Saving worldcupplayerinfo_20140701.tsv to worldcupplayerinfo_20140701.tsv


{'worldcupplayerinfo_20140701.tsv': b"Group\tCountry\tRank\tJersey\tPosition\tAge\tSelections\tClub\tPlayer\tCaptain\t\t\nA\tBrazil\t3\t1\tGoalie\t31\t9\tBotafogo   \tJefferson\t0\t\t\nA\tBrazil\t3\t12\tGoalie\t34\t80\tToronto FC  \tJulio Cesar\t0\t\t\nA\tBrazil\t3\t22\tGoalie\t31\t6\tAtletico Mineiro  \tVictor\t0\t\t\nA\tBrazil\t3\t2\tDefender\t31\t75\tBarcelona   \tDani Alves\t0\t\t\nA\tBrazil\t3\t13\tDefender\t30\t12\tBayern Munich  \tDante\t0\t\t\nA\tBrazil\t3\t4\tDefender\t27\t36\tChelsea   \tDavid Luiz\t0\t\t\nA\tBrazil\t3\t15\tDefender\t27\t5\tNapoli   \tHenrique\t0\t\t\nA\tBrazil\t3\t23\tDefender\t32\t72\tRoma   \tMaicon\t0\t\t\nA\tBrazil\t3\t6\tDefender\t26\t31\tReal Madrid  \tMarcelo\t0\t\t\nA\tBrazil\t3\t14\tDefender\t32\t9\tParis Saint-Germain  \tMaxwell\t0\t\t\nA\tBrazil\t3\t3\tDefender\t29\t46\tParis Saint-Germain  \tThiago Silva \t1\t\t\nA\tBrazil\t3\t20\tMidfielder\t21\t11\tShakhtar Donetsk  \tBernard\t0\t\t\nA\tBrazil\t3\t5\tMidfielder\t29\t7\tManchester City  \tFernan

Lets try:

In [0]:
inputRDD = sc.textFile("worldcupplayerinfo_20140701.tsv")
barcelonaRDD = inputRDD.filter(lambda x: "Barcelona" in x)

And:

In [0]:
madridRDD = inputRDD.filter(lambda x: "Real Madrid" in x)
teamLinesRDD = barcelonaRDD.union(madridRDD)

Then we want to actually see our transformations:

In [61]:
print ("Input had " + str(teamLinesRDD.count()) + " Players from Barcelona and Real Madrid")
print ("Here are the players in common:")
for line in teamLinesRDD.collect():
  print (line)

Input had 28 Players from Barcelona and Real Madrid
Here are the players in common:
A	Brazil	3	2	Defender	31	75	Barcelona   	Dani Alves	0		
A	Brazil	3	10	Forward	22	49	Barcelona   	Neymar	0		
A	Cameroon	56	6	Midfielder	26	47	Barcelona   	Alex Song	0		
B	Chile	14	7	Forward	25	67	Barcelona   	Alexis Sanchez	0		
B	Spain	1	3	Defender	27	60	Barcelona   	Gerard Pique	0		
B	Spain	1	18	Defender	25	26	Barcelona   	Jordi Alba	0		
B	Spain	1	6	Midfielder	30	97	Barcelona   	Andres Iniesta	0		
B	Spain	1	10	Midfielder	27	89	Barcelona   	Cesc Fabregas	0		
B	Spain	1	16	Midfielder	25	65	Barcelona   	Sergio Busquets	0		
B	Spain	1	8	Midfielder	34	132	Barcelona   	Xavi	0		
B	Spain	1	11	Forward	26	40	Barcelona   	Pedro	0		
E	Ecuador	26	1	Goalie	28	24	Barcelona   	Maximo Banguera	0		
E	Ecuador	26	4	Defender	26	37	Barcelona   	Juan Carlos Paredes	0		
E	Ecuador	26	19	Midfielder	30	48	Barcelona   	Luis Saritama	0		
F	Argentina	5	14	Midfielder	29	96	Barcelona   	Javier Mascherano	0		
F	Argentina	5	10	Forward	26	

# Go Back to Slide 21

Sources: 

https://realpython.com/pyspark-intro/