# Creating our first Context

## Basics of the basics

In [2]:
import sys
!conda install --yes --prefix {sys.prefix} -c conda-forge findspark

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - findspark


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    findspark-1.3.0            |             py_1           6 KB  conda-forge
    ------------------------------------------------------------
                                           Total:           6 KB

The following NEW packages will be INSTALLED:

  findspark          conda-forge/noarch::findspark-1.3.0-py_1



Downloading and Extracting Packages
findspark-1.3.0      | 6 KB      | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [3]:
import os

import findspark
import pyspark
import time
import operator

In [4]:
from pyspark import SparkConf
from pyspark import SparkContext


conf = SparkConf()
conf.setMaster("local")
conf.setAppName("spark-basic")
sc = SparkContext(conf = conf)

What does this function do?

In [5]:
def surprise(x):
    for i in range(2,x):
        if (x % i) == 0:
            return(x,"No")
    return(x,"Yes")

In [6]:
rdd = sc.parallelize(range(2,10000)).map(surprise).take(20)

In [5]:
print(rdd)

[(2, 'Yes'), (3, 'Yes'), (4, 'No'), (5, 'Yes'), (6, 'No'), (7, 'Yes'), (8, 'No'), (9, 'No'), (10, 'No'), (11, 'Yes'), (12, 'No'), (13, 'Yes'), (14, 'No'), (15, 'No'), (16, 'No'), (17, 'Yes'), (18, 'No'), (19, 'Yes'), (20, 'No'), (21, 'No')]


#### Warming up Exercise: Smarter way?

**Option 1**

In [6]:
start_time = time.time()

rdd = sc.parallelize(range(2,int(5e4))).map(surprise).take(int(5e4))

print("--- %s seconds ---" % (time.time() - start_time))

--- 21.17146611213684 seconds ---


**Option 2**

In [7]:
def surprise2(x):
    # Your code here
    return (x,"Yes")

In [8]:
start_time = time.time()

rdd = sc.parallelize(range(2,int(5e4))).map(surprise2).take(int(5e4))

print("--- %s seconds ---" % (time.time() - start_time))

--- 1.682525634765625 seconds ---


- What happens if we reduce the value of the function take()?
- What happens if we don't use the function take() ?

## A bit more...

### Persistance

- RDDs in python are lazily evaluated, so if we are planning to reuse multiple times the same RDD, we will be recomputing it that many times
- To avoid this, we can force our program to persist the data

Let's see the persisting approach:

In [9]:
rdd = sc.parallelize(range(2,int(1e9)))
rdd.cache()

start_time = time.time()

rdd.map(surprise).take(10000)

print("--- %s seconds ---" % (time.time() - start_time))

--- 16.40270161628723 seconds ---


Against the normal one:

In [10]:
rdd = sc.parallelize(range(2,int(1e9)))

start_time = time.time()

rdd.map(surprise).take(10000)

print("--- %s seconds ---" % (time.time() - start_time))

--- 2.460320234298706 seconds ---


And same happens with the function cache()... What? Why?

### And what about text?

In [11]:
file = open("Quijote.txt","r", encoding='utf-8').read()

In [12]:
rdd = sc.textFile(file)

In [13]:
rdd

PRIMERA PARTE
CAPÍTULO 1: Que trata de la condición y ejercicio del famoso hidalgo D. Quijote de la Mancha
En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor. Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lentejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda. El resto della concluían sayo de velarte, calzas de velludo para las fiestas con sus pantuflos de lo mismo, los días de entre semana se honraba con su vellori de lo más fino. Tenía en su casa una ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte, y un mozo de campo y plaza, que así ensillaba el rocín como tomaba la podadera. Frisaba la edad de nuestro hidalgo con los cincuenta años, era de complexión recia, seco de carnes, enjuto de rostro; gran madrugador y amigo 

# More theory... again

## Transformations vs Actions

In a nutshell:

**Transformations** create new rdds\
**Actions** give us values

## Transformations

- Are lazy, really lazy!
- Create dependencies, chains of transformations
- Trigger

![transformations.PNG](attachment:transformations.PNG)

#### Map

In [14]:
x = sc.parallelize(["A","B","C","D","D"])
y = x.map(lambda x:(x,1))
y.collect()

[('A', 1), ('B', 1), ('C', 1), ('D', 1), ('D', 1)]

By the way... collect is not a transformation, but an action!

#### Flatmap

In [15]:
x = sc.parallelize([1,3,4,5])
sorted(x.flatMap(lambda x: range(1,x)).collect())

[1, 1, 1, 2, 2, 2, 3, 3, 4]

Try to explain what flatmap does. What is the difference with map? Can you explain what is going on here?

#### Filter

In [16]:
x = sc.parallelize(range(1,20,3))
x.filter(lambda x: x%2 == 0).collect()

[4, 10, 16]

Define a function that, given a list of strings, is able to keep only e-mail adresses.

In [17]:
### Your code here

#### Sample

Search how to use the transformation sample(), which parameters are needed?

Code a function that, using pyspark, returns the sum of k dices of m sides. So dice(k = 3, m = 6) would return values between 3 and 18.

In [18]:
### Your code here

Code a function that generates lottery numbers between 00000 and 99999.

##### Union and intersection

In [19]:
x = sc.parallelize(range(0,10))
y = sc.parallelize(range(6,12))
print(x.union(y).collect())
print(x.intersection(y).collect())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 6, 7, 8, 9, 10, 11]
[6, 8, 7, 9]


Are they symmetric?
Think about a real world application (on social media, for example....) of these two transformations

Code a function that depends of 2 lists, and returns the values that are unique in the first list (www.google.com)

In [20]:
#### Your code here

##### Distinct

Explore the transformation distinct, and find a meaningful example where it can be useful

In [21]:
#### Your code here

#### Sortby

In [22]:
sc.parallelize([1,3,4,6,2,3,4,5,1]).sortBy(lambda x: x, True ).collect()

[1, 1, 2, 3, 3, 4, 4, 5, 6]

In [23]:
sc.parallelize([("A",1),("B",2),("C",3),("D",4)]).sortBy(lambda x: x, False ).collect()

[('D', 4), ('C', 3), ('B', 2), ('A', 1)]

Wait wait... what is it really doing here? Explore a bit!

#### MapPartitions

In [24]:
x = sc.parallelize([1,2,3,4,5],2)
def func(x): yield sum(x)

x.mapPartitions(func).collect()

[3, 12]

What is going on? What are these numbers? Experiment a bit!

Once you have it clear, check the differences between mapPartitions and map PartitionsWithIndex

#### Groupby

In [25]:
x = sc.parallelize([1,1,2,3,4,5,6,8,9])
groups = x.groupBy(lambda x: x % 3).collect()
sorted([(x,sorted(y))] for (x,y) in groups)

[[(0, [3, 6, 9])], [(1, [1, 1, 4])], [(2, [2, 5, 8])]]

Create a function that, given a list of different texts, detects how many times a word(s), appears on it. 

For example, to detect if an article is talking about FCB, check if the words "Messi", "Setien" or "Barça" appear.

In [26]:
#### Your code here

#### Zip

In [27]:
sc.parallelize(range(0,5)).zip(sc.parallelize(range(20,25))).collect()

[(0, 20), (1, 21), (2, 22), (3, 23), (4, 24)]

Does it have any restrictions? What happens if we give different lengths? What would R do?

#### Repartition and coalesce

In [8]:
x = sc.parallelize(range(0,20),6)
x.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [7]:
x = sc.parallelize(range(0,20),6)
x.glom().collect()

[[0, 1, 2],
 [3, 4, 5],
 [6, 7, 8, 9],
 [10, 11, 12],
 [13, 14, 15],
 [16, 17, 18, 19]]

In [15]:
print(x.repartition(2).glom().collect())

print(x.coalesce(numPartitions = 2, shuffle= True).glom().collect())



[[0, 1, 2, 6, 7, 8, 9, 10, 11, 12], [3, 4, 5, 13, 14, 15, 16, 17, 18, 19]]
[[0, 1, 2, 6, 7, 8, 9, 10, 11, 12], [3, 4, 5, 13, 14, 15, 16, 17, 18, 19]]


What is the criteria? Experiment a little

#### Reduce

In [20]:
from operator import add

a = sc.parallelize(range(0,4))
print(a)

PythonRDD[81] at RDD at PythonRDD.scala:53


In [18]:
from operator import add

sc.parallelize(range(0,4)).reduce(add)

6

Why add and not sum?

## Actions

- They produce values back to the Spark program
- They make transformations start moving!

![actions.PNG](attachment:actions.PNG)

#### Reduce

In [31]:
from operator import add

sc.parallelize(range(0,100)).reduce(add)

4950

Why add and not sum?

Hint:

In [32]:
def prod(a,b):
    return(a*b)

sc.parallelize(range(1,10)).reduce(prod)

362880

#### First and TakeOrdered

In [33]:
x = sc.parallelize([11,1,3,5,4,2,7,8,1,6,5,11])

print(x.first())

print(x.takeOrdered(6))

11
[1, 1, 2, 3, 4, 5]


How would you select the last 3? And the last 3 without repetitions?

In [28]:
x = sc.parallelize([11,1,3,5,4,2,7,8,1,6,5,11])
print(x.distinct().takeOrdered(6, key=lambda x: -x))

[11, 8, 7, 6, 5, 4]


#### Max, min, sum, mean, variance, stdev, count....

Create a function that computes the coefficient of variation of a numeric list. If you don't know what it is... what better time than now to learn it?!

Create a function that converts a numeric list into a percentage list

In [31]:
dept = [("BDI",10), 
        ("RTA",20), 
        ("BDS",30), 
        ("BDF",40) 
      ]

dept

[('BDI', 10), ('RTA', 20), ('BDS', 30), ('BDF', 40)]

In [66]:
num_list = [1,2,3,4]
rdd = sc.parallelize(num_list)
rdd_tot = rdd.sum()

rdd_perc = rdd.map(lambda x:(x / rdd_tot * 100))
rdd_perc.collect()

[10.0, 20.0, 30.0, 40.0]

In [34]:
#### Your code here

#### Countbyvalue

In [67]:
sc.parallelize([1,2,5,4,6,8,7,2,3,4,3,1,6,7,3,2,1]).countByValue()

defaultdict(int, {1: 3, 2: 3, 5: 1, 4: 2, 6: 2, 8: 1, 7: 2, 3: 3})

## Some functions and more transformations (with paired RDDs)

#### Reducebykey and groupByKey

In [69]:
a = {(1,2),(2,1),(1,3),(3,3),(4,1)}
type(a)

set

In [68]:
x = sc.parallelize({(1,2),(2,1),(1,3),(3,3),(4,1)})
print(x.reduceByKey(add).collect())
x.groupByKey().collect()

[(1, 5), (2, 1), (4, 1), (3, 3)]


[(1, <pyspark.resultiterable.ResultIterable at 0x7fca08b7f610>),
 (2, <pyspark.resultiterable.ResultIterable at 0x7fca08b994c0>),
 (4, <pyspark.resultiterable.ResultIterable at 0x7fca08b99d90>),
 (3, <pyspark.resultiterable.ResultIterable at 0x7fca08b99280>)]

In [74]:
x.groupByKey().mapValues(list).collect()

[(1, [2, 3]), (2, [1]), (4, [1]), (3, [3])]

Describe and code a real world application with these functions. Try to incorporate other actions/transformations to make it more meaningful

In [37]:
#### Your code here

#### mapValues and flatMapValues

In [81]:
teams = [('FC Barcelona', 'Spain'),
 ('Manchester City', 'UK'),
 ('Juventus', 'Italy'),
 ('Borussia Dortmund', 'Germany')]

rdd_teams = sc.parallelize(teams)

In [82]:
upper_case = rdd_teams.mapValues(lambda value: value.upper())
upper_case.take(3)

[('FC Barcelona', 'SPAIN'), ('Manchester City', 'UK'), ('Juventus', 'ITALY')]

In [84]:
rdd_teams.flatMapValues(lambda value: value[0]).collect()

[('FC Barcelona', 'S'),
 ('Manchester City', 'U'),
 ('Juventus', 'I'),
 ('Borussia Dortmund', 'G')]

Explore how these two transformations work, and find a working example for them

In [38]:
#### Your code here

#### Keys, sortByKey and substractByKey

In [85]:
x = sc.parallelize({(1,2),(2,1),(1,3),(3,3),(4,1)})
x.keys().collect()

[1, 2, 4, 3, 1]

In [86]:
x.sortByKey().collect()

[(1, 2), (1, 3), (2, 1), (3, 3), (4, 1)]

In [91]:
y = sc.parallelize({(1,6),(5,3)})
x.subtractByKey(y).collect()

[(2, 1), (4, 1), (3, 3)]

#### Joins!!!

Explore the transformations .join(), .rightOuterJoin(), .leftOuterJoin() and .cogroup()

Work with the following data:

In [93]:
x = sc.parallelize({("A",2),("B",1),("A",3),("C",3),("D",1)})
y = sc.parallelize({("D",1),("A",7),("B",3),("A",9),("E",3),("F",4)})

In [95]:
sorted(x.join(y).collect())

[('A', (2, 7)),
 ('A', (2, 9)),
 ('A', (3, 7)),
 ('A', (3, 9)),
 ('B', (1, 3)),
 ('D', (1, 1))]

In [96]:
sorted(x.rightOuterJoin(y).collect())

[('A', (2, 7)),
 ('A', (2, 9)),
 ('A', (3, 7)),
 ('A', (3, 9)),
 ('B', (1, 3)),
 ('D', (1, 1)),
 ('E', (None, 3)),
 ('F', (None, 4))]

In [97]:
sorted(x.leftOuterJoin(y).collect())

[('A', (2, 7)),
 ('A', (2, 9)),
 ('A', (3, 7)),
 ('A', (3, 9)),
 ('B', (1, 3)),
 ('C', (3, None)),
 ('D', (1, 1))]

In [107]:
sorted(x.cogroup(y).mapValues(list).collect())

[('A',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a87af0>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a92f40>]),
 ('B',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08b94a00>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a87280>]),
 ('C',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a91280>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08bed730>]),
 ('D',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08b94f70>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08b94400>]),
 ('E',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a92d30>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a927f0>]),
 ('F',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a92cd0>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a92c70>])]

In [108]:
x.cogroup(y).map(lambda x : (x[0], list(x[1]))).collect()

[('C',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a95400>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a99e50>]),
 ('D',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a99eb0>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a99f10>]),
 ('B',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a99fa0>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a9b040>]),
 ('A',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a9b0d0>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a9b160>]),
 ('E',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a9b250>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a9b280>]),
 ('F',
  [<pyspark.resultiterable.ResultIterable at 0x7fca08a9b310>,
   <pyspark.resultiterable.ResultIterable at 0x7fca08a9b370>])]

In [43]:
#### Your code here

### Exercise

Code a word count program that gives you the top 100 words of "El Quijote". The order you perform the operations is really important here!

In [126]:
fileQuijote = open("Quijote.txt","r", encoding='utf-8').read()
#It returns a long string
#fileQuijote

In [161]:
rddQuijote=sc.textFile("Quijote.txt")
#This return pyspark.rdd.RDD
type(rddQuijote)

pyspark.rdd.RDD

In [135]:
#rddQuijote.collect()

In [142]:
#wordcount = rddQuijote.map(lambda x: (x,1))
words = rddQuijote.flatMap(lambda line: line.split(" "))
#words.collect()

In [160]:
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a+b)
type(wordCounts)
wordCounts = wordCounts.map(lambda x: (x[1],x[0]) )
#wordCounts.collect()
wordCounts.sortByKey(False).take(100)

[(10351, 'que'),
 (8947, 'de'),
 (8042, 'y'),
 (4941, 'la'),
 (4726, 'a'),
 (3883, 'en'),
 (3726, 'el'),
 (2786, 'no'),
 (2382, 'se'),
 (2122, 'los'),
 (2024, 'con'),
 (1881, 'por'),
 (1856, 'su'),
 (1801, 'le'),
 (1778, 'lo'),
 (1473, 'las'),
 (1154, 'me'),
 (1129, 'como'),
 (1113, 'del'),
 (960, 'es'),
 (899, 'si'),
 (894, 'un'),
 (882, 'más'),
 (851, 'mi'),
 (814, 'yo'),
 (800, 'al'),
 (750, 'tan'),
 (714, 'don'),
 (693, 'para'),
 (689, 'porque'),
 (653, 'había'),
 (627, 'él'),
 (617, 'ni'),
 (616, 'sin'),
 (594, 'una'),
 (512, 'o'),
 (509, 'todo'),
 (487, 'sus'),
 (466, 'ser'),
 (460, 'ha'),
 (452, 'era'),
 (451, 'bien'),
 (445, 'vuestra'),
 (407, 'Y'),
 (376, 'ya'),
 (372, 'todos'),
 (354, 'cuando'),
 (348, 'dijo'),
 (345, 'Don'),
 (343, 'fue'),
 (342, 'donde'),
 (340, 'te'),
 (326, 'este'),
 (326, 'cual'),
 (321, 'así'),
 (313, 'sino'),
 (312, 'esto'),
 (312, 'Sancho'),
 (311, 'Quijote'),
 (310, 'que,'),
 (305, 'quien'),
 (300, 'muy'),
 (294, 'pero'),
 (293, 'aquel'),
 (292, 'est

In [152]:
#wordCounts.sortByKey(False).take(100)

In [44]:
#### Your code here

List of hints:
- First try to split the words before transforming them into a rdd
- You need to assign each of the words a value 1, we have seen how to do it
- After that, you will need to use the word (the key) to sum the 1s
- Almost done, but we would like to see the top... is the dictionary in the right order?

#### One possible solution... don't look!
Explain line by line what is going on!

In [122]:
import re

file = open("Quijote.txt","r", encoding='utf-8').read()
words = file.split(" ")
words = sc.parallelize(words)
#words = words.map(lambda x: re.sub(r'[^\w\s]','',x)) # Try it with and without this line

wordcount = words.map(lambda x: (x,1))
wordcount = wordcount.reduceByKey(add) # Alternatively wordcount.reduceByKey(lambda x,y: x+y)
wordcount = wordcount.map(lambda x: (x[1],x[0]))
wordcount.sortByKey(False).take(50)

[(10310, 'que'),
 (8926, 'de'),
 (8012, 'y'),
 (4930, 'la'),
 (4716, 'a'),
 (3867, 'en'),
 (3713, 'el'),
 (2776, 'no'),
 (2380, 'se'),
 (2116, 'los'),
 (2015, 'con'),
 (1874, 'por'),
 (1853, 'su'),
 (1799, 'le'),
 (1773, 'lo'),
 (1470, 'las'),
 (1152, 'me'),
 (1129, 'como'),
 (1109, 'del'),
 (957, 'es'),
 (895, 'si'),
 (889, 'un'),
 (878, 'más'),
 (851, 'mi'),
 (812, 'yo'),
 (797, 'al'),
 (749, 'tan'),
 (713, 'don'),
 (693, 'para'),
 (685, 'porque'),
 (649, 'había'),
 (625, 'él'),
 (613, 'sin'),
 (611, 'ni'),
 (594, 'una'),
 (511, 'o'),
 (508, 'todo'),
 (486, 'sus'),
 (466, 'ser'),
 (459, 'ha'),
 (450, 'era'),
 (450, 'bien'),
 (445, 'vuestra'),
 (376, 'ya'),
 (372, 'todos'),
 (350, 'cuando'),
 (346, 'dijo'),
 (343, 'fue'),
 (342, 'donde'),
 (339, 'te')]

### Another Exercise!

Let's code a shopping list!
We will have a list of elements like this:
    
x = sc.parallelize([["Apple",3,0.2],["Pear",5,0.35],["Milk",2,1.1],["Apple",3,0.2]])

Where the first element of each list is the product, the second the number of unit we bought and the third the unit price.

We want to have the list of how much we have spent in each product (ordered), and the total amount of money we have spent. 

(Optional) If we buy more than 10 products of the same type, we have a 10% discount of the final price

In [46]:
x = sc.parallelize([["Apple",3,0.2],["Pear",5,0.35],["Milk",2,1.1],["Apple",3,0.2]])

### Last one...

Replicate the last exercise, but the structure of the data is different. We have one object with products and prices. On the other hand, we have one list of the following form:

x = sc.parallelize([["Maria","Apple",1],["Maria","Pear",2],["Pau","Milk",4],["Laura","Apple",3]])

We want to know how much each of the have spent in total.

In [47]:
#### Your code here

In [48]:
sc._jsc.sc().uiWebUrl().get()

'http://MAD-SURF002.netmind.local:4040'