# Functional Programming 
## for Big Data Analytics

SELECT AVG(price)
FROM goods
WHERE quantity > 2


```
avg(filter(select(goods)))
```

```
goods.select(price).fiter(quantity_test).show()#
```

## Part 1 -- Data Processing with Comprehensions

### Table-Like Queries

* choose a problem with a relational structure (tabular)
     * define a list of dictionries to represent the data
     * query for interesting characteristics of your dataset
         * subset of fields
         * a price convertion (1.33 * )
         * other operations that seem useful (eg., len(), .upper(), r['field'][0])

In [7]:
table = [
    {"name": "Reece", "role": "Presales", "points": 100},
    {"name": "Toby", "role": "Network Analyst", "points": 0},
    {"name": "Toby", "role": "Graph Analyst", "points": 10}
]

In [2]:
names = []

for row in table:
    names.append(row['name'])
    
names

['Reece', 'Toby']

`[ SELECT   name        FROM  source ]`

In [8]:
[ r['name'] for r in table ]

['Reece', 'Toby', 'Toby']

In [9]:
{ r['name'] for r in table }

{'Reece', 'Toby'}

In [12]:
[ (r["name"], r["points"]/100) for r in table]

[('Reece', 1.0), ('Toby', 0.0), ('Toby', 0.1)]

### Connections

In [13]:
loves = [("Alice", "Bob"), ("Bob", "Bob"), ("Alice", "Eve"), ("Eve", "Bob")]

In [18]:
[ pair for pair in loves ]

[('Alice', 'Bob'), ('Bob', 'Bob'), ('Alice', 'Eve'), ('Eve', 'Bob')]

In [19]:
[ "Bob" in pair for pair in loves ]

[True, True, False, True]

In [20]:
sum([ "Bob" in pair for pair in loves ]) 

3

In [17]:
sum([ "Bob" in pair for pair in loves ]) / len(loves)

0.75

### Connections with Info

* Starting with the following structure edges: (node_u, node_v, weight)
    * find all the first nodes (u), 
    * find all the last nodes (v),
    * find the last nodes and their weights (u, w)
    * find the total of all the weights
        * HINT: `sum(...)`

In [21]:
hates = [
    ("Bob", "Alice", 7), 
    ("Eve", "Alice", 6),
    ("Bob", "Michael", 5), 
    ("Eve", "Michael", 3)
]

In [28]:
[ (u,w) for u, v, w in hates ]

[('Bob', 7), ('Eve', 6), ('Bob', 5), ('Eve', 3)]

In [29]:
from collections import Counter

In [31]:
Counter([ u for u, v, w in hates ])

Counter({'Bob': 2, 'Eve': 2})

### Filtering

#### Exercise
* add a filter to one of your queries above

`SELECT r['name'] FROM users AS r`


In [67]:
[ h for h in hates]

[('Bob', 'Alice', 7),
 ('Eve', 'Alice', 6),
 ('Bob', 'Michael', 5),
 ('Eve', 'Michael', 3)]

In [71]:
[w for u,v,w in hates if u == "Bob"]

[7, 5]

In [72]:
sum([w for u,v,w in hates if u == "Bob"])

12

### Aside: Dictionary Comprehension

In [74]:
{ u: []  for u, v, w in hates  }

{'Bob': [], 'Eve': []}

In [78]:
{ first: [ w for u, v, w in hates] for first, second, weight in hates }

{'Bob': [7, 6, 5, 3], 'Eve': [7, 6, 5, 3]}

```
SELECT first AS key, 
    (SELECT weight FROM users AS A WHERE A.first = B.first ) AS value
FROM users as B
```

In [80]:
{ A_u: [ B_w for B_u, B_v, B_w in hates if B_u == A_u] for A_u, A_v, A_w in hates }

{'Bob': [7, 5], 'Eve': [6, 3]}

In [66]:
{ u: sum([ c for a,b,c in hates if u == a  ]) for u, v, w in hates }

{'Bob': 12, 'Eve': 9}

### Working with Python

In [33]:
names = ["Alice John X", "Bob Clarrisa Y", "Eve Michael Z"]

[ n.split()[-1] for n in names ]

['X', 'Y', 'Z']

### Nested Comprehensions

In [39]:
three = range(3)

In [40]:
list(three)

[0, 1, 2]

In [56]:
x = 5
f"x = {x : .2f}"

'x =  5.00'

In [55]:
"x = {x}"

'x = {x}'

In [48]:
[ name.split()  for name in   [f"Michael {i}" for i in range(3)]]

[['Michael', '0'], ['Michael', '1'], ['Michael', '2']]

In [53]:
[''.join([p[0] for p in n.split()]) for n in names]

['AJX', 'BCY', 'EMZ']

In [38]:
['.'.join( [ piece[0] for piece in n.upper().split()]  ) for n in names]

['A.J.X', 'B.C.Y', 'E.M.Z']

# Nested Comprehensions

In [87]:
[(i, j) for i in range(3) for j in range(3)]

[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

In [83]:
for i in range(3):
    for j in range(3):
        print(i, j)

0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2


## Part 2 -- Data Transformation with Functions

* Why?
    * The Spark API is written using a functional programming style
* Why?
    * ...

In [2]:
sob = [253, 189, 100, 150]

[300 - p for p in sob]

[47, 111, 200, 150]

* Fixed, Imperative, 
    * specific to one dataset, old
    * specific to one operation, 2*

In [3]:
old = [10, 20, 30, 40]
new = []

for e in old:
    new.append( 2 * e )
    
new

[20, 40, 60, 80]

* Generalized:
    * non-specific to datasets, any old
    * non-specific to opertation, any f we want

SELECT UPPER(letter) FROM words

In [4]:
def _map(old, f):
    new = []

    for e in old:
        new.append( f(e) )

    return new

In [5]:
def double(x):
    return x * 2

_map(sob, double)

[506, 378, 200, 300]

In [6]:
_map("Hello", str.upper)

['H', 'E', 'L', 'L', 'O']

In [7]:
_map(["Sherlock Holmes", "John Watson"], str.upper)

['SHERLOCK HOLMES', 'JOHN WATSON']

In [8]:
_map([18, 9, 20], double)

[36, 18, 40]

In [9]:
_map([  ["h", "e", "b"],  [1, 2, 3],  [0, 23, 1] ], len)

[3, 3, 3]

In [10]:
_map( [1, 0, None, "asd"], bool)

[True, False, False, True]

## Exercise
* using the defintion of `_map` above:
    * define a list of names and uppercase them
    * define a list of ages and double them
    * define a list of lists, where each inner list is a shopping cart
        * produce a list of the len() (ie., number of items) in each cart
    * HINT: `str.upper`, `double`, `len`

### EXTRA
* define a list which contains missing data (ie., use `None` in the list)
    * run `bool`, `str`, `int` across this list
    * what does `bool` tell you?
* define `def reduce(data, f): ...` 
    * such that `reduce(sob, int.__add__)` returns the sum of `sob`


## Higher-Order Functions
* functions that take other functions as arguments 
    * they are always run with some other user-defined operation

* using these you can have an API for workign with data that is:
     * streaming
     * withuot knownign how the aloirhmasd works 
     
     
* for this we need:
* map
    * list -> list
* flatMap
    * list -> lists of lists -> list 
* fold
    * reduce
    * aggregatnig
    * list -> value
    
* filter 
    * long list -> shorter list
    
* exists, forall
    * helpers
    * list -> bool

In [15]:
names = ["Gurpreet D", "Chris E", "Deborah F", "Michael B"]

_map(names, str.split)

[['Gurpreet', 'D'], ['Chris', 'E'], ['Deborah', 'F'], ['Michael', 'B']]

In [16]:
def _flatMap(old, f):
    new = []
    
    for e in old:
        for p in f(e):
            new.append(p)
            
    return new

_flatMap(names, str.split)

['Gurpreet', 'D', 'Chris', 'E', 'Deborah', 'F', 'Michael', 'B']

In [21]:
def _filter(old, f):
    new = []
    
    for e in old:
        if f(e):
            new.append(e)
            
    return new

def has(name):
    return 'E' in name.upper()


_filter(names, has)

['Gurpreet D', 'Chris E', 'Deborah F', 'Michael B']

In [23]:
prices = [10, 20, 30]

total = 0

for p in prices:
    total = total + p
    
total

60

In [32]:
prices = [10, 20, 30]

def _fold(old, f, start):
    total = start

    for p in old:
        total = f(total, p)

    return total

In [33]:
_fold(prices, int.__add__, 0)

60

In [34]:
_fold(names, str.__add__, "")

'Gurpreet DChris EDeborah FMichael B'

In [48]:
prizes = [0.1, 0.2, 0.8, ]


_fold(prizes, float.__mul__, 1.0)

0.016000000000000004

In [49]:
f" {_fold(prizes, float.__mul__, 1.0) * 100 : .2f} %"

'  1.60 %'

In [51]:
_fold(prices, lambda t, e: t + e ** 2, 0)

1400

In [54]:
sum([p ** 2 for p in prices])

1400

## Exercise

* define a list of probabilities (numbers between 0 and 1)
    * each is how probable it is to win a prize
    
* what is the probability of winning all the prizes?
    * HINT: use `_fold` with `float.__mul__` 

EXTRA:
    * calculate the sum of the squares of the prices above

## Lambad Syntax

* Short-hand way of defining a function
* annoymous functions -- dont need to define a name


In [35]:
_map(names, lambda n: 'g' in n.lower())

[True, False, False, False]

In [36]:
f = lambda c: len(c) == 3

In [38]:
def f(c):
    return c

In [37]:
f("ABC")

True

## Other Tricks

In [55]:
_map([1, 2, 3], (1).__add__)

[2, 3, 4]