In [1]:
sc

**flatmap** vs. **map** only affects how the data is collected.

## Fold, Aggregate, countByValue, takeSample, and foreach(n)


### Fold - similiar to a reduce function
**fold** is similar to reduce with a default value. 
- takes function with same design as needed for reduce()
- Takes a default value to be used for the intial call on each partition
- Returns a new value of hte same type

---

If there's 3 partitions

`.fold(0,lambda x,y : x+y)`

Step 1 : **0** 1 3 5 | **0** 7 9 | **0**

Step 2 : **0** 9 | **0** 16

Step 3 :25

---

`.fold(1,lambda x,y : x+y)`

If there's 3 partitions

Step 1 : **1** 1 3 5 | **1** 7 9 | **1**

Step 2 : **0** 9 | **0** 16

Step 3 : 25

Fold is often used for error handling. Handle error on your own. The 0 value will automatically handle the error. 



#### Lets look at odd numbers, want to calculate (sum / count)

In [2]:
odd_nums = sc.parallelize([1,3,5,7,9],2)

In [3]:
odd_nums.map(lambda x : x+1 if x == 3 else x).collect()

[1, 4, 5, 7, 9]

In [4]:
odd_nums.map(lambda x : x+1 if x == 3 else x+2 if x==5 else 0).collect()

[0, 4, 7, 0, 0]

In [5]:
odd_nums.collect()

[1, 3, 5, 7, 9]

We will make a 2-value tuple `(0,0)` which will be considered as `x`.

- `x[0]` will be our sum
- `x[1]` will be our count

```python
.aggregate((0,0),  # default value
    lambda x,y : (x[0],x[1]+y),  # within a partition
    lambda x,y: (x[0]+y[0],x[1]+y[1])) # condensing partitions
```

#### Lets look at the parts

##### Default Value
Will default to (0,0) as a starting point. Then will continually add. 

##### Within partition 

`lambda x,y : (x[0]+1,x[1]+y)`

Let's work through this. 

1. We start with `(0,0) = x`
2. We encounter our first value 1 so `y=1`
3. We want (0,0) to be => (1,1)


##### Gathering Partitions results

`lambda x,y: (x[0]+y[0],x[1]+y[1]))`

Let's work through this step by step. Lets say the partition break is `[1,3] [5,7,9]`

Then after applying the function within the partition we should have `(4,2) (21,3)`

So to reiterate:

1. `(4,2) = x`
2. `(21,3) = y`
3. so the lambda function should add these pair-wise together
4. `x[0]+y[0] = 25`
5. `x[1]+y[1] = 5`
6. Returns `(25,5)`







#### Example 4-1 for the numbers between 1 and 9 calculate sum of odd numbers

In [6]:
odd_nums.reduce(lambda x,y : x + y)

25

#### Example 4-2 for the numbers between 1 and 9 calculate the sum of the odd numbers using fold

In [7]:
odd_nums.fold(0,lambda x,y : x + y)

25

#### Example 4-3 using aggregate () return (sum # of elements ) of odd numbers

In [8]:
odd_nums.aggregate((0,0),lambda x,y:(x[0]+1,+x[1]+y),lambda x,y:(x[0]+y[0],x[1]+y[1]))

(5, 25)

In [9]:
odd_nums.glom().collect()

[[1, 3], [5, 7, 9]]

### Whats the difference between take() and first()?

One returns values, the other returns a string

In [10]:
odd_nums = sc.parallelize([1,3,5,7,9],2)

In [11]:
odd_nums.take(1)

[1]

In [12]:
odd_nums.first()

1

### Whats the difference between sample() and takeSample()?


### Sample
Sample(withReplacement=True, fraction=0.5, seed=1) **transformation**
- **CREATES A NEW RDD** with random elements form the calling RDD
- with replacement (with repeats)
- fraction - expected positive
- expected probability element is used
- seed (for standardizing randomiztion) - otherwise default based on millisecond time

### takeSample
takesSample (with replacement , num, seed) : ** action **
- returns fixed size sample subset of an RDD as an **ARRAY**
- with replacement allow sample multiple times
- num - example number of sampled element
- seed


### Example 5-1

Try collect(), count(), countByValue(), top(n), take(n), first(), takeSample(), operations on Z

In [13]:
x = sc.parallelize([3,4,1,2])
y = sc.parallelize(range(2,6))
z = x.union(y)

In [14]:
z.collect()

[3, 4, 1, 2, 2, 3, 4, 5]

In [15]:
z.count()

8

In [16]:
z.countByValue()

defaultdict(int, {1: 1, 2: 2, 3: 2, 4: 2, 5: 1})

In [17]:
z.top(3)

[5, 4, 4]

In [18]:
z.take(3)

[3, 4, 1]

In [19]:
z.first()

3

In [20]:
z.takeSample(withReplacement=False,num=3,seed=None)

[2, 1, 3]

In [21]:
z.takeSample(withReplacement=True,num=20,seed=None)

[3, 2, 3, 2, 4, 3, 5, 2, 4, 1, 2, 4, 2, 4, 2, 3, 2, 2, 3, 4]

In [22]:
z.glom().collect()

[[], [3], [], [4], [], [1], [], [2], [], [2], [], [3], [], [4], [], [5]]

In [23]:
res = z.sample(True,0.5)
res.glom().collect()

[[], [], [], [], [], [], [], [], [], [2], [], [], [], [], [], [5]]

In [24]:
b = z.foreach(lambda x : x+1)

## Week 3 - Pair RDDs

Pair RDDs are a **key value pair**. 

|Key|value|
|---|------------------------------|
|001| 'some data that i have added'|

Very common structure for schemaless data or NoSQL. In NoSQL - key is unique. Spark, key could be duplicated. Key could show up in multiple places

#### Keys 
Could be simple, or complex objects (tuples)
Could be simple, or a complex json

#### Exercise 1 - get words from readme file

In [6]:
path = '/Users/owner/USF/spark/README.md'
readme = sc.textFile(path)

In [7]:
words = readme.flatMap(lambda x: x.split(' '))
word_count = words.map(lambda x: (x,1))
word_count.collect()[:10]

[(u'#', 1),
 (u'Apache', 1),
 (u'Spark', 1),
 (u'', 1),
 (u'Spark', 1),
 (u'is', 1),
 (u'a', 1),
 (u'fast', 1),
 (u'and', 1),
 (u'general', 1)]

## Transformations designed for key value pairs

|function | description |
|-----|------------------|
|keys()| gets the keys|
|values()| value|

#### Do the same but do value as length of word and sort by value

In [11]:
words = readme.flatMap(lambda x: x.split(' '))
word_len = words.map(lambda x: (len(x),x))
sorted_list = word_len.sortByKey(ascending=False)
sorted_list.collect()[:20]

[(115,
  u'[IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).'),
 (112,
  u'[Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)'),
 (96,
  u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)'),
 (82,
  u'3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).'),
 (81,
  u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).'),
 (65, u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).'),
 (62, u'Guide](http://spark.apache.org/docs/latest/configuration.html)'),
 (57, u'wiki](https://cwiki.apache.org/confluence/display/SPARK).'),
 (49, u'page](http://spark.apache.org/documentation.html)'),
 (35, u'sc.parallelize(range(1000)).count()'),
 (33, u'Maven](http://maven.apache.org/).'),
 (26, u'<http://spark.apache.org/>'),
 (24, u'MASTER=spark://host:707

**groupByKey()**

- group by key
- return an RDD  of (Key, ResultIterable)

#### Example 3 

Create a pair RDD with (length of a word, list of words) from README.md

In [34]:
words = readme.flatMap(lambda x: x.split())
words = words.map(lambda x: (len(x),x))
res = words.groupByKey().map(lambda x: (x[0],list(x[1])))
res.sortByKey()


for x in res.collect()[:5]:
    print x[:2]
    print '----------------'


(96, [u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)'])
----------------
(112, [u'[Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)'])
----------------
(2, [u'is', u'It', u'in', u'R,', u'an', u'It', u'of', u'##', u'on', u'##', u'is', u'To', u'do', u'to', u'do', u'if', u'by', u'-T', u'in', u'is', u'at', u'an', u'##', u'to', u'is', u'to', u'##', u'if', u'##', u'in', u'To', u'of', u'Pi', u'to', u'to', u'be', u'or', u'to', u'on', u'to', u'or', u'to', u'an', u'if', u'is', u'in', u'of', u'if', u'no', u'##', u'is', u'be', u'on', u'to', u'or', u'##', u'to', u'to', u'in', u'of', u'to', u'at', u'on', u'of', u'##', u'to', u'in', u'an', u'on', u'to'])
----------------
(4, [u'fast', u'APIs', u'that', u'data', u'also', u'rich', u'find', u'This', u'file', u'only', u'run:', u'(You', u'need', u'this', u'more', u'than', u'with', u'More', u'from', u'IDE,', u'also', u'also', u'with', u'will',

### mapValues

Pass each value in the key value pair RDD through a map function wihtout changing the keys

#### Example 4

In [35]:
words = rdm.flatMap(lambda x: x.split())
words = words.map(lambda x: (x,1))
res = words.groupByKey()
res.collect()[:3]

[(u'when', <pyspark.resultiterable.ResultIterable at 0x1039c2910>),
 (u'R,', <pyspark.resultiterable.ResultIterable at 0x10391ed10>),
 (u'including', <pyspark.resultiterable.ResultIterable at 0x10391e690>)]

In [36]:
res2 = res.mapValues(sum)
res2.sortBy(lambda x: x[1], ascending = False).collect()[:10]

[(u'the', 22),
 (u'Spark', 15),
 (u'to', 14),
 (u'for', 11),
 (u'and', 11),
 (u'##', 8),
 (u'a', 8),
 (u'run', 7),
 (u'can', 7),
 (u'is', 6)]

### flatMapValues

#### Example 5 - similiar to before, but using flatMapValues


In [48]:
words = rdm.flatMap(lambda x: x.split())
words = words.map(lambda x: (len(x),x))
words.mapValues(lambda x : list([x])).collect()[:10]

[(1, [u'#']),
 (6, [u'Apache']),
 (5, [u'Spark']),
 (5, [u'Spark']),
 (2, [u'is']),
 (1, [u'a']),
 (4, [u'fast']),
 (3, [u'and']),
 (7, [u'general']),
 (7, [u'cluster'])]

### reduceByKey

similar to reduce ()
reuns in parallel reduce operations

#### Example 6 - do the word counts - generate word / occurance again

In [49]:
words = rdm.flatMap(lambda x: x.split())
words = words.map(lambda x: (x,1))
res = words.reduceByKey(lambda x,y: x+y).sortBy(lambda x: x[1], ascending=False)
res.collect()[:10]

[(u'the', 22),
 (u'Spark', 15),
 (u'to', 14),
 (u'for', 11),
 (u'and', 11),
 (u'##', 8),
 (u'a', 8),
 (u'run', 7),
 (u'can', 7),
 (u'is', 6)]

### Which requires less shuffles?

    .groupByKey().mapValues(lambda x : sum(x))
    - Sends all the values over first, then condenses 2nd with the mapValues

    .reduceByKey(lambda x,y: x+y)
    - Does partition accumulation first, then send pair to be combined
    - FEWER SHUFFLES