## Fundamental Spark Tools

Easiest way to make an RDD is to use the `parallelize` method

In [1]:
rdd = sc.parallelize([1,2,3,4])

In [2]:
rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:489

So here we see that we have a special type of RDD - known as the Parallel Collection RDD!

RDD is very useful for dealing with failures when you are spreading jobs out to many clusters. The R in RDD means resilient - this means that all the data can be reconstructed on the fly should a node fail.

#### Reading a text file.

Spark supports reading text files from any Hadoop system like HDFS or a local file system.

Be sure that the file is accessible over the network or is in the same place for all of your worker nodes to access.

In [3]:
import os
text = sc.textFile(os.getcwd()+"/text_files/harvard_sentences.txt")

From the above, we see that we have a Map Partitions RDD!

In [4]:
text.collect()

['Oak is strong and also gives shade.',
 'Cats and dogs each hate the other.',
 'The pipe began to rust while new.',
 "Open the crate but don't break the glass.",
 'Add the sum to the product of these three.',
 'Thieves who rob friends deserve jail.',
 'The ripe taste of cheese improves with age.',
 'Act on these orders with great speed.',
 'The hog crawled under the high fence.',
 'Move the vat over the hot fire.']

Fun fact - the above sentences are known as [Harvard Sentences](https://en.wikipedia.org/wiki/Harvard_sentences)

`sc.textFile` has an optional argument `minPartitions=None`

This argument tells spark how many partitions our data should have.

You should note that the minPartitons is equivalent to the amount of parallelism that you will employ.

That is, `minPartitions = 1` means that only 1 node/executor will be processing your data. This doesnt exactly make sense when dealing with a cluster - you want to partition your data out amongst several nodes and then let each executor run your code in parallel on the data.

Furthermore, you are specifying the MINIMUM number of partitions, as such, this wioll serve as a the lower bound on your parallelism. Spark may decide to give you more, i.e. 5

Now, ifg you had 15 nodes - you may be tempted to set `minPartitions = 15` - indeed, this is exactly what you should do as you will ensure every node will get some data to work with.

However, in our incredibly simplistic example above, we only have 10 sentences, thus setting `minPartitions = 15` doesnt make much sense, as there is no way to split this data 15 ways. Spark is NOT going to split up a sentence.

Another thing to note is that when loading data from HDFS, spark will assign one partition per block - if I recall correctly, a block size is 64MB by default.

Spark docs recommend 2-4 partitions per CPU on your machine.

The other optional argument is `use_unicode=True` - if you are certain that your text data contains no unicode characters, then setting this parameter to False will make your data utf-8. This is a performance improvement, so be sure to use it where applicable.

#### Spark 'Actions'

You should think of 'actions' as poutput producing functions - they force spark to do a computation and spit out a result.

In terms of inputs and outputs, if a function takes in an RDD and spits out something that isnt an RDD - it is an action!

In [5]:
rdd = sc.parallelize(range(16))

In [6]:
rdd.collect() 

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

`.collect()` pulls the result into the driver. By driver, I mean the application in which the result is being computed and displayed. In this case, that is a jupyter notebook. In other cases, it could be a spark shell.

In [7]:
rdd.sum()

120

In [8]:
rdd.take(4) # takes the first x elements

[0, 1, 2, 3]

In [9]:
rdd.count() # equivalent of len(list(range(16)))

16

In [10]:
# numbers.saveAsTextFile("sample_numbers.txt") - saves to disk as .txt

#### Spark 'Transformations'

Transformations allow us to define computations on RDD's.

In terms of inputs and outputs, a function that takes an RDD and putputs an RDD is a transformation.

It is important to note that the output of one transformation can serve as the input for another transformation.

This way, we are building up a tree or **graph** of transformations that need to be applied to out initial RDD. No computation occurs whatsoever when specifying transformations - spark is **lazy**.

In fact, when you perform an action, spark applies a transformation to RDD's in a recursive fashion until it reaches an RDD that originates from an input source. This stop the recursion and allows all of the transformations to be made.

In [11]:
rdd = sc.parallelize(range(16))

In [12]:
small_rdd = sc.parallelize(range(10))

In [13]:
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

In [14]:
small_rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The 'heritage' of an RDD can be foundusing the `toDebugString()` method.

In [15]:
small_rdd.toDebugString()

b'(8) PythonRDD[12] at collect at <ipython-input-14-e0ebabccc164>:1 []\n |  ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:489 []'

In [16]:
def show_heritage(rdd):
    for s in rdd.toDebugString().split(b'\n'): #it's a bytes object.
        print(s.strip())

In [17]:
combo = rdd.union(small_rdd)

In [18]:
show_heritage(combo)

b'(16) UnionRDD[13] at union at NativeMethodAccessorImpl.java:0 []'
b'|   PythonRDD[11] at collect at <ipython-input-13-20868699513c>:1 []'
b'|   ParallelCollectionRDD[9] at parallelize at PythonRDD.scala:489 []'
b'|   PythonRDD[12] at collect at <ipython-input-14-e0ebabccc164>:1 []'
b'|   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:489 []'


Combo is an RDD - however, when looking at its 'heritage' we see the information of `numbers` and `subset`.

#### Data Persistence

The general idea here is that when we do'expensive' transformations on RDD's, we may want to keep that RDD around - i.e. in 'memory'.

By default, spark stores the RDD in memory, persisting an RDD allows you to either fully store it in memory, fully store it on disk or use a combination of the two. Furthermore, we know that when we save to disk, the data is serialized - however, you can also serialize the data **in memory**. You can specify the level of storage for persistance.

Furthermore, you can set a _replication level_ for the persistance. The higher the replication, the higher 'insurance' you have on replicating all your data if you lose a node/executor. In addition, higher replciation means that more processes can be run on the data if the data in question is being used as a source.

As such, by persisting an RDD, any subsequent action on that partiuclar RDD **within the same session** will **not** have to recurse all the way back to an input RDD.

NOTE:

In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, and DISK_ONLY_2 - see more [here](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence)

Also - once you have set a level of storage for an RDD, you cannot somply change the level, you must refresh the notebook, and set the new level!

In [19]:
import pyspark as pyk

In [20]:
big_numbers = sc.parallelize(range(1000000))

In [21]:
trans = big_numbers.map(lambda x : x ** 3)

In [22]:
# trans.saveAsTextFile("initial-trans.txt")

In [23]:
# trans.map(lambda x: x - 22000).saveAsTextFile("second-trans.txt")

The above is a classic example of why we would prefer to persist an RDD. the second transformation, would require us to do all the transformations again. However, if we persisted the RDD, the second transformation would occur much faster since it no longer needs to recurse to the input RDD.

In [24]:
# trans.persist(pyk.StorageLevel.MEMORY_AND_DISK) # this should occur before our first saveAsTextFile.

In [25]:
# trans.is_cached # will return True if RDD has been persisted

In [26]:
# trans.unpersist() # will unpersist after you are done with the RDD.

## Transformation Section

###### Map

Works exactly the same as the built-in python version of map.

Start with an RDD, map a function over that RDD!

When using map, you should expect one input (RDD) and expect one output (RDD). Map creates an output for EACH input - where an input is every element present in the RDD.

In [27]:
rdd = sc.parallelize(range(21))

In [28]:
rdd.map(lambda x: x * 2).collect()

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40]

Lambda syntax is quite common in PySpark, however, you can still define regular functions just like you would in Python.

In [29]:
def mult_2(val):
    return 2 * val

In [30]:
rdd.map(mult_2).collect()

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40]

when writing traditional functions like this, it is important that this function is a **pure** function - that is, it doesnt alter/store/reference  any states/variables etc outside of its scope.

This is incredibly important as we are doing distributed work across multiple executors. As such, we must be sure that given an input, our function will always give the same output. This is an identitcal concept to [**referential transparency**](https://wiki.haskell.org/Referential_transparency) employed by functional programming languages i.e. Haskell.

Map has an optional argument `preservesPartitioning=False`. In essence, if set to `True`, it will ensure that the mapping over your data does not change the way your data was paritioned. This is a more advanced spark feature, however, it is incredibly useful when doing joins! In general, if yu wish to preserve partition strucutre, check out the `mapValues` function instead of `map`.

`mapValues` requires a Key-Value pair, it preserves the keys!

In [31]:
x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])

def f(val):
    return len(val)

x.mapValues(f).collect()

[('a', 3), ('b', 1)]

###### Filter

You can think of filter as being equivalent to a `WHERE` clause in SQL. It only keeps the values that are relevent in your RDD!

Again, each element in the RDD is passed through the filter. If the value passng through the function evaluates to True, the value is kept!

In [32]:
rdd = sc.parallelize(range(25))

In [33]:
def even(val):
    return val % 2 == 0

In [34]:
rdd.filter(even).collect()

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]

Now - given a large dataset, say taht we filter out a significant chunk of the data. It is wise at this point to use the `coalesce` function.

This function reduced the number of partitions on the resulting RDD in addition to **minimizing** network traffic.

You should think of network traffic as **computation overhead**. That is, if you had a really large dataset that required 1000 Nodes - but now after filtering, all your need is 10 nodes. It doesnt make sense to start all 1000 nodes everytime you start manipulating an RDD that requires 10 Nodes.

`coalesce` does this reduction for you! We will get to an example later!

###### FlatMap

Most beginners confuse this with `map` - dont do this!

At a high level, `map` is a **one-to-one** transformation.

`flatMap` is a **one-to-many** transformation!

Think of flatmap as taking one input at a time, and for each input, producing many outputs!

In [35]:
rdd = sc.textFile(os.getcwd()+"/text_files/harvard_sentences.txt")

In [36]:
rdd.flatMap(lambda x: x.split(" ")).collect()

['Oak',
 'is',
 'strong',
 'and',
 'also',
 'gives',
 'shade.',
 'Cats',
 'and',
 'dogs',
 'each',
 'hate',
 'the',
 'other.',
 'The',
 'pipe',
 'began',
 'to',
 'rust',
 'while',
 'new.',
 'Open',
 'the',
 'crate',
 'but',
 "don't",
 'break',
 'the',
 'glass.',
 'Add',
 'the',
 'sum',
 'to',
 'the',
 'product',
 'of',
 'these',
 'three.',
 'Thieves',
 'who',
 'rob',
 'friends',
 'deserve',
 'jail.',
 'The',
 'ripe',
 'taste',
 'of',
 'cheese',
 'improves',
 'with',
 'age.',
 'Act',
 'on',
 'these',
 'orders',
 'with',
 'great',
 'speed.',
 'The',
 'hog',
 'crawled',
 'under',
 'the',
 'high',
 'fence.',
 'Move',
 'the',
 'vat',
 'over',
 'the',
 'hot',
 'fire.']

In [37]:
rdd.map(lambda x: x.split(" ")).collect()

[['Oak', 'is', 'strong', 'and', 'also', 'gives', 'shade.'],
 ['Cats', 'and', 'dogs', 'each', 'hate', 'the', 'other.'],
 ['The', 'pipe', 'began', 'to', 'rust', 'while', 'new.'],
 ['Open', 'the', 'crate', 'but', "don't", 'break', 'the', 'glass.'],
 ['Add', 'the', 'sum', 'to', 'the', 'product', 'of', 'these', 'three.'],
 ['Thieves', 'who', 'rob', 'friends', 'deserve', 'jail.'],
 ['The', 'ripe', 'taste', 'of', 'cheese', 'improves', 'with', 'age.'],
 ['Act', 'on', 'these', 'orders', 'with', 'great', 'speed.'],
 ['The', 'hog', 'crawled', 'under', 'the', 'high', 'fence.'],
 ['Move', 'the', 'vat', 'over', 'the', 'hot', 'fire.']]

I hope you see the differnce! `flatMap` gives you access to all the words i all sentences directly! In essence, `flatMap` removes one level of grouping!

###### MapPartitions

Exactly the same as `map` however, it runs transformations over the partitions of an RDD and then aggregates thme together!

Below I will show you an example of counting all the word occurences in our harvard sentences.

In [38]:
rdd = sc.textFile(os.getcwd()+"/text_files/harvard_sentences.txt", minPartitions=7)
all_words = rdd.flatMap(lambda x: x.split(" "))

Below, the iterator_obj is all of the values in a specific partition in the RDD.

Since this behaves as an iterator object, we use the `yield` keyword and not `return`.

This is essential, because we want to update counts as we go through each element in a partition rather than just returning one large ditcionary with everything at the end.

An excellent explanation of Generators, `yield` and etc can be found [here](https://pythontips.com/2013/09/29/the-python-yield-keyword-explained/).

In [39]:
def generate_count(iterator_obj):
    counts = {}
    
    for word in iterator_obj:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] +=1
    
    yield counts # we yield and NOT return here!

In [40]:
counts = all_words.mapPartitions(generate_count)

In [41]:
counts.collect()

[{'Cats': 1,
  'Oak': 1,
  'also': 1,
  'and': 2,
  'dogs': 1,
  'each': 1,
  'gives': 1,
  'hate': 1,
  'is': 1,
  'other.': 1,
  'shade.': 1,
  'strong': 1,
  'the': 1},
 {'Open': 1,
  'The': 1,
  'began': 1,
  'break': 1,
  'but': 1,
  'crate': 1,
  "don't": 1,
  'glass.': 1,
  'new.': 1,
  'pipe': 1,
  'rust': 1,
  'the': 2,
  'to': 1,
  'while': 1},
 {'Add': 1,
  'of': 1,
  'product': 1,
  'sum': 1,
  'the': 2,
  'these': 1,
  'three.': 1,
  'to': 1},
 {'Thieves': 1, 'deserve': 1, 'friends': 1, 'jail.': 1, 'rob': 1, 'who': 1},
 {'The': 1,
  'age.': 1,
  'cheese': 1,
  'improves': 1,
  'of': 1,
  'ripe': 1,
  'taste': 1,
  'with': 1},
 {'Act': 1,
  'The': 1,
  'crawled': 1,
  'fence.': 1,
  'great': 1,
  'high': 1,
  'hog': 1,
  'on': 1,
  'orders': 1,
  'speed.': 1,
  'the': 1,
  'these': 1,
  'under': 1,
  'with': 1},
 {'Move': 1, 'fire.': 1, 'hot': 1, 'over': 1, 'the': 2, 'vat': 1}]

In [42]:
counts.collect()

[{'Cats': 1,
  'Oak': 1,
  'also': 1,
  'and': 2,
  'dogs': 1,
  'each': 1,
  'gives': 1,
  'hate': 1,
  'is': 1,
  'other.': 1,
  'shade.': 1,
  'strong': 1,
  'the': 1},
 {'Open': 1,
  'The': 1,
  'began': 1,
  'break': 1,
  'but': 1,
  'crate': 1,
  "don't": 1,
  'glass.': 1,
  'new.': 1,
  'pipe': 1,
  'rust': 1,
  'the': 2,
  'to': 1,
  'while': 1},
 {'Add': 1,
  'of': 1,
  'product': 1,
  'sum': 1,
  'the': 2,
  'these': 1,
  'three.': 1,
  'to': 1},
 {'Thieves': 1, 'deserve': 1, 'friends': 1, 'jail.': 1, 'rob': 1, 'who': 1},
 {'The': 1,
  'age.': 1,
  'cheese': 1,
  'improves': 1,
  'of': 1,
  'ripe': 1,
  'taste': 1,
  'with': 1},
 {'Act': 1,
  'The': 1,
  'crawled': 1,
  'fence.': 1,
  'great': 1,
  'high': 1,
  'hog': 1,
  'on': 1,
  'orders': 1,
  'speed.': 1,
  'the': 1,
  'these': 1,
  'under': 1,
  'with': 1},
 {'Move': 1, 'fire.': 1, 'hot': 1, 'over': 1, 'the': 2, 'vat': 1}]

In [43]:
len(counts.collect()) # output partitioned!

7

If you wanted the index of the partition to be return in addition to the output, check out the `mapPartitionsWithIndex` function!

###### Sample

Returns a sample of your data.

This sample can then be used for statistical analysis with regards to the population etc..

In [44]:
rdd = sc.parallelize(range(99999))

In [45]:
rdd.count()

99999

In [46]:
rdd.sample(False, 0.2).count() # 20% of data with No replacement.

20020

###### Union

Allows you to combine RDD's - there is no remval of duplicates, sorting etc. It's simply a merging of the RDD's.

In [47]:
first = sc.parallelize(range(20))
second = sc.parallelize(range(30))
first.union(second).collect()

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29]

###### Intersection

Find elements that exist in both RDD's.

Note: Intersection can be slow on very large databases as internally, spark is running a `reduce` job acorss multiple nodes. This data shifting between nodes and reduction can cause quite the overhead. As such, if you job is every very slow, check to see if you have used an intersection - can you optimize any further before using `intersection`?

In [48]:
first = sc.parallelize([1,2,3,4,4,5])
second = sc.parallelize([5,17,20,4,1])
first.intersection(second).collect()

[1, 4, 5]

###### Distinct

Drops multiple duplciates from an RDD

I will employ the `cartesian` method - this gives the [cartesian product](https://en.wikipedia.org/wiki/Cartesian_product)

Recall that cartesian products scale with the size of your data = you should never be doing a cartesian product between 2 large data sets!

Rather, you should find a way to take the product between smaller subsets. You can then broadcast your operations to the large dataset using `map`.

If you must take the product of two large set - look into join, full outer join etc.

In [49]:
rdd = sc.parallelize(["Ibrahim", "Juan"]).cartesian(sc.parallelize(range(25)))

In [50]:
rdd.collect()

[('Ibrahim', 0),
 ('Ibrahim', 1),
 ('Ibrahim', 2),
 ('Ibrahim', 3),
 ('Ibrahim', 4),
 ('Ibrahim', 5),
 ('Ibrahim', 6),
 ('Ibrahim', 7),
 ('Ibrahim', 8),
 ('Ibrahim', 9),
 ('Ibrahim', 10),
 ('Ibrahim', 11),
 ('Ibrahim', 12),
 ('Ibrahim', 13),
 ('Ibrahim', 14),
 ('Ibrahim', 15),
 ('Ibrahim', 16),
 ('Ibrahim', 17),
 ('Ibrahim', 18),
 ('Ibrahim', 19),
 ('Ibrahim', 20),
 ('Ibrahim', 21),
 ('Ibrahim', 22),
 ('Ibrahim', 23),
 ('Ibrahim', 24),
 ('Juan', 0),
 ('Juan', 1),
 ('Juan', 2),
 ('Juan', 3),
 ('Juan', 4),
 ('Juan', 5),
 ('Juan', 6),
 ('Juan', 7),
 ('Juan', 8),
 ('Juan', 9),
 ('Juan', 10),
 ('Juan', 11),
 ('Juan', 12),
 ('Juan', 13),
 ('Juan', 14),
 ('Juan', 15),
 ('Juan', 16),
 ('Juan', 17),
 ('Juan', 18),
 ('Juan', 19),
 ('Juan', 20),
 ('Juan', 21),
 ('Juan', 22),
 ('Juan', 23),
 ('Juan', 24)]

Now, suppose we needed to see how mnay unique names were in this list!

In [51]:
rdd.map(lambda x: x[0]).distinct().collect() # Awesome

['Juan', 'Ibrahim']

`.distinct()` takes an optional argument `numPartitions=None` - the higher you set this, the more parallelized your code will run. Again, `distinct` uses `reduce` behind the scenes, as such, performance can be slow with large amounts of data. Optimize as much as you can before using it.

###### Pipe

This function takes each partition of data within an RDD and **pipes** it to a command line tool of your choice!

This is useful if you have already developed a command line tool in **ANY** language and now wish to parallelize it.

All data is fed in a strings and output as strings.

In [52]:
rdd = sc.parallelize(range(1,50))
rdd.pipe('grep 4').collect()

['4',
 '14',
 '24',
 '34',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49']

###### Coalesce

This is an incredibly useful function - however, in order to really reap the benefits we must first dig a little deeper into how spark works under the hood.

By now, you should know that Spark stores your data in a distrbuted manner. This means that data is split into chunks where each chunk is known as a **partitions**.Partitions exist throughout your entire cluster.

In essence, the the coalesce functions allows you to **reduce** the number of partitions you have of your data in a **significantly more efficient** manner over doing a full repartition.

How? Coalesce combines paritions that are **already on the same executors**. This minimizes network traffic **between** executors!

This is all well and good, however, when do I need to change the number of paritions? What is the appropriate number of partitions?

I like to think that choosing the optimal number of partitions is a competition between 2 important considerations:

- The number of partitions is the **upper limit** for parallelism.
    - This means its **impossible** to have 8 processors working on 3 partitions.Spark docs [recommend 2-4 tasks](https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism) (aka partitions) per CPU in your cluster.
    - Too many partions will also cause excessive network traffic due to the creation of many small tasks.

- Number of partitions determines number of output files for an action.

I would suggest using the Spark docs suggestions of 2-4 partitions/CPu as a rule of thumb.

However, I cannot stress enough the importance of experimentation when it comes to tuning your settings.

After running a job, cut the number of partitions in half and see what happens? Increase the number of partitions slightly, did it get faster?

As we see in Machine Learning, most (Hyper)parameter tuning starts with an educated guess and then embarks on a series of iterations until we find that sweet spot.

In [1]:
rdd = sc.parallelize(range(99999), numSlices=1000) #original with 1000 partitions

In [2]:
reducedRDD = rdd.coalesce(30) #same rdd, but with 30 partitons

###### Repartition.

A very useful function for when you want to increase the number of partions that your RDD started with. This can cause some overhead as new partitions are created and sent to nodes.

If reducing the number of partitions - its better to use `coalesce` as this will minimize network overhead by merging together partitions that are already on the same node.

Partitioning data is central to spark - this is something that just takes practice and some experimenting!

Recall, 2-4 partions per CPU is a good rule of thumb to start at!

In [53]:
rdd = sc.parallelize(range(999), numSlices=1)

In [54]:
rdd.repartition(30) # simple as that

MapPartitionsRDD[63] at coalesce at NativeMethodAccessorImpl.java:0

###### RepartitionAndSortWithinPartitions

Same as the above function, but will also allow you to sort each partion there and then.

This funciton works on key-value pairs and partitions the data according to key. This function is more efficent than just repartitioning, and then later on sorting the data.

Learn about `.glom()` [here](https://spark.apache.org/docs/0.7.2/api/pyspark/pyspark.rdd.RDD-class.html#glom).

In [55]:
rdd = sc.parallelize([[3,41], [3,7], [100,100], [0.5, 17]])

In [56]:
rdd.repartitionAndSortWithinPartitions(2).glom().collect()

[[(0.5, 17), (100, 100)], [(3, 41), (3, 7)]]

The outermost list containing everything is due to glom.

Then we have two inner lists - this is because we repartitioned into 2. Each of those sublists is also sorted!

Notie how the two 3 appears together - this means our sorting falls within our partioning!

You can also specify a custom `partitionFunc` in the optional arguments.

There is also the optional `ascending` and `keyFunc` arguments

In [57]:
rdd.repartitionAndSortWithinPartitions(2, partitionFunc=lambda x: x == 100).glom().collect()

[[(0.5, 17), (3, 41), (3, 7)], [(100, 100)]]

Above, I specified that I want one partion to contain all keys that equal 100.

## Actions Section

###### Reduce

Calculates aggregates **over many** inputs!

Requires an **associative** and **commutative** function that is applied to pairs of inputs. The function is then applied between pairs and so on until we have a single output.


In [58]:
rdd = sc.parallelize(range(25), numSlices=3)
rdd.glom().collect()

[[0, 1, 2, 3, 4, 5, 6, 7],
 [8, 9, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23, 24]]

In [59]:
rdd.reduce(max)

24

Spark is running the reduce command in parallel over every partition - think of this as asking for the max of the first sub list and so on.

Then taking those results and asking for the max in that secondary comparison. Eventually, the answer **reduces** to 24.

`reduceByKey` allows you to calculate aggregates based on subsets of data - like a pandas groupby!

In [60]:
rdd.reduce(lambda x,y : x + y) # equivalent to sum(list(range(25)))

300

###### Count

This function returns the number of elements in the RDD.

More intersting are 2 other functions that are closely related:

- `countApprox(timeout=500, confidence = 0.7)`

The above function will use the [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) algorithm to approximate the size of your data. This come in very handy when you have very large data and dont want to wait for an accurate count.

If you wanted to count the distinct elements you could use:

- `countApproxDistinct`

refer to documentation for more info on optional arguments!

###### First

Pulls the first element from your RDD. Nothing special here!

In [61]:
rdd = sc.parallelize(range(1,6,1))

In [62]:
rdd.first()

1

###### Take

Take # of elements from an RDD and returns a list

In [63]:
rdd = sc.parallelize(range(1,6,1))
rdd.take(3)

[1, 2, 3]

###### TakeSample

Pulls a random sample of elements of a given size from the RDD.

In [64]:
rdd = sc.parallelize(range(50))
rdd.takeSample(withReplacement=False, num=8)

[37, 19, 4, 8, 17, 46, 0, 9]

###### TakeOrdered

Sorts the RDD, and then take a # of elements.

Really fast when N is small - if you want to sort the entire RDD, just sort, no point in doing `takeOrdered`.

###### saveAsTextFile

number of partitions will equal the number of output files.

You should repartition before running `saveAsTextFile` in order to control number of output files!

You can also use compression codecs as optional entries! Excellent for saving space on disk.

In python - you can use `saveAsPickleFile`

In [65]:
rdd = sc.parallelize(range(9999), numSlices=7)

In [66]:
# rdd.saveAsTextFile("put_output_folder_name_here")

###### CountByKey

Counts the occurence of Keys in an RDD.

There is also a `countByValue` function.

In [67]:
rdd = sc.parallelize([('Andrew', 45), ('Juan', 99), ('Mauricio', 12), ('Mauricio', 1)])

In [68]:
rdd.countByKey()

defaultdict(int, {'Andrew': 1, 'Juan': 1, 'Mauricio': 2})

###### ForEach

Takes an action for each element of an RDD!

Be sure to use reduce or an accumulator if you wish to see the results reflected in the driver.

By default, sparks nodes do not push back variable changes to the driver.

it's very important to distinguish between the local and cluster environments.

Read this very important part of the documentation [here](https://spark.apache.org/docs/2.1.1/programming-guide.html#understanding-closures-a-nameclosureslinka).

## Conclusion

That's all I have for today. Plenty of information for you to play with and get your hands dirty!!

The next notebook will cover more on key-value pairs, IO actions in addition to performance boosters! Stay tuned.

As always, feel free to reach out: igabr@uchicago.edu or [@Gabr\_Ibrahim](https://twitter.com/Gabr_Ibrahim).