---
<center><u><h1>Pyspark. Part 1</h1></u></center>
---

---

# What is Big Data

According to [IDC](https://en.wikipedia.org/wiki/International_Data_Corporation), in 2011 we created 1.8 zettabytes (or 1.8 trillion GBs) of information, which is enough data to fill about 57.5 billion 32GB Apple iPads. That's enough iPads to build a Great iPad Wall of China two times higher than the original one. In 2012 it reached 2.8 zettabytes and IDC now forecasts that we will generate 40 zettabytes (ZB) by 2020. Every day we create data, that comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos and cell phone GPS signals to name a few. In average every minute of every day we create: 

* more than 204 million email messages;
* over 2 million Google search queries;
* 48 hours of new YouTube videos; 
* 684,000 bits of content shared on Facebook; 
* more than 100,000 tweets; 
* $272,000 spent on e-commerce, etc.

The amount of digital data has grown tremendously in the past few years. It is predicted that this year we will reach around 8 zetta-bytes worldwide. The amount of data is growing and is expected to grow exponentially because more and more devices are connected to the internet (the internet of things). Second factor stimulating the growth of data is the use of social media. It is expected that the total amount data with an IP address will reach a whopping 44 zetta-bytes by 2020. 

<img src="images/bid_data_growth.png" width=60%>

**Big data** is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy.

The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

## Quick overview of BigData history

* In *2005* Roger Mougalas from O’Reilly Media coined the term Big Data for the first time, only a year after they created the term Web 2.0. It refers to a large set of data that is almost impossible to manage and process using traditional business intelligence tools.

* *2005* is also the year that Hadoop was created by Yahoo! built on top of Google’s MapReduce. It’s goal was to index the entire World Wide Web and nowadays the open-source Hadoop is used by a lot organizations to crunch through huge amounts of data.

* As more and more social networks start appearing and the Web 2.0 takes flight, more and more data is created on a daily basis. Innovative startups slowly start to dig into this massive amount of data and also governments start working on Big Data projects. In *2009* the Indian government decides to take an iris scan, fingerprint and photograph of all of tis 1.2 billion inhabitants. All this data is stored in the largest biometric database in the world.

* In *2010* Eric Schmidt speaks at the Techonomy conference in Lake Tahoe in California and he states that "there were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days."

* In *2011* the McKinsey report on Big Data: The next frontier for innovation, competition, and productivity, states that in 2018 the USA alone will face a shortage of 140.000 – 190.000 data scientist as well as 1.5 million data managers.

In the past few years, there has been a massive increase in Big Data startups, all trying to deal with Big Data and helping organizations to understand Big Data and more and more companies are slowly adopting and moving towards Big Data. However, while it looks like Big Data is around for a long time already, in fact Big Data is as far as the internet was in 1993. The large Big Data revolution is still ahead of us so a lot will change in the coming years. Let the Big Data era begin!


# Applications and fields related to Big Data

* Understanding and targeting customers. Here, big data is used to better understand customers and their behaviors and preferences. Companies are keen to expand their traditional data sets with social media data, browser logs as well as text analytics and sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models.

* Buisness processes' optimizations. Big data is also increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictions generated from social media data, web search trends and weather forecasts. One particular business process that is seeing a lot of big data analytics is supply chain or delivery route optimization. Here the delivery of goods is being optimized by analyzing live data traffics.

* Computing powers of big data allows us to analyze, understand and predict deseases. Real time analytics of social medias and medical records allows us to monitor outbreaks of diseases.

* The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second. All this information is needed to be analysed.


# Technologies
Main technologies are widely-used [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop), [Spark](https://en.wikipedia.org/wiki/Apache_Spark), [MapReduce model](https://en.wikipedia.org/wiki/MapReduce), various [Data Mining](https://en.wikipedia.org/wiki/Data_mining) and [Machine learning](https://en.wikipedia.org/wiki/Machine_learning) techniques.  
[NoSQL databases](https://en.wikipedia.org/wiki/NoSQL) like Graph databases, document or column based databases and key-value databases.



# MapReduce 

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.  

MapReduce is composed of a `Map()` procedure that performs filtering and sorting and a `Reduce()` method that folds the data using some reduce function.

The `Map` stage is preprocessing stage at which every element of input data is passed through a function transforming the element. For example students can be sorted according to their first name into queues, one queue for each name.
And at `Reduce` stage we can aggregate all these queues,  recieving frequencies of students' names.  

As another example: ingridients are first mapped into pieces, and then reduced to sandwiches.
![alt text](images/1.jpg)

MapReduce allows for distributed data pre-processing and folding of data. Operations can be done independently and in parallel, limited only by processing power. Thus allowing faster performance on large data sets and providing fault-tolerance.

## [Apache Hadoop](http://hadoop.apache.org/)

Hadoop is an open-source software framework used for distributed processing and storage of large datasets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. 

All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

## [Apache Spark](http://spark.apache.org/)

Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Spark does not have its own system to organize files in a distributed way(the file system). For this reason, programmers install Spark on top of Hadoop so that Spark’s advanced analytics applications can make use of the data stored using the Hadoop Distributed File System(HDFS). Hadoop has a file system that is much like the one on your desktop computer, but it allows us to distribute files across many machines. HDFS organizes information into a consistent set of file blocks and storage blocks for each node.

HDFS uses MapReduce to process and analyze data. MapReduce takes a back of all the data in a physical server after each operation. This was done because data stored in a RAM is volatile than that stored in a physical server.

In contrast, Spark copies most of the data from a physical server to a RAM memory, this is called as “in memory” operation. This reduces the time required to interact with servers and makes Spark faster than the Hadoop’s MapReduce system. Spark uses a system called Resilient Distributed Datasets to recover data when there is a failure.

## Spark vs Hadoop

Here's a brief look at what they do and how they compare.

1. They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don't need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage.

2. You can use one without the other. Hadoop includes not just a storage component, known as the Hadoop Distributed File System, but also a processing component called MapReduce, so you don't need Spark to get your processing done. Conversely, you can also use Spark without Hadoop. Spark does not come with its own file management system, though, so it needs to be integrated with one - if not HDFS, then another cloud-based data platform. Spark was designed for Hadoop, however, so many agree they're better together.

3. Spark is speedier. Spark is generally a lot faster than MapReduce because of the way it processes data. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. "The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc.," explained Kirk Borne, principal data scientist at Booz Allen Hamilton. Spark, on the other hand, completes the full data analytics operations in-memory and in near real-time: "Read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done," Borne said. Spark can be as much as 10 times faster than MapReduce for batch processing and up to 100 times faster for in-memory analytics, he said.

4. You may not need Spark's speed. MapReduce's processing style can be just fine if your data operations and reporting requirements are mostly static and you can wait for batch-mode processing. But if you need to do analytics on streaming data, like from sensors on a factory floor, or have applications that require multiple operations, you probably want to go with Spark. Most machine-learning algorithms, for example, require multiple operations. Common applications for Spark include real-time marketing campaigns, online product recommendations, cybersecurity analytics and machine log monitoring.

5. Failure recovery: different, but still good. Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation, but Spark has similar built-in resiliency by virtue of the fact that its data objects are stored in something called resilient distributed datasets distributed across the data cluster. "These data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures," Borne pointed out.

The following diagram demonstrates the advantage of Spark over Haddop in performance when doing logistic regression using both these tools

<img src="images/perf.jpg">

## Resilient Distributed Datasets (RDD)


Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.


Iterative operations store all intermediate results in RAM memory instead of saving to disk, allowing for great processing speed.   
First, we get our input data from disk, load it into Spark, then perform some Map-Reduce operations, and store results in RAM. Every next Map-Reduce iteration will take data calculated at previous operations from RAM, thus saving time otherways spent for reading data from disk. And only when our final computations are done, we write results to disk.
![alt text](images/2.jpg)

For interactive operations the data is loaded from disk only once, and stored in RAM, so each subsequent operation queries data from RAM, increasing operating speed.

![alt text](images/3.jpg)


## Basic RDD Transformations and Actions

RDDs support two types of operations: `transformations`, which create a new dataset from existing one and `actions`, which return a value to the driver program after running a computation on the dataset. 

For example, `map` is a transformation which passes each element of dataset to a function and then returns a new RDD representing the results.
And `count` is an action, which will count all elements in the RDD, and return only one number.

All transformations in Spark are **lazy**, this means that they do not compute their results right away. They just remember the order of transformations applied to some base dataset. The transformations are only computed, when an action requires a result to be returned to a driver program.



By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.


## Transformations

* ### First, here are single-rdd transformations.
We will use simple RDD for example    
`rdd = [1, 2, 3, 4, 4]   `

|Name|Description|Example|Result|
|:---|------------|:------|:----:|
|`map(func)`|Passes each element of RDD through `func` and returns new RDD|`rdd.map(lambda x: x + 1)`|`[2,3,4,5,5]`|
|`filter(func)`|Tests each element, if func returns true for it, these elements become new RDD|`rdd.filter(lambda x: x > 2)`| `[3,4,4]`|
|`flatMap(func)`|Similar to map, but each input can be mapped to 0 or more elements, func should return a sequence, rather than a single element.|`rdd.flatMap(lambda x: range(0,x))`|`[0,0,1,0,1,2,0,1,2,3,0,1,2,3]`|
|`sample(withReplacement, fraction, seed)`|Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed|`rdd.sample(False,1/2)`|`[2, 3, 4]`|
|`distinct()`|Returns a new dataset without any duplicates|`rdd.distinct()`|`[1,2,3,4]`|


* ### Now, let's take a look on transformations, which require two RDDs.  
`rdd = [1, 2, 3, 4]`  
`other = [3, 4, 5, 6]`

|Name|Description|Example|Result|
|:---|------------|:------|:----:|
|`union(other)`|Returns RDD formed of elements present in both datasets, e.g. returns [union](https://en.wikipedia.org/wiki/Union_%28set_theory%29) of them|`rdd.union(other)`|`[1, 2, 3, 4, 3, 4, 5, 6]`|
|`intersection(otherDataset)`|Returns RDD formed of elements present in both datasets at the same time, e.g. [intersection]( https://en.wikipedia.org/wiki/Intersection_%28set_theory%29) |`rdd.intersection(other)`|`[3, 4]`|
|`cartesian(other)`|When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements - [cartesian product](https://en.wikipedia.org/wiki/Cartesian_product))|`rdd.cartesian(other)`|`[(1, 3), (1, 4), (1, 5),(1, 6), (2, 3), (2, 4), (2, 5), (2, 6), (3, 3), (3, 4), (3, 5), (3, 6), (4, 3), (4, 4), (4, 5), (4, 6)]`|



* ### Here are transformations on RDDs of key-value pairs.  
`rdd = [('hello', 1), ('hello', 1), ('world', 1)]`

|Name|Description|Example|Result|
|:---|------------|:------|:----:|
|`groupByKey()`|When called on a dataset of `(Key, Value)` pairs, returns a dataset of `(Key, Iterable[Values])` pairs.|`rdd.groupByKey()`|`[(hello,[1,1]),('world',1)]`|
|`reduceByKey(func)`|When called on a dataset of `(Key, Value)` pairs, returns a dataset of `(Key, Value)` pairs, where values are aggregated using given reduce function `func`, which must be of type: `(Value, Value) -> Value`|`rdd.reduceByKey(lambda x,y: x + y)`|`[('hello',2), ('world',1)]`|
|`sortByKey(ascending)`|When called on a dataset of `(Key, Value)` pairs, returns a dataset of `(Key, Value)` pairs sorted by keys in ascending or descending order|`rdd.sortByKey()`|`[('hello', 1), ('hello', 1), ('world', 1)]`|



* ### Here are transformations on two RDDs of key-value pairs.
`rdd = [('hello','world'),('program','bugs')]`  
`other = [('hello','python'),('program','code')]`

|Name|Description|Example|Result|
|:---|------------|:------|:----:|
|`join(other)`|When called on datasets of type `(K, V)` and `(K, W)`, returns a dataset of `(K, (V, W))` pairs with all pairs of elements for each key. Outer joins are supported through `leftOuterJoin`, `rightOuterJoin`, and `fullOuterJoin`|`rdd.join(other)`|`[('hello', ('world', 'python')), ('program', ('bugs', 'code'))]`|
|`cogroup(other)`|	When called on datasets of type `(K, V)` and `(K, W)`, returns a dataset of `(K, (Iterable[V], Iterable[W]))` tuples. This operation is also called `groupWith`|`rdd.cogroup(other)`|`[('hello',[['world'], ['python']]),('program',[['bugs'], ['code']])]`|

### There are a few more transformations.
* `coalesce(numPartitions)` decrease number of partitions in RDD to numPartitions.

* `repartition(numPartitions)` reshuffle data in the RDD randomly to create either more or fewer partitions, and balance it across them.

* `repartitionAndSortWithinPartitions(partitioner)` 	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

* `pipe(command)` transformation takes each element of partition through a shell command, like a Perl or bash script. RDD elements are written to process's stdin and and lines output to its stdout are returned as an RDD of strings.

## Transformations - Examples

Let's now take a look on transformations of RDDs in Spark. There are quite a few of them. To demonstrate we'll use a simple dataset with numbers.

In [None]:
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

To create a dataset from this list we have to use parallelize method of PySparkContext.

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

In [None]:
rdd = sc.parallelize(dataset)

To take a look on RDD we must `collect()` it.

In [None]:
rdd.collect()

Let's increment each element by 1.

In [None]:
rdd_after_map = rdd.map(lambda x: x + 1)

To take a look on results, we have to make an action -  `collect()`. Thats all because transformations are lazy.

In [None]:
rdd_after_map.collect()

Let's  take elements which are less than 5.

In [None]:
rdd.filter(lambda x: x < 5).collect()

It is easier to demonstrate the work of `flatMap` with a list of sentences, than a list of numbers. Here we will split each sentence into words.

In [None]:
data = ["hello Spark world!", "i like python"]
sc.parallelize(data).flatMap(lambda x: x.split()).collect()

Let's select in random way only 20% of elements in `rdd`

In [None]:
rdd.sample(withReplacement=False, fraction = 0.2).collect()

`distinct()` returns a new dataset with no duplicates.

In [None]:
sc.parallelize([1,2,1,1,1,1,1,1,1,2,3,4,5,6,7,8,9]).distinct().collect()

We'll go with two tiny datasets to show how they can be combined

In [None]:
first = sc.parallelize([1,2,3,4,9])
second = sc.parallelize([1,3,5,6,7])

In [None]:
# Union of these two datasets
first.union(second).collect()

In [None]:
# Intersection of these datasets
first.intersection(second).collect()

In [None]:
first = sc.parallelize([1,2])
second = sc.parallelize([3,4])
first.cartesian(second).collect()

Next operations require RDDs of key-value pairs.

In this example, we group all values for "hello" and "programming" words.

In [None]:
data = [("hello", "world"), ("hello", "spark"),
        ("hello", "python"), ("programming","fun"),
        ("programming","interestring")]
# Create RDD from list
kv_rdd = sc.parallelize(data)
# group RDD by key and collect.
kv_rdd.groupByKey().collect()
# But, this will yield some unreadable results.

In [None]:
# Here we convert pyspark.resultiterable.ResultIterable to List
kv_rdd.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

Other few methods:

In [None]:
kv_rdd.reduceByKey(lambda x, y: x +", "+ y).collect()

In [None]:
kv_rdd.sortByKey().collect()

In [None]:
one = sc.parallelize([(1,2),(2,3),(3,4)])
two = sc.parallelize([(1,5),(2,10),(3,15)])
one.join(two).collect()

In [None]:
# using map to convert iterables to lists
one.cogroup(two).map(lambda x: (x[0], [list(a) for a in x[1]])).collect()

## Actions

We will use simple RDD for example     
`rdd = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`  

|Name|Description|Example|Result|
|:---|------------|:------|:----:|
|`collect()`|Returns all elements of dataset as an array at the driver program|`rdd.collect()`|`[1,2,3,4,5,6,7,8,9,10]`|
|`reduce(func)`|Aggregates elements of dataset using func of type `(Value, Value) => Value`. Func must be commutative and associative to be able to computed in parallel|`rdd.reduce(lambda x, y : x + y)`|`55`|
|`count()`| Returns number of elements in dataset|`rdd.count()`|`10`|
|`take(n)`|Returns array with first n elements from dataset|`rdd.take(5)`|`[1, 2, 3, 4, 5]`|
|`first()`| Returns first element from dataset|`rdd.first()`|`1`|
|`takeSample(withReplacement, num, [seed])`| returns an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.|`rdd.takeSample(False, 5, 0)`| `[8, 9, 2, 6, 4]`|
|`takeOrdered(n)`| Returns first n elements in natural order, or using some custom comparator|`rdd.takeOrdered(3)`|`[1,2,3]`|
|`saveAsTextFile(path)`|Saves dataset as text file into given directory|`rdd.saveAsTextFile('dir')`| |
|countByKey()|Is appliable to key-value RDDs onyl. Returns dictionary with keys and their counts in RDD|`[('hello','world'),('hello','python')].countByKey()`|`{'hello': 2 }`|
|foreach(func)|Applies function func on each element of RDD|`rdd.foreach(print)`|`1 2 3 4 5 6 7 8 9 10`|

`collect()` returns all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

In [None]:
rdd.collect()

Let's provide few examples for described above functions:

In [None]:
# Reduce elements summing them together
rdd.reduce(lambda x,y: x + y)

In [None]:
# Return the number of elements in dataset
rdd.count()

In [None]:
# Return array with first n elements from dataset
rdd.take(3)

In [None]:
# Return first element of the dataset. It is similar to take(1)
rdd.first()

Together with `take` `takeSample` and `takeOrdered` can be used

In [None]:
rdd.takeSample(False, 5, 0)

In [None]:
rdd.takeOrdered(5)

`saveAsTextFile(path)` writes elements of dataset as text file into given directory in local system, HDFS, or another Hadoop-supported filesystem. This writes data to text files, applying toString() method on each element of RDD. Number of output files is equal to number of partitions RDD has.

In [None]:
# Write rdd to folder_rdd
# This can produce an exception, if a folder exists already
rdd.saveAsTextFile("folder_rdd")

`countByKey()` works only on RDDs of key-value pairs, returns a dictionary with keys, and count of each key.

In [None]:
kv_rdd.countByKey()

`foreach(f)` runs a function f on each element from dataset.

In [None]:
# You would get no output in ipython notebook
# because output is printed on the worker. 
rdd.foreach(print)

## Using predefined functions in Spark
You've probably noticed, that a lot of transformations allow usage of functions,  for simple transformations lambda-functions are generally enough, but say, you want to perform something more complex, that's when custom predefined functions will come in handy. Say, you want to calculate mean, max, and average of lists in RDD of lists.

In [None]:
# Function to calculate min,max,average
def min_max_avg(array):
    mn = min(array)
    mx = max(array)
    avg = sum(array)/len(array)
    return (mn,mx,avg)
# List of lists
data = [[1,2,3,5],[4,5,6],[7,8,9]]
# Make RDD
listrdd = sc.parallelize(data)
# Apply our function to RDD
listrdd.map(min_max_avg).collect()

## RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
In addition, each persisted RDD can be stored using a different storage level.
Take a look at available storage levels in pyspark docs
[here](https://spark.apache.org/docs/preview/api/python/pyspark.html#pyspark.StorageLevel).

In [None]:
rdd.persist(StorageLevel(False, True, False, False, 1))

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

In [None]:
rdd.unpersist()

* cache() persist this RDD with the default storage level (MEMORY_ONLY).

In [None]:
rdd.cache()

For further information refer to pyspark documentation [here](https://spark.apache.org/docs/preview/api/python/pyspark.html).

## Working with files
We'll read a text file with Alice in Wonderland by Lewis Caroll. The book can be downloaded freely from [Project Gutenberg](https://www.gutenberg.org/).  
[Link](https://www.gutenberg.org/files/11/11-0.txt) for the Alice text file.

In [None]:
alice = sc.textFile("alice.txt")

Take a look on first 5 lines of the document.

In [None]:
alice.take(5)

Let's make some data analysis on this text. Say, we want to find most common words in the text, this is a sort of Hello-World for Spark, the word count example. To do so we will need first to perform some transformations to our data.

In [None]:
# First we have to split each sentence into words
data = alice.flatMap(lambda x: x.split(" "))
# Then we have to make everything lowercase,
# so "the" and "The" are not counted twice
data = data.map(lambda x: x.lower())
# Now we map all our words to tuples (word, 1)
data = data.map(lambda x: (x,1))
# Now we reduceByKey, summing those ones for every key
data = data.reduceByKey(lambda x,y : x + y)
# Invert tuples, to make sorting by value possible
data = data.map(lambda x: (x[1],x[0]))
# Sort data by values descending
data = data.sortByKey(False)
# Invert back
data = data.map(lambda x: (x[1],x[0]))
# Print results
data.take(15)

As you can see, the most common word is "", which is not a word at all, and then, "the", which is so-called stopword. [Stopwords](https://en.wikipedia.org/wiki/Stop_words) are most common words in a language, which usually hold no important information. And there is also punctuation, that is rather unwanted. So we need to do some text preprocessing before counting words.

In [None]:
# import Regular Expressions
import re
# this is a list of stopwords from NLTK.corpus
sw = [
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
    'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
    'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
    'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
    'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
    'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
    'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
    'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
    'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
    'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
    'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
    'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]
data = alice
# Let's remove punctuation
data = data.map(lambda x: re.sub('[^a-zA-Z0-9]',' ', x))
# First we have to split each sentence into words
data = data.flatMap(lambda x: x.split())
# Then we have to make everything lowercase,
# so "Spark" and "spark" are not counted twice as different words
data = data.map(lambda x: x.lower())
# filter out all words which are in stopwords
data = data.filter(lambda x: x not in sw)
# filter all words which are shorter that 1 letter
data = data.filter(lambda x: len(x) > 1)
# Now we map all our words to tuples (word, 1)
data = data.map(lambda x: (x,1))
# Now we reduceByKey, summing those ones for every key
data = data.reduceByKey(lambda x,y : x + y)
# Invert tuples, to make sorting by value possible
data = data.map(lambda x: (x[1],x[0]))
# Sort data by values descending
data = data.sortByKey(False)
# Invert back
data = data.map(lambda x: (x[1],x[0]))
# Print results
data.take(15)

> # Exercise 1
> Find five most common words starting with letter "a" in the Alice in Wonderland text.

In [None]:
# type your code here 

> # Exercise 2
> Find 10 least used words in the text.

In [None]:
# type your code here 

> # Exercise 3
> Find frequencies of the starting letter of word in the text.

In [None]:
# type your code here