<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>


# RDD Manipulations 

In this section, we will present to you some basics manipulations that we can do with RDD - Resilient Distributed Dataset, the basic abstraction in Spark. 

In resume, "it represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit.

Internally, each RDD is characterized by five main properties:

- A list of partitions 
- A function for computing each split 
- A list of dependencies on other RDDs 
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Please refer to the Spark paper for more details on RDD internals."

(Source : https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html).

When we talk about RDD, we have actions and transformations. Transformations in RDD means we just convert the type of RDD1 to RDD2 without executing or trying to find out what will be the results, otherwise transformations in RDD won't take much time (ex : map, flatMap, etc.). Actions in RDD means we compute really the results (ex : count, collect, take, etc.).

To illustrate some basics actions and transformations in RDD, let's take a Word Counts example!

### We read data from HDFS. By using sc, we have a RDD in the output

In [ ]:
val data = sc.textFile("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/WordCountDataset.txt")

/*
Here we used sc.textFile to read a text file in HDFS, if you want to read a json file  

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example") 
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val df = spark.read.json("examples/src/main/resources/example.json") // we have a Dataset here
*/

data: org.apache.spark.rdd.RDD[String] = hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/WordCountDataset.txt MapPartitionsRDD[1] at textFile at <console>:67


### We count number of lines in data

In [ ]:
data.count()

res2: Long = 4


### We can see all of data, but be careful of this action because if data is too big, it can crash the notebook!

In [ ]:
data.collect()

res4: Array[String] = Array(A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles., Internally, each RDD is characterized by five main properties:, - A list of partitions - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-val...

### We can also do this way...

In [ ]:
data.collect().foreach(println)

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles.
Internally, each RDD is characterized by five main properties:
- A list of partitions - A function for computing each split - A list of dependencies on other RDDs - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
All of the scheduling and execution in Spark is done based on these m

### But it's always better to take some n lines of data instead of collecting all data

In [ ]:
data.take(2)

res4: Array[String] = Array(A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles., Internally, each RDD is characterized by five main properties:)


### Next, we need to remove all the special characters in data by using regex. 

To understand more about regex, you can find in this link (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html). In this transformation, we use map() to apply the function replaceAll for each element inside RDD.

In [ ]:
val data_without_special_characters = data.map(l => l.replaceAll("""[\p{Punct}]""", ""))

data_without_special_characters: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:69


In [ ]:
data_without_special_characters.take(10)

res14: Array[String] = Array(A Resilient Distributed Dataset RDD the basic abstraction in Spark Represents an immutable partitioned collection of elements that can be operated on in parallel This class contains the basic operations available on all RDDs such as map filter and persist In addition PairRDDFunctions contains operations available only on RDDs of keyvalue pairs such as groupByKey and join DoubleRDDFunctions contains operations available only on RDDs of Doubles and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles, Internally each RDD is characterized by five main properties, " A list of partitions  A function for computing each split  A list of dependencies on other RDDs  Optionally a Partitioner for keyvalue RDDs eg to say that...

### Then, we split data into words and convert all into lower case. 

Here we use flatMap (simply equal to map() + "flatten" ). Here we want to transform an RDD of Array of String into an RDD of String and apply the function inside flatMap for each String. 

In [ ]:
val words = data_without_special_characters.flatMap(l => l.split(" ").map(l => l.toLowerCase))

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:71


In [ ]:
words.collect().s

res10: Int = 198


### We can now count by key to compute the number of occurences of each word

To do this, we have 2 ways to do :

#### 1/ First option :

For each element in RDD, we add 1 column with value 1, then by key (word), we compute the sum of value by using reduceByKey. When we do "_ + _" means everytime we have same key (word), we do the sum of its values.

In [ ]:
val wordCount = words.map(l => (l, 1)).reduceByKey(_ + _)
wordCount.collect()

wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:75
res17: Array[(String, Int)] = Array((reading,1), (this,1), (paper,1), (doublerddfunctions,1), (collection,1), (internally,1), (is,3), (its,1), (sequencefiles,1), (persist,1), (only,2), (data,1), (abstraction,1), (internals,1), (basic,2), (map,1), (sequencefilerddfunctions,1), (class,1), (methods,1), (scheduling,1), (execution,1), (new,1), (computing,2), (locations,2), (join,1), (other,1), (file,1), (from,1), (details,1), (dataset,1), (immutable,1), (refer,1), (can,3), (doubles,1), (split,2), (filter,1), (operated,1), (keyvalue,2), (eg,3), (as,3), ("",5), (operations,4), (done,1), (list,3), (please,1), (users,1), (five,1), (own,1), (rdd,5), (function,1), (addition,1), (itself,1), (character...

#### 2/ Second option :

We do the same, for each element in RDD, we map with value 1. Then we groupByKey and compute the sum of list of the value

In [ ]:
val wordCount = words.map(l => (l, 1)).groupByKey().map(l => (l._1, l._2.toList.sum))


wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[12] at map at <console>:73


### For example if we want to find the count of "is", we can also do filter

In [ ]:
wordCount.filter(l => l._1 == "is").take(10)

res8: Array[(String, Int)] = Array((is,3))


### So without any filter, the output will be 

In [ ]:
wordCount.take(10)

res23: Array[(String, Int)] = Array((reading,1), (this,1), (paper,1), (doublerddfunctions,1), (collection,1), (internally,1), (its,1), (is,3), (sequencefiles,1), (persist,1))


### From RDD to DataFrame and Dataset

We can convert RDD to Dataset or DataFrame. In fact, in Spark, these 3 APIs (RDD, DataFrame ans Dataset) are the three types officially used. They can convert to each other. 

To understand thoroughly when to use them, you can read at https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

#### From RDD to DataFrame

In [ ]:
val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._

val df = wordCount.toDF("word", "count")

       val sqlContext = new SQLContext(sc)
                        ^
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@194699f1
import sqlContext.implicits._
df: org.apache.spark.sql.DataFrame = [word: string, count: int]


In [ ]:
df.show()

+--------------------+-----+
|                word|count|
+--------------------+-----+
|             reading|    1|
|                this|    1|
|               paper|    1|
|  doublerddfunctions|    1|
|          collection|    1|
|          internally|    1|
|                 its|    1|
|                  is|    3|
|       sequencefiles|    1|
|             persist|    1|
|                only|    2|
|                data|    1|
|         abstraction|    1|
|           internals|    1|
|               basic|    2|
|                 map|    1|
|sequencefilerddfu...|    1|
|               class|    1|
|             methods|    1|
|          scheduling|    1|
+--------------------+-----+
only showing top 20 rows



#### From RDD to Dataset

In [ ]:
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}


In [ ]:
val sparkSession = SparkSession.builder
      .master("local")
      .appName("example")
      .getOrCreate()

val sparkContext = sparkSession.sparkContext
import sparkSession.implicits._

sparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4925c3d7
sparkContext: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2405fc0d
import sparkSession.implicits._


In [ ]:
val dataset = words_with_nbOccurences.toDS()

dataset: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]


In [ ]:
dataset.show()

+--------------------+---+
|                  _1| _2|
+--------------------+---+
|             reading|  1|
|                this|  1|
|               paper|  1|
|  doublerddfunctions|  1|
|          collection|  1|
|          internally|  1|
|                 its|  1|
|                  is|  3|
|       sequencefiles|  1|
|             persist|  1|
|                only|  2|
|                data|  1|
|         abstraction|  1|
|           internals|  1|
|               basic|  2|
|                 map|  1|
|sequencefilerddfu...|  1|
|               class|  1|
|             methods|  1|
|          scheduling|  1|
+--------------------+---+
only showing top 20 rows

