In this section, we will borrow the same example like Formation 1 : Word Count example but instead of working with RDD, we'll work with Dataset. 

The definition of Dataset is "a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row." (source : https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html)

In short, we can say that Dataset combines RDD and Dataframe with higher performance. In short, we can operate transformations of RDD inside a Dataset that has a structured schema like Dataframe.   

Let's try to do the word count with Dataset!

For more explanation : http://blog.madhukaraphatak.com/introduction-to-spark-two-part-2/

### Firstly, to work with Dataset, we always need to import the library of SparkSession (entry point for Dataset API like sc, entry point for RDD)  

In [ ]:
import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession


### And define its parameters and change the configurations (if necessary)..

In [ ]:
val sparkSession = SparkSession.builder
      .master("local")
      .appName("WordCountExample")
      .getOrCreate()

sparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@620f8a03


### Read data from HDFS 

In [ ]:
import sparkSession.implicits._
val data = sparkSession.read.text("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/WordCountDataset.txt").as[String]

import sparkSession.implicits._
data: org.apache.spark.sql.Dataset[String] = [value: string]


### In a Dataset, we can do some transformations like for RDD 

In [ ]:
data.filter(l => l.contains("the")).take(1)

res10: Array[String] = Array(A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles.)


In [ ]:
data.count()

res12: Long = 4


In [ ]:
data.take(2)

res14: Array[String] = Array(A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles., Internally, each RDD is characterized by five main properties:)


### We can also apply function that we use for Dataframe 

In [ ]:
data.printSchema()

root
 |-- value: string (nullable = true)



In [ ]:
data.show()

+--------------------+
|               value|
+--------------------+
|A Resilient Distr...|
|Internally, each ...|
|- A list of parti...|
|All of the schedu...|
+--------------------+



In [ ]:
data.select("value").take(2)

res20: Array[org.apache.spark.sql.Row] = Array([A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles.], [Internally, each RDD is characterized by five main properties:])


### Back to the Word Count Example, after having the Dataset data, we need to do flatMap inside Dataset and convert to Dataset of words

In [ ]:
val words = data.flatMap(value => value.split("\\s+")) // or split(" ")

words: org.apache.spark.sql.Dataset[String] = [value: string]


In [ ]:
words.take(2)

res23: Array[String] = Array(A, Resilient)


### Next, we convert all into lower case and then do groupByKey

In [ ]:
val groupedWords = words.groupByKey(_.toLowerCase)

groupedWords: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = org.apache.spark.sql.KeyValueGroupedDataset@5d8d2924


### And finally, we do count() to have list of count of each words

In [ ]:
val counts = groupedWords.count()

counts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]


In [ ]:
counts.show()

+-----------------+--------+
|            value|count(1)|
+-----------------+--------+
|          indeed,|       1|
|            (e.g.|       3|
|          reading|       1|
|hash-partitioned)|       1|
|         methods,|       1|
|               by|       2|
|            based|       1|
|              new|       1|
|              own|       1|
|             more|       1|
|       collection|       1|
|            saved|       1|
|          itself.|       1|
|              can|       3|
|         allowing|       1|
|              for|       5|
|             main|       1|
|           please|       1|
|         operated|       1|
|               in|       4|
+-----------------+--------+
only showing top 20 rows

