### 2.1 DataFrame
#### 1. Creating DataFrames
1. `Dataset`:  
 `Datasets`是一种分布式数据集, 拥有RDD的优势(强类型,支持lambda表达式).  
 `Datasets`支持各种函数式算子  
2. `DataFrame`  
  `DataFrame`和python中的DataFrame在概念上一样. 是`Dataset+列名`

In [1]:
// org.apache.spark.sql.Dataset
val df = spark.read.json("people.json")
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

In [2]:
print(df.getClass)

class org.apache.spark.sql.Dataset

In [3]:
df.printSchema

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [4]:
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [5]:
df.select($"name",$"age"+1).show()

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [6]:
df.filter($"age">21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [7]:
df.groupBy($"age").count.show()

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+



In [8]:
df.agg(collect_list($"name"), collect_set($"name"))

Name: Compile Error
Message: <console>:30: error: not found: value collect_list
       df.agg(collect_list($"name"), collect_set($"name"))
              ^
<console>:30: error: not found: value collect_set
       df.agg(collect_list($"name"), collect_set($"name"))
                                     ^

StackTrace: 

#### 2. Running SQL Queries on DataFrames
1. 注册DataFrame为全局试图  
2. 执行`sql`

In [13]:
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

In [14]:
val res = spark.sql("select * from global_temp.people")
println(res.getClass)
res.show()

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()

class org.apache.spark.sql.Dataset
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



res = [age: bigint, name: string]


[age: bigint, name: string]

### 2.2 Datasets
#### 1. Creating Datasets
1. Datasets和RDD相似, 但是Datasets没有使用java序列化或`Kryo`序列化. 它使用`Encoder`将对象序列化称bytes.  
 `encoders`可以动态产生代码, 在集群网络中传递, 而且Spark不需要反序列化就能在这些对象上执行filter,sort等操作

In [10]:
case class Person(name:Option[String],age:Option[Long])

defined class Person


lastException: Throwable = null


In [16]:
val myspark = spark
import myspark.implicits._   // 通配符_需要从val变量中使用, 而环境中内置的spark对象是var的

// Encoders are created for case classes
val caseClassDS = Seq(Person(Some("zhangsan"),Some(25))).toDS
caseClassDS.show()

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect().foreach(println)

+--------+---+
|    name|age|
+--------+---+
|zhangsan| 25|
+--------+---+

2
3
4


myspark = org.apache.spark.sql.SparkSession@795e04fc
caseClassDS = [name: string, age: bigint]
primitiveDS = [value: int]


[value: int]

2. `DataSet`可以从`DataFrame`转换而来, 通过指定转换的`class`  
 (要求class的属性名和DataFrame的column name匹配)

In [17]:
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val ds = spark.read.json("people.json").as[Person]
ds.collect().foreach(println)

Person(Some(Michael),None)
Person(Some(Andy),Some(30))
Person(Some(Justin),Some(19))


ds = [age: bigint, name: string]


[age: bigint, name: string]

#### 2. 使用case class反射推断DataFrame为DataSets

In [20]:
import myspark.implicits._
val df = spark.read.json("people.json")
df.createOrReplaceTempView("people")

df = [age: bigint, name: string]


[age: bigint, name: string]

In [58]:
val teenagerDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 12 AND 31")
teenagerDF.show
// The columns of a row in the result can be accessed by field index
teenagerDF.map(x=>"name: "+x(0)).collect.foreach(println)

+------+---+
|  name|age|
+------+---+
|  Andy| 30|
|Justin| 19|
+------+---+

name: Andy
name: Justin


teenagerDF = [name: string, age: bigint]


[name: string, age: bigint]

In [59]:
// or by field name
teenagerDF.map(x=>x.getAs[String]("name")).collect.foreach(println)

Andy
Justin


In [63]:
// 对于未在spark.implicit._中出现的encoder, 需要显式声明
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]  // 任意类型转换为String
// val ds = teenagerDF.map(row => row.getValuesMap[Any](List("age","name")))
// ds.collect.foreach(println)

teenagerDF.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)


Map(name -> Andy, age -> 30)
Map(name -> Justin, age -> 19)
