### 2.1 DataFrame
#### 1. Creating DataFrames
1. `Dataset`:  
 `Datasets`是一种分布式数据集, 拥有RDD的优势(强类型,支持lambda表达式).一个 Dataset 可以从JVM对象来构造并且使用transformation算子
2. `DataFrame`  
  `DataFrame`在scala api中仅仅是`Dataset[ROW]`类型的别名

In [1]:
// org.apache.spark.sql.Dataset
val df = spark.read.json("people.json")
println(df.getClass)
df.show()

class org.apache.spark.sql.Dataset
+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|null|Michael|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

#### 2. 无类型的Dataset操作 (DataFrame)
1. 因为Dataset是一组java对象组成的, 这些对象是强类型的, 其设计思路与RDD一致.  
2. DataFrame概念上被设计成无类型的Dataset,即多个"Row"对象组成的Dataset, 即DataFrame=Dataset[Row]

In [None]:
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// +-------+
// |   name|
// +-------+
// |Michael|
// |   Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// |   name|(age + 1)|
// +-------+---------+
// |Michael|     null|
// |   Andy|       31|
// | Justin|       20|
// +-------+---------+

// Select people older than 21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// |  19|    1|
// |null|    1|
// |  30|    1|
// +----+-----+

#### 2. Running SQL Queries on DataFrames

1. SparkSession 的 sql 函数可以让应用程序以编程的方式运行 SQL 查询, 并将结果作为一个 DataFrame 返回.

In [2]:
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|null|Michael|
|  19| Justin|
+----+-------+



sqlDF = [age: bigint, name: string]


[age: bigint, name: string]

2. 全局临时视图  
Spark SQL中的临时视图是session级别的, 也就是会随着session的消失而消失. 如果你想让一个临时视图在所有session中相互传递并且可用, 直到Spark 应用退出, 你可以建立一个全局的临时视图.全局的临时视图存在于系统数据库 global_temp中, 我们必须加上库名去引用它, 比如. SELECT * FROM global_temp.view1.

In [None]:
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

### 2.2 Datasets
#### 1. Creating Datasets
1. Datasets和RDD相似, 但是Datasets没有使用java序列化或`Kryo`序列化. 它使用`Encoder`将对象序列化称bytes.  
 `encoders`可以动态产生代码, 在集群网络中传递, 而且Spark不需要反序列化就能在这些对象上执行filter,sort等操作  
2. Dataset可由`Seq`生成, 也可从DataFrame转化而来

In [None]:
val myspark = spark
import myspark.implicits._

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
// case class Person(name: Option[String], age: Option[Long])
case class Person(name: String, age: Long)

// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+


// Encoders for most common types are automatically provided by importing spark.implicits._
val primativeDS = Seq(1,2,3).toDS
primativeDS.map(x=>x+1).collect // Returns: Array(2, 3, 4)


// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.collect
// res4: Array[Person] = Array(Person(Some(Michael),None), 
//                             Person(Some(Andy),Some(30)), 
//                             Person(Some(Justin),Some(19))
//                            )


### 2.3 RDD互操作  
Spark SQL 支持两种不同的方法用于转换已存在的RDD成为Dataset :  
1. 第一种方法是使用反射去推断一个包含指定的对象类型的 RDD 的 Schema   
2. 第二种用于创建 Dataset 的方法是通过一个允许你构造一个 Schema 然后把它应用到一个已存在的 RDD 的编程接口.    

#### 1. 利用反射推断schama  
1. Spark SQL的Scala接口支持自动转换一个包含 case classes的RDD为DataFrame.  
2. Case class定义了表的Schema.Case class的参数名使用反射读取并且成为了列名.  
3. Case class也可以是嵌套的或者包含像Seq或者Array这样的复杂类型.这个RDD能够被隐式转换成一个DataFrame然后被注册为一个表.表可以用于后续的SQL语句.

In [None]:
val myspark = spark
import myspark.implicits._

case class Person(name: String, age: Long)

val rdd1 = spark.sparkContext.textFile("people.txt")
val rdd2 = rdd1.map(x=>x.split(",")).map(x=>Person(x(0),x(1).trim.toInt)) // Rdd[Person]  
val peopleDF = rdd2.toDF
peopleDF.collect
println()
// [Michael,29]
// [Andy,30]
// [Justin,19]


// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
val rdd1 = teenagersDF.map(row =>"Name:"+row(0))
println(rdd1.collect)
// or by field name
val rdd2 - teenagersDF.map(row=>"Name:"+row.getAs[String]("name"))
println(rdd2.collect)

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
val rdd3 = teenagersDF.map(row=>row.getValuesMap[Any](List("name","age")))
println(rdd3.coolect)

#### 2. 构造StructType对象
1. 从原始RDD中创建RDD[ROW]  
2. 创建StructType匹配RDD中的Row  
3. 通过 SparkSession 提供的 createDataFrame 方法应用 Schema 到 RDD 的 RowS（行）

In [1]:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val path = "/Users/lj/devkits/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/people.txt"
// StructType
val schema = StructType(Array(StructField("name",StringType,nullable=true),
                             StructField("age",StringType,nullable=true)))
// Origion RDD
val rdd1 = spark.sparkContext.textFile(path)
val rdd2 = rdd1.map(x=>x.split(","))
val rdd3 = rdd2.map(x=>Row(x(0),x(1).trim))  // RDD[Row]

//DataFrame
val peopleDF = spark.createDataFrame(rdd3,schema)  // RDD to Dataframe
peopleDF.createOrReplaceTempView("people")

val res = spark.sql("select collect_list(age) from people")
res.collect

Array([WrappedArray(29, 30, 19)])

#### 3. pivot