- Author: Ben Du
- Date: 2020-06-17
- Title: The select Function in Spark DataFrame
- Slug: spark-dataframe-select
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, select
- Modified: 2020-06-17


https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

In [1]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder().
    appName("Spark SQL basic example").
    getOrCreate()
spark

spark = org.apache.spark.sql.SparkSession@749b1121


In [2]:
val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## Select Using String Name

In [12]:
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



## Select Using $ Sign (Syntax Sugar)

In [13]:
df.select($"name", $"age" + 1).show()

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [16]:
df.select($"age" > 21).show()

+----------+
|(age > 21)|
+----------+
|      null|
|      true|
|     false|
+----------+



## Renaming Column When Selecting

In [6]:
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [8]:
df.select($"age".alias("year")).show

+----+
|year|
+----+
|null|
|  30|
|  19|
+----+



## Star in Select

Notice that `*` can be used to select all columns from a table.

In [2]:
val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

In [3]:
df.select("*").show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



However, `table.*` is not supported.
Of course, there are several alternatives for this. 

In [4]:
df.select("df.*").show

Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve 'df.*' give input columns '';
StackTrace:   at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:278)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:878)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:876)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLi

In [5]:
df.select($"*").show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## Column Alias in Select CANNOT be Referenced in Select

If you define a column alias in select, 
you cannot use it in the same select clause. 
This is inconvenient.

In [2]:
val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

In [4]:
import org.apache.spark.sql.functions._
df.select(
    $"age",
    lower($"name").alias("name_lower"),
    upper($"name_lower").alias("name_upper")
).show

Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve '`name_lower`' given input columns: [age, name];;
'Project [age#8L, lower(name#9) AS name_lower#19, upper('name_lower) AS name_upper#20]
+- Relation[age#8L,name#9] json

StackTrace: 'Project [age#8L, lower(name#9) AS name_lower#19, upper('name_lower) AS name_upper#20]
+- Relation[age#8L,name#9] json
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withO

## Reorder Columns

In [22]:
import org.apache.spark.sql.functions._

val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

In [5]:
df.columns.sorted(Ordering[String].reverse)

[name, age]

## Comment

Living genrated strings won't work in select. 
You have to use Column type instead.

In [26]:
df.select(
    df.columns.sorted(Ordering[String].reverse).map(c => col(c)): _*
).show

+-------+----+
|   name| age|
+-------+----+
|Michael|null|
|   Andy|  30|
| Justin|  19|
+-------+----+



In [23]:
df.select(df.columns.sorted(Ordering[String].reverse): _*)

Name: Compile Error
Message: <console>:33: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
       df.select(df.columns.sorted(Ordering[String].reverse): _*)
                                                            ^

StackTrace: 