## Special Functions - col and lit

Let us understand special functions such as col and lit.

### Starting Spark Context

Let us start spark context for this Notebook so that we can execute the code provided.

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Processing Column Data").
    master("yarn").
    getOrCreate

In [None]:
spark

* First let us create Data Frame for demo purposes.

In [None]:
val employees = List((1, "Scott", "Tiger", 1000.0, 
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, 
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, 
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                    )

In [None]:
val employeesDF = employees.
    toDF("employee_id", "first_name",
         "last_name", "salary",
         "nationality", "phone_number",
         "ssn"
        )

* For Data Frame APIs such as `select`, `groupBy`, `orderBy` etc we can pass column names as strings.


In [None]:
// to use operators such as $ in place of functions like col


In [None]:
// Alternative using col function
// $ is shorthand operator for col from implicits


In [None]:
// Alternative by passing column names as strings.


In [None]:
// We have to pass all the column names as strings or column type (using col or $)
// This will not work


* If there are no transformations on any column in any function then we should be able to pass all column names as strings.
* If not we need to pass all columns as type column by using col function or its shorthand operator $.

In [None]:
// Passing columns as part of groupBy

In [None]:
// Passing columns as part of orderBy or sort

* However, if we want to apply any transformation using functions then passing column names as strings to some of the functions will not suffice. We have to pass them as column type. 

In [None]:
import org.apache.spark.sql.functions.upper

In [None]:
//This code fails as upper is not valid function on string
employeesDF.
    select(upper("first_name")).
    show

* `col` is the function which will convert column name from string type to **Column** type. We can also refer column names as **Column** type using Data Frame name.


In [None]:
// Using col and upper


In [None]:
// Alternate using $ and upper


In [None]:
// Using as part of groupBy


In [None]:
// Using as part of orderBy


In [None]:
// Alternative - we can also refer column names using Data Frame like this


* Sometimes, we want to add a literal to the column values. For example, we might want to concatenate first_name and last_name with separated by comma and space in between.

In [None]:
// Below approaches fail.

In [None]:
import org.apache.spark.sql.functions.concat

In [None]:
employeesDF.
    select(concat($"first_name", ", ", $"last_name")).
    show()

In [None]:
// Same as above
employeesDF.
    select(concat(col("first_name"), ", ", col("last_name"))).
    show

In [None]:
// Referring columns using Data Frame
employeesDF.
    select(concat(employeesDF("first_name"), ", ", employeesDF("last_name"))).
    show

* If we pass the literals directly in the form of string or numeric type, then it will fail. We have to convert literals to column type by using `lit` function.


In [None]:
// Using lit to use literals to derive new expressions