## expr()
PySpark `expr()` is a SQL function to execute SQL-like expressions and to use an existing DataFrame column value as an expression argument to Pyspark built-in functions. Most of the commonly used SQL functions are either part of the PySpark Column class or built-in `pyspark.sql.functions` API, besides these PySpark also supports many other SQL functions, so in order to use these, you have to use `expr()` function.

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName("expr").getOrCreate()

25/08/08 20:40:15 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Concatenate Columns using || (similar to SQL)
Use || to concatenate values from two string columns, you can use expr() expression to do exactly same

In [10]:
data=[("James","Bond"),("Scott","Varsa")] 
df=spark.createDataFrame(data).toDF("col1","col2") 
df.withColumn("Name",expr(" col1 ||','|| col2")).show()

+-----+-----+-----------+
| col1| col2|       Name|
+-----+-----+-----------+
|James| Bond| James,Bond|
|Scott|Varsa|Scott,Varsa|
+-----+-----+-----------+



### Using SQL CASE WHEN with expr()
Using `CASE WHEN` expression on `withColumn()` by using `expr()`, this example updates an existing column gender with the derived values, M for male, F for Female, and unknown for others

In [11]:
data = [("James","M"),("Michael","F"),("Jen","")]
columns = ["name","gender"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

#Using CASE WHEN similar to SQL.
df2 = df.withColumn("gender", expr("CASE WHEN gender = 'M' THEN 'Male' WHEN gender = 'F' THEN 'Female' ELSE 'unknown' END"))
df2.show()

+-------+------+
|   name|gender|
+-------+------+
|  James|     M|
|Michael|     F|
|    Jen|      |
+-------+------+

+-------+-------+
|   name| gender|
+-------+-------+
|  James|   Male|
|Michael| Female|
|    Jen|unknown|
+-------+-------+



### Using an Existing Column Value for Expression
adds a number of months from an existing column instead of a Python constant.

In [12]:
data=[("2019-01-23",1),("2019-06-24",2),("2019-09-20",3)] 
df=spark.createDataFrame(data).toDF("date","increment") 
df.show()

#Add Month value from another column
df.select(df.date,df.increment, expr("add_months(date,increment)").alias("inc_date")).show()

+----------+---------+
|      date|increment|
+----------+---------+
|2019-01-23|        1|
|2019-06-24|        2|
|2019-09-20|        3|
+----------+---------+

+----------+---------+----------+
|      date|increment|  inc_date|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+



use SQL like syntax to provide the alias name to the column expression

In [13]:
df.select(df.date,df.increment, expr("""add_months(date,increment) as inc_date""")).show()

+----------+---------+----------+
|      date|increment|  inc_date|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+



### Cast Function with expr()
converts long data type to String type.

In [14]:
df.printSchema()
df.select("date",expr("cast(increment as string) as str_increment")).printSchema()


root
 |-- date: string (nullable = true)
 |-- increment: long (nullable = true)

root
 |-- date: string (nullable = true)
 |-- str_increment: string (nullable = true)



### Arithmetic operations
`expr()` is also used to provide arithmetic operations, below examples add value 5 to increment and creates a new column new_increment

In [15]:
df.show()
df.select(df.date,df.increment, expr("increment + 5 as new_increment")).show()

+----------+---------+
|      date|increment|
+----------+---------+
|2019-01-23|        1|
|2019-06-24|        2|
|2019-09-20|        3|
+----------+---------+

+----------+---------+-------------+
|      date|increment|new_increment|
+----------+---------+-------------+
|2019-01-23|        1|            6|
|2019-06-24|        2|            7|
|2019-09-20|        3|            8|
+----------+---------+-------------+



### Using Filter with expr()
Filter the DataFrame rows can done using expr() expression

In [16]:
data=[(100,2),(200,3000),(500,500)] 
df=spark.createDataFrame(data).toDF("col1","col2") 
df.show()
df.filter(expr("col1 == col2")).show()

+----+----+
|col1|col2|
+----+----+
| 100|   2|
| 200|3000|
| 500| 500|
+----+----+

+----+----+
|col1|col2|
+----+----+
| 500| 500|
+----+----+

