- Title: Column Functions and Operators in Spark
- Slug: pyspark-func-operators
- Date: 2021-04-26 10:38:08
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, column, functions, operators, func, fun
- Author: Ben Du
- Modified: 2021-12-08 20:43:28


In [1]:
from typing import List, Tuple
import pandas as pd

In [2]:
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
#findspark.init("/opt/spark-2.3.0-bin-hadoop2.7")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType, StructField, ArrayType

spark = SparkSession.builder.appName("PySpark_Str_Func") \
    .enableHiveSupport().getOrCreate()

In [3]:
df = spark.createDataFrame(
    pd.DataFrame(
        data=[([1, 2], "how", 1), ([2, 3], "are", 2), ([3, 4], "you", 3)],
        columns=["col1", "col2", "col3"]
    )
)
df.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+




## [Boolean Operators and Functions](http://www.legendu.net/misc/blog/boolean-column-operators-and-functions-in-spark)

Please refer to
[Boolean Operators and Functions](http://www.legendu.net/misc/blog/boolean-column-operators-and-functions-in-spark)
for details.

## [Rounding Functions](http://www.legendu.net/misc/blog/pyspark-dataframe-func-rounding)

Please refer to 
[Rounding Functions in Spark](http://www.legendu.net/misc/blog/pyspark-dataframe-func-rounding)
for details.

## [String Functions](http://www.legendu.net/misc/blog/pyspark-dataframe-func-string)

Please refer to 
[String Functions in Spark](http://www.legendu.net/misc/blog/pyspark-dataframe-func-string)
for details.

## [Statistical Functions](http://www.legendu.net/misc/blog/pyspark-stat-functions)

Please refer to
[Statistical Functions in Spark](http://www.legendu.net/misc/blog/pyspark-stat-functions)
for details.

## [Date Functions in Spark](http://www.legendu.net/misc/blog/pyspark-dataframe-func-date)

Please refer to 
[Date Functions in Spark](http://www.legendu.net/misc/blog/pyspark-dataframe-func-date)
for details.

## [Window Functions in Spark](http://www.legendu.net/misc/blog/window-functions-in-spark)

Please refer to 
[Window Functions in Spark](http://www.legendu.net/misc/blog/window-functions-in-spark)
for details.

## [Collection Functions](http://www.legendu.net/misc/blog/pyspark-dataframe-func-collection)

Please refer to
[Collection Functions](http://www.legendu.net/misc/blog/pyspark-dataframe-func-collection)
for details.


## between

In [7]:
df.filter(col("col2").between("hoa", "hox")).show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
+------+----+----+



In [8]:
df.filter(col("col3").between(2, 3)).show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



## cast

In [12]:
df2 = df.select(
    col("col1"),
    col("col2"),
    col("col3").astype(StringType())
)
df2.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



In [13]:
df2.schema

StructType(List(StructField(col1,ArrayType(LongType,true),true),StructField(col2,StringType,true),StructField(col3,StringType,true)))

In [15]:
df3 = df2.select(
    col("col1"),
    col("col2"),
    col("col3").cast(IntegerType())
)
df3.show()

+------+----+----+
|  col1|col2|col3|
+------+----+----+
|[1, 2]| how|   1|
|[2, 3]| are|   2|
|[3, 4]| you|   3|
+------+----+----+



In [16]:
df3.schema

StructType(List(StructField(col1,ArrayType(LongType,true),true),StructField(col2,StringType,true),StructField(col3,IntegerType,true)))

## lit

In [4]:
x = lit(1)

In [5]:
type(x)

pyspark.sql.column.Column

## hash

In [7]:
df.withColumn("hash_code", hash("col2")).show()

+------+----+----+-----------+
|  col1|col2|col3|  hash_code|
+------+----+----+-----------+
|[1, 2]| how|   1|-1205091763|
|[2, 3]| are|   2| -422146862|
|[3, 4]| you|   3| -315368575|
+------+----+----+-----------+



## when

1. `null` in when condition is considered as false.

In [1]:
import org.apache.spark.sql.functions._

val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



df = [age: bigint, name: string]


[age: bigint, name: string]

`null` in when condition is considered as `false`.

In [3]:
df.select(when($"age" > 20, 1).otherwise(0).alias("gt20")).show

+----+
|gt20|
+----+
|   0|
|   1|
|   0|
+----+



In [5]:
df.select(when($"age" <= 20, 1).otherwise(0).alias("le20")).show

+----+
|le20|
+----+
|   0|
|   0|
|   1|
+----+



In [6]:
df.select(when($"age".isNull, 0).when($"age" > 20 , 100).otherwise(10).alias("age")).show

+---+
|age|
+---+
|  0|
|100|
| 10|
+---+



In [7]:
df.select(when($"age".isNull, 0).alias("age")).show

+----+
| age|
+----+
|   0|
|null|
|null|
+----+



## References

[Spark SQL Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)

[Spark Scala Functions](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html)

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html