- Author: Ben Du
- Date: 2020-09-05 14:56:47
- Title: String Functions in Spark
- Slug: pyspark-func-string
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, string, round, Spark SQL, functions
- Modified: 2021-10-07 09:48:12


# Tips and Traps

1. You can use the `split` function to split a delimited string into an array.
    It is suggested that removing trailing separators before you apply the `split` function.
    Please refer to the split section before for more detailed discussions.

1. Some string functions (e.g., `right`, etc.) are available in the Spark SQL APIs
    but not available as Spark DataFrame APIs.

2. Notice that functions `trim`/`rtrim`/`ltrim` behaves a little counter-intuitive.
    First, 
    they trim spaces only rather than white spaces by default.
    Second,
    when explicitly passing the characters to trim,
    the 1st parameter is the characters to trim 
    and the 2nd parameter is the string from which to trim characters.

2. `instr` and `locate` behaves similar to each other 
    except that their parameters are reversed.
    
2. Notice that `replace` is for replacing elements in a column 
    NOT for replacemnt inside each string element.
    To replace substring with another one in a string,
    you have to use either `regexp_replace` or `translate`.
    
6. The operator `+` does not work as concatenation for sting columns.
    You have to use the function `concat` instead.

In [59]:
import re

In [65]:
re.search("\\s", "nima ")

<re.Match object; span=(4, 5), match=' '>

In [66]:
s = "\s"

In [70]:
"\s\\s"

'\\s\\s'

In [69]:
"\s" == "\\s"

True

In [71]:
"\n" == "\\n"

False

In [73]:
"\\n"

'\\n'

In [72]:
"\n"

'\n'

In [1]:
import pandas as pd

In [2]:
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Str_Func") \
    .enableHiveSupport().getOrCreate()

21/10/04 20:31:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/10/04 20:31:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [20]:
df = spark.createDataFrame(
    pd.DataFrame(
        data=[
            ("2017/01/01", 1), 
            ("2017/02/01", 2),
            ("2018/02/05", 3),
            (None, 4),
            ("how \t", 5),
        ], columns=["col1", "col2"]
    )
)
df.show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+



## [ascii](https://spark.apache.org/docs/latest/api/sql/index.html#ascii)

## [base64](https://spark.apache.org/docs/latest/api/sql/index.html#base64)

## [bin](https://spark.apache.org/docs/latest/api/sql/index.html#bin)

## [bit_length](https://spark.apache.org/docs/latest/api/sql/index.html#bit_length)

## [char](https://spark.apache.org/docs/latest/api/sql/index.html#char)

## [char_length](https://spark.apache.org/docs/latest/api/sql/index.html#char_length)

## [character_length](https://spark.apache.org/docs/latest/api/sql/index.html#character_length)

## [chr](https://spark.apache.org/docs/latest/api/sql/index.html#chr)

## [coalesce](https://spark.apache.org/docs/latest/api/sql/index.html#coalesce)

## [concat](https://spark.apache.org/docs/latest/api/sql/index.html#concat)

The `+` operator does not work as concatenation for 2 string columns.

In [5]:
df.withColumn("col", col("date") + col("month")).show()

+----------+-----+----+
|      date|month| col|
+----------+-----+----+
|2017/01/01|    1|null|
|2017/02/01|    2|null|
+----------+-----+----+



The function `concat` concatenate 2 string columns.

In [6]:
df.withColumn("col", concat(col("date"), col("month"))).show()

+----------+-----+-----------+
|      date|month|        col|
+----------+-----+-----------+
|2017/01/01|    1|2017/01/011|
|2017/02/01|    2|2017/02/012|
+----------+-----+-----------+



In [7]:
df.withColumn("col", concat(col("date"), lit("_"), col("month"))).show()

+----------+-----+------------+
|      date|month|         col|
+----------+-----+------------+
|2017/01/01|    1|2017/01/01_1|
|2017/02/01|    2|2017/02/01_2|
+----------+-----+------------+



## [concat_ws](https://spark.apache.org/docs/latest/api/sql/index.html#concat_ws)

## [decode](https://spark.apache.org/docs/latest/api/sql/index.html#decode)

## [encode](https://spark.apache.org/docs/latest/api/sql/index.html#encode)

## [format_string](https://spark.apache.org/docs/latest/api/sql/index.html#format_string)

## [hash](https://spark.apache.org/docs/latest/api/sql/index.html#hash)

## [hex](https://spark.apache.org/docs/latest/api/sql/index.html#hex)

## [initcap](https://spark.apache.org/docs/latest/api/sql/index.html#initcap)

## [input_file_name](https://spark.apache.org/docs/latest/api/sql/index.html#input_file_name)

## [instr](https://spark.apache.org/docs/latest/api/sql/index.html#instr)

`instr` behaves similar to `locate` except that their parameters are reversed.

In [8]:
spark.sql("""
    select instr("abcd", "ab") as index
    """).show()

+-----+
|index|
+-----+
|    1|
+-----+



In [9]:
spark.sql("""
    select instr("abcd", "AB") as index
    """).show()

+-----+
|index|
+-----+
|    0|
+-----+



## [lcase](https://spark.apache.org/docs/latest/api/sql/index.html#lcase)

## [left](https://spark.apache.org/docs/latest/api/sql/index.html#left)

In [6]:
spark.sql("""
    select 
        left("how are you doing?", 7) as phrase
    """).show()

+-------+
| phrase|
+-------+
|how are|
+-------+



## [length](https://spark.apache.org/docs/latest/api/sql/index.html#length)

In [18]:
val df = Seq(
    ("2017", 1),
    ("2017/02", 2),
    ("2018/02/05", 3),
    (null, 4)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|      2017|    1|
|   2017/02|    2|
|2018/02/05|    3|
|      null|    4|
+----------+-----+



null

In [19]:
import org.apache.spark.sql.functions.length

df.select($"date", length($"date")).show

+----------+------------+
|      date|length(date)|
+----------+------------+
|      2017|           4|
|   2017/02|           7|
|2018/02/05|          10|
|      null|        null|
+----------+------------+



null

## [like](https://spark.apache.org/docs/latest/api/sql/index.html#like)

## [lpad](https://spark.apache.org/docs/latest/api/sql/index.html#lpad)

## [ltrim](https://spark.apache.org/docs/latest/api/sql/index.html#ltrim)

Notice that functions `trim`/`rtrim`/`ltrim` behaves a little counter-intuitive.
    First, 
    they trim spaces only rather than white spaces by default.
    Second,
    when explicitly passing the characters to trim,
    the 1st parameter is the characters to trim 
    and the 2nd parameter is the string from which to trim characters.

In [7]:
spark.sql("""
    select ltrim("a ", "a a abcd") as after_ltrim
""").show()

+-----------+
|after_ltrim|
+-----------+
|        bcd|
+-----------+



## [locate](https://spark.apache.org/docs/latest/api/sql/index.html#locate)

`locate` behaves similar to `instr` except that their parameters are reversed.

In [8]:
df.withColumn("date", translate($"date", "/", "-")).show

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+



null

## [md5](https://spark.apache.org/docs/latest/api/sql/index.html#md5)

## [octet_length](https://spark.apache.org/docs/latest/api/sql/index.html#octet_length)

## [parse_url](https://spark.apache.org/docs/latest/api/sql/index.html#parse_url)

## [position](https://spark.apache.org/docs/latest/api/sql/index.html#position)

## [printf](https://spark.apache.org/docs/latest/api/sql/index.html#printf)

## [regex_extract](https://spark.apache.org/docs/latest/api/sql/index.html#regexp_extract)

```
public static Column regexp_extract(Column e, String exp, int groupIdx)
```

## [regex_extract_all](https://spark.apache.org/docs/latest/api/sql/index.html#regexp_extract_all)

## [regexp_replace](https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace)

In [9]:
df.withColumn("date", regexp_replace(col("date"), "/", "-")).show()

+----------+-----+
|      date|month|
+----------+-----+
|2017-01-01|    1|
|2017-02-01|    2|
+----------+-----+



## [repeat](https://spark.apache.org/docs/latest/api/sql/index.html#repeat)

## [replace](https://spark.apache.org/docs/latest/api/sql/index.html#replace)

## [reverse](https://spark.apache.org/docs/latest/api/sql/index.html#reverse)

## [right](https://spark.apache.org/docs/latest/api/sql/index.html#right)

In [14]:
spark.sql("""
    select right("abcdefg", 3) 
""").show()

+-------------------+
|right('abcdefg', 3)|
+-------------------+
|                efg|
+-------------------+



## [rlike](https://spark.apache.org/docs/latest/api/sql/index.html#rlike)

In [21]:
df.show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
|      null|   4|
|     how 	|   5|
+----------+----+



In [23]:
df.filter(col("col1").rlike("\\d{4}/02/\\d{2}")).show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+



In [51]:
df.filter(col("col1").rlike(r"\s")).show()

+-----+----+
| col1|col2|
+-----+----+
|how 	|   5|
+-----+----+



In [37]:
df.createOrReplaceTempView("t1")

In [52]:
spark.sql(r"""
    select 
        *
    from 
        t1 
    where
        col1 rlike '\\d'
    """).show()

+----------+----+
|      col1|col2|
+----------+----+
|2017/01/01|   1|
|2017/02/01|   2|
|2018/02/05|   3|
+----------+----+



## [rpad](https://spark.apache.org/docs/latest/api/sql/index.html#rpad)

## [rtrim](https://spark.apache.org/docs/latest/api/sql/index.html#rtrim)

Notice that functions `trim`/`rtrim`/`ltrim` behaves a little counter-intuitive.
    First, 
    they trim spaces only rather than white spaces by default.
    Second,
    when explicitly passing the characters to trim,
    the 1st parameter is the characters to trim 
    and the 2nd parameter is the string from which to trim characters.

In [7]:
spark.sql("""
    select rtrim("abcd\t ") as after_trim
""").show()

+----------+
|after_trim|
+----------+
|     abcd	|
+----------+



In [6]:
spark.sql("""
    select rtrim(" \t", "abcd\t ") as after_trim
""").show()

+----------+
|after_trim|
+----------+
|      abcd|
+----------+



21/10/04 20:32:27 WARN Analyzer$ResolveFunctions: Two-parameter TRIM/LTRIM/RTRIM function signatures are deprecated. Use SQL syntax `TRIM((BOTH | LEADING | TRAILING)? trimStr FROM str)` instead.


In [8]:
spark.sql("""
    select rtrim("a ", "a a abcda a a") as after_ltrim
""").show()

+-----------+
|after_ltrim|
+-----------+
|   a a abcd|
+-----------+



## [sentences](https://spark.apache.org/docs/latest/api/sql/index.html#sentences)

## [sha](https://spark.apache.org/docs/latest/api/sql/index.html#sha)

## [sha1](https://spark.apache.org/docs/latest/api/sql/index.html#sha1)

## [sha2](https://spark.apache.org/docs/latest/api/sql/index.html#sha2)

## [split](https://spark.apache.org/docs/latest/api/sql/index.html#split)

If there is a trailing separator, 
then an emptry string is generated at the end of the array.
It is suggested that you get rid of the trailing separator 
before applying `split` 
to avoid unnecessary empty string generated.
The benefit of doing this is 2-fold.

1. Avoid generating non-neeed data (emtpy strings).
2. Too many empty strings can causes serious data skew issues 
    if the corresponding column is used for joining with another table.
    By avoiding generating those empty strings,
    we avoid potential Spark issues in the beginning.

In [26]:
spark.sql("""
    select split("ab;cd;ef", ";") as elements
""").show()

+------------+
|    elements|
+------------+
|[ab, cd, ef]|
+------------+



In [27]:
spark.sql("""
    select split("ab;cd;ef;", ";") as elements
""").show()

+--------------+
|      elements|
+--------------+
|[ab, cd, ef, ]|
+--------------+



## [string](https://spark.apache.org/docs/latest/api/sql/index.html#string)

## [substr](https://spark.apache.org/docs/latest/api/sql/index.html#substr)

## [substring](https://spark.apache.org/docs/latest/api/sql/index.html#substring)

1. Uses 1-based index.

2. `substring` on `null` returns `null`.

In [9]:
import org.apache.spark.sql.functions._

val df = Seq(
    ("2017/01/01", 1),
    ("2017/02/01", 2),
    (null, 3)
).toDF("date", "month")
df.show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|    1|
|2017/02/01|    2|
|      null|    3|
+----------+-----+



null

In [10]:
df.withColumn("year", substring($"date", 1, 4)).show

+----------+-----+----+
|      date|month|year|
+----------+-----+----+
|2017/01/01|    1|2017|
|2017/02/01|    2|2017|
|      null|    3|null|
+----------+-----+----+



null

In [11]:
df.withColumn("month", substring($"date", 6, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   02|
|      null| null|
+----------+-----+



null

In [12]:
df.withColumn("month", substring($"date", 9, 2)).show

+----------+-----+
|      date|month|
+----------+-----+
|2017/01/01|   01|
|2017/02/01|   01|
|      null| null|
+----------+-----+



null

## [substring_index](https://spark.apache.org/docs/latest/api/sql/index.html#substring_index)

## [translate](https://spark.apache.org/docs/latest/api/sql/index.html#translate)

Notice that translate is different from usual replacemnt!!!

## [trim](https://spark.apache.org/docs/latest/api/sql/index.html#trim)

Notice that functions `trim`/`rtrim`/`ltrim` behaves a little counter-intuitive.
    First, 
    they trim spaces only rather than white spaces by default.
    Second,
    when explicitly passing the characters to trim,
    the 1st parameter is the characters to trim 
    and the 2nd parameter is the string from which to trim characters.

In [23]:
spark.sql("""
    select trim("abcd\t  ") as after_trim
""").show()

+----------+
|after_trim|
+----------+
|     abcd	|
+----------+



In [21]:
spark.sql("""
    select trim(" \t", "abcd\t ") as after_trim
""").show()

+----------+
|after_trim|
+----------+
|      abcd|
+----------+



## [trunc](https://spark.apache.org/docs/latest/api/sql/index.html#trunc)

## [ucase](https://spark.apache.org/docs/latest/api/sql/index.html#ucase)

## [unbase64](https://spark.apache.org/docs/latest/api/sql/index.html#unbase64)

## [unhex](https://spark.apache.org/docs/latest/api/sql/index.html#unhex)

## [upper](https://spark.apache.org/docs/latest/api/sql/index.html#upper)

## [uuid](https://spark.apache.org/docs/latest/api/sql/index.html#uuid)

## [xxhash64](https://spark.apache.org/docs/latest/api/sql/index.html#xxhash64)

## References 

[Spark Scala Functions](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html)

[Spark SQL Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)

https://obstkel.com/spark-sql-functions

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html