## regexp_replace() Replace Column Values in DataFrame
How to replace part of a string with another string, replace all columns, change values conditionally, replace values from a python dictionary, replace column value from another DataFrame column.

Let’s create a PySpark DataFrame with some addresses and will use this DataFrame to explain how to replace column values.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("regexp_replace").getOrCreate()

address = [(1,"14851 Jeffrey Rd","DE"),
           (2,"43421 Margarita St","NY"),
           (3,"13111 Siemon Ave","CA")]
           
df =spark.createDataFrame(address,["id","address","state"])
df.show()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/09 21:00:03 WARN Utils: Your hostname, javier-ubuntu, resolves to a loopback address: 127.0.1.1; using 172.17.0.1 instead (on interface docker0)
25/08/09 21:00:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/09 21:00:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/09 21:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/09 21:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/08/09 21:00:05 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
                             

+---+------------------+-----+
| id|           address|state|
+---+------------------+-----+
|  1|  14851 Jeffrey Rd|   DE|
|  2|43421 Margarita St|   NY|
|  3|  13111 Siemon Ave|   CA|
+---+------------------+-----+



### Replace String Column Values
By using PySpark SQL function `regexp_replace()` you can replace a column value with a string for another string/substring. `regexp_replace()` uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name `Rd` value with `Road` string on `address` column.

In [2]:
#Replace part of string with another string
from pyspark.sql.functions import regexp_replace

df.withColumn('address', regexp_replace('address', 'Rd', 'Road')).show(truncate=False)

+---+------------------+-----+
|id |address           |state|
+---+------------------+-----+
|1  |14851 Jeffrey Road|DE   |
|2  |43421 Margarita St|NY   |
|3  |13111 Siemon Ave  |CA   |
+---+------------------+-----+



### Replace Column Values Conditionally
we just replaced `Rd` with `Road`, but not replaced `St` and `Ave` values, let’s see how to replace column values conditionally in PySpark Dataframe by using `when().otherwise()` SQL condition function.

In [3]:
from pyspark.sql.functions import when
df.withColumn('address', 
    when(df.address.endswith('Rd'),regexp_replace(df.address,'Rd','Road')) \
   .when(df.address.endswith('St'),regexp_replace(df.address,'St','Street')) \
   .when(df.address.endswith('Ave'),regexp_replace(df.address,'Ave','Avenue')) \
   .otherwise(df.address)) \
   .show(truncate=False)

+---+----------------------+-----+
|id |address               |state|
+---+----------------------+-----+
|1  |14851 Jeffrey Road    |DE   |
|2  |43421 Margarita Street|NY   |
|3  |13111 Siemon Avenue   |CA   |
+---+----------------------+-----+



### Replace Column Value with Dictionary (map)
We can replace column values from the python dictionary (map). Below, we replace the string value of the `state` column with the full abbreviated name from a dictionary key-value pair, in order to do so I use PySpark `map()` transformation to loop through each row of DataFrame.

In [4]:
stateDic={'CA':'California','NY':'New York','DE':'Delaware'}

df2=df.rdd.map(lambda x: (x.id,x.address,stateDic[x.state]) ).toDF(["id","address","state"])
df2.show()

+---+------------------+----------+
| id|           address|     state|
+---+------------------+----------+
|  1|  14851 Jeffrey Rd|  Delaware|
|  2|43421 Margarita St|  New York|
|  3|  13111 Siemon Ave|California|
+---+------------------+----------+



### Replace Column Value Character by Character
By using `translate()` string function you can *replace character by character of DataFrame column* value. In the below example, every character of 1 is replaced with A, 2 replaced with B, and 3 replaced with C on the `address` column.

In [5]:
#Using translate to replace character by character
from pyspark.sql.functions import translate

df.withColumn('address', translate('address', '123', 'ABC')).show(truncate=False)

+---+------------------+-----+
|id |address           |state|
+---+------------------+-----+
|1  |A485A Jeffrey Rd  |DE   |
|2  |4C4BA Margarita St|NY   |
|3  |ACAAA Siemon Ave  |CA   |
+---+------------------+-----+



### Replace Column with Another Column Value
By using `expr()` and `regexp_replace()` you can replace column value with a value from another DataFrame column. In the below example, we match the value from `col2` in `col1` and replace with `col3` to create new_column. Use `expr()` to provide SQL like expressions and is used to refer to another column to perform operations.

In [6]:
#Replace column with another column
from pyspark.sql.functions import expr
from termcolor import cprint

cprint("--- df", 'blue')
df = spark.createDataFrame([("ABCDE_XYZ", "XYZ","FGH")], ("col1", "col2","col3"))
df.show()
cprint('--- df.withColumn("new_column", expr("regexp_replace(col1, col2, col3)")', 'blue')
df.withColumn("new_column", expr("regexp_replace(col1, col2, col3)")).show()


[34m--- df[0m
+---------+----+----+
|     col1|col2|col3|
+---------+----+----+
|ABCDE_XYZ| XYZ| FGH|
+---------+----+----+

[34m--- df.withColumn("new_column", expr("regexp_replace(col1, col2, col3)")[0m
+---------+----+----+----------+
|     col1|col2|col3|new_column|
+---------+----+----+----------+
|ABCDE_XYZ| XYZ| FGH| ABCDE_FGH|
+---------+----+----+----------+



### Replace Empty Value With None/null on DataFrame
In PySpark DataFrame use `when().otherwise()` SQL functions to find out if a column has an empty value and use `withColumn()` transformation to replace a value of an existing column.

Note: In PySpark DataFrame `None` value are shown as `null` value

DataFrame with empty values on some rows

In [7]:
from pyspark.sql import SparkSession

data = [("","CA"), ("Julia",""),("Robert",""),("","NJ")]
df =spark.createDataFrame(data,["name","state"])
df.show()

+------+-----+
|  name|state|
+------+-----+
|      |   CA|
| Julia|     |
|Robert|     |
|      |   NJ|
+------+-----+



#### Replace Empty Value with None
In order to replace empty value with `None/null` on single DataFrame column, you can use `withColumn()` and `when().otherwise()` function.

In [8]:
# Replace empty string with None value
from pyspark.sql.functions import col,when

df.withColumn("name", when(col("name")=="", None).otherwise(col("name"))).show()

+------+-----+
|  name|state|
+------+-----+
|  NULL|   CA|
| Julia|     |
|Robert|     |
|  NULL|   NJ|
+------+-----+



#### Replace Empty Value with None on All DataFrame Columns
To replace an empty value with `None/null` on all DataFrame columns, use `df.columns` to get all DataFrame columns, loop through this by applying conditions.

In [9]:
# Replace empty string with None for all columns
from pyspark.sql.functions import col,when

df2 = df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns])
df2.show()

+------+-----+
|  name|state|
+------+-----+
|  NULL|   CA|
| Julia| NULL|
|Robert| NULL|
|  NULL|   NJ|
+------+-----+



#### Replace Empty Value with None on Selected Columns
you can also replace a selected list of columns, specify all columns you wanted to replace in a list and use this on same expression above.

In [10]:
# Replace empty string with None on selected columns
from pyspark.sql.functions import col,when

replaceCols = ["name","state"]

df2 = df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in replaceCols])
df2.show()

+------+-----+
|  name|state|
+------+-----+
|  NULL|   CA|
| Julia| NULL|
|Robert| NULL|
|  NULL|   NJ|
+------+-----+

