## Concatenate DataFrame columns
Using concat() or concat_ws() Spark SQL functions we can concatenate one or more DataFrame columns into a single column.

## Test Data and Dataframe
Note that we need to import implicits on `spark` object which is an instance of SparkSession in order to use toDF() on Seq collection

In [19]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from termcolor import cprint 

spark = SparkSession.builder.appName('concat').getOrCreate()

data = [("Jason","B","Williams","2017","M",4000),
        ("Maria","Rose","Jones","2009","M",4500),
        ("Rob","J","Smith","2009","M",5000),
        ("Gloria","Mery","Jones","2005","F",5500),
        ("Sean","K","Brown","2006","",-1)
       ]

columns = ["fname","mname","lname","dob_year","gender","salary"]

df = spark.createDataFrame(data=data, schema = columns)

df.show(truncate=False)

+------+-----+--------+--------+------+------+
|fname |mname|lname   |dob_year|gender|salary|
+------+-----+--------+--------+------+------+
|Jason |B    |Williams|2017    |M     |4000  |
|Maria |Rose |Jones   |2009    |M     |4500  |
|Rob   |J    |Smith   |2009    |M     |5000  |
|Gloria|Mery |Jones   |2005    |F     |5500  |
|Sean  |K    |Brown   |2006    |      |-1    |
+------+-----+--------+--------+------+------+



### Using concat() Function to Concatenate DataFrame Columns
Spark SQL functions provide `concat()` to concatenate two or more DataFrame columns into a single Column.

Syntax

    concat(exprs: Column*): Column
    
It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.

This statement creates `FullName` column by concatenating columns `fname`, `mname`, `lname` separating by delimiter comma. To add a delimiter, we have used `lit()` function. This yields output with just a concatenated column.

In [20]:
from pyspark.sql.functions import concat, lit

df.select(concat(col("fname"),lit(', '),col("mname"),lit(', '),col("lname")).alias("FullName")).show(truncate=False)

+-------------------+
|FullName           |
+-------------------+
|Jason, B, Williams |
|Maria, Rose, Jones |
|Rob, J, Smith      |
|Gloria, Mery, Jones|
|Sean, K, Brown     |
+-------------------+



### concat() Function on withColumn()
we will add a new column `FullName` by concatenating columns names.

In [21]:
df.withColumn("FullName",concat(col("fname"),lit(','), col("mname"),lit(','),col("lname"))).show(truncate=False)

+------+-----+--------+--------+------+------+-----------------+
|fname |mname|lname   |dob_year|gender|salary|FullName         |
+------+-----+--------+--------+------+------+-----------------+
|Jason |B    |Williams|2017    |M     |4000  |Jason,B,Williams |
|Maria |Rose |Jones   |2009    |M     |4500  |Maria,Rose,Jones |
|Rob   |J    |Smith   |2009    |M     |5000  |Rob,J,Smith      |
|Gloria|Mery |Jones   |2005    |F     |5500  |Gloria,Mery,Jones|
|Sean  |K    |Brown   |2006    |      |-1    |Sean,K,Brown     |
+------+-----+--------+--------+------+------+-----------------+



The above snippet also keeps the individual names, if you do not need it you can `drop()` them using the below statement.

In [22]:
df.withColumn("FullName",concat(col("fname"),lit(','), col("mname"),lit(','),col("lname"))) \
    .drop("fname").drop("mname").drop("lname") \
    .show(truncate=False)

+--------+------+------+-----------------+
|dob_year|gender|salary|FullName         |
+--------+------+------+-----------------+
|2017    |M     |4000  |Jason,B,Williams |
|2009    |M     |4500  |Maria,Rose,Jones |
|2009    |M     |5000  |Rob,J,Smith      |
|2005    |F     |5500  |Gloria,Mery,Jones|
|2006    |      |-1    |Sean,K,Brown     |
+--------+------+------+-----------------+



### Using concat_ws() Function to Concatenate with Delimiter
Adding a delimiter while concatenating DataFrame columns can be easily done using another function `concat_ws()`.

syntax

    concat_ws(sep: String, exprs: Column*): Column

`concat_ws()` function takes the first argument as delimiter following with columns that need to concatenate.

In [23]:
from pyspark.sql.functions import concat_ws

df.withColumn("FullName",concat_ws(" ,",col("fname"),col("mname"),col("lname"))) \
    .drop("fname").drop("mname").drop("lname") \
    .show(truncate=False)

+--------+------+------+-------------------+
|dob_year|gender|salary|FullName           |
+--------+------+------+-------------------+
|2017    |M     |4000  |Jason ,B ,Williams |
|2009    |M     |4500  |Maria ,Rose ,Jones |
|2009    |M     |5000  |Rob ,J ,Smith      |
|2005    |F     |5500  |Gloria ,Mery ,Jones|
|2006    |      |-1    |Sean ,K ,Brown     |
+--------+------+------+-------------------+



### Using Raw SQL
Spark SQL provides a way to concatenate using Raw SQL syntax. But In order to use this first you need to create a temporary view using `df.createOrReplaceTempView("TMP")`. This creates a temporary table "TMP".

We can use `concat(`) function on the raw SQL statements.

In [24]:
df.createOrReplaceTempView("TMP")

spark.sql("select CONCAT(fname,' ',lname,' ',mname) as FullName from TMP").show(truncate=False)

+-----------------+
|FullName         |
+-----------------+
|Jason Williams B |
|Maria Jones Rose |
|Rob Smith J      |
|Gloria Jones Mery|
|Sean Brown K     |
+-----------------+

