# 5.2 Union of data frame

We have seen how to join two dataframe which has common columns. The join operation merge the columns of dataframe. Now if we have 
two or more data frames which has the same (or almost the same) schema or structure, how do we merge them (meger rows)? We call the operation that merge dataframe by rows **Union of data frames**.

In spark, we have two functions:
- union(other): It's called by a data frame, it takes another data frame as argument. It returns a new dataframe which
                is the union of the two.
- unionByName(other, allowMissingColumns). Idem to union, but since spark 3.1. A new argument allowMissingColumns which
                takes a bool value has been added. This allows us to merger data frame with different column numbers.

The difference between the two transformations is that 
- union() resolve column by its position. 
- unionByName() resolve column by its name. 

In exp1, 
In exp3, we tested on different column type, union works. How the output data frame choose column type is unclear. 
Note there is another transformation called unionAll() which is deprecated since Spark “2.0.0” version. 

In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import lit
import os

In [2]:
local=True

if local:
    spark=SparkSession.builder.master("local[4]").appName("UnionDataFrame").getOrCreate()
else:
    spark=SparkSession.builder \
              .master("k8s://https://kubernetes.default.svc:443") \
              .appName("UnionDataFrame") \
              .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
              .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
              .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
              .config("spark.executor.instances", "4") \
              .config("spark.executor.memory","8g") \
              .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1') \
              .getOrCreate()

We create two dataframes below (e.g. df1, df2). They have almost the same schema, only one column name is different (name vs employee_name). We will try to union df1 and df2 with different method.

In [3]:
data1 = [("James", "Sales", "NY", 90000, 34, 10000),
             ("Michael", "Sales", "NY", 86000, 56, 20000),
             ("Robert", "Sales", "CA", 81000, 30, 23000),
             ("Maria", "Finance", "CA", 90000, 24, 23000)
             ]

columns1 = ["name", "department", "state", "salary", "age", "bonus"]

df1 = spark.createDataFrame(data=data1, schema=columns1)
print("Source data 1: row number {}".format(df1.count()))
df1.printSchema()
df1.show(truncate=False)

Source data 1: row number 4
root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------+----------+-----+------+---+-----+
|name   |department|state|salary|age|bonus|
+-------+----------+-----+------+---+-----+
|James  |Sales     |NY   |90000 |34 |10000|
|Michael|Sales     |NY   |86000 |56 |20000|
|Robert |Sales     |CA   |81000 |30 |23000|
|Maria  |Finance   |CA   |90000 |24 |23000|
+-------+----------+-----+------+---+-----+



In [4]:
data2 = [("James", "Sales", "NY", 90000, 34, 10000),
             ("Maria", "Finance", "CA", 90000, 24, 23000),
             ("Jen", "Finance", "NY", 79000, 53, 15000),
             ("Jeff", "Marketing", "CA", 80000, 25, 18000),
             ("Kumar", "Marketing", "NY", 91000, 50, 21000)
             ]
columns2 = ["employee_name", "department", "state", "salary", "age", "bonus"]
df2 = spark.createDataFrame(data=data2, schema=columns2)
print("Source data 2: row number {}".format(df2.count()))
df2.printSchema()
df2.show(truncate=False)

Source data 2: row number 5
root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
+-------------+----------+-----+------+---+-----+



## 5.2.1 Use union() function 

In below example, we use the union() function to union the two dataframes that has the same column number. As it unions dataframe by using the position of the column, the column name variation will not cause error. 

Inner working of spark union(): **spark analyze the column number and type. If those are identical, the difference between column names are ommitted.**

Another important note, the union() function just merge the two dataframe without dealing with duplicates.

So you need to use distinct() or df.drop_duplicate()


In [5]:
# note we use df1 as base table, and union df2
# The order of the rows in df1, and df2 does not change.
df_union=df1.union(df2)
df_union.show()

+-------+----------+-----+------+---+-----+
|   name|department|state|salary|age|bonus|
+-------+----------+-----+------+---+-----+
|  James|     Sales|   NY| 90000| 34|10000|
|Michael|     Sales|   NY| 86000| 56|20000|
| Robert|     Sales|   CA| 81000| 30|23000|
|  Maria|   Finance|   CA| 90000| 24|23000|
|  James|     Sales|   NY| 90000| 34|10000|
|  Maria|   Finance|   CA| 90000| 24|23000|
|    Jen|   Finance|   NY| 79000| 53|15000|
|   Jeff| Marketing|   CA| 80000| 25|18000|
|  Kumar| Marketing|   NY| 91000| 50|21000|
+-------+----------+-----+------+---+-----+



In [6]:
# You can notice, after dropDuplicates we do not have duplicates anymore  
df_union.dropDuplicates().show()

+-------+----------+-----+------+---+-----+
|   name|department|state|salary|age|bonus|
+-------+----------+-----+------+---+-----+
|  Kumar| Marketing|   NY| 91000| 50|21000|
|  Maria|   Finance|   CA| 90000| 24|23000|
| Robert|     Sales|   CA| 81000| 30|23000|
|  James|     Sales|   NY| 90000| 34|10000|
|Michael|     Sales|   NY| 86000| 56|20000|
|    Jen|   Finance|   NY| 79000| 53|15000|
|   Jeff| Marketing|   CA| 80000| 25|18000|
+-------+----------+-----+------+---+-----+



## 5.2.2 Some bad examples

We have seen a successful union, what happens if we want to union two data frames that have different column numbers?



In [5]:
df3=df1.withColumn("msg",lit("Hello_world"))
df3.show()

+-------+----------+-----+------+---+-----+-----------+
|   name|department|state|salary|age|bonus|        msg|
+-------+----------+-----+------+---+-----+-----------+
|  James|     Sales|   NY| 90000| 34|10000|Hello_world|
|Michael|     Sales|   NY| 86000| 56|20000|Hello_world|
| Robert|     Sales|   CA| 81000| 30|23000|Hello_world|
|  Maria|   Finance|   CA| 90000| 24|23000|Hello_world|
+-------+----------+-----+------+---+-----+-----------+



In [8]:
# off course it failed. And the error message is very clear, because two dataframe has different schema(e.g. column number)

df3.union(df1).show()

AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;
'Union false, false
:- Project [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L, Hello_world AS msg#158]
:  +- LogicalRDD [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L], false
+- Project [name#0 AS name#195, department#1 AS department#196, state#2 AS state#197, salary#3L AS salary#198L, age#4L AS age#199L, bonus#5L AS bonus#200L]
   +- LogicalRDD [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L], false


This time if the column number are the same, but the column type are different. Will the union be successful?


In [9]:
# we cast column age from long type to string type
df4=df1.withColumn("age",df1.age.cast("string"))
df4.printSchema()

root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: string (nullable = true)
 |-- bonus: long (nullable = true)



In [11]:
df_bad_union1=df4.union(df2)
df_bad_union1.show()
df_bad_union1.printSchema()


+-------+----------+-----+------+---+-----+
|   name|department|state|salary|age|bonus|
+-------+----------+-----+------+---+-----+
|  James|     Sales|   NY| 90000| 34|10000|
|Michael|     Sales|   NY| 86000| 56|20000|
| Robert|     Sales|   CA| 81000| 30|23000|
|  Maria|   Finance|   CA| 90000| 24|23000|
|  James|     Sales|   NY| 90000| 34|10000|
|  Maria|   Finance|   CA| 90000| 24|23000|
|    Jen|   Finance|   NY| 79000| 53|15000|
|   Jeff| Marketing|   CA| 80000| 25|18000|
|  Kumar| Marketing|   NY| 91000| 50|21000|
+-------+----------+-----+------+---+-----+

root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: string (nullable = true)
 |-- bonus: long (nullable = true)



**The above union has successed**
Second point, we can notice that **the column type of the result is string**. Is this caused by the spark inner working or just because df4 is the base table of the union? 

So we change the union base table and try again. The result column type of age is still string. So we can say it's the inner working of spark, which cast long to string automaticlly when there is type difference in the two dataframe. 

In [13]:
df_bad_union2=df2.union(df4)
df_bad_union2.show()
df_bad_union2.printSchema()

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|        James|     Sales|   NY| 90000| 34|10000|
|        Maria|   Finance|   CA| 90000| 24|23000|
|          Jen|   Finance|   NY| 79000| 53|15000|
|         Jeff| Marketing|   CA| 80000| 25|18000|
|        Kumar| Marketing|   NY| 91000| 50|21000|
|        James|     Sales|   NY| 90000| 34|10000|
|      Michael|     Sales|   NY| 86000| 56|20000|
|       Robert|     Sales|   CA| 81000| 30|23000|
|        Maria|   Finance|   CA| 90000| 24|23000|
+-------------+----------+-----+------+---+-----+

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: string (nullable = true)
 |-- bonus: long (nullable = true)



## 5.2.3 Use unionByName to union dataframe

We have seen how to union two dataframes by using their column position. We can also union them by using the column names. In below example, we use unionByName() to union two data frame which has different column number.



In [11]:
df_renamed=df2.withColumnRenamed("employee_name","name")
df_renamed.show()
# note below union failed. Because df_renamed has 6 column and df3 has 7 column.
df_renamed.unionByName(df3).show()

+-----+----------+-----+------+---+-----+
| name|department|state|salary|age|bonus|
+-----+----------+-----+------+---+-----+
|James|     Sales|   NY| 90000| 34|10000|
|Maria|   Finance|   CA| 90000| 24|23000|
|  Jen|   Finance|   NY| 79000| 53|15000|
| Jeff| Marketing|   CA| 80000| 25|18000|
|Kumar| Marketing|   NY| 91000| 50|21000|
+-----+----------+-----+------+---+-----+



AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 7 columns;
'Union false, false
:- Project [employee_name#48 AS name#308, department#49, state#50, salary#51L, age#52L, bonus#53L]
:  +- LogicalRDD [employee_name#48, department#49, state#50, salary#51L, age#52L, bonus#53L], false
+- Project [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L, msg#96]
   +- Project [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L, Hello_world AS msg#96]
      +- LogicalRDD [name#0, department#1, state#2, salary#3L, age#4L, bonus#5L], false


To make the union work, we need to use the option **allowMissingColumns=True**, the absent columns will be filled with null after union.



In [12]:
# You can notice the row in df_rename has a new column msg filled with null value.
# If you are using a spark version prior to 3.1. You don't have access of allowMissingColumns=True. You need
# to creat the absent column and filled it with null by yourself.
df_renamed.unionByName(df3,allowMissingColumns=True).show()

+-------+----------+-----+------+---+-----+-----------+
|   name|department|state|salary|age|bonus|        msg|
+-------+----------+-----+------+---+-----+-----------+
|  James|     Sales|   NY| 90000| 34|10000|       null|
|  Maria|   Finance|   CA| 90000| 24|23000|       null|
|    Jen|   Finance|   NY| 79000| 53|15000|       null|
|   Jeff| Marketing|   CA| 80000| 25|18000|       null|
|  Kumar| Marketing|   NY| 91000| 50|21000|       null|
|  James|     Sales|   NY| 90000| 34|10000|Hello_world|
|Michael|     Sales|   NY| 86000| 56|20000|Hello_world|
| Robert|     Sales|   CA| 81000| 30|23000|Hello_world|
|  Maria|   Finance|   CA| 90000| 24|23000|Hello_world|
+-------+----------+-----+------+---+-----+-----------+



In [14]:
# This time let's try the same column number but different column names
# You can notice the message error says it cannot find column "name" in df2.
df1.unionByName(df2).show()

AnalysisException: Cannot resolve column name "name" among (employee_name, department, state, salary, age, bonus)

## 5.2.4 Last check on the difference of the union and unionByName

This time we will use two dataframe has the same column number and type, but the order of the last two columns are different. 
Then we use union and unionByName to union the two dataframe and check the difference.

In [15]:
# This function finds all columns of df2 which does not exist in df1, then create then in df1 and fill it with None
def create_missing_column(df1, df2):
    for column in [column for column in df2.columns if column not in df1.columns]:
        df1 = df1.withColumn(column, lit(None))
    return df1

In [17]:
# df_msg has all the common columns and one extra column "msg"
df_msg = df1.withColumn("msg", lit("hello"))
    
# df_mail has all the common columns and one extra column "mail"
df_mail = df2.withColumn("mail", lit("world")).withColumnRenamed("employee_name", "name")

# we need to create column "mail" in df_msg, and column "msg" in df_mail
df_msg_after_fill = create_missing_column(df_msg, df_mail)
df_msg_after_fill.show()

df_mail_after_fill = create_missing_column(df_mail, df_msg)
df_mail_after_fill.show()



+-------+----------+-----+------+---+-----+-----+----+
|   name|department|state|salary|age|bonus|  msg|mail|
+-------+----------+-----+------+---+-----+-----+----+
|  James|     Sales|   NY| 90000| 34|10000|hello|null|
|Michael|     Sales|   NY| 86000| 56|20000|hello|null|
| Robert|     Sales|   CA| 81000| 30|23000|hello|null|
|  Maria|   Finance|   CA| 90000| 24|23000|hello|null|
+-------+----------+-----+------+---+-----+-----+----+

+-----+----------+-----+------+---+-----+-----+----+
| name|department|state|salary|age|bonus| mail| msg|
+-----+----------+-----+------+---+-----+-----+----+
|James|     Sales|   NY| 90000| 34|10000|world|null|
|Maria|   Finance|   CA| 90000| 24|23000|world|null|
|  Jen|   Finance|   NY| 79000| 53|15000|world|null|
| Jeff| Marketing|   CA| 80000| 25|18000|world|null|
|Kumar| Marketing|   NY| 91000| 50|21000|world|null|
+-----+----------+-----+------+---+-----+-----+----+



In [18]:
# after the fill, we can do the unionByName or union

# Check the different result of union and unionByName, it confirms the union resolve column by positon, unionByName
# resolve column by name
union1_df = df_msg_after_fill.union(df_mail_after_fill)
union1_df.show()

union2_df = df_msg_after_fill.unionByName(df_mail_after_fill)
union2_df.show()

+-------+----------+-----+------+---+-----+-----+----+
|   name|department|state|salary|age|bonus|  msg|mail|
+-------+----------+-----+------+---+-----+-----+----+
|  James|     Sales|   NY| 90000| 34|10000|hello|null|
|Michael|     Sales|   NY| 86000| 56|20000|hello|null|
| Robert|     Sales|   CA| 81000| 30|23000|hello|null|
|  Maria|   Finance|   CA| 90000| 24|23000|hello|null|
|  James|     Sales|   NY| 90000| 34|10000|world|null|
|  Maria|   Finance|   CA| 90000| 24|23000|world|null|
|    Jen|   Finance|   NY| 79000| 53|15000|world|null|
|   Jeff| Marketing|   CA| 80000| 25|18000|world|null|
|  Kumar| Marketing|   NY| 91000| 50|21000|world|null|
+-------+----------+-----+------+---+-----+-----+----+

+-------+----------+-----+------+---+-----+-----+-----+
|   name|department|state|salary|age|bonus|  msg| mail|
+-------+----------+-----+------+---+-----+-----+-----+
|  James|     Sales|   NY| 90000| 34|10000|hello| null|
|Michael|     Sales|   NY| 86000| 56|20000|hello| null|
| Ro