## PySpark Union and UnionAll

PySpark `union()` and `unionAll()` transformations are used to merge two or more DataFrame’s of the same schema/structure.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark Union and UnionAll').getOrCreate()

In [0]:
data = [
  ('Sam', 'Software Engineer', 'US', 5000, 30, 500),
  ('Adam', 'Data Scientist', 'US', 6000, 58, 550),
  ('Jonas', 'Sales Person', 'Wales', 5000, 41, 500),
  ('Nick', 'Data Engineer', 'Ireland', 5000, 41, 600),
  ('Peter', 'CTO', 'Ireland', 10000, 35, 1500),
  ('Ann', 'Data Analyst', 'Australia', 6000, 24, 500),
  ('Wade', 'Data Engineer', 'Scotland', 5500, 25, 600)
]

columns = ['name', 'job', 'country', 'salary', 'age', 'bonus']

df = spark.createDataFrame(data = data, schema = columns)

df.printSchema()
df.show()

In [0]:
data2 = [
  ('Peter', 'CTO', 'Ireland', 10000, 35, 1500),
  ('Ann', 'Data Analyst', 'Australia', 6000, 24, 500),
  ('Ralph', 'CEO', 'Germany', 15000, 50, 2500),
  ('Jonas', 'Sales Person', 'Wales', 5000, 41, 500),
  ('Nick', 'Data Engineer', 'Ireland', 5000, 41, 600),
  ('Lekhana', 'Advertising', 'England', 4500, 27, 560),
  ('Tomas', 'Marketing', 'Hungary', 4500, 30, 570)
]

columns2 = ['name', 'job', 'country', 'salary', 'age', 'bonus']

df2 = spark.createDataFrame(data = data2, schema = columns2)

df2.printSchema()
df2.show()

#### Merge two or more DataFrames

In [0]:
unionDF = df.union(df2)
unionDF.show()

#### Merge without Duplicates

In [0]:
#This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by distinct().
deDupDF = df.union(df2).distinct()
deDupDF.show()

#### The end of the notebook