# VLT Union operator

The original doc can be found at **line 4977 of VTL-2.0-Reference-Manual**

## Syntax

```text
union ( dsList )
   - dsList ::= ds { , ds }*
```
**Additional constraints**:

All the Data Sets in dsList have the same Identifier, Measure and Attribute Components.

**Behaviour**

The `union` operator implements the union of functions (i.e., Data Sets). The resulting Data Set has the same `Identifier, Measure and Attribute` Components of the operand Data Sets specified in the dsList, and contains the Data Points belonging to any of the operand Data Sets.

The operand Data Sets can contain Data Points having the same values of the Identifiers. To avoid duplications of Data Points in the resulting Data Set, those Data Points are filtered by chosing the Data Point belonging to the left most operand Data Set. For instance, let's assume that in `union ( ds1, ds2 )` the operand `ds1 contains a Data Point dp1` and the operand `ds2 contains a Data Point dp2` such that `dp1 has the same Identifiers values of dp2`, then the resulting Data Set contains `dp1 only`.

The operator has the typical behaviour of the “Behaviour of the Set operators” (see the section “Typical behaviours of the ML Operators”).

The automatic Attribute propagation is not applied.

## Example

Given the operand Data Sets DS_1 and DS_2:

DS_1 :
```text
Id_1,Id_2,Id_3,Id_4,Me_1
2012,B,Total,Total,5
2012,G,Total,Total,2
2012,F,Total,Total,3
```
DS_2:
```text
Id_1,Id_2,Id_3,Id_4,Me_1
2012,N,Total,Total,23
2012,S,Total,Total,5
```

The result of `DS_r := union(DS_1,DS_2)`

```text
Id_1,Id_2,Id_3,Id_4,Me_1
2012,B,Total,Total,5
2012,G,Total,Total,2
2012,F,Total,Total,3
2012,N,Total,Total,23
2012,S,Total,Total,5
```

In [1]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when

In [2]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLUnion")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLUnion")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()


22/10/24 15:42:19 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
22/10/24 15:42:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/24 15:42:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [16]:
root_path="../data"
ds1_path=f"{root_path}/union_ds1.csv"
ds2_path=f"{root_path}/union_ds2.csv"
ds3_path=f"{root_path}/union_ds3.csv"
ds4_path=f"{root_path}/union_ds4.csv"

In [6]:
df1=spark.read.csv(ds1_path, header=True,inferSchema=True)
df1.show()

                                                                                

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   B|Total|Total|   5|
|2012|   G|Total|Total|   2|
|2012|   F|Total|Total|   3|
+----+----+-----+-----+----+



In [7]:
df2=spark.read.csv(ds2_path, header=True,inferSchema=True)
df2.show()

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   N|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [10]:
df3=spark.read.csv(ds3_path, header=True,inferSchema=True)
df3.show()

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   B|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [17]:
df4=spark.read.csv(ds4_path, header=True,inferSchema=True)
df4.show()

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   L|Total|Total|   5|
|2012|   M|Total|Total|   2|
|2012|   X|Total|Total|   3|
+----+----+-----+-----+----+



In [11]:
df_r1=df1.union(df2)
df_r1.show(10)

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   B|Total|Total|   5|
|2012|   G|Total|Total|   2|
|2012|   F|Total|Total|   3|
|2012|   N|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [13]:
id_col=["Id_1","Id_2","Id_3","Id_4"]
df_r2=df1.union(df3).dropDuplicates(id_col)
df_r2.show(10)

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   F|Total|Total|   3|
|2012|   G|Total|Total|   2|
|2012|   B|Total|Total|   5|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [37]:
def df_union(df_list,id_col):
    size=len(df_list)
    print(f"total size: {size}")
    if size==1:
        return df_list[0]
    elif size==2:
        return df_list[0].union(df_list[1]).dropDuplicates(id_col)
    else:
        df_resu=df_list[0]
        for i in range(1,size):
            print(f"index: {i}")
            df_list[i].show()
            df_resu=df_resu.union(df_list[i])
        return df_resu.dropDuplicates(id_col)

In [38]:
df_list1=[df1,df2]
df_list2=[df1,df3]
df_list3=[df1,df3,df4]
df_list4=[df1,df2,df3,df4]

In [39]:
df_r3=df_union(df_list1,id_col)
df_r3.show()

total size: 2
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   F|Total|Total|   3|
|2012|   G|Total|Total|   2|
|2012|   B|Total|Total|   5|
|2012|   N|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [40]:
df_r4=df_union(df_list2,id_col)
df_r4.show()

total size: 2
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   F|Total|Total|   3|
|2012|   G|Total|Total|   2|
|2012|   B|Total|Total|   5|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+



In [41]:
df_r5=df_union(df_list3,id_col)
df_r5.show()

total size: 3
index: 1
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   B|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+

index: 2
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   L|Total|Total|   5|
|2012|   M|Total|Total|   2|
|2012|   X|Total|Total|   3|
+----+----+-----+-----+----+

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   F|Total|Total|   3|
|2012|   G|Total|Total|   2|
|2012|   B|Total|Total|   5|
|2012|   S|Total|Total|   5|
|2012|   M|Total|Total|   2|
|2012|   L|Total|Total|   5|
|2012|   X|Total|Total|   3|
+----+----+-----+-----+----+



In [42]:
df_r6=df_union(df_list4,id_col)
df_r6.show()

total size: 4
index: 1
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   N|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+

index: 2
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   B|Total|Total|  23|
|2012|   S|Total|Total|   5|
+----+----+-----+-----+----+

index: 3
+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   L|Total|Total|   5|
|2012|   M|Total|Total|   2|
|2012|   X|Total|Total|   3|
+----+----+-----+-----+----+

+----+----+-----+-----+----+
|Id_1|Id_2| Id_3| Id_4|Me_1|
+----+----+-----+-----+----+
|2012|   F|Total|Total|   3|
|2012|   G|Total|Total|   2|
|2012|   B|Total|Total|   5|
|2012|   N|Total|Total|  23|
|2012|   S|Total|Total|   5|
|2012|   M|Total|Total|   2|
|2012|   L|Total|Total|   5|
|2012|   X|Total|Total|   3|
+----+----+-----+-----+----+

