# Dealing with column ambiguity

Sometimes columns can be ambiguous, for example:

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.Builder().config("spark.ui.showConsoleProgress", "false").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [2]:
from typedspark import Column, Schema, create_partially_filled_dataset
from pyspark.errors import AnalysisException
from pyspark.sql.types import IntegerType, StringType

class Person(Schema):
    id: Column[IntegerType]
    name: Column[StringType]
    age: Column[IntegerType]

class Job(Schema):
    id: Column[IntegerType]
    salary: Column[IntegerType]

df_a = create_partially_filled_dataset(spark, Person, {Person.id: [1, 2, 3]})
df_b = create_partially_filled_dataset(spark, Job, {Job.id: [1, 2, 3]})

try:
    df_a.join(df_b, Person.id == Job.id)
except AnalysisException as e:
    print(e)

[AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`].


The above resulted in a `AnalysisException`, because Spark can't figure out whether `id` belongs to `df_a` or `df_b`. To deal with this, you need to register your `Schema` to the `DataSet`.

In [3]:
from typedspark import register_schema_to_dataset

person = register_schema_to_dataset(df_a, Person)
job = register_schema_to_dataset(df_b, Job)
(
    df_a
    .join(df_b, person.id == job.id)
    .show()
)

+---+----+----+---+------+
| id|name| age| id|salary|
+---+----+----+---+------+
|  1|null|null|  1|  null|
|  2|null|null|  2|  null|
|  3|null|null|  3|  null|
+---+----+----+---+------+



It is often a good idea to drop the ambiguous column, for example:

In [4]:
from typedspark import transform_to_schema

class PersonWithJob(Person, Job):
    pass

(
    transform_to_schema(
        df_a
        .join(df_b, person.id == job.id)
        .drop(job.id),
        PersonWithJob
    )
    .show()
)

+---+------+----+----+
| id|salary|name| age|
+---+------+----+----+
|  1|  null|null|null|
|  2|  null|null|null|
|  3|  null|null|null|
+---+------+----+----+

