# Left, inner, outer and cross
By the end of this lecture you will be able to:
- left joins between `DataFrames`
- inner joins between `DataFrames`
- outer and cross joins between `DataFrames`


In [1]:
import polars as pl

In [2]:
df1 = pl.DataFrame(
    {
        "id": [1,2,3],
        "name": ["Alice", "Bob", "Charlie"]
    }
)
df1

id,name
i64,str
1,"""Alice"""
2,"""Bob"""
3,"""Charlie"""


In [3]:
df2 = pl.DataFrame({
    "id": [2, 3, 4],
    "age": [25, 30, 35]
})
df2

id,age
i64,i64
2,25
3,30
4,35


### Inner Join:

    Combines rows from both DataFrames where the id column matches in both tables.
    Only rows with matching id values (2 and 3) are included in the result.

In [4]:
result = df1.join(df2, on="id", how="inner")

In [5]:
result

id,name,age
i64,str,i64
2,"""Bob""",25
3,"""Charlie""",30


## Left join
In a left join we return all the rows from the left `DataFrame` and the matched rows from the right `DataFrame`. The matching happens on the basis of the join column(s). If the join column name is the same in the left and right `DataFrames` we can pass it to the `on` argument

In [7]:
(
    df1
    .join(
        df2, 
        on="id",
        how="left",
        coalesce=False
    )
)

id,name,id_right,age
i64,str,i64,i64
1,"""Alice""",,
2,"""Bob""",2.0,25.0
3,"""Charlie""",3.0,30.0


Note that:
- The order of `df1` is maintained in this left join
- The `null` `value` in the last row if `df1` is not joined to the `null` value in `df2`
- If the join column name(s) are not the same in both `DataFrames` then we specify `left_on` and `right_on` instead of `on`
- as we pass `coalesce=False` (which is the default) we get `id` and `id_right` join columns in the output

If we instead pass `coalesce=True` Polars coalesces the join columns `id` and `id_right` into a single `id` column (this was the default behaviour previously, personally this is what I normally want)

In [8]:
(
    df1
    .join(
        df2,
        on="id",
        how='left',
        coalesce=True
    )
)

id,name,age
i64,str,i64
1,"""Alice""",
2,"""Bob""",25.0
3,"""Charlie""",30.0


If we set `join_nulls=True` then Polars does join on `null` values

In [9]:
(
    df1.
    join(
        df2,
        on="id",
        how="left",
        coalesce=False,
        join_nulls=True
    )
)

id,name,id_right,age
i64,str,i64,i64
1,"""Alice""",,
2,"""Bob""",2.0,25.0
3,"""Charlie""",3.0,30.0


When there are duplicate columns in both `DataFrames` Polars adds the suffix `_right` to the duplicate columns on the right by default. We can set an alternative suffix with the `suffix` argument

In [10]:
(
    df1
    .join(
        df2,
        on="id",
        how="left",
        coalesce=False,
        suffix="_right"
    )
)

id,name,id_right,age
i64,str,i64,i64
1,"""Alice""",,
2,"""Bob""",2.0,25.0
3,"""Charlie""",3.0,30.0


## Inner joins
In an inner join we only retain the rows in both `DataFrames` where there is a matching join key

In [11]:
(
    df1.join(
        df2,
        on="id"
    )
)

id,name,age
i64,str,i64
2,"""Bob""",25
3,"""Charlie""",30


As for left joins the output excludes `null` values unless we set `join_nulls=True`

In [12]:
(
    df1
    .join(
        df2,
        on="id",
        join_nulls=True
    )
)

id,name,age
i64,str,i64
2,"""Bob""",25
3,"""Charlie""",30


## Cross join
With a cross join we get the Cartesian product of both tables - so we end up with each row of the left `DataFrame` matched with each row of the right `DataFrame` and there is no join key.

In [13]:
(
    df1
    .join(
        df2,
        how="cross"
    )
)

id,name,id_right,age
i64,str,i64,i64
1,"""Alice""",2,25
1,"""Alice""",3,30
1,"""Alice""",4,35
2,"""Bob""",2,25
2,"""Bob""",3,30
2,"""Bob""",4,35
3,"""Charlie""",2,25
3,"""Charlie""",3,30
3,"""Charlie""",4,35


## Full outer join
In a full outer join we returns all rows when there is a match in either left or right `DataFrame`

In [15]:
(
    df1
    .join(
        df2,
        on="id",
        how= "full"
    )
)

id,name,id_right,age
i64,str,i64,i64
2.0,"""Bob""",2.0,25.0
3.0,"""Charlie""",3.0,30.0
,,4.0,35.0
1.0,"""Alice""",,


In the output we see that:
- we get an `id_right` column as we have the same name for the join column in both `DataFrames`
- the first two rows have the matching keys
- we then get the two rows where there is no match in the left `DataFrame` for `null` and `D`
- we then get the two rows where there is no match in the right `DataFrame` for `null` and `B`

We can of course choose to match on the `null` values with `join_nulls=True`

In [16]:
(
    df1
    .join(
        df2,
        on="id",
        how="full",
        join_nulls=True
    )
)

id,name,id_right,age
i64,str,i64,i64
2.0,"""Bob""",2.0,25.0
3.0,"""Charlie""",3.0,30.0
,,4.0,35.0
1.0,"""Alice""",,


## Full outer join with coalesced join keys
An outer-coalesce join is like: 
- an `full outer` join followed by
- a `coalesce` of the `id` and `id_right` columns into a single `id` column with the first non-`null` value

To do this we pass `coalesce=True`

In [18]:
(
    df1
    .join(
        df2,
        on="id",
        how="full",
        coalesce= True
    )
)

id,name,age
i64,str,i64
2,"""Bob""",25.0
3,"""Charlie""",30.0
4,,35.0
1,"""Alice""",
