## Concatenation
By the end of this lecture you will be able to:
- vertically concatenate `DataFrames`
- handle inconsistent dtypes in a vertical concat
- horizontally concatenate `DataFrames`
- diagonally concatenate `DataFrames`


In [25]:
import polars as pl

We create a first `DataFrame` with fake trade records from 2023

In [26]:
df1 = pl.DataFrame(
    [
        {"year":2023, "exporter": "India", "importer": "Russia", "quantity":0},
        {"year":2023, "exporter": "India", "importer": "Russia", "quantity":1}
    ]
) 
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1


We now create a second `DataFrame` with trade records from 2024

In [27]:
df2 = pl.DataFrame(
    [
        {"year":2024, "exporter":"India", "importer":"Russia", "quantity":2},
        {"year":2024, "exporter":"India", "importer": "Russia", "quantity":2}
    ]
)
df2

year,exporter,importer,quantity
i64,str,str,i64
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


## Combining `DataFrames` vertically
If we have data in two different `DataFrames` that we combine as a new `DataFrame` we can manage the data in memory in three different ways:
- keeping the data in the original two locations in memory and referencing the new `DataFrame` to these original locations
- copying the data to a single location in memory and and referencing the new `DataFrame` to this single location
- appending the data from the second `DataFrame` to the location of the first `DataFrame`

We cover three methods for vertically combining `DataFrames`: `df.vstack`, `df.extend` and `pl.concat`. The output of each method is the same from a user perspective but differs in terms of where the data sits in memory underneath the hood

### `vstack`

We combine the df1 and df2 `DataFrames` into a single `DataFrame` with the `vstack` method

In [28]:
(
    df1
    .vstack(
        df2
    )
    
)

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


A `vstack`:
- keeps the data from both `DataFrames` in their original locations in memory and points the new `DataFrame` to those locations

### Rechunk
A `vstack` is computationally very cheap (as no data is copied). However, subsequent operations (e.g. `group_by`) are slower than if the data has been *rechunked* (i.e. copied from the original two chunks to a new single location in memory.


We can manually cause two `DataFrames` linked by `vstack` to be copied to a single location in memory with `rechunk`

In [29]:
(
    df1
    .vstack(
        df2
    )
    .rechunk()
)

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


We see below that the `pl.concat` function is a way of applying `vstack` and `rechunk` to a list of `DataFrames`.

### Extend
We can append one `DataFrame` to another with `extend`

In [30]:
(
    df1
    .extend(
        df2
    )
)

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


An `extend`:
- copies the data from second `DataFrame` (`df2`) and appends it to the data of the first `DataFrame` (`df1`)
- modifies the first `DataFrame` (`df1`) 

We can see that `df1` has been modified in-place as it now has both years of data

In [31]:
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


Before continuing we re-assign `df1` back to its original value to reduce confusion if cells are executed out-of-order!

In [32]:
df1 = pl.DataFrame(
    [
        {"year":2023, "exporter": "India", "importer": "Russia", "quantity":0},
        {"year":2023, "exporter": "India", "importer": "Russia", "quantity":1}
    ]
) 
df1

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1


### Use case of `vstack`, `rechunk` and `extend`
- If you are combining `DataFrames` to do more transformations/groupbys/joins etc it is normally best to use `vstack` and `rechunk` so that all the data is together in memory. In practice it is simpler to use `pl.concat` to do this as we see below
- If you want to combine two `DataFrames` but do not want to do more operations on them (e.g. you just want to check their length of perhaps write to a file) you should use `vstack`
- If you want to add a small `DataFrame` to a large `DataFrame` use `extend` as it only copies the data in the small `DataFrame`

### Vertically concatenating `DataFrames`

Above we saw how to vertically combine two `DataFrames`. More generally, we can combine a `list` of `DataFrames` with `pl.concat`. For clarity, we set the `how="vertical"` argument explicitly this time although it is the default argument

In [33]:
(
    pl.concat(
        [df1,df2],
        how= 'vertical'
    )
)

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


When we do `pl.concat` Polars:
- does a series of `vstacks` to combine the list of `DataFrames`
- then does a `rechunk` to gather all the data together in memory

We can stop Polars from doing the `rechunk` by passing the `rechunk=False` argument

In [34]:
df_vertical = (
    pl.concat(
        [
            df1,
            df2
        ],
        rechunk=False
    )
)
df_vertical

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


### Handling different dtypes in vertical concatenation

Polars expects the column names and dtypes to match when doing vertical concatenation.

To illustrate some approaches for handling differences in types we create an alternative `df2` where the `quantity` column is 64-bit float instead of 64-bit integer

In [35]:
df2_float = (
    df2
    .with_columns(
        pl.col("quantity").cast(pl.Float64)
    )
)
df2_float

year,exporter,importer,quantity
i64,str,str,f64
2024,"""India""","""Russia""",2.0
2024,"""India""","""Russia""",2.0


When the dtypes do not match we may have to manage this by doing an explicit `cast` of the column types.

In this example we cast the `quantity` column back to `pl.Int64`

In [36]:
(
    pl.concat(
        [
            df1,
            df2_float.with_columns(
                pl.col("quantity").cast(pl.Int64)
            )
        ]
    )
)

year,exporter,importer,quantity
i64,str,str,i64
2023,"""India""","""Russia""",0
2023,"""India""","""Russia""",1
2024,"""India""","""Russia""",2
2024,"""India""","""Russia""",2


However, Polars also has a way of managing certain differences by casting to a "supertype". For example, the supertype of `pl.Float64` and `pl.Int64` is `pl.Float64`.

We can do a vertical concatenation using supertypes where necessary by specifying the `how` method as `vertical_relaxed` instead of `vertical`

In [38]:
(
    pl.concat(
        [
            df1,
            df2_float
        ],
        how = "vertical_relaxed"
    )
)

year,exporter,importer,quantity
i64,str,str,f64
2023,"""India""","""Russia""",0.0
2023,"""India""","""Russia""",1.0
2024,"""India""","""Russia""",2.0
2024,"""India""","""Russia""",2.0



## Horizontal concatenation
We can horizontally concatenate `DataFrames` that have:
- the same number of rows and
- different column names

For horizontal concatenation we create another `DataFrame` that has more details about each of the trades in 2023

In [39]:
df1_details = pl.DataFrame(
    [
        {"item": "Clothes", "value":10},
        {"item": "Machinery", "value": 100}
    ]
)
df1_details

item,value
str,i64
"""Clothes""",10
"""Machinery""",100


### `hstack`

We can combine two `DataFrames` horizontally with `hstack`

In [40]:
(
    df1
    .hstack(
        df1_details
    )
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100


This operation is *not* in-place unless we pass `in-place=True`.

We can also pass a `list` of `Series` inside `hstack`

In [41]:
(
    df1
    .hstack(
        [
        df1_details["item"],
        df1_details["value"]
        ]
    )
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100



### Horizontal concatenation
We can also use `pl.concat` for horizontal concatenation

In [42]:
(
    pl.concat(
        [
            df1,
            df1_details
        ],
        how = "horizontal"
    )
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""Russia""",0,"""Clothes""",10
2023,"""India""","""Russia""",1,"""Machinery""",100


If we have common columns and some overlap in the values in those common columns we can use an alternative horizontal concatenation method called `align` where Polars identifies the common columns and aligns the rows appropriately.

In this modified example we have `item` as a column in both `DataFrames` but in the second `DataFrame` we only have one `item`.

In [45]:
(
    pl.concat(
        [
            pl.DataFrame(
                [
                    {"year":2023, "exporter": "India", "item":"Clothes"},
                    {"year":2023, "exporter": "India", "item":"Machinery"}
                ]
            ),
            pl.DataFrame(
                [
                    {"item":"Machinery", "value":100}
                ]
            )
        ], how= "align"
    )
)

year,exporter,item,value
i64,str,str,i64
2023,"""India""","""Clothes""",
2023,"""India""","""Machinery""",100.0


When we do an `align` concatenation Polars sees that we can horizontally concatenate the second `DataFrame` but that only the second row has a value t be concatenated.

## Diagonal concatenation

We are now looking at new trade records for 2023 and 2024 between India and the USA.

In 2023 the schema of the trade records is the same as we saw above with: 
- `year`
- `exporter` and 
- `importer`

However, in 2024 the schema also includes:
- `item` and 
- `value`

In [46]:
df1 = pl.DataFrame(
    [
        {"year":2023, "exporter": "India", "importer":"USA", "quantity":0},
        {"year":2023, "exporter": "India", "importer":"USA", "quantity":1},
    ]
)

df2 = pl.DataFrame(
    [
        {"year":2024, "exporter": "India", "importer":"USA", "quantity":2, "item": "Clothes", "value":10},
        {"year":2024, "exporter": "India", "importer":"USA", "quantity":3, "item": "Machinery", "value":100}
    ]
)

We want to combine these records into a single `DataFrame`. As the column names are not the same we cannot do a vertical concatenation.

Instead we can do a diagonal concatenation

In [47]:
(
    pl.concat(
        [
            df1,
            df2
        ], how = "diagonal"
    )
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2023,"""India""","""USA""",0,,
2023,"""India""","""USA""",1,,
2024,"""India""","""USA""",2,"""Clothes""",10.0
2024,"""India""","""USA""",3,"""Machinery""",100.0
