# Merging data

```{admonition} Summary
:class: hint

This section explains XY ways to merge geospatial data:

- Merging similar datasets using the `concat()` function
- Merging datasets by common attributes using the `merge()` function
- Merging datasets based on spatial relationships using the `sjoin()` function
...

```

## Merging similar datasets using `concat()`

If the attributes of the input datasets are identical, they can be merged vertically using the `concat()` function in `pandas`.

Import and load required libraries:

In [1]:
from pathlib import Path
import geopandas as gp
import pandas as pd

Define input and output paths:

In [2]:
INPUT = Path.cwd().parents[0] / "00_data"
OUTPUT = Path.cwd().parents[0] / "out"

Load input datasets:

In [4]:
layers_path = Path(INPUT / "layers")
input_layer = gp.read_file(layers_path / "border.shp")
second_layer = gp.read_file(layers_path / "border.shp")

The `concat()` function is used to concatenate the two layers. The `ignore_index` parameter determines whether a new index is created or if the original indexes are preserved.

- `ignore_index=True` → Creates a new index for the features in the merged dataset.
- `ignore_index=False` → Preserves original indexes (which may lead to duplicate indexes).

In [5]:
merged_layer = pd.concat(
    [input_layer, second_layer], 
    ignore_index=True)

Check the results: 

In [6]:
merged_layer

Unnamed: 0,Id,geometry
0,0,"POLYGON ((405728.211 5659356.174, 405708.836 5..."
1,0,"POLYGON ((405728.211 5659356.174, 405708.836 5..."


## Merging based on common attributes using `merge()`

If two datasets share a `common attribute`, they can be merged into one dataset horizontally. In this case, the output will include only rows with matching values in the common attribute. 

For example, if both datasets contain an `ID` column, only the rows with the **same ID** will be merged horizontally, combining all attributes for that ID. 

````{admonition} How the **how** parameter works in the **merge() function**
:class: tip
The `how` parameter controls how datasets are merged:
- `inner`: The result includes only matching rows. Common attribute must be the same in both datasets from both datasets.
- `outer`: The result  all rows from both datasets. Missing values are filled with `NaN`.
- `left`: The result includes all rows from the first dataset and matching data from the second. Non-matching rows get `NaN`.
- `right`: The result includes all rows from the second dataset and matching data from the first. Non-matching rows get `NaN`.

For more information, check the the [GeoPandas documentation](https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#attribute-joins).
````

```{dropdown} Examples how different **how** options affect merging with tables
The following examples show how two datasets are merged, defined by the `how` parameter. 

**Dataset 1**:

|ID  |Type  |
|:---| ----:|
|1   |     A|
|3   |C     |
|6   |D     |

**Dataset 2**:

|    ID  |   Time    |
| :------| --------: |
|    1   |  15 min   |
|    3   |  16 min   |
|    4   |  17 min   |

- `inner`: Keeps only the rows with matching `ID` values in both datasets.

|    ID  |  Type   |   Time   |
| :------| ------- |--------: |
|    1   |   A     |  15 min  |
|    3   |   C     |  16 min  |

- `outer`: Includes all rows from both datasets; missing values are filled with `NaN`.

|    ID  |  Type   |   Time   |
| :------| ------- |--------: |
|    1   |   A     |  15 min  |
|    3   |   C     |  16 min  |
|    6   |   D     | **NaN**  |
|    4   | **NaN** |  17 min  |

- `left`: Keeps all rows from **Dataset 1** and adds matching values from **Dataset 2**; unmatched values get `NaN`.

|    ID  |  Type   |   Time   |
| :------| ------- |--------: |
|    1   |   A     |  15 min  |
|    3   |   C     |  16 min  |
|    6   |   D     | **NaN**  |

- `right`: Keeps all rows from **Dataset 2** and adds matching values from **Dataset 1**; unmatched values get `NaN`.

|    ID  |  Type   |   Time   |
| :------| ------- |--------: |
|    1   |   A     |  15 min  |
|    3   |   C     |  16 min  |
|    4   | **NaN** |  17 min  |
```

**Code example: Using the `merge()` function in Python**

Load the datasets:

In [7]:
input_layer = gp.read_file(layers_path / "border.shp")
second_layer = gp.read_file(layers_path / "border.shp")

Merge using the `merge()` function:

In [8]:
merged_layer = input_layer.merge(
    second_layer, 
    on='Id',     # on - defines the similarity attribute.
    how='inner') # how - controls how datasets are merged.

Then check the result with the `print` function:

In [9]:
merged_layer

Unnamed: 0,Id,geometry_x,geometry_y
0,0,"POLYGON ((405728.211 5659356.174, 405708.836 5...","POLYGON ((405728.211 5659356.174, 405708.836 5..."


```{admonition} Handling duplicate column names
:class: note
If both datasets have columns with the same name (other than the common attribute), GeoPandas adds suffixes `_x` and `_y` by default.

**Example**: If both datasets contain a column named area, the merged dataset will have area_x (from the first dataset) and area_y (from the second dataset).
```

### Customizing suffixes

To rename these suffixes within the `merge()` function, use the `suffixes` parameter:

In [10]:
merged_layer = input_layer.merge(
    second_layer,
    on='Id',
    how='inner',
    suffixes=('_inputlayer', '_2ndlayer')) # rename suffixes

Check the result with the `print` function:

In [13]:
merged_layer.T

Unnamed: 0,0
Id,0
geometry_inputlayer,"POLYGON ((405728.2110448944 5659356.17364894, ..."
geometry_2ndlayer,"POLYGON ((405728.2110448944 5659356.17364894, ..."


```{admonition} **Merging only selected attributes**
:class: tip, dropdown
By default, merging includes all attributes from both datasets. However, you can specify only the required attribute.

To define the specific attributes for the merge process, these attributes are defined in double brackets `[[ ]]`. The "common" attribute (`on` parameter), **must** be defined in both layers.

Note:
- Use double brackets `[[ ]]` to preserve the tabular format (GeoDataFrame).
- Using single brackets `[ ]` creates a series of data, which is not in tabular format (GeoDataFrame) and makes an error with operating the `merge()` function.
```

Example: Merging only the attributes `Id` and `geometry`

In [23]:
merged_layer = input_layer[['Id']].merge(
    second_layer[['Id','geometry']],
    on='Id',
    how='inner')
merged_layer # Check the result

Unnamed: 0,Id,geometry
0,0,"POLYGON ((405728.211 5659356.174, 405708.836 5..."


## Merging datasets based on spatial relationships using `sjoin()`

Another way to merge datasets is by using spatial relationships instead of common attributes. The `sjoin()` function can be used to join two datasets based on their spatial relationships. 

Key parameters:
- `predicate` - Defines the type of spatial relationship.
- `how` - Specifies how the layers are combined.

````{admonition} Available operations with the **predicate** parameter
:class: tip

The following list explains available operations:
- `contains`: Object A **completely** encloses object B (no boundary touch).
- `covers`: Object A **completely** contains object B (boundaries may touch).
- `within`: Object A is **completely** inside object B (no boundary touch).
- `covered_by`: Object A is **completely** within object B (boundaries may touch).
- `touches`: Objects A and B meet **only** at boundaries.
- `overlaps`: Objects A and B share an area.
- `crosses`: Objects A and B intersect at discrete points.
- `intersects`: Object A and B touch, cross or share an area.

```{figure} ../resources/13.png
:height: 200px
:name: figure-example

Available operations with the **predicate** parameter
```

Check the [ArcGIS documentation](https://desktop.arcgis.com/en/arcmap/latest/extensions/data-reviewer/types-of-spatial-relationships-that-can-be-validated.htm#GUID-B8BCA279-A7D9-422D-90B6-414B11350D1A) to learn more about spatial relationships.
````

````{admonition} Available operations with the **how** parameter
:class: tip

The `how` parameter determines how the two datasets are merged:

- `left`: keeps all records from the left dataset (first dataset) and adds matching records from the right dataset. The index comes from the left dataset.
- `right`: Keeps all records from the right dataset (second dataset) and adds matching records from the left dataset. The index comes from the right dataset.
- `inner`: Keeps only matching records from both datasets (where spatial conditions are met). The index comes from the right dataset.

Check [GeoPandas documentation](https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html) to learn more about  `sjoin()` function. 

````


```{admonition} Handling duplicate column names
:class: note, dropdown

If both datasets have columns with the same name, GeoPandas automatically assigns suffixes (`_left` and `_right`) to distinguish them.
To customize these suffixes, use:

- `lsuffix`: Defines a suffix for the left dataset.
- `rsuffix`: Defines a suffix for the right dataset.

```

**Code example: Using the `sjoin()` function in Python**

Import and load required libraries:

In [None]:
from pathlib import Path
import geopandas as gp
import pandas as pd

Load the datasets:

In [25]:
input_layer = gp.read_file (OUTPUT / "clipped.shp")
join_layer = gp.read_file(INPUT / "layers" / "border.shp")

In [26]:
intersected = input_layer.sjoin(
    join_layer, 
    predicate='intersects', 
    how='left')

The result is printed using the `print` function.

In [28]:
list(intersected.columns)

['KS_IS',
 'CLC_st1',
 'CLC18',
 'CLC',
 'Biotpkt201',
 'Shape_Leng',
 'Shape_Area',
 'geometry',
 'index_right',
 'Id']

In [30]:
intersected.head(3)

Unnamed: 0,KS_IS,CLC_st1,CLC18,CLC,Biotpkt201,Shape_Leng,Shape_Area,geometry,index_right,Id
0,SV,122,,,5.271487,487783.286284,2869516.0,"POLYGON ((401808.569 5661532.707, 401859.892 5...",0,0
1,,231,231.0,231.0,10.981298,1385.65415,12351.55,"POLYGON ((403104.213 5657996.304, 403105.003 5...",0,0
2,,231,231.0,231.0,10.981298,2978.763179,42370.39,"POLYGON ((403076.477 5658033.218, 403036.719 5...",0,0


```{admonition} **Suffixes in Spatial Join**
:class: tip, dropdown
These suffixes are the default names for the indexes, which are also customizable. Using the `lsuffix` parameter for the input layer (left GeoDataFrame) and the `rsuffix` parameter for the second layer (right GeoDataFrame), the suffixes can be customized separately. It is only necessary to define the suffixes of interest in the `lsuffix` or `rsuffix` parameters.
```

In the following example as the type of join is `inner` and by default, the values from the second layer will be join to the first layer, the indexes will be from the second layer (right layer), So the `rsuffix` customized.

In [32]:
intersects_result = input_layer.sjoin(
    join_layer, 
    predicate='intersects', 
    how='inner', 
    rsuffix='_border')
list(intersects_result.columns)

['KS_IS',
 'CLC_st1',
 'CLC18',
 'CLC',
 'Biotpkt201',
 'Shape_Leng',
 'Shape_Area',
 'geometry',
 'index__border',
 'Id']

It is also possible to define the interested columns from the available columns of both datasets to have in the spatial join output, these columns are defined in the double brackets `[[`. 

```{admonition} **Geometry Importance in Spatial Join**
:class: warning, dropdown
Since the spatial join is based on the geometry of the features, the `geometry` attribute **must** be defined among selected columns in both layers.
```

In [33]:
intersects_result = input_layer[['KS_IS', 'CLC','geometry']] \
    .sjoin(
        join_layer[['Id','geometry']],
        predicate='intersects', how='right')

list(intersects_result.columns)

['index_left', 'KS_IS', 'CLC', 'Id', 'geometry']

The second way of spatial joining is to use the `gp.sjoin()` function from geopandas. The parameters are exactly the same as the first method, the only difference is that both datasets are called inside the function, which is how functions generally work.

In [34]:
intersects_result= gp.sjoin(
    input_layer, join_layer, 
    how='right',
    predicate='intersects', 
    lsuffix='_left')
list(intersects_result.columns)

['index__left',
 'KS_IS',
 'CLC_st1',
 'CLC18',
 'CLC',
 'Biotpkt201',
 'Shape_Leng',
 'Shape_Area',
 'Id',
 'geometry']

Here also it is possible to define specific columns in the double brackets `[[`  to take part in the operation. As it is spatial join, the `geometry` attribute **must** be defined in both layers.

In [35]:
intersects_result= gp.sjoin(
    input_layer[['KS_IS', 'CLC','geometry']],
    join_layer[['Id','geometry']],
    how='right', 
    predicate='intersects', 
    lsuffix='_left')
list(intersects_result.columns)

['index__left', 'KS_IS', 'CLC', 'Id', 'geometry']