# NETWORKS TRANSFORMATION

This notebook contains all the functions needed to perform various trasformations to the networks of the tracing usecase, and prepare the data for harmonization.

Datasets Needed:

- Hydro-network dataset
- Sewer network dataset / Discharge Points dataset

## Expected Outputs

- Connection nodes
- Discharge points
- Start and end nodes for water
- Water dataset with start and end ID
- Split water dataset with new start and end ids
- Fully connected water dataset

In [1]:
import os
import sys
path = os.path.dirname(os.path.abspath(''))
os.chdir(path)
print(path)
sys.path.insert(0, path)

C:\workdir\develop\repository\go-peg


In [2]:
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, LineString, MultiPoint, MultiLineString
from shapely import wkt
from shapely.ops import nearest_points
import shapely.wkt
import numpy as np

import random


import warnings
from shapely.errors import ShapelyDeprecationWarning
warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning)
pd.options.mode.chained_assignment = None  # default='warn'


from src.config import config




C:\workdir\develop\repository\go-peg


# 1. Prepare Water Network
A water network is received as an edge only network wih no nodes. Here, we generate hydro-nodes from the begining and end points of a linestring geometry, and assign them unique ids that can then be added to the water segments as begin and end points.

The following steps are performed:

## 1.1. CRS
Assign the coordinate reference system as a global variable to use through out the application.

In [3]:
PROJ_CRS = "EPSG:31370"

## 1.2. Load Water Data
Load the data into a dataframe. Various formats can be loaded onto a dataframe in Geopandas. Here, both shapefiles and GML data are used.

Set the crs to the project crs. 

In [4]:
def load_data(path, PROJ_CRS):
    """
    Loads the data from the given path,
    and prints the shape and crs of the data.
    """
    data = gpd.read_file(path)
    print(data.shape)
    #print("Original crs:", data.crs)
    data = data.to_crs(PROJ_CRS)
    print("Project crs:", data.crs)
    data = data.drop_duplicates(subset=["geometry"]).reset_index(drop=True)
    return data


path = config.data_src / "flanders_hydro_network/Wlas.shp"

water_data = load_data(path, PROJ_CRS)

(63767, 19)
Project crs: EPSG:31370


## 1.3. Turn multiline water network into single line water network

The water segments geometries can sometimes be stored as a multilinestring. This means that the geometries are represented by nested lists and this can make programmatically manipulating them difficult. Therefore multiline water networks are converted into single line water networks by splitting the linestrings into individual linestrings. This is done by 'flattening' the nested list that makes up a multilinestring structure.

This ensures we can extract begin and end points of a water segment.

In [5]:
# Check for multiline strings in a dataset
def check_multiline(df):
    """This function checks for multiline strings
    from the geometry column in a given dataset"""
    lst = df["geometry"].to_list()
    multiline_count = 0
    for item in lst:
        if isinstance(item, MultiLineString):
            multiline_count += 1
    print("MultiLinesStrings:", multiline_count)


# filter out multilinestring dataset
def multiline_to_linestring(df):
    # filter out multilinestring dataset
    multiline_df = df[df["geometry"].apply(lambda x: isinstance(x, MultiLineString))]
    linestrings_df = df[df.geom_type == "LineString"]
    if len(linestrings_df) == len(df):
        print("No multiline strings found")
        return df

    else:
        print("Checking for multiline strings...")
        check_multiline(df)
        # turn multilinestrings into linestrings
        linestrings = []
        for idx, row in multiline_df.iterrows():
            inlines = row.geometry
            outcoords = [list(item.coords) for item in inlines]
            outline = shapely.geometry.LineString(
                [i for sublist in outcoords for i in sublist]
            )
            # outline_geom = shapely.wkt.dumps(outline)
            linestrings.append(outline)

        # add  linestrings to dataframe and drop original geom column
        multiline_df["exploded"] = linestrings
        multiline_df = (
            multiline_df.drop(["geometry"], axis=1)
            .rename(columns={"exploded": "geometry"})
            .reset_index(drop=True)
        )
        multiline_gdf = gpd.GeoDataFrame(
            multiline_df, geometry="geometry", crs=PROJ_CRS
        )

        gdf = linestrings_df.append(multiline_gdf).reset_index(drop=True)
        print("Checking for multiline strings after...")
        check_multiline(gdf)

    return gdf

In [6]:
water_data = multiline_to_linestring(water_data)

Checking for multiline strings...
MultiLinesStrings: 3


  gdf = linestrings_df.append(multiline_gdf).reset_index(drop=True)


Checking for multiline strings after...
MultiLinesStrings: 0


## 1.4. Generate begin and end nodes

Get begin and end point geometries by extracting the first and the last point geometries of a linestring.

In [7]:
def add_beginpoints(df):
    startnodes_gdf = df.copy()
    lst = startnodes_gdf["geometry"].to_list()
    beginpoints = []
    for item in lst:
        first = Point(item.coords[0])
        first_precise = shapely.wkt.dumps(first)
        beginpoints.append(first_precise)

    startnodes_gdf["start_point"] = [wkt.loads(g) for g in beginpoints]
    startnodes_gdf = startnodes_gdf.drop(["geometry"], axis=1).rename(
        columns={"start_point": "geometry"}
    )

    startnodes_gdf = gpd.GeoDataFrame(
        startnodes_gdf, geometry=startnodes_gdf["geometry"], crs=PROJ_CRS
    )  # .drop(columns=[col])
    return startnodes_gdf


def add_endpoints(df):
    endnodes_gdf = df.copy()
    lst = endnodes_gdf["geometry"].to_list()
    endpoints = []
    for item in lst:
        last = Point(item.coords[-1])
        last_precise = shapely.wkt.dumps(last)
        endpoints.append(last_precise)

    endnodes_gdf["end_point"] = [wkt.loads(g) for g in endpoints]
    endnodes_gdf = endnodes_gdf.drop(["geometry"], axis=1).rename(
        columns={"end_point": "geometry"}
    )

    endnodes_gdf = gpd.GeoDataFrame(
        endnodes_gdf, geometry=endnodes_gdf["geometry"], crs=PROJ_CRS
    )  # .drop(columns=[col])
    return endnodes_gdf

In [8]:
startnodes_gdf = add_beginpoints(water_data)
endnodes_gdf = add_endpoints(water_data)

#### Note 1:
Assert statements are used to test if the results generated are the expected results

In [9]:
assert startnodes_gdf.shape == endnodes_gdf.shape

## 1.5. Document the nodes

After the nodes have been created, perform spatial join the startnodes and endnodes dataframes to create one nodes geometry.

These nodes have a sequentially generated id, with a chosen prefix to make it a unique node identifier.


In [10]:
def get_nodes(id_col, region):
    nodes_geom = pd.merge(
        startnodes_gdf[[id_col, "geometry"]],
        endnodes_gdf[[id_col, "geometry"]],
        on="geometry",
        how="outer",
    ).reset_index(drop=True)
    unique_id_df = (
        nodes_geom[["geometry"]].drop_duplicates().reset_index().drop(columns=["index"])
    )
    assert len(unique_id_df) == nodes_geom.geometry.nunique()

    unique_id_df["New_ID"] = range(1, len(unique_id_df) + 1)
    unique_id_df["node_id"] = region + unique_id_df["New_ID"].astype(str)
    gdf = gpd.GeoDataFrame(
        unique_id_df, geometry=unique_id_df["geometry"], crs=PROJ_CRS
    ).drop(columns=["New_ID"])
    return gdf

In [11]:
water_nodes_df = get_nodes("VHAS", "VL")
assert len(water_nodes_df) == water_nodes_df.geometry.nunique()

In [12]:
water_nodes_df.sample(5)

Unnamed: 0,geometry,node_id
48632,POINT (108527.365 203553.217),VL48633
59525,POINT (33078.766 194499.750),VL59526
36408,POINT (244447.252 181532.100),VL36409
35680,POINT (184744.374 233515.233),VL35681
32811,POINT (186473.114 187338.428),VL32812


In [13]:
# water_nodes_df.to_file(r"data_transform\\vl_water_nodes.shp")

## 1.6. Add the nodes to water segments, and create start and end id columns

Using the sjoin method, map the nodes onto the linestrings to identify water segment start nodes and end nodes. Label nodes as either start_id or end_id in the water dataframe.

In [14]:
def add_ids_to_edges():
    # Label nodes as either start_id or end_id
    startnodes_merged = (
        gpd.sjoin(startnodes_gdf, water_nodes_df, how="left")
        .rename(columns={"node_id": "start_ID"})
        .drop("index_right", axis=1)
    )
    endnodes_merged = (
        gpd.sjoin(endnodes_gdf, water_nodes_df, how="left")
        .rename(columns={"node_id": "end_ID"})
        .drop("index_right", axis=1)
    )

    nodes_geom = pd.merge(startnodes_merged, endnodes_merged, on="VHAS")

    nodes = nodes_geom[["VHAS", "start_ID", "end_ID"]]

    water_edges_nodes = pd.merge(
        water_data, nodes, left_on="VHAS", right_on="VHAS"
    )  # .drop('id', axis=1)
    return water_edges_nodes


water_final = add_ids_to_edges()
assert water_final.VHAS.nunique() == water_final.geometry.nunique()

In [15]:
# water_final.to_file(r"data_transform\\vl_water_edges.shp")
#![2.water_nodes.PNG](attachment:2.water_nodes.PNG)

# 2. Prepare Sewer Network

Load the sewer network edges and nodes files.
If there is no sewer network, then load discharge points.

In this example dataset, the sewer network dataset consists of both nodes and edges.

Some networks will contain only nodes(discharge points). When only the discharge points are available, these will be used to get the connection points on the waer network as demonstrated  below.


## 2.1. Load sewer data

In [16]:
# path = data_src / "flanders_sewernetwork/Streng.shp"
sewer_edges = load_data(config.data_src / "flanders_sewernetwork/Streng.shp", PROJ_CRS)

# path = data_src / "flanders_sewernetwork/Hydpnt.shp"
hpoint_data = load_data(config.data_src / "flanders_sewernetwork/Hydpnt.shp", PROJ_CRS)

# path = data_src / "flanders_sewernetwork/Koppnt.shp"
koppnt_data = load_data(config.data_src / "flanders_sewernetwork/Koppnt.shp", PROJ_CRS)

(327212, 28)
Project crs: EPSG:31370
(36190, 22)
Project crs: EPSG:31370
(336225, 5)
Project crs: EPSG:31370


### 2.1.1.  Merge the nodes datasets

This dataset comes with two node data. To work with properties from both datasets, merge koppnt_data and hpoint_data dataframes.

In [17]:
merged_sewer_nodes = pd.merge(
    koppnt_data, hpoint_data, left_on="NRKPNT", right_on="CODEKOPPNT", how="left"
)
# koppnt_data.to_file(r"data_transform\\VL_koppnt.shp")

### 2.1.2. Save sewer network nodes and edges to file.

In [18]:
# sewer_edges.to_file(r"data_transform\\VL_sewer_edges.shp")
# data_hpoint_gml.to_file(r"data_transform\\VL_hydpnt.shp")
# koppnt_data.to_file(r"data_transform\\VL_koppnt.shp")

## 2.2. Expose external nodes

Extract the external nodes from the sewer network. External nodes refers to the nodes that have no start point, indicating that they empty into the river network. Here, using the attributes 'BEGINKPNT' and 'EINDKPNT' which are the node codes, we can find the external nodes.

In [19]:
def find_external_nodes(df, begin_col, end_col):

    """This function extracts the endpoints of a sewer segment
    that are not beginpoints of another sewer segment"""

    beginpoints = df[begin_col].to_list()
    endpoints = df[end_col].to_list()
    beginpoints_set = set(beginpoints)
    endpoints_set = set(endpoints)
    external_nodes = list(endpoints_set - beginpoints_set)
    return external_nodes

In [20]:
external_nodes = find_external_nodes(sewer_edges, "BEGINKPNT", "EINDKPNT")

ext_nodes_df = (
    hpoint_data.query("CODEKOPPNT in @external_nodes")
    .query("VHAS != 0")
    .drop_duplicates(subset="CODEKOPPNT")
    .drop(columns=["geometry"])
    .merge(koppnt_data[["NRKPNT", "geometry"]], left_on="CODEKOPPNT", right_on="NRKPNT")
    .drop(columns=["NRKPNT"])
    .drop_duplicates(subset="geometry")
)

In [21]:
assert ext_nodes_df.geometry.nunique() == ext_nodes_df.CODEKOPPNT.nunique()

ext_nodes_df.shape

(14440, 22)

In [22]:
#![3.water_sewer.PNG](attachment:3.water_sewer.PNG)

## 2.3 Find Connection Nodes

Connection nodes are the points on a water network where a sewer network 'connects' to, or in the real world, where the sewer empties into a water network.

Use a custom function to identify a connection point by projecting to the nearest point on a river from an external node.

For sewer networks made up of just discharge points, these are used in place of external nodes.

The expected output is a dataframe with sewer nodes projected onto water segments.

In [23]:
def get_nearest_point(df, line_col, point_col):
    """
    For each point in points_df, find the nearest point in lines_df.
    """
    geoms = []
    for idx, row in df.iterrows():
        destinations = MultiPoint(np.array(row[line_col].coords))
        # destinations = MultiPoint(row[line_col].coords)  # geometry_y
        nearest_geoms = nearest_points(row[point_col], destinations)  # geometry_x
        try:
            for coord in destinations.geoms:
                if coord == nearest_geoms[1]:
                    geoms.append(coord)
        except ValueError:
            print("No nearest point found for {}".format(row.CODEKOPPNT))
    return geoms

In [24]:
assert len(water_final) == water_final.VHAS.nunique()

In [25]:
water_final_cols = ["VHAS", "geometry"]
ext_nodes_cols = ["NRHPNT", "CODEKOPPNT", "VHAS", "geometry"]
sewer_water_df = (
    ext_nodes_df[ext_nodes_cols]
    .merge(water_final[water_final_cols], on="VHAS", how="left")
    .drop_duplicates(subset="geometry_x", keep="first")
    .query("geometry_y.notnull()")
    .assign(new_points=lambda x: get_nearest_point(x, "geometry_y", "geometry_x"))
)
print(sewer_water_df.shape)

sewer_water_df = gpd.GeoDataFrame(
    sewer_water_df, geometry="new_points", crs=PROJ_CRS
).drop_duplicates(subset="new_points")

conn_node_cols = ["NRHPNT", "CODEKOPPNT", "VHAS", "new_points"]
water_cols = ["VHAS", "CODEKOPPNT", "geometry_y"]
connection_nodes_df = (
    sewer_water_df[conn_node_cols]
    .rename(columns={"new_points": "geometry"})
    .reset_index(drop=True)
)

connection_nodes_gdf = gpd.GeoDataFrame(
    connection_nodes_df, geometry="geometry", crs=PROJ_CRS
)
print("Connection_nodes_df: ", connection_nodes_gdf.shape)

water_df = (
    sewer_water_df[water_cols]
    .rename(columns={"geometry_y": "geometry"})
    .reset_index(drop=True)
)
water_gdf = gpd.GeoDataFrame(water_df, geometry="geometry", crs=PROJ_CRS)
print("Water_df: ", water_gdf.shape)

(13904, 6)
Connection_nodes_df:  (11699, 4)
Water_df:  (11699, 3)


In [26]:
print(sewer_water_df.shape)
sewer_water_df.head(2)

(11699, 6)


Unnamed: 0,NRHPNT,CODEKOPPNT,VHAS,geometry_x,geometry_y,new_points
0,6025960.0,24195613725_1,6033745,POINT (49632.228 215175.512),"LINESTRING (49842.161 215096.866, 49720.629 21...",POINT (49720.629 215107.628)
1,6028829.0,7165620_1,6002399,POINT (63142.598 172228.907),"LINESTRING (64485.050 172900.574, 64483.883 17...",POINT (63142.133 172229.301)


In [27]:
connection_nodes_gdf.head(2)

Unnamed: 0,NRHPNT,CODEKOPPNT,VHAS,geometry
0,6025960.0,24195613725_1,6033745,POINT (49720.629 215107.628)
1,6028829.0,7165620_1,6002399,POINT (63142.133 172229.301)


In [28]:
# nodes_gdf_upload = nodes_gdf.drop(columns=['coords'])
# connection_nodes_gdf.to_file(r"data_transform\\vl_connection_nodes_PROCESSED.shp")

In [29]:
#![4.connection_nodes.PNG](attachment:4.connection_nodes.PNG)

## 2.4 Join the sewer network to the water network

This is done by transforming the linestring and point geometries into coordinates, and using these coordinates to identify split points on a water segment, by extracting the coordinates on a linestring that correspond to the connection nodes coordinates.

In [30]:
def get_point_coords(gdf):

    """Returns coordinates as tuples of coordinates"""

    return gdf.geometry.apply(lambda geom: (geom.x, geom.y))


def get_line_coords(line):

    """Returns a list of tuples of coordinates"""

    coords_list = []
    multi_points = MultiPoint(np.array(line.coords))

    multi_points_list = [shapely.wkt.dumps(g) for g in multi_points]
    multi_points_geoms = [shapely.wkt.loads(i) for i in multi_points_list]
    for i in multi_points_geoms:
        long, lat = i.x, i.y
        coords_list.append((long, lat))

    return coords_list

In [31]:
# print(connection_nodes_gdf.shape)
# print(connection_nodes_gdf.geometry.nunique())
nodes_gdf = connection_nodes_gdf.copy()
nodes_gdf["coords"] = get_point_coords(nodes_gdf)

water_gdf["coords"] = water_gdf.apply(lambda row: get_line_coords(row.geometry), axis=1)
print(water_gdf.shape)

(11699, 4)


## 2.5. Split function

The split function splits the water segements where the sewer empties into the river. One segment can have several splits. All the split segments, using the unique identifier of the original segment are added onto a split segment dataframe.

In [32]:
def get_line_segments(l, points_list):

    idx_list = [
        i for i, item in enumerate(l) if item in points_list
    ]  # compares the two lists and returns the indexes of occurence

    p = [l[i] for i in idx_list]  # get correct order of points list on the line

    super_list = []

    start_idx = 0

    # print("Index list: ", idx_list)
    if len(idx_list) == 1 and (
        p[0] == l[0] or p[0] == l[-1]
    ):  #      (i == 0 or i == len(l)-1) and len(idx_list) == 1:
        # print("One split point, at first or last index")
        line_segment = LineString(l)
        super_list.append(line_segment)

    elif len(idx_list) == 2 and (p[0] == l[0] or p[1] == l[-1]):
        # print("Two split points, at first and last index")
        line_segment = LineString(l)
        super_list.append(line_segment)

    else:
        # import pdb; pdb.set_trace()
        for i in idx_list:
            # In the case of the first coordinates of a line being a split point but there are other split points
            if i == 0 and len(idx_list) > 1:
                index_list = len(idx_list)
                # print(f"First index is a split point, with {index_list} split points")
                continue

            else:
                # print("Many split points")
                stop_idx = i + 1  # grab list elements until index i
                # print(f"stop index is {stop_idx}")
                line_list = l[start_idx:stop_idx]
                line_segment = LineString(line_list)
                super_list.append(line_segment)
                start_idx = (
                    i  # reset the start index to the number of the prevous stop index
                )

                # super list still has one more segment to add
                if len(super_list) == len(idx_list):
                    last_segment = l[stop_idx - 1 : len(l)]
                    if stop_idx == len(l):
                        # print("Split point at end of list") # stop index goes beyond the line list
                        break
                    # n = len(l) - len(super_list)
                    # last_segment = l[stop_idx-1:len(l)] # Grab the last segments of the list from the prevous stop_idx-1, to the end of the lin len(l)
                    else:
                        # print("Split point at end of list")
                        last_segment_geom = LineString(last_segment)
                        super_list.append(last_segment_geom)
    return super_list


# pass a dataframe to the function
def split_lines(water_gdf, nodes_gdf, unique_id):

    water_no_duplicates = water_gdf.drop_duplicates(subset=unique_id)

    groups = nodes_gdf.groupby(unique_id)

    codes_list = nodes_gdf[unique_id].to_list()
    
    unique_code_list = list(set(codes_list))

    all_segments = []
    ids = []
    # counter = 0

    for num, i in enumerate(unique_code_list):
        points_list = groups.get_group(i).coords.to_list()
        # print("Points list: ", points_list)
        line = water_no_duplicates[water_no_duplicates[unique_id] == i]["coords"][:1]
        # indx = water_no_duplicates[water_no_duplicates[unique_id] == i].index [0]
        points_list = groups.get_group(i).coords.to_list()

        line_segments = get_line_segments(*line, points_list)
        
        num_segments = len(line_segments)

        all_segments.extend(line_segments)
        # ids.append(indx)
        num_unique_ids = [i] * num_segments
        # assert len(flat_list) == len(water_no_duplicates)
        ids.extend(num_unique_ids)

    gdf_segments = gpd.GeoDataFrame(
        list(range(len(all_segments))), geometry=all_segments, crs=PROJ_CRS
    )
    gdf_segments.columns = ["index", "geometry"]
    gdf_segments[unique_id] = ids
    gdf_segments = gdf_segments.set_index("index")
    return gdf_segments

In [33]:
import time

initialTime = time.time()
splitlines_df = split_lines(water_gdf, nodes_gdf, "VHAS")
finishTime = time.time()
# print(ids)
print(splitlines_df.shape)
print(splitlines_df.crs)
print(f"Time taken: {finishTime - initialTime}")
print("********************************************************")

(15067, 2)
EPSG:31370
Time taken: 21.446630001068115
********************************************************


In [34]:
splitlines_df.head()

Unnamed: 0_level_0,geometry,VHAS
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"LINESTRING (138979.165 172375.119, 138977.345 ...",32768
1,"LINESTRING (138977.309 172382.827, 138977.282 ...",32768
2,"LINESTRING (139426.094 170006.797, 139438.876 ...",32776
3,"LINESTRING (139710.263 170419.798, 139711.638 ...",32776
4,"LINESTRING (138841.797 171003.798, 138847.552 ...",32784


In [35]:
#![5.split_segments2.PNG](attachment:5.split_segments2.PNG)

In [36]:
# splitlines_df_upload.to_file(r"data_transform\vl_water_segments_PROCESSED.shp")

# 3. Creating new network

## 3.1 Water Nodes

###  3.1.1. Final water nodes

Merge the original water nodes to the new water nodes which are the split points used in the previous operation. These nodes will be added back to the final water edges

In [66]:
water_nodes_df["source"] = "water_node"

connection_nodes = (
    connection_nodes_gdf[["CODEKOPPNT", "geometry"]]
    .rename(columns={"CODEKOPPNT": "node_id"})
    .assign(source="connection_node")
)

final_nodes_combined = (
    pd.concat([water_nodes_df, connection_nodes])
    .drop_duplicates(subset="geometry", keep="first")
    .reset_index(drop=True)
)

In [67]:
final_nodes_combined

Unnamed: 0,geometry,node_id,source
0,POINT (177317.033 187108.927),VL1,water_node
1,POINT (175948.922 187590.860),VL2,water_node
2,POINT (168312.751 188947.734),VL3,water_node
3,POINT (190287.875 162834.403),VL4,water_node
4,POINT (177620.500 182754.219),VL5,water_node
...,...,...,...
69755,POINT (221539.327 180803.419),6002935_1,connection_node
69756,POINT (218308.622 189506.152),7184778_1,connection_node
69757,POINT (224393.703 184071.719),7177561_1,connection_node
69758,POINT (217762.589 183421.889),6003589_1,connection_node


### 3.1.2. Final node ids

The original water nodes have sequentially generated ids. After these are merged to the connection nodes, new ids for the connection nodes are generated sequentially also. A column called sewernode_id is retained to indicate the water nodes that have a corresponding sewer node in the dataset.

In [40]:
def add_sewernode_id(row):
    if row["source"] == "connection_node":
        return row["node_id"]
    else:
        return None


def get_water_nodes(df, prefix):
    df["sewernode_id"] = df.apply(add_sewernode_id, axis=1)

    conn_df = df[df["source"] == "connection_node"]
    water_nodes = df.loc[df["source"] == "water_node"]

    nodes_list = water_nodes["node_id"].to_list()
    start_num = max([int(i[2:]) for i in nodes_list])

    diff = len(df.index) - len(water_nodes_df.index)

    conn_df["node_id"] = range((start_num + 1), (start_num + diff + 1))
    conn_df["node_id"] = prefix + conn_df["node_id"].astype(str)

    nodes_all = pd.concat([water_nodes, conn_df])
    nodes_all_gdf = gpd.GeoDataFrame(nodes_all, geometry="geometry", crs=PROJ_CRS)

    return nodes_all_gdf

In [41]:
waternodes = get_water_nodes(final_nodes_combined, "VL")

In [42]:
waternodes

Unnamed: 0,geometry,node_id,source,sewernode_id
0,POINT (177317.033 187108.927),VL1,water_node,
1,POINT (175948.922 187590.860),VL2,water_node,
2,POINT (168312.751 188947.734),VL3,water_node,
3,POINT (190287.875 162834.403),VL4,water_node,
4,POINT (177620.500 182754.219),VL5,water_node,
...,...,...,...,...
69755,POINT (221539.327 180803.419),VL69756,connection_node,6002935_1
69756,POINT (218308.622 189506.152),VL69757,connection_node,7184778_1
69757,POINT (224393.703 184071.719),VL69758,connection_node,7177561_1
69758,POINT (217762.589 183421.889),VL69759,connection_node,6003589_1


### 3.1.3. Added properties

To get the final water nodes for a tracing water network, add various sewer node properties to the water nodes.

In [68]:
waternodes_final = (
    waternodes.merge(
        merged_sewer_nodes[["STATUS", "LBLTYPE", "NRKPNT"]],
        left_on="sewernode_id",
        right_on="NRKPNT",
        how="left",
    )
    .drop_duplicates(subset="geometry", keep="first")
    .reset_index(drop=True)
)

In [69]:
waternodes_final

Unnamed: 0,geometry,node_id,source,sewernode_id,STATUS,LBLTYPE,NRKPNT
0,POINT (177317.033 187108.927),VL1,water_node,,,,
1,POINT (175948.922 187590.860),VL2,water_node,,,,
2,POINT (168312.751 188947.734),VL3,water_node,,,,
3,POINT (190287.875 162834.403),VL4,water_node,,,,
4,POINT (177620.500 182754.219),VL5,water_node,,,,
...,...,...,...,...,...,...,...
69755,POINT (221539.327 180803.419),VL69756,connection_node,6002935_1,Actief,Uitlaat,6002935_1
69756,POINT (218308.622 189506.152),VL69757,connection_node,7184778_1,Gepland,Uitlaat,7184778_1
69757,POINT (224393.703 184071.719),VL69758,connection_node,7177561_1,Actief,Uitlaat,7177561_1
69758,POINT (217762.589 183421.889),VL69759,connection_node,6003589_1,Actief,Uitlaat,6003589_1


In [45]:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#    print(merged_nodes)

# merged_nodes.to_file(r"data_transform\vl_nodes_combined_V02.shp")
#merged_nodes.to_file(config.data_dest / "vl_nodes_combined_V02.shp")

## 3.2. Water Edges

### 3.2.1. Add nodes to water segments



In [46]:
def line_segments_start_end_ids(splitlines_df, all_nodes_gdf, node_id, PROJ_CRS):
    """Returns the start and end ids of the line segments for a given node id"""
    splitlines_df["coords"] = splitlines_df.apply(
        lambda row: get_line_coords(row.geometry), axis=1
    )  
    all_nodes_gdf["coords"] = get_point_coords(all_nodes_gdf)
    # join linestrings to the nearest node, in this case the node attached to the line
    joined_lines_nodes = gpd.sjoin_nearest(
        splitlines_df, all_nodes_gdf, how="left"
    ).reset_index()
    # identify the nodes that corresponding to the line start and end points
    idx_start = []
    start_id = []
    idx_end = []
    end_id = []

    for idx, row in joined_lines_nodes.iterrows():
        if row.coords_right == row.coords_left[0]:
            idx_start.append(row["index"])
            start_id.append(row[node_id])
        elif row.coords_right == row.coords_left[-1]:
            idx_end.append(row["index"])
            end_id.append(row[node_id])

    start_id_df = (
        pd.DataFrame({"line_index": idx_start, f"start_{node_id}": start_id})
        .merge(
            joined_lines_nodes[["index", node_id, "VHAS", "geometry"]],
            left_on="line_index",
            right_on="index",
            how="left",
        )
        .drop_duplicates("geometry")
    )

    end_id_df = (
        pd.DataFrame({"line_index": idx_end, f"end_{node_id}": end_id})
        .merge(
            joined_lines_nodes[["index", node_id, "VHAS", "geometry"]],
            left_on="line_index",
            right_on="index",
            how="left",
        )
        .drop_duplicates("geometry")
    )

    merged_start_end_df = (
        pd.merge(start_id_df, end_id_df, on="geometry", how="outer")
        .drop(
            [
                "line_index_x",
                "index_x",
                "node_id_x",
                "index_y",
                "line_index_y",
                "node_id_y",
                "VHAS_y",
            ],
            axis=1,
        )
        .rename(columns={"start_node_id": "start_ID", "end_node_id": "end_ID"})
    )

    return gpd.GeoDataFrame(merged_start_end_df, geometry="geometry", crs=PROJ_CRS)

In [48]:
node_id = "node_id"
splitlines_with_ids = line_segments_start_end_ids(
    splitlines_df, waternodes[["geometry", "node_id"]], node_id, PROJ_CRS
)

In [49]:
splitlines_with_ids.head()

Unnamed: 0,start_ID,VHAS_x,geometry,end_ID
0,VL55954,32768,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL66013
1,VL66013,32768,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL42787
2,VL579,32776,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL67674
3,VL67674,32776,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL580
4,VL22835,32784,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL65514


### 3.3.2. Get water segments Unidue Ids

The new water segments still retain the original unique ids, meaning some of them share an id. A systematic method to assign new ids is applied, by adding a suffix to the original id, indicating the number of times the original water segment was split. By retaining part of the original id, one can quickly identify if the water segment is a split one or an original one.

In [50]:
# get unique line segments ids
def get_unique_ID(df, col):
    """Get unique ID for each new split segment in a dataframe
    Assert that the number of unique IDs is equal to the number of split segments
    """
    # the new split lines need a new unique uniqueID value
    df["num_id"] = df.groupby(col).cumcount() + 1
    df["new_string_id"] = df[col].astype(str) + "_" + df["num_id"].astype(str)

    return df

In [51]:
splitlines_vhas = get_unique_ID(splitlines_with_ids, "VHAS_x")
assert len(splitlines_vhas) == splitlines_vhas.new_string_id.nunique()

In [52]:
splitlines_vhas

Unnamed: 0,start_ID,VHAS_x,geometry,end_ID,num_id,new_string_id
0,VL55954,32768,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL66013,1,32768_1
1,VL66013,32768,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL42787,2,32768_2
2,VL579,32776,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL67674,1,32776_1
3,VL67674,32776,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL580,2,32776_2
4,VL22835,32784,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL65514,1,32784_1
...,...,...,...,...,...,...
15062,VL64257,7012201,"LINESTRING (70890.328 219736.109, 70895.602 21...",VL56242,4,7012201_4
15063,VL34678,32672,"LINESTRING (138704.147 176812.610, 138688.060 ...",VL29687,1,32672_1
15064,VL25141,6029276,"LINESTRING (161243.920 186896.994, 161230.637 ...",VL66511,1,6029276_1
15065,VL66511,6029276,"LINESTRING (159618.409 188060.371, 159396.076 ...",VL66513,2,6029276_2


### 3.2.3. Getting water properties to water segments

Merge the split segments to original water dataframe to get orignal water properties to the split segments before joining them back to the final water dataset

In [53]:
splitlines_final = (
    splitlines_vhas.merge(water_final, left_on="VHAS_x", right_on="VHAS", how="left")
    .drop(["VHAS_x", "VHAS", "num_id", "start_ID_y", "end_ID_y", "geometry_y"], axis=1)
    .rename(
        columns={
            "new_string_id": "VHAS",
            "start_ID_x": "start_ID",
            "end_ID_x": "end_ID",
            "geometry_x": "geometry",
        }
    )
)

In [72]:
splitlines_final.head(2)

Unnamed: 0,start_ID,geometry,end_ID,VHAS,OIDN,UIDN,VHAG,NAAM,REGCODE,REGCODE1,...,CATC,LBLCATC,BEKNR,BEKNAAM,STRMGEB,GEO,LBLGEO,VHAZONENR,WTRLICHC,LENGTE
0,VL55954,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL66013,32768_1,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12
1,VL66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL42787,32768_2,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12


In [54]:
# splitlines_final = splitlines_merged.drop(['VHAS_x', 'VHAS', 'num_id', 'start_ID_y', 'end_ID_y', 'geometry_y'], axis=1)\
#                                     .rename(columns={'new_string_id': 'VHAS', 'start_ID_x':'start_ID', 'end_ID_x':'end_ID', 'geometry_x': 'geometry'})
# splitlines_final.sample(3)

### 3.2.4. Gather all linestrings into one dataset

With the split segments now joined to the water dataset to get the necessary attributes, we can now combine all the linestrings into one dataset. This is done by dropping all the water sements that were split in the splitting funtion, and merging the splitlines_final dataset.

In [55]:
def merge_segments_to_water(split_segments, split_segments_final, water_df, col):
    
    """This function merges segments to water linestrings"""
    
    # drop the linestrings to be split and merge the df with split lines
    split_segments = split_segments.astype({col: str}, errors="raise")
    split_segments_final = split_segments_final.astype({col: str}, errors="raise")
    water_df = water_df.astype({col: str}, errors="raise")
    print("water_df: ", len(water_df))

    assert (
        split_segments_final[col].nunique()
        == split_segments_final["geometry"].nunique()
    )
    assert water_df[col].nunique() == water_df["geometry"].nunique()

    linestrings_to_drop = list(set(split_segments[col].to_list()))
    print("linestrings_to_drop: ", len(linestrings_to_drop))
    print("split_segments: ", len(split_segments))

    water_df_trimmed = water_df.query(
        col + " not in @linestrings_to_drop"
    )  # .reset_index(drop=True)
    water_df_drop = water_df.query(col + " in @linestrings_to_drop")

    # merge the split lines with the original water lines
    merged_df = gpd.GeoDataFrame(
        pd.concat([split_segments_final, water_df_trimmed], ignore_index=True),
        geometry="geometry",
        crs=PROJ_CRS,
    )
    print("merged df: ", len(merged_df))
    
    assert merged_df["geometry"].nunique() == merged_df[col].nunique()
    assert (len(water_df) - len(linestrings_to_drop)) + len(split_segments) == len(merged_df)

    return merged_df

In [56]:
segments_to_water = merge_segments_to_water(
    splitlines_df, splitlines_final, water_final, "VHAS"
)

water_df:  63762
linestrings_to_drop:  6846
split_segments:  15067
merged df:  71983


### 3.2.5. Recalculate the length of the linestrings

This is necessary to get the correct length of the linestrings after the split function.

In [60]:
segments_to_water["new_length"] = segments_to_water["geometry"].apply(
    lambda x: x.length
)

In [61]:
#segments_to_water.to_file(config.data_dest / "vl_water_PROCESSED_V2.shp")

## 4. Intermodal network connection object

In [73]:
final_nodes_combined.sample(3)

Unnamed: 0,geometry,node_id,source
25812,POINT (150940.264 237750.425),VL25813,water_node
32165,POINT (162449.952 231812.328),VL32166,water_node
8982,POINT (183724.750 170346.985),VL8983,water_node


In [63]:
network_connection_object = (
    merged_nodes[["node_id", "sewernode_id"]]
    .query("sewernode_id.notnull()")
    .rename(columns={"node_id": "hydronode_id"})
    .reset_index(drop=True)
)

In [64]:
network_connection_object

Unnamed: 0,hydronode_id,sewernode_id
0,VL60772,24195613725_1
1,VL60773,7165620_1
2,VL60774,7166026_1
3,VL60775,7165606_1
4,VL60776,7165608_1
...,...,...
8984,VL69756,6002935_1
8985,VL69757,7184778_1
8986,VL69758,7177561_1
8987,VL69759,6003589_1


In [65]:
# network_connection_object.to_csv(r"data_transform\vl_network_connection_object.csv", index=False)