# NETWORKS TRANSFORMATION

This notebook contains all the functions needed to perform various trasformations to the networks of the tracing usecase, and prepare the data for harmonization.

Datasets Needed:

- Hydro-network dataset
- Sewer network dataset / Discharge Points dataset

## Expected Outputs

- Connection nodes
- Discharge points
- Start and end nodes for water
- Water dataset with start and end ID
- Split water dataset with new start and end ids
- Fully connected water dataset

In [1]:
import os
import sys
path = os.path.dirname(os.path.abspath(''))
os.chdir(path)
print(path)
sys.path.insert(0, path)

c:\Workdir\Develop\repository\go-peg


In [2]:
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, LineString, MultiPoint, MultiLineString
from shapely import wkt
from shapely.ops import nearest_points
import shapely.wkt
import numpy as np

import random


import warnings
from shapely.errors import ShapelyDeprecationWarning
warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning)
pd.options.mode.chained_assignment = None  # default='warn'


from config import config


In [None]:
# from src.config import funcs

## 1. Prepare Water Network
A water network is received as an edge only network wih no nodes. Here, we generate hydro-nodes from the begining and end points of a linestring geometry, and assign them unique ids that can then be added to the water segments as begin and end points.

The following steps are performed:

### 1.1. CRS
Assign the coordinate reference system as a global variable to use through out the application.

In [3]:
PROJ_CRS = "EPSG:31370"

### 1.2. Load Water Data
Load the data into a dataframe. Various formats can be loaded onto a dataframe in Geopandas. Here, both shapefiles and GML data are used.

Set the crs to the project crs. 

In [4]:
def load_data(path, PROJ_CRS):
    """
    Loads the data from the given path,
    and prints the shape and crs of the data.
    """
    data = gpd.read_file(path)
    print(data.shape)
    #print("Original crs:", data.crs)
    data = data.to_crs(PROJ_CRS)
    print("Project crs:", data.crs)
    data = data.drop_duplicates(subset=["geometry"]).reset_index(drop=True)
    return data


path = config.data_src / "flanders_hydro_network/Wlas.shp"

water_data = load_data(path, PROJ_CRS)

(63767, 19)
Project crs: EPSG:31370


### 1.3. Turn multiline water network into single line water network

The water segments geometries can sometimes be stored as a multilinestring. This means that the geometries are represented by nested lists and this can make programmatically manipulating them difficult. Therefore multiline water networks are converted into single line water networks by splitting the linestrings into individual linestrings. This is done by 'flattening' the nested list that makes up a multilinestring structure.

This ensures we can extract begin and end points of a water segment.

In [2]:
# Check for multiline strings in a dataset
def check_multiline(df):
    """This function checks for multiline strings
    from the geometry column in a given dataset"""
    lst = df["geometry"].to_list()
    multiline_count = 0
    for item in lst:
        if isinstance(item, MultiLineString):
            multiline_count += 1
    print("MultiLinesStrings:", multiline_count)


def multiline_to_linestring(df, geom_col):
    # multiline_count = check_multiline(df)
    # if multiline_count == 0:
    #     return df
    # else:
    linestrings = []
    for idx, row in df.iterrows():
        if isinstance(row['geometry'], LineString):  
            linestrings.append(row['geometry'])
        elif isinstance(row['geometry'], MultiLineString):
            geoms = [i for i in row['geometry'].geoms]
            outcoords = [list(item.coords) for item in geoms]
            outline = shapely.geometry.LineString([i for sublist in outcoords for i in sublist])
            linestrings.append(outline)
    df['geometry'] = linestrings
    print("Checking for multiline strings after...")
    check_multiline(df)
     # multiline_count = check_multiline(df)
    assert len(linestrings) == len(df)
    return df

In [3]:
water_data = multiline_to_linestring(water_data, 'geometry')

NameError: name 'water_data' is not defined

### 1.4. Generate begin and end nodes

Get begin and end point geometries by extracting the first and the last point geometries of a linestring.

In [7]:
def add_beginpoints(df):
    startnodes_gdf = df.copy()
    lst = startnodes_gdf["geometry"].to_list()
    beginpoints = []
    for item in lst:
        first = Point(item.coords[0])
        first_precise = shapely.wkt.dumps(first)
        beginpoints.append(first_precise)

    startnodes_gdf["start_point"] = [wkt.loads(g) for g in beginpoints]
    startnodes_gdf = startnodes_gdf.drop(["geometry"], axis=1).rename(
        columns={"start_point": "geometry"}
    )

    startnodes_gdf = gpd.GeoDataFrame(
        startnodes_gdf, geometry=startnodes_gdf["geometry"], crs=PROJ_CRS
    )  # .drop(columns=[col])
    return startnodes_gdf


def add_endpoints(df):
    endnodes_gdf = df.copy()
    lst = endnodes_gdf["geometry"].to_list()
    endpoints = []
    for item in lst:
        last = Point(item.coords[-1])
        last_precise = shapely.wkt.dumps(last)
        endpoints.append(last_precise)

    endnodes_gdf["end_point"] = [wkt.loads(g) for g in endpoints]
    endnodes_gdf = endnodes_gdf.drop(["geometry"], axis=1).rename(
        columns={"end_point": "geometry"}
    )

    endnodes_gdf = gpd.GeoDataFrame(
        endnodes_gdf, geometry=endnodes_gdf["geometry"], crs=PROJ_CRS
    )  # .drop(columns=[col])
    return endnodes_gdf

In [8]:
startnodes_gdf = add_beginpoints(water_data)
endnodes_gdf = add_endpoints(water_data)

#### Note 1:
Assert statements are used to test if the results generated are the expected results

In [9]:
assert startnodes_gdf.shape == endnodes_gdf.shape

### 1.5. Document the nodes

After the nodes have been created, perform spatial join the startnodes and endnodes dataframes to create one nodes geometry.


In [10]:
def get_nodes(id_col, region):
    nodes_geom = pd.merge(
        startnodes_gdf[[id_col, "geometry"]],
        endnodes_gdf[[id_col, "geometry"]],
        on="geometry",
        how="outer",
    ).reset_index(drop=True)
    unique_id_df = (
        nodes_geom[["geometry"]].drop_duplicates().reset_index().drop(columns=["index"])
    )
    assert len(unique_id_df) == nodes_geom.geometry.nunique()

    unique_id_df["New_ID"] = range(1, len(unique_id_df) + 1)
    unique_id_df["node_id"] = (region + '_HN') + unique_id_df["New_ID"].astype(str)
    gdf = gpd.GeoDataFrame(
        unique_id_df, geometry=unique_id_df["geometry"], crs=PROJ_CRS
    ).drop(columns=["New_ID"])
    return gdf

In [11]:
water_nodes_df = get_nodes("VHAS", "VL")
assert len(water_nodes_df) == water_nodes_df.geometry.nunique()

In [12]:
water_nodes_df.sample(5)

Unnamed: 0,geometry,node_id
4366,POINT (119985.935 203401.895),VL_HN4367
46568,POINT (233723.400 207993.233),VL_HN46569
23283,POINT (187961.047 214322.640),VL_HN23284
54511,POINT (86579.294 210403.255),VL_HN54512
47946,POINT (101552.117 191962.212),VL_HN47947


### 1.6. Add the nodes to water segments, and create start and end id columns

Using the sjoin method, map the nodes onto the linestrings to identify water segment start nodes and end nodes. Label nodes as either start_id or end_id in the water dataframe.

In [14]:
def add_ids_to_edges():
    # Label nodes as either start_id or end_id
    startnodes_merged = (
        gpd.sjoin(startnodes_gdf, water_nodes_df, how="left")
        .rename(columns={"node_id": "start_ID"})
        .drop("index_right", axis=1)
    )
    endnodes_merged = (
        gpd.sjoin(endnodes_gdf, water_nodes_df, how="left")
        .rename(columns={"node_id": "end_ID"})
        .drop("index_right", axis=1)
    )

    nodes_geom = pd.merge(startnodes_merged, endnodes_merged, on="VHAS")

    nodes = nodes_geom[["VHAS", "start_ID", "end_ID"]]

    water_edges_nodes = pd.merge(
        water_data, nodes, left_on="VHAS", right_on="VHAS"
    )  # .drop('id', axis=1)
    return water_edges_nodes


water_final = add_ids_to_edges()
assert water_final.VHAS.nunique() == water_final.geometry.nunique()

In [15]:
# water_final.to_file(r"data_transform\\vl_water_edges.shp")
#![2.water_nodes.PNG](attachment:2.water_nodes.PNG)

## 2. Prepare Sewer Network

Load the sewer network edges and nodes files.
If there is no sewer network, then load discharge points.

In this example dataset, the sewer network dataset consists of both nodes and edges.

Some networks will contain only nodes(discharge points). This scenario can also work as explained a few steps further

STRENG
- Definition: A string is the modeling of a part of a sewer with fixed properties. Strands are linked together by coupling points, creating a network of strands


HPOINTS
- Definition: A Hydraulic point (or H-point) is a representation of an important hydraulic point with regard
to the sewers.  
- This concerns points of the type: inlet, transfer point, overflow, pump, reservoir, outlet and
purification station. An H-point is identified by the identifier assigned by the VMM, the data manager.

KOPPNT
- Koppnt is a coupling point. A coupling point is a point where two or more strands are connected. A coupling point is identified by the identifier assigned by the VMM, the data manager.

In [16]:
# path = data_src / "flanders_sewernetwork/Streng.shp"
sewer_edges = load_data(config.data_src / "flanders_sewernetwork/Streng.shp", PROJ_CRS)

# path = data_src / "flanders_sewernetwork/Hydpnt.shp"
data_hpoint_gml = load_data(config.data_src / "flanders_sewernetwork/Hydpnt.shp", PROJ_CRS)

# path = data_src / "flanders_sewernetwork/Koppnt.shp"
koppnt_data = load_data(config.data_src / "flanders_sewernetwork/Koppnt.shp", PROJ_CRS)

(327212, 28)
Project crs: EPSG:31370
(36190, 22)
Project crs: EPSG:31370
(336225, 5)
Project crs: EPSG:31370


Merge Koppnt and hydpnt datasets to get the properties of the hydraulic points to the nodes.

In [123]:
# merged_sewer_nodes = pd.merge(
#     koppnt_data, data_hpoint_gml, left_on="NRKPNT", right_on="CODEKOPPNT", how="left"
# )
joined_sewer_nodes = gpd.overlay(koppnt_data, data_hpoint_gml, how='union', keep_geom_type=False, make_valid=False)

In [125]:
drop_cols = ['OIDN_1', 'UIDN_1', 'OIDN_2', 'UIDN_2']
all_sewer_nodes = joined_sewer_nodes.drop(drop_cols, axis=1)

#Copy values from similar columns in the joined datasets
all_sewer_nodes.loc[all_sewer_nodes['NRKPNT'].isnull(), 'NRKPNT'] = all_sewer_nodes['CODEKOPPNT']
all_sewer_nodes.loc[all_sewer_nodes['RWZI_1'].isnull(), 'RWZI_1'] = all_sewer_nodes['RWZI_2']

In [134]:
all_sewer_nodes = all_sewer_nodes.convert_dtypes()

In [223]:
nodes_drop_cols = ['RWZI_1', 'RWZI_2', 'TYPE', 'UITLWAT', 'GUPPROJ']
all_sewer_nodes_final = all_sewer_nodes.drop(columns=(nodes_drop_cols))

In [225]:
all_sewer_nodes_final.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 366664 entries, 0 to 366663
Data columns (total 17 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   NRKPNT      366664 non-null  string  
 1   NRHPNT      36098 non-null   Int64   
 2   LBLTYPE     36098 non-null   string  
 3   STATUS      36098 non-null   string  
 4   CODEUITL    22191 non-null   string  
 5   LBLUITLWAT  36098 non-null   string  
 6   VHAS        36098 non-null   Int64   
 7   NAAMWTL     18540 non-null   string  
 8   CODEKOPPNT  36098 non-null   string  
 9   VRSTLLNG    36098 non-null   Int64   
 10  LBLVRSTLNG  36098 non-null   string  
 11  STARTDATUM  36098 non-null   string  
 12  STOPDATUM   36098 non-null   string  
 13  RENDATUM    36098 non-null   string  
 14  NISCODE     36098 non-null   string  
 15  GEMEENTE    36098 non-null   string  
 16  geometry    366664 non-null  geometry
dtypes: Int64(3), geometry(1), string(13)
memory usage: 48.6 MB


Save sewer network nodes and edges to file.

In [226]:
# all_sewer_nodes_final.to_file(config.data_dest / "VL_sewernodesPROCESSED.shp")

### Create connection lines between the sewer points and hydraulic points

In [135]:
def make_intersection_lines(df, from_point, to_point):
    lines = []
    for index, row in df.iterrows():
        p_1 = Point(row[from_point])
        p_2 = Point(row[to_point])
        intersect = LineString([p_1, p_2])
        # linestring = loads(intersect)
        lines.append(intersect)
    return lines

In [136]:
merged_sewer_nodes = pd.merge(
    koppnt_data, data_hpoint_gml, left_on="NRKPNT", right_on="CODEKOPPNT", how="left"
)

connection_df = merged_sewer_nodes[merged_sewer_nodes['geometry_y'].notna()]

connection_df['connection_lines'] = make_intersection_lines(connection_df, 'geometry_x', 'geometry_y')


connection_lines_df = gpd.GeoDataFrame(
                (connection_df[['CODEKOPPNT', 'LBLTYPE', 'NAAMWTL', 'LBLUITLWAT', 'LBLVRSTLNG', 'RWZI_y', 'connection_lines']].rename(
                        columns={"connection_lines": "geometry"}
                    )
                ),
                geometry="geometry",
                crs=PROJ_CRS,
            ).reset_index(drop=True)

connection_lines_df["New_ID"] = range(1, len(connection_lines_df) + 1)
connection_lines_df["newID"] = 'CONN_' + connection_lines_df["New_ID"].astype(str)

connection_lines_df = connection_lines_df.drop('New_ID', axis=1)

connection_lines_df['fictitious'] = 'true'

In [193]:
# BEGINKPNT	EINDKPNT
connection_lines_df = connection_lines_df.rename(columns={'CODEKOPPNT': 'BEGINKPNT', 'newID':'NRSTRENG', 'RWZI_y':'RWZI'})
connection_lines_df['EINDKPNT'] = connection_lines_df.loc[:, 'BEGINKPNT']

In [194]:
connection_lines_df.columns

Index(['BEGINKPNT', 'LBLTYPE', 'NAAMWTL', 'LBLUITLWAT', 'LBLVRSTLNG', 'RWZI',
       'geometry', 'NRSTRENG', 'fictitious', 'EINDKPNT'],
      dtype='object')

In [204]:
sewer_edges_merged = pd.concat([sewer_edges, connection_lines_df], join='outer')
assert connection_lines_df.shape[0] + sewer_edges.shape[0] == sewer_edges_merged.shape[0]

In [209]:
drop_cols = ['OIDN', 'UIDN', 'LEIDING', 'VRSTLLNG', 'RWZI', 'LENGTE', 'LBLTYPE', 'NAAMWTL', 'LBLUITLWAT', 'GUPPROJ', 'STRAATNMID', 'STRAATNM', 'INW', 'FUNCTIE', 'WATER']
sewer_edges_final = sewer_edges_merged.drop(columns=(drop_cols))

In [1]:
sewer_edges_final.sample(5)

NameError: name 'sewer_edges_final' is not defined

In [None]:
{'s_line_id':'NRSTRENG',
'status':'STATUS',
'start_ID':'BEGINKPNT',
'end_ID':'EINDKPNT',
's_node_id':'snode_id',
'start_date':'STARTDATUM',
'end_date':'STOPDATUM',
'ren_date':'RENDATUM',
'water_type':'LBLWATER',
'pipe_type':'LBLLEIDING',
'function':'LBLFUNCTIE'}

In [215]:
sewer_edges_final.columns

Index(['NRSTRENG', 'STATUS', 'BEGINKPNT', 'EINDKPNT', 'LBLFUNCTIE',
       'LBLLEIDING', 'LBLWATER', 'BRON', 'LBLBRON', 'STARTDATUM', 'STOPDATUM',
       'RENDATUM', 'NRHPNT', 'LBLVRSTLNG', 'PERCUITL', 'geometry',
       'fictitious'],
      dtype='object')

In [216]:
# connection_lines_df.to_file('data/test_data/sewer_connection_lines.shp')
# sewer_edges_final.to_file(config.data_dest / "VL_seweredgesPROCESSED.shp")

### 2. Expose external nodes

Expose external nodes in the sewer dataset. External nodes are end points of a strand that are not beginpoints of another strand. Here, using the attributes 'BEGINKPNT' and 'EINDKPNT' which are the node codes, we can find the external nodes.

In [19]:
def find_external_nodes(df, begin_col, end_col):

    """This function extracts the endpoints of a sewer segment
    that are not beginpoints of another sewer segment"""

    beginpoints = df[begin_col].to_list()
    endpoints = df[end_col].to_list()
    beginpoints_set = set(beginpoints)
    endpoints_set = set(endpoints)
    external_nodes = list(endpoints_set - beginpoints_set)
    return external_nodes

In [20]:
external_nodes = find_external_nodes(sewer_edges, "BEGINKPNT", "EINDKPNT")

ext_nodes_df = (
    data_hpoint_gml.query("CODEKOPPNT in @external_nodes")
    .query("VHAS != 0")
    .drop_duplicates(subset="CODEKOPPNT")
    .drop(columns=["geometry"])
    .merge(koppnt_data[["NRKPNT", "geometry"]], left_on="CODEKOPPNT", right_on="NRKPNT")
    .drop(columns=["NRKPNT"])
    .drop_duplicates(subset="geometry")
)

In [21]:
assert ext_nodes_df.geometry.nunique() == ext_nodes_df.CODEKOPPNT.nunique()

In [22]:
ext_nodes_df.shape

(14440, 22)

## FIND CONNECTION NODES

Use a custom function to identify a connection point by projecting to the nearest point on a river from an external node.

The expected output is a df with sewer nodes projected onto water segments.

In [24]:
def get_nearest_point(df, line_col, point_col):
    """
    For each point in points_df, find the nearest point in lines_df.
    """
    geoms = []
    for idx, row in df.iterrows():
        destinations = MultiPoint(np.array(row[line_col].coords))
        # destinations = MultiPoint(row[line_col].coords)  # geometry_y
        nearest_geoms = nearest_points(row[point_col], destinations)  # geometry_x
        try:
            for coord in destinations.geoms:
                if coord == nearest_geoms[1]:
                    geoms.append(coord)
        except ValueError:
            print("No nearest point found for {}".format(row.CODEKOPPNT))
    return geoms

In [25]:
assert len(water_final) == water_final.VHAS.nunique()

In [26]:
water_final_cols = ["VHAS", "geometry"]
ext_nodes_cols = ["NRHPNT", "CODEKOPPNT", "VHAS", "geometry"]
sewer_water_df = (
    ext_nodes_df[ext_nodes_cols]
    .merge(water_final[water_final_cols], on="VHAS", how="left")
    .drop_duplicates(subset="geometry_x", keep="first")
    .query("geometry_y.notnull()")
    .assign(new_points=lambda x: get_nearest_point(x, "geometry_y", "geometry_x"))
)
print(sewer_water_df.shape)

sewer_water_df = gpd.GeoDataFrame(
    sewer_water_df, geometry="new_points", crs=PROJ_CRS
).drop_duplicates(subset="new_points")

conn_node_cols = ["NRHPNT", "CODEKOPPNT", "VHAS", "new_points"]
water_cols = ["VHAS", "CODEKOPPNT", "geometry_y"]
connection_nodes_df = (
    sewer_water_df[conn_node_cols]
    .rename(columns={"new_points": "geometry"})
    .reset_index(drop=True)
)

connection_nodes_gdf = gpd.GeoDataFrame(
    connection_nodes_df, geometry="geometry", crs=PROJ_CRS
)
print("Connection_nodes_df: ", connection_nodes_gdf.shape)

water_df = (
    sewer_water_df[water_cols]
    .rename(columns={"geometry_y": "geometry"})
    .reset_index(drop=True)
)
water_gdf = gpd.GeoDataFrame(water_df, geometry="geometry", crs=PROJ_CRS)
print("Water_df: ", water_gdf.shape)

(13904, 6)
Connection_nodes_df:  (11699, 4)
Water_df:  (11699, 3)


In [27]:
print(sewer_water_df.shape)
sewer_water_df.head(2)

(11699, 6)


Unnamed: 0,NRHPNT,CODEKOPPNT,VHAS,geometry_x,geometry_y,new_points
0,6025960.0,24195613725_1,6033745,POINT (49632.228 215175.512),"LINESTRING (49842.161 215096.866, 49720.629 21...",POINT (49720.629 215107.628)
1,6028829.0,7165620_1,6002399,POINT (63142.598 172228.907),"LINESTRING (64485.050 172900.574, 64483.883 17...",POINT (63142.133 172229.301)


In [28]:
connection_nodes_gdf.head(2)

Unnamed: 0,NRHPNT,CODEKOPPNT,VHAS,geometry
0,6025960.0,24195613725_1,6033745,POINT (49720.629 215107.628)
1,6028829.0,7165620_1,6002399,POINT (63142.133 172229.301)


In [29]:
# connection_nodes_gdf.to_file('../data/test_data2/connection_nodes.shp')
# nodes_gdf_upload = nodes_gdf.drop(columns=['coords'])
# connection_nodes_gdf.to_file(r"data_transform\\vl_connection_nodes_PROCESSED.shp")

In [30]:
#![4.connection_nodes.PNG](attachment:4.connection_nodes.PNG)

## Join the sewer network to the water network

This is done by transforming the linestring and point geometries into coordinates, and using these coordinates to identify split points on a water segment.

In [31]:
def get_point_coords(gdf):

    """Returns coordinates as tuples of coordinates"""

    return gdf.geometry.apply(lambda geom: (geom.x, geom.y))


def get_line_coords(line):

    """Returns a list of tuples of coordinates"""

    coords_list = []
    multi_points = np.array(line.coords)
    for i in multi_points:
        geom = Point(i)
        long, lat = geom.x, geom.y
        coords_list.append((long, lat))

    return coords_list

In [32]:
# print(connection_nodes_gdf.shape)
# print(connection_nodes_gdf.geometry.nunique())
nodes_gdf = connection_nodes_gdf.copy()
nodes_gdf["coords"] = get_point_coords(nodes_gdf)

water_gdf["coords"] = water_gdf.apply(lambda row: get_line_coords(row.geometry), axis=1)
print(water_gdf.shape)

(11699, 4)


In [89]:
nodes_gdf.shape

(11699, 5)

In [33]:
#connection_nodes_gdf.to_file()

## Split function

The split function splits the water segements where the sewer empties into the river, and creates a split. One segment can have several splits. All the split segments, using the unique identifier of the original segment are added onto a split segment dataframe.

In [34]:
def get_line_segments(l, points_list):

    idx_list = [
        i for i, item in enumerate(l) if item in points_list
    ]  # compares the two lists and returns the indexes of occurence

    p = [l[i] for i in idx_list]  # get correct order of points list on the line

    super_list = []

    start_idx = 0

    # print("Index list: ", idx_list)
    if len(idx_list) == 1 and (
        p[0] == l[0] or p[0] == l[-1]
    ):  #      (i == 0 or i == len(l)-1) and len(idx_list) == 1:
        print("One split point, at first or last index")
        line_segment = LineString(l)
        super_list.append(line_segment)

    elif len(idx_list) == 2 and (p[0] == l[0] or p[1] == l[-1]):
        print("Two split points, at first and last index")
        line_segment = LineString(l)
        super_list.append(line_segment)

    else:
        # import pdb; pdb.set_trace()
        for i in idx_list:
            # In the case of the first coordinates of a line being a split point but there are other split points
            if i == 0 and len(idx_list) > 1:
                index_list = len(idx_list)
                print(f"First index is a split point, with {index_list} split points")
                continue
            elif i != 0 and len(idx_list) == 1:
                stop_idx = i + 1
                print(f"One split point at index {i}")
                print('stop_idx:', stop_idx)
                print('length of list:', len(l))
                if len(l) - stop_idx == 1:
                    last_segment = l #[stop_idx-1: len(l)+1]
                    last_segment_geom = LineString(last_segment)
                    super_list.append(last_segment_geom)
                else:
                    print(f"One split point at index {i}")
                    # stop_idx = i + 1  # grab list elements until index i
                    print(f"stop index is {stop_idx}")
                    line_list = l[start_idx:stop_idx]
                    line_segment = LineString(line_list)
                    super_list.append(line_segment)
                    print('index_list:' ,len(idx_list))
                    print('stop_idx:', stop_idx)
                    print('length of list:', len(l))
                    if len(l) > stop_idx:
                        last_segment = l[i: len(l)+1]
                        last_segment_geom = LineString(last_segment)
                        super_list.append(last_segment_geom)

            else:
                print("Many split points")
                stop_idx = i + 1  # grab list elements until index i
                print(f"stop index is {stop_idx}")

                line_list = l[start_idx:stop_idx]
                line_segment = LineString(line_list)
                super_list.append(line_segment)
                start_idx = (
                    i  # reset the start index to the number of the prevous stop index
                )

                # super list still has one more segment to add
                print('Super_list:', len(super_list))
                print('index_list:' ,len(idx_list))
                print('stop_idx:', stop_idx)
                print('length of list:', len(l))
                if len(idx_list) - len(super_list)  == 1:
                    print('super list still has one more segment to add')
                    # last_segment = l[stop_idx - 1 : len(l)]
                    last_segment = l[i: len(l)+1]
                    # print('stop_idx:', stop_idx)
                    # print('length of list:', len(l))
                    if stop_idx == len(l):
                        print("Split point at end of list") # stop index goes beyond the line list
                        break
                    # n = len(l) - len(super_list)
                    # last_segment = l[stop_idx-1:len(l)] # Grab the last segments of the list from the prevous stop_idx-1, to the end of the lin len(l)
                    elif len(l) - stop_idx == 1:
                        del super_list[-1]
                        last_segment = l[i: len(l)+1]
                        last_segment_geom = LineString(last_segment)
                        super_list.append(last_segment_geom)

                    else:
                        last_segment_geom = LineString(last_segment)
                        super_list.append(last_segment_geom)
                        print("Last segment added")
                # elif len(super_list) < len(idx_list):
                #     last_segment = l[stop_idx - 1 : len(l)]

    return super_list


# pass a dataframe to the function
def split_lines(water_gdf, nodes_gdf, unique_id):

    water_no_duplicates = water_gdf.drop_duplicates(subset=unique_id)

    groups = nodes_gdf.groupby(unique_id)

    codes_list = nodes_gdf[unique_id].to_list()
    unique_code_list = list(set(codes_list))

    all_segments = []
    ids = []
    # counter = 0

    for num, i in enumerate(unique_code_list):
        points_list = groups.get_group(i).coords.to_list()
        # print("Points list: ", points_list)

        # print(points_list)
        line = water_no_duplicates[water_no_duplicates[unique_id] == i]["coords"][:1]
        # indx = water_no_duplicates[water_no_duplicates[unique_id] == i].index [0]

        points_list = groups.get_group(i).coords.to_list()

        line_segments = get_line_segments(*line, points_list)
        num_segments = len(line_segments)

        all_segments.extend(line_segments)
        # ids.append(indx)
        num_unique_ids = [i] * num_segments
        # get group for each unique code
        # group_ids = groups.get_group(i)[unique_id].to_list()

        # assert len(flat_list) == len(water_no_duplicates)
        ids.extend(num_unique_ids)

        print(f"Line: {i}")
        print(f'len({line_segments}) line_segments added')
        # print(line_segments)
        # all_segments = all_segments.extend(line_segment)

    ## Create dataframe with all segments
    print(len(all_segments))
    print(len(ids))
    # ids_list = list(range(len(ids)))
    gdf_segments = gpd.GeoDataFrame(
        list(range(len(all_segments))), geometry=all_segments, crs=PROJ_CRS
    )
    # gdf_segments = gpd.GeoDataFrame(gdf_segments, geometry='geometry', crs=PROJ_CRS)
    gdf_segments.columns = ["index", "geometry"]
    gdf_segments[unique_id] = ids
    gdf_segments = gdf_segments.set_index("index")
    return gdf_segments

**Perform test**

In [35]:
# test_water = water_gdf[water_gdf['VHAS']==7021447]
# test_nodes = nodes_gdf[nodes_gdf['VHAS']==7021447]
test_water = water_gdf[water_gdf['VHAS']==6034005]
test_nodes = nodes_gdf[nodes_gdf['VHAS']==6034005]

In [36]:
test_water

Unnamed: 0,VHAS,CODEKOPPNT,geometry,coords
3476,6034005,7150580_1,"LINESTRING (28837.991 200433.859, 28776.999 20...","[(28837.990739999746, 200433.85949997976), (28..."
6621,6034005,7180940_1,"LINESTRING (28837.991 200433.859, 28776.999 20...","[(28837.990739999746, 200433.85949997976), (28..."
6622,6034005,7180965_1,"LINESTRING (28837.991 200433.859, 28776.999 20...","[(28837.990739999746, 200433.85949997976), (28..."


In [37]:
test_nodes

Unnamed: 0,NRHPNT,CODEKOPPNT,VHAS,geometry,coords
3476,6025773.0,7150580_1,6034005,POINT (28837.991 200433.859),"(28837.990739999746, 200433.85949997976)"
6621,6032039.0,7180940_1,6034005,POINT (28387.987 200156.375),"(28387.986739999047, 200156.3754999796)"
6622,6032047.0,7180965_1,6034005,POINT (28421.054 200194.617),"(28421.053740000716, 200194.61719997786)"


In [38]:
test_split = split_lines(test_water, test_nodes, "VHAS")

First index is a split point, with 3 split points
Many split points
stop index is 31
Super_list: 1
index_list: 3
stop_idx: 31
length of list: 39
Many split points
stop index is 37
Super_list: 2
index_list: 3
stop_idx: 37
length of list: 39
super list still has one more segment to add
Last segment added
Line: 6034005
len([<LINESTRING (28837.991 200433.859, 28776.999 200424.313, 28759.188 200412.07...>, <LINESTRING (28421.054 200194.617, 28420.47 200191.701, 28422.998 200182.369...>, <LINESTRING (28387.987 200156.375, 28402.803 200102.672, 28423.682 200017.959)>]) line_segments added
3
3


**Apply on whole dataframe**

In [39]:
import time

initialTime = time.time()
splitlines_df = split_lines(water_gdf, nodes_gdf, "VHAS")
finishTime = time.time()
# print(ids)
print(splitlines_df.shape)
print(splitlines_df.crs)
print(f"Time taken: {finishTime - initialTime}")
print("********************************************************")

One split point at index 3
stop_idx: 4
length of list: 276
One split point at index 3
stop index is 4
index_list: 1
stop_idx: 4
length of list: 276
Line: 32768
len([<LINESTRING (138979.165 172375.119, 138977.345 172382.688, 138977.33 172382....>, <LINESTRING (138977.309 172382.827, 138977.282 172382.922, 138977.26 172382....>]) line_segments added
One split point at index 53
stop_idx: 54
length of list: 117
One split point at index 53
stop index is 54
index_list: 1
stop_idx: 54
length of list: 117
Line: 32776
len([<LINESTRING (139426.094 170006.797, 139438.876 170030.61, 139457.922 170057....>, <LINESTRING (139710.263 170419.798, 139711.638 170421.174, 139714.152 170423...>]) line_segments added
One split point at index 130
stop_idx: 131
length of list: 144
One split point at index 130
stop index is 131
index_list: 1
stop_idx: 131
length of list: 144
Line: 32784
len([<LINESTRING (138841.797 171003.798, 138847.552 171017.45, 138849.179 171024....>, <LINESTRING (139310.071 171426.947, 13

In [40]:
splitlines_df2 = splitlines_df.drop_duplicates(subset='geometry')
splitlines_df2.shape

(15153, 2)

In [41]:
# splitlines_df2.to_file('../data/test_data/test_split_full_4.shp')

In [42]:
splitlines_df2.VHAS.isna().sum()

0

In [43]:
splitlines_df2

Unnamed: 0_level_0,geometry,VHAS
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"LINESTRING (138979.165 172375.119, 138977.345 ...",32768
1,"LINESTRING (138977.309 172382.827, 138977.282 ...",32768
2,"LINESTRING (139426.094 170006.797, 139438.876 ...",32776
3,"LINESTRING (139710.263 170419.798, 139711.638 ...",32776
4,"LINESTRING (138841.797 171003.798, 138847.552 ...",32784
...,...,...
15210,"LINESTRING (70886.860 219736.109, 70890.328 21...",7012201
15211,"LINESTRING (138704.147 176812.610, 138688.060 ...",32672
15212,"LINESTRING (161243.920 186896.994, 161230.637 ...",6029276
15213,"LINESTRING (159618.409 188060.371, 159396.076 ...",6029276


**Assign new water line ids to the new split segments**

In [115]:
# get unique line segments ids
def get_unique_ID(df, col):
    """Get unique ID for each new split segment in a dataframe
    Assert that the number of unique IDs is equal to the number of split segments
    """
    # the new split lines need a new unique uniqueID value
    df["num_id"] = df.groupby(col).cumcount() + 1
    df["new_string_id"] = df[col].astype(str) + "_" + df["num_id"].astype(str)

    # df = df.drop(columns=['num_id', col]).rename(columns={'new_string_id': col})
    # assert df[col].nunique() == len(df)

    return df

splitlines_vhas = get_unique_ID(splitlines_df2, "VHAS")
assert len(splitlines_vhas) == splitlines_vhas.new_string_id.nunique()

In [116]:
splitlines_vhas

Unnamed: 0_level_0,geometry,VHAS,num_id,new_string_id,coords
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,"LINESTRING (138979.165 172375.119, 138977.345 ...",32768,1,32768_1,"[(138979.16464000044, 172375.1191999782), (138..."
1,"LINESTRING (138977.309 172382.827, 138977.282 ...",32768,2,32768_2,"[(138977.30854000003, 172382.8270999808), (138..."
2,"LINESTRING (139426.094 170006.797, 139438.876 ...",32776,1,32776_1,"[(139426.09394000322, 170006.7971999785), (139..."
3,"LINESTRING (139710.263 170419.798, 139711.638 ...",32776,2,32776_2,"[(139710.26303999725, 170419.79849998187), (13..."
4,"LINESTRING (138841.797 171003.798, 138847.552 ...",32784,1,32784_1,"[(138841.79733999673, 171003.7978999773), (138..."
...,...,...,...,...,...
15210,"LINESTRING (70886.860 219736.109, 70890.328 21...",7012201,4,7012201_4,"[(70886.85953999906, 219736.1094999779), (7089..."
15211,"LINESTRING (138704.147 176812.610, 138688.060 ...",32672,1,32672_1,"[(138704.14674000297, 176812.61049997993), (13..."
15212,"LINESTRING (161243.920 186896.994, 161230.637 ...",6029276,1,6029276_1,"[(161243.91973999736, 186896.99359998107), (16..."
15213,"LINESTRING (159618.409 188060.371, 159396.076 ...",6029276,2,6029276_2,"[(159618.40934000016, 188060.37079997826), (15..."


In [46]:
#![5.split_segments2.PNG](attachment:5.split_segments2.PNG)

In [47]:
# splitlines_vhas.to_file('../data/test_data2/splitlines_vhas3.shp')
# splitlines_df_upload.to_file(r"data_transform\vl_water_segments_PROCESSED.shp")

## Combine Nodes

Combine the original water nodes, and the new water nodes which are the split points used in the previous operation.

In [100]:
print(water_nodes_df.head())
# assert len(water_nodes_df) == water_nodes_df.geometry.nunique()

                        geometry node_id      source
0  POINT (177317.033 187108.927)  VL_HN1  water_node
1  POINT (175948.922 187590.860)  VL_HN2  water_node
2  POINT (168312.751 188947.734)  VL_HN3  water_node
3  POINT (190287.875 162834.403)  VL_HN4  water_node
4  POINT (177620.500 182754.219)  VL_HN5  water_node


In [101]:
water_nodes_df["source"] = "water_node"

connection_nodes = (
    connection_nodes_gdf[["CODEKOPPNT", "geometry"]]
    .rename(columns={"CODEKOPPNT": "node_id"})
    .assign(source="connection_node")
)

nodes_combined = (
    pd.concat([water_nodes_df, connection_nodes])
    .drop_duplicates(subset="geometry", keep="first")
    .reset_index(drop=True)
)

print(connection_nodes.shape)
connection_nodes.head()

(11699, 3)


Unnamed: 0,node_id,geometry,source
0,24195613725_1,POINT (49720.629 215107.628),connection_node
1,7165620_1,POINT (63142.133 172229.301),connection_node
2,7166026_1,POINT (61256.557 172127.427),connection_node
3,7165606_1,POINT (65981.594 174958.909),connection_node
4,7170009_1,POINT (222673.313 204101.625),connection_node


In [102]:
nodes_combined

Unnamed: 0,geometry,node_id,source
0,POINT (177317.033 187108.927),VL_HN1,water_node
1,POINT (175948.922 187590.860),VL_HN2,water_node
2,POINT (168312.751 188947.734),VL_HN3,water_node
3,POINT (190287.875 162834.403),VL_HN4,water_node
4,POINT (177620.500 182754.219),VL_HN5,water_node
...,...,...,...
69755,POINT (221539.327 180803.419),6002935_1,connection_node
69756,POINT (218308.622 189506.152),7184778_1,connection_node
69757,POINT (224393.703 184071.719),7177561_1,connection_node
69758,POINT (217762.589 183421.889),6003589_1,connection_node


In [103]:
nodes_combined2 = nodes_combined.drop_duplicates(subset='geometry')
nodes_combined2.shape

(69760, 3)

In [104]:
def add_sewernode_id(row):
    if row["source"] == "connection_node":
        return row["node_id"]
    else:
        return None


def get_water_nodes(df, prefix):
    df["sewernode_id"] = df.apply(add_sewernode_id, axis=1)

    conn_df = df[df["source"] == "connection_node"]
    water_nodes = df.loc[df["source"] == "water_node"]

    nodes_list = water_nodes["node_id"].to_list()
    start_num = max([int(i[len(prefix) + 3:]) for i in nodes_list])

    diff = len(df.index) - len(water_nodes_df.index)

    conn_df["node_id"] = range((start_num + 1), (start_num + diff + 1))
    conn_df["node_id"] = (prefix + '_HN') + conn_df["node_id"].astype(str)

    nodes_all = pd.concat([water_nodes, conn_df])
    nodes_all_gdf = gpd.GeoDataFrame(nodes_all, geometry="geometry", crs=PROJ_CRS)

    return nodes_all_gdf

waternodes = get_water_nodes(nodes_combined, "VL")
waternodes

Unnamed: 0,geometry,node_id,source,sewernode_id
0,POINT (177317.033 187108.927),VL_HN1,water_node,
1,POINT (175948.922 187590.860),VL_HN2,water_node,
2,POINT (168312.751 188947.734),VL_HN3,water_node,
3,POINT (190287.875 162834.403),VL_HN4,water_node,
4,POINT (177620.500 182754.219),VL_HN5,water_node,
...,...,...,...,...
69755,POINT (221539.327 180803.419),VL_HN69756,connection_node,6002935_1
69756,POINT (218308.622 189506.152),VL_HN69757,connection_node,7184778_1
69757,POINT (224393.703 184071.719),VL_HN69758,connection_node,7177561_1
69758,POINT (217762.589 183421.889),VL_HN69759,connection_node,6003589_1


In [105]:
final_water_nodes = (
    waternodes.merge(
        merged_sewer_nodes[["STATUS", "LBLTYPE", "NRKPNT"]],
        left_on="sewernode_id",
        right_on="NRKPNT",
        how="left",
    )
    .drop('NRKPNT', axis=1)
    .drop_duplicates(subset="geometry", keep="first")
    .reset_index(drop=True)
)

In [106]:
final_water_nodes

Unnamed: 0,geometry,node_id,source,sewernode_id,STATUS,LBLTYPE
0,POINT (177317.033 187108.927),VL_HN1,water_node,,,
1,POINT (175948.922 187590.860),VL_HN2,water_node,,,
2,POINT (168312.751 188947.734),VL_HN3,water_node,,,
3,POINT (190287.875 162834.403),VL_HN4,water_node,,,
4,POINT (177620.500 182754.219),VL_HN5,water_node,,,
...,...,...,...,...,...,...
69755,POINT (221539.327 180803.419),VL_HN69756,connection_node,6002935_1,Actief,Uitlaat
69756,POINT (218308.622 189506.152),VL_HN69757,connection_node,7184778_1,Gepland,Uitlaat
69757,POINT (224393.703 184071.719),VL_HN69758,connection_node,7177561_1,Actief,Uitlaat
69758,POINT (217762.589 183421.889),VL_HN69759,connection_node,6003589_1,Actief,Uitlaat


In [191]:
# final_water_nodes.to_file(config.data_dest / "vl_nodes_PROCESSED.shp")

  final_water_nodes.to_file(config.data_dest / "vl_nodes_PROCESSED.shp")


In [108]:
final_water_nodes.crs

<Derived Projected CRS: EPSG:31370>
Name: BD72 / Belgian Lambert 72
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: Belgium - onshore.
- bounds: (2.5, 49.5, 6.4, 51.51)
Coordinate Operation:
- name: Belgian Lambert 72
- method: Lambert Conic Conformal (2SP)
Datum: Reseau National Belge 1972
- Ellipsoid: International 1924
- Prime Meridian: Greenwich

In [109]:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#    print(merged_nodes)

# merged_nodes.to_file(r"data_transform\vl_nodes_combined_V02.shp")
# merged_nodes.to_file(data_dest / "vl_nodes_combined_V02.shp")

## Get start and end IDs for the line segments, to include new connection nodes

In [156]:
def line_segments_start_end_ids(splitlines_df, all_nodes_gdf, node_id, PROJ_CRS):
    """Returns the start and end ids of the line segments for a given node id"""
    splitlines_df["coords"] = splitlines_df.apply(
        lambda row: get_line_coords(row.geometry), axis=1
    )  #%time
    all_nodes_gdf["coords"] = get_point_coords(all_nodes_gdf)
    # join linestrings to the nearest node, in this case the node attached to the line
    joined_lines_nodes = gpd.sjoin_nearest(
        splitlines_df, all_nodes_gdf, how="left"
    ).reset_index()
    print(joined_lines_nodes.shape)
    print(joined_lines_nodes['source'].value_counts())
    # identify the nodes that corresponding to the line start and end points
    idx_start = []
    start_id = []
    idx_end = []
    end_id = []

    for idx, row in joined_lines_nodes.iterrows():
        if row.coords_right == row.coords_left[0]:
            idx_start.append(row["index"])
            start_id.append(row[node_id])
        elif row.coords_right == row.coords_left[-1]:
            idx_end.append(row["index"])
            end_id.append(row[node_id])

    start_id_df = (
        pd.DataFrame({"line_index": idx_start, f"start_{node_id}": start_id})
        .merge(
            joined_lines_nodes[["index", node_id, "VHAS", "geometry"]],
            left_on="line_index",
            right_on="index",
            how="left",
        )
        .drop_duplicates("geometry")
    )
    print(start_id_df.shape)
    end_id_df = (
        pd.DataFrame({"line_index": idx_end, f"end_{node_id}": end_id})
        .merge(
            joined_lines_nodes[["index", node_id, "VHAS", "geometry"]],
            left_on="line_index",
            right_on="index",
            how="left",
        )
        .drop_duplicates("geometry")
    )
    print(end_id_df.shape)
    merged_start_end_df = (
        pd.merge(start_id_df, end_id_df, on="geometry", how="outer")
        .drop(
            [
                "line_index_x",
                "index_x",
                "node_id_x",
                "index_y",
                "line_index_y",
                "node_id_y",
                "VHAS_x",
            ],
            axis=1,
        )
        .rename(columns={"start_node_id": "start_ID", "end_node_id": "end_ID"})
    )

    return gpd.GeoDataFrame(merged_start_end_df, geometry="geometry", crs=PROJ_CRS)

In [120]:
splitlines_vhas.shape

(15153, 5)

In [113]:
waternodes.columns

Index(['geometry', 'node_id', 'source', 'sewernode_id'], dtype='object')

In [157]:
node_id = "node_id"
splitlines_with_ids = line_segments_start_end_ids(
    splitlines_vhas, waternodes[["geometry", "source", "node_id"]], node_id, PROJ_CRS
)

(32523, 10)
connection_node    18817
water_node         13706
Name: source, dtype: int64
(15153, 6)
(15153, 6)


In [158]:
splitlines_with_ids

Unnamed: 0,start_ID,geometry,end_ID,VHAS_y
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776
3,VL_HN67674,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL_HN580,32776
4,VL_HN22837,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL_HN65514,32784
...,...,...,...,...
15148,VL_HN64258,"LINESTRING (70886.860 219736.109, 70890.328 21...",VL_HN64257,7012201
15149,VL_HN34681,"LINESTRING (138704.147 176812.610, 138688.060 ...",VL_HN29689,32672
15150,VL_HN25143,"LINESTRING (161243.920 186896.994, 161230.637 ...",VL_HN66511,6029276
15151,VL_HN66511,"LINESTRING (159618.409 188060.371, 159396.076 ...",VL_HN58124,6029276


**Get unique water line ids**

In [159]:
# get unique line segments ids
def get_unique_ID(df, col):
    """Get unique ID for each new split segment in a dataframe
    Assert that the number of unique IDs is equal to the number of split segments
    """
    # the new split lines need a new unique uniqueID value
    df["num_id"] = df.groupby(col).cumcount() + 1
    df["new_string_id"] = df[col].astype(str) + "_" + df["num_id"].astype(str)

    # df = df.drop(columns=['num_id', col]).rename(columns={'new_string_id': col})
    # assert df[col].nunique() == len(df)

    return df

splitlines_vhas = get_unique_ID(splitlines_with_ids, "VHAS_y")
assert len(splitlines_vhas) == splitlines_vhas.new_string_id.nunique()

In [160]:
splitlines_with_ids

Unnamed: 0,start_ID,geometry,end_ID,VHAS_y,num_id,new_string_id
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768,1,32768_1
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768,2,32768_2
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776,1,32776_1
3,VL_HN67674,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL_HN580,32776,2,32776_2
4,VL_HN22837,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL_HN65514,32784,1,32784_1
...,...,...,...,...,...,...
15148,VL_HN64258,"LINESTRING (70886.860 219736.109, 70890.328 21...",VL_HN64257,7012201,4,7012201_4
15149,VL_HN34681,"LINESTRING (138704.147 176812.610, 138688.060 ...",VL_HN29689,32672,1,32672_1
15150,VL_HN25143,"LINESTRING (161243.920 186896.994, 161230.637 ...",VL_HN66511,6029276,1,6029276_1
15151,VL_HN66511,"LINESTRING (159618.409 188060.371, 159396.076 ...",VL_HN58124,6029276,2,6029276_2


In [161]:
splitlines_vhas = get_unique_ID(splitlines_with_ids, "VHAS_y")
assert len(splitlines_vhas) == splitlines_vhas.new_string_id.nunique()

In [162]:
splitlines_vhas

Unnamed: 0,start_ID,geometry,end_ID,VHAS_y,num_id,new_string_id
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768,1,32768_1
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768,2,32768_2
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776,1,32776_1
3,VL_HN67674,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL_HN580,32776,2,32776_2
4,VL_HN22837,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL_HN65514,32784,1,32784_1
...,...,...,...,...,...,...
15148,VL_HN64258,"LINESTRING (70886.860 219736.109, 70890.328 21...",VL_HN64257,7012201,4,7012201_4
15149,VL_HN34681,"LINESTRING (138704.147 176812.610, 138688.060 ...",VL_HN29689,32672,1,32672_1
15150,VL_HN25143,"LINESTRING (161243.920 186896.994, 161230.637 ...",VL_HN66511,6029276,1,6029276_1
15151,VL_HN66511,"LINESTRING (159618.409 188060.371, 159396.076 ...",VL_HN58124,6029276,2,6029276_2


In [163]:
splitlines_final = (
    splitlines_vhas.merge(water_final, left_on="VHAS_y", right_on="VHAS", how="left")
    .drop(["VHAS_y", "VHAS", "num_id", "start_ID_y", "end_ID_y", "geometry_y"], axis=1)
    .rename(
        columns={
            "new_string_id": "VHAS",
            "start_ID_x": "start_ID",
            "end_ID_x": "end_ID",
            "geometry_x": "geometry",
        }
    )
)

# assert splitlines_merged.VHAS_x.all() == splitlines_merged.VHAS.all()

In [164]:
splitlines_final.head(3)

Unnamed: 0,start_ID,geometry,end_ID,VHAS,OIDN,UIDN,VHAG,NAAM,REGCODE,REGCODE1,...,CATC,LBLCATC,BEKNR,BEKNAAM,STRMGEB,GEO,LBLGEO,VHAZONENR,WTRLICHC,LENGTE
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768_1,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768_2,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776_1,44660,664399,6243,Peverstraatbeek,B5114,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,1,2.5 tot 0.25 m,422,1033,952.75


In [165]:
splitlines_final.shape

(15153, 21)

## Gather all linestrings into one dataset

With the split segments now joined to the water dataset to get the necessary attributes, we can now combine all the linestrings into one dataset.

In [166]:
def merge_segments_to_water(split_segments, split_segments_final, water_df, col):
    """This function merges segments to water polygons"""
    # drop the linestrings to be split and merge the df with split lines
    # water_final_str = water_final_str.astype({"VHAS": str}, errors='raise')

    split_segments = split_segments.astype({col: str}, errors="raise")
    split_segments_final = split_segments_final.astype({col: str}, errors="raise")
    water_df = water_df.astype({col: str}, errors="raise")
    print("water_df: ", len(water_df))

    assert split_segments_final[col].nunique() == split_segments_final["geometry"].nunique()
    assert water_df[col].nunique() == water_df["geometry"].nunique()

    linestrings_to_drop = list(set(split_segments[col].to_list()))
    print("linestrings_to_drop: ", len(linestrings_to_drop))
    print("split_segments: ", len(split_segments))

    water_df_trimmed = water_df.query(
        col + " not in @linestrings_to_drop"
    )  # .reset_index(drop=True)
    water_df_drop = water_df.query(col + " in @linestrings_to_drop")
    # assert water_df_trimmed[col].nunique() == water_df_trimmed['geometry'].nunique()
    # print("water df trimmed: ", len(water_df_trimmed))
    # print("water df drop: ", len(water_df_drop))

    # merge the split lines with the original water lines
    # water_df_trimmed_merged = water_df_trimmed.merge(split_segments_line_ids, on=col, how='outer')
    merged_df = gpd.GeoDataFrame(
        pd.concat([split_segments_final, water_df_trimmed], ignore_index=True),
        geometry="geometry",
        crs=PROJ_CRS,
    )
    print("merged df: ", len(merged_df))
    
    assert merged_df["geometry"].nunique() == merged_df[col].nunique()
    assert (len(water_df) - len(linestrings_to_drop)) + len(split_segments) == len(merged_df)

    # assert merged_df['geometry'].nunique() == merged_df[col].nunique()
    return merged_df

In [167]:
splitlines_vhas

Unnamed: 0,start_ID,geometry,end_ID,VHAS_y,num_id,new_string_id
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768,1,32768_1
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768,2,32768_2
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776,1,32776_1
3,VL_HN67674,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL_HN580,32776,2,32776_2
4,VL_HN22837,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL_HN65514,32784,1,32784_1
...,...,...,...,...,...,...
15148,VL_HN64258,"LINESTRING (70886.860 219736.109, 70890.328 21...",VL_HN64257,7012201,4,7012201_4
15149,VL_HN34681,"LINESTRING (138704.147 176812.610, 138688.060 ...",VL_HN29689,32672,1,32672_1
15150,VL_HN25143,"LINESTRING (161243.920 186896.994, 161230.637 ...",VL_HN66511,6029276,1,6029276_1
15151,VL_HN66511,"LINESTRING (159618.409 188060.371, 159396.076 ...",VL_HN58124,6029276,2,6029276_2


In [168]:
splitlines_df = splitlines_vhas[['VHAS_y', 'geometry']].rename(columns={'VHAS_y':'VHAS'})

In [169]:
segments_to_water = merge_segments_to_water(
    splitlines_df, splitlines_final, water_final, "VHAS"
)

water_df:  63762
linestrings_to_drop:  6846
split_segments:  15153
merged df:  72069


In [170]:
segments_to_water.head()

Unnamed: 0,start_ID,geometry,end_ID,VHAS,OIDN,UIDN,VHAG,NAAM,REGCODE,REGCODE1,...,CATC,LBLCATC,BEKNR,BEKNAAM,STRMGEB,GEO,LBLGEO,VHAZONENR,WTRLICHC,LENGTE
0,VL_HN55957,"LINESTRING (138979.165 172375.119, 138977.345 ...",VL_HN66013,32768_1,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12
1,VL_HN66013,"LINESTRING (138977.309 172382.827, 138977.282 ...",VL_HN42790,32768_2,44659,723201,6200,Zierbeek,B5111,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,2,< 0.25 m,422,213,866.12
2,VL_HN579,"LINESTRING (139426.094 170006.797, 139438.876 ...",VL_HN67674,32776_1,44660,664399,6243,Peverstraatbeek,B5114,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,1,2.5 tot 0.25 m,422,1033,952.75
3,VL_HN67674,"LINESTRING (139710.263 170419.798, 139711.638 ...",VL_HN580,32776_2,44660,664399,6243,Peverstraatbeek,B5114,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,1,2.5 tot 0.25 m,422,1033,952.75
4,VL_HN22837,"LINESTRING (138841.797 171003.798, 138847.552 ...",VL_HN65514,32784_1,44567,687269,6258,Zibbeek,B5116,,...,2,"Geklasseerd, tweede categorie",7,Denderbekken,Schelde,1,2.5 tot 0.25 m,422,1033,781.83


In [171]:
segments_to_water.geometry.nunique()

72069

**Recalculate the length of the new linestrings**

In [172]:
segments_to_water["LENGTE"] = segments_to_water["geometry"].apply(
    lambda x: x.length
)

In [173]:
segments_to_water.columns

Index(['start_ID', 'geometry', 'end_ID', 'VHAS', 'OIDN', 'UIDN', 'VHAG',
       'NAAM', 'REGCODE', 'REGCODE1', 'BEHEER', 'CATC', 'LBLCATC', 'BEKNR',
       'BEKNAAM', 'STRMGEB', 'GEO', 'LBLGEO', 'VHAZONENR', 'WTRLICHC',
       'LENGTE'],
      dtype='object')

In [227]:
segments_to_water2 = (segments_to_water.rename
                                        (columns={'LBLCATC':'category',
                                        'end_ID':'end_ID', 
                                        'VHAS': 'line_id',
                                        'LENGTE': 'length',
                                        'start_ID':'start_ID', 
                                        'STRMGEB':'basin',
                                        'NAAM':'line_name'
                                        }))

In [228]:
# segments_to_water2.to_file(config.data_dest / "vl_water_PROCESSED.shp")

### Intermodal network connection object

An intermodal network connection object is a link object between the water network and the sewer network.

In [176]:
network_connection_object = (
    final_water_nodes[["node_id", "sewernode_id", "geometry"]]
    .query("sewernode_id.notnull()")
    .rename(columns={"node_id": "hydronode_id"})
    .reset_index(drop=True)
)

In [177]:
network_connection_object.head(2)

Unnamed: 0,hydronode_id,sewernode_id,geometry
0,VL_HN60772,24195613725_1,POINT (49720.629 215107.628)
1,VL_HN60773,7165620_1,POINT (63142.133 172229.301)


In [179]:
network_conn2 = network_connection_object.merge(ext_nodes_df[["CODEKOPPNT", "geometry"]], left_on="sewernode_id", right_on="CODEKOPPNT", how="left")

def make_connection_lines(df, from_point, to_point):
    lines = []
    for index, row in df.iterrows():
        p_1 = Point(row[from_point])
        p_2 = Point(row[to_point])
        intersect = LineString([p_1, p_2])
        # linestring = loads(intersect)
        lines.append(intersect)
    return lines

network_conn2['connection_lines'] = make_connection_lines(network_conn2, 'geometry_x', 'geometry_y')
network_conn2.head(2)


connection_links = gpd.GeoDataFrame(network_conn2[["hydronode_id", "sewernode_id", "connection_lines"]]
                    .rename(columns={"hydronode_id":"idElement1", "sewernode_id":"idElement2", "connection_lines":"geometry"}))


In [187]:
connection_links['fictitious'] = 'true'

In [180]:
network_conn2.head(2)

Unnamed: 0,hydronode_id,sewernode_id,geometry_x,CODEKOPPNT,geometry_y,connection_lines
0,VL_HN60772,24195613725_1,POINT (49720.629 215107.628),24195613725_1,POINT (49632.228 215175.512),LINESTRING (49720.628540000325 215107.62849998...
1,VL_HN60773,7165620_1,POINT (63142.133 172229.301),7165620_1,POINT (63142.598 172228.907),LINESTRING (63142.132540000996 172229.30059998...


In [188]:
connection_links

Unnamed: 0,idElement1,idElement2,geometry,UUID,watercourse_namespace,fictitious
0,VL_HN60772,24195613725_1,"LINESTRING (3828319.268 3148978.328, 3828236.4...",21fb95a073a34a7f908a86cd98f427b0,gopeg.eu/tracing,true
1,VL_HN60773,7165620_1,"LINESTRING (3838398.511 3105159.856, 3838398.9...",a793a461705e452b860a1bbfd621486a,gopeg.eu/tracing,true
2,VL_HN60774,7166026_1,"LINESTRING (3836511.918 3105203.709, 3836511.9...",ddb70f0455c34637bb5bfe0c10e52e23,gopeg.eu/tracing,true
3,VL_HN60775,7165606_1,"LINESTRING (3841436.874 3107664.237, 3841433.5...",1495042cda824d39bbb9a1a7093571ab,gopeg.eu/tracing,true
4,VL_HN60776,7165608_1,"LINESTRING (3841582.291 3107649.537, 3841584.0...",e445c17c3a2d415a81c94251fc5b53aa,gopeg.eu/tracing,true
...,...,...,...,...,...,...
8984,VL_HN69756,6002935_1,"LINESTRING (3996915.723 3101525.462, 3996912.2...",af1e590ce89f4140be7e619ab8f8aeb1,gopeg.eu/tracing,true
8985,VL_HN69757,7184778_1,"LINESTRING (3994360.101 3110453.353, 3994357.9...",c6505f3dabf541c9b5af5e33048eeaf8,gopeg.eu/tracing,true
8986,VL_HN69758,7177561_1,"LINESTRING (4000010.552 3104566.229, 3999720.3...",01ab346468244f8ca8e0ca00e6394bcf,gopeg.eu/tracing,true
8987,VL_HN69759,6003589_1,"LINESTRING (3993351.027 3104426.881, 3993350.6...",1d2143bd618e4a69987891f5a999f2bf,gopeg.eu/tracing,true


In [182]:
import uuid
connection_links['UUID'] = [uuid.uuid4().hex for _ in range(len(connection_links.index))]
connection_links['watercourse_namespace'] = "gopeg.eu/tracing"

FINAL_CRS = 'EPSG:3035'
connection_links = connection_links.set_crs(PROJ_CRS)
# connection_links = connection_links.to_crs(FINAL_CRS)

In [189]:
connection_links

Unnamed: 0,idElement1,idElement2,geometry,UUID,watercourse_namespace,fictitious
0,VL_HN60772,24195613725_1,"LINESTRING (3828319.268 3148978.328, 3828236.4...",21fb95a073a34a7f908a86cd98f427b0,gopeg.eu/tracing,true
1,VL_HN60773,7165620_1,"LINESTRING (3838398.511 3105159.856, 3838398.9...",a793a461705e452b860a1bbfd621486a,gopeg.eu/tracing,true
2,VL_HN60774,7166026_1,"LINESTRING (3836511.918 3105203.709, 3836511.9...",ddb70f0455c34637bb5bfe0c10e52e23,gopeg.eu/tracing,true
3,VL_HN60775,7165606_1,"LINESTRING (3841436.874 3107664.237, 3841433.5...",1495042cda824d39bbb9a1a7093571ab,gopeg.eu/tracing,true
4,VL_HN60776,7165608_1,"LINESTRING (3841582.291 3107649.537, 3841584.0...",e445c17c3a2d415a81c94251fc5b53aa,gopeg.eu/tracing,true
...,...,...,...,...,...,...
8984,VL_HN69756,6002935_1,"LINESTRING (3996915.723 3101525.462, 3996912.2...",af1e590ce89f4140be7e619ab8f8aeb1,gopeg.eu/tracing,true
8985,VL_HN69757,7184778_1,"LINESTRING (3994360.101 3110453.353, 3994357.9...",c6505f3dabf541c9b5af5e33048eeaf8,gopeg.eu/tracing,true
8986,VL_HN69758,7177561_1,"LINESTRING (4000010.552 3104566.229, 3999720.3...",01ab346468244f8ca8e0ca00e6394bcf,gopeg.eu/tracing,true
8987,VL_HN69759,6003589_1,"LINESTRING (3993351.027 3104426.881, 3993350.6...",1d2143bd618e4a69987891f5a999f2bf,gopeg.eu/tracing,true


In [230]:
connection_links = connection_links.to_crs(PROJ_CRS)

In [232]:
# connection_links.to_file(config.data_dest / "vl_HS_connection.shp")

  connection_links.to_file(config.data_dest / "vl_HS_connection.shp")
