In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
from matplotlib import pyplot as plt
import pydeck as pdk
import zipfile
import io
import requests
from lxml import etree
from haversine import haversine
import uuid

# Parsing NETEX as master-records for odmkraken.busspeeds

## Reading XML with pandas

### Downloading a zip file with multiple XML files

So let's just try opening a netex file in pandas:

In [2]:
URL = 'https://download.data.public.lu/resources/horaires-et-arrets-des-transport-publics-netex/20221202-114015/netex-20221201-20221231.zip'

In [3]:
r = requests.get(URL)
z = zipfile.ZipFile(io.BytesIO(r.content))
file = next(f for f in z.filelist if 'RGTR-622' in f.filename)
pd.read_xml(z.open(file))

Unnamed: 0,PublicationTimestamp,ParticipantRef,RequestTimestamp,topics,CompositeFrame
0,2022-12-02T10:28:55,,,,
1,,LU,,,
2,,,2022-12-02T10:28:55,,
3,,,,,


### Using `read_xml` with namespaces to filter for specific tags

So with the right tag-name and namespaces specified correctly, pandas works:

In [4]:
pd.read_xml(z.open(file), xpath='//nx:Line', namespaces={'nx': 'http://www.netex.org.uk/netex'})

Unnamed: 0,id,version,Name,ShortName,TransportMode,PublicCode,PrivateCode,AuthorityRef,additionalOperators,allowedDirections
0,LU::Line:2367::,1669973335,622,622,bus,622,622,,,


However, tags that have children with pertinent information, such as the latitude and longitude tags of `Location`, are lost:

In [5]:
pd.read_xml(z.open(file), xpath='//nx:ScheduledStopPoint', namespaces={'nx': 'http://www.netex.org.uk/netex'}).head()

Unnamed: 0,id,version,Location,ShortName,PublicCode,StopType
0,LU::ScheduledStopPoint:16160119_RGTR_::,1669973335,,"LUX Gare, Routière CFL quai 104",LUGARC04,busStation
1,LU::ScheduledStopPoint:17410303_RGTR_::,1669973335,,"LUX Hollerich, Fonderie",LUFOND03,busStation
2,LU::ScheduledStopPoint:25600101_RGTR_::,1669973335,,"LUX Hollerich, Jean-Baptiste Merkels",LUJBME01,busStation
3,LU::ScheduledStopPoint:14710403_RGTR_::,1669973335,,"LUX Hollerich, Assurances Sociales 3",LUASSO03,busStation
4,LU::ScheduledStopPoint:12140101_RGTR_::,1669973335,,"LUX Cessange, Barrès",LUCBAR01,busStation


### Using `read_xml` with stylesheets to read nested tags as a table

According to the docs, the intended answer to that problem is a stylesheet:

In [6]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    <xsl:output method="xml"/>    
    <xsl:template match = "nx:PublicationDelivery">   
        <xsl:apply-templates select="/nx:PublicationDelivery/nx:dataObjects/nx:CompositeFrame/nx:frames/nx:ServiceFrame/nx:scheduledStopPoints/nx:ScheduledStopPoint" />
    </xsl:template>
    <xsl:template match = "nx:ScheduledStopPoint">   
        <xsl:copy>
            <id><xsl:value-of select="@id" /></id>
            <version><xsl:value-of select="@version" /></version>
            <xsl:copy-of select="nx:ShortName" />
            <xsl:copy-of select="nx:PublicCode" />
            <xsl:copy-of select="nx:StopType" />
            <xsl:copy-of select="nx:Location/nx:Latitude" />
            <xsl:copy-of select="nx:Location/nx:Longitude" />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
'''

In [7]:
stop_points = pd.read_xml(z.open(file), stylesheet=style, xpath='/*')
stop_points['id'] = stop_points['id'].str.extract(r'LU::ScheduledStopPoint:(?P<id>[0-9]+)_[A-Z]+_::')
stop_points.head()

Unnamed: 0,id,version,ShortName,PublicCode,StopType,Latitude,Longitude
0,16160119,1669973335,"LUX Gare, Routière CFL quai 104",LUGARC04,busStation,49.598712,6.132902
1,17410303,1669973335,"LUX Hollerich, Fonderie",LUFOND03,busStation,49.598529,6.127708
2,25600101,1669973335,"LUX Hollerich, Jean-Baptiste Merkels",LUJBME01,busStation,49.599458,6.12238
3,14710403,1669973335,"LUX Hollerich, Assurances Sociales 3",LUASSO03,busStation,49.597378,6.117603
4,12140101,1669973335,"LUX Cessange, Barrès",LUCBAR01,busStation,49.595895,6.11549


Success ! This table would be usable- the `HALT` codes sent in the latest ICTS data dumps seem to match the `id` of `ScheduledStopPoint`. Except that this only provides the name given by the respective operator. The "master stop" simplifying spatial aggregation is hidden in other objects:

## NETEX data model

Having explored a few files manually, the data model seems centered around the idea of linking the infrastructure of the bus-stop itself, to the fact that a vehicle is scheduled to stop there. In particular:

1. `PassengerStopAssignment` (in the `ServiceFrame` under `stopAssignments`) stores a reference to a `StopPlace` (or `Quay`) and a `ScheduledStopPont`.
2. A `StopPlace` (in the `SiteFrame` under `stopPlaces`) has a name, a location (coordinates of its centroid), a `TopographicalPlace` (which seems to be a commune), a `Level` (referring to what level of a building a stop is at) and it may indicate one or several `Quay` objects, that have the same attributes and that can be linked to directly from the `StopPlace`.
3. A `ScheduledStopPoint` (in the `ServiceFrame` under `scheduledStopPoints`) simiarly has a nme and a location. It is referred to by a `StopPointInJourneyPattern`, which links the stop point to a `ServiceJourneyPattern`.
4. A `ServiceJourneyPattern` (in the `ServiceFrame`) stores a sequence of `StopPointInJourneyPattern` indicating the sequence of `ScheduledStopPoint` visited during the journey, as well as (for each stop) whether passengers may board, disembark and whether there is a "change of destination display". It also includes meta data, such as the formal direction of travel, `NoticeAssignment` informing e.g. about school transport services or guaranteed connections. While conceptually, the `ServiceJourneyPattern` corresponds to a  `run` (=course), the IDs do not seem to match.
5. `TimetabledPassingTime` add the departure and arrival time for every `StopPointInJourneyPattern`.
    



So what we would need as a `stop` table are the links from `ScheduledStopPoint` to `StopPlace` via  `PassengerStopAssignment`.

There is also the option to gather information about `runs` and scheduling from the `StopPointInJourneyPattern`.

### Definition of lines

In [8]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    
    <xsl:output method="xml"/>
    <xsl:key name="authorities" match="//nx:organisations/nx:Authority" use="@id" />
    <xsl:key name="operators" match="//nx:organisations/nx:Operator" use="@id" />
    
    <xsl:template match="//nx:CompositeFrame">
    <Line>
        <xsl:apply-templates select="//nx:Line" />
        <xsl:apply-templates match="//nx:ValidBetween" />
        </Line>
    </xsl:template>

    <!-- indices to merge references -->
    <xsl:template match="nx:Line">
            <id><xsl:value-of select="@id" /></id>
            <version><xsl:value-of select="@version" /></version>
            <name><xsl:value-of select="nx:Name" /></name>
            <shortName><xsl:value-of select="nx:ShortName" /></shortName>
            <mode><xsl:value-of select="nx:TransportMode" /></mode>
            <publicCode><xsl:value-of select="nx:PublicCode" /></publicCode>
            <privateCode><xsl:value-of select="nx:PrivateCode" /></privateCode>
            <authority><xsl:value-of select="key('authorities', nx:AuthorityRef/@ref)/nx:Name" /></authority>
            <xsl:for-each select="nx:additionalOperators">
                <operator><xsl:value-of select="key('operators', nx:OperatorRef/@ref)/nx:Name" /></operator>
            </xsl:for-each>
            <!--<direction>
                <xsl:for-each select="nx:allowedDirections/nx:AllowedLineDirection">
                    <xsl:value-of select="@id" />
                </xsl:for-each>
            </direction>-->
        
    </xsl:template>
        
    <xsl:template match="nx:ValidBetween">
        <fromDate><xsl:value-of select="nx:FromDate" /></fromDate>
        <toDate><xsl:value-of select="nx:ToDate" /></toDate>
    </xsl:template>
    
</xsl:stylesheet>
'''

lines = []
transform = etree.XSLT(etree.XML(style))
for file in z.filelist:
    with z.open(file) as h:
        lines.append(pd.read_xml(h, xpath='/Line', stylesheet=style))
lines = pd.concat(lines)

lines['fromDate'] = pd.to_datetime(lines['fromDate'])
lines['toDate'] = pd.to_datetime(lines['toDate'])
lines['ictsLineCode'] = lines['authority'].replace('CFL_Bus', 'CFL').str.cat(lines['shortName'].astype('str').str.pad(2, fillchar='0'))

In [9]:
lines.head()

Unnamed: 0,id,version,name,shortName,mode,publicCode,privateCode,authority,operator,fromDate,toDate,ictsLineCode
0,LU::Line:257::,1669973335,10,10,bus,10,10,AVL,Ville de Luxembourg - Service Autobus,2022-12-01,2022-12-31,AVL10
0,LU::Line:258::,1669973335,11,11,bus,11,11,AVL,Ville de Luxembourg - Service Autobus,2022-12-01,2022-12-31,AVL11
0,LU::Line:259::,1669973335,12,12,bus,12,12,AVL,Ville de Luxembourg - Service Autobus,2022-12-01,2022-12-31,AVL12
0,LU::Line:260::,1669973335,13,13,bus,13,13,AVL,Ville de Luxembourg - Service Autobus,2022-12-01,2022-12-31,AVL13
0,LU::Line:261::,1669973335,14,14,bus,14,14,AVL,Ville de Luxembourg - Service Autobus,2022-12-01,2022-12-31,AVL14


### Journey patterns

The interesting info about journeys is hidden in the `Notice` and `NoticeAssignment` objects. This is trickier to extract, as there may be one or several messages that act either like a tag, i.e. attributing a property to a line or not, or they are a specifyer, attributing a line to a certain category. Finally, solved this using XSLT:

In [12]:
style = '''
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    
    <xsl:output method="xml"/>
    <xsl:key name="authorities" match="//nx:organisations/nx:Authority" use="@id" />
    <xsl:key name="operators" match="//nx:organisations/nx:Operator" use="@id" />
    <xsl:key name="notices" match="//nx:Notice" use="@id" />
    
    <xsl:template match="//nx:CompositeFrame">
         <journey>
             <id>(forgetme)</id>
             <bicycle>yes</bicycle>  <!-- necessary as there currently seems to be none... -->
             <barrierefrei>yes</barrierefrei>
         </journey>
         <xsl:apply-templates select="//nx:ServiceJourneyPattern" />
    </xsl:template>

    <xsl:template match="nx:ServiceJourneyPattern">
        <journey>
            <id><xsl:value-of select="@id" /><xsl:value-of select="key('notices', nx:NoticeRef/@ref)/nx:Text" /></id>
            <line><xsl:value-of select="nx:RouteView/nx:LineRef/@ref" /></line>
            <direction><xsl:value-of select="nx:DirectionRef/@ref" /></direction>
            <xsl:for-each select="nx:noticeAssignments/nx:NoticeAssignment">
                <xsl:apply-templates select="nx:Notice" />
                <xsl:apply-templates select="key('notices', nx:NoticeRef/@ref)" />
            </xsl:for-each>
        </journey>
    </xsl:template>
    
    <xsl:template match="nx:Notice">
        <xsl:choose>
            <xsl:when test="contains(nx:Text, 'Fremdunternehmer')">
                <operator><xsl:value-of select="substring(nx:Text, 19)" /></operator>
            </xsl:when>c
            <xsl:when test="contains(nx:Text, 'Fahrtart')">
                <journeyType><xsl:value-of select="substring(nx:Text, 11)" /></journeyType>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'correspondance')">
                <connection><xsl:value-of select="nx:Text" /></connection>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Barrierefrei')">
                <barrierefrei>yes</barrierefrei>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Rollstuhlgerecht')">
                <rollstuhlgerecht>yes</rollstuhlgerecht>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Fahradmitnahme')">
                <bicycle>yes</bicycle>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Einstiegshilfe')">
                <einstiegshilfe>yes</einstiegshilfe>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Klimaanlage')">
                <airconditioned>yes</airconditioned>
            </xsl:when>
            <xsl:when test="contains(nx:Text, 'Für HAFAS gesperrt')">
                <nohafas>yes</nohafas>
            </xsl:when>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>
'''

journeys = []
for file in z.filelist:
    with z.open(file) as h:
        journeys.append(pd.read_xml(h, xpath='/journey', stylesheet=style).drop(0))
journeys = pd.concat(journeys)

for field in ('barrierefrei', 'rollstuhlgerecht', 'bicycle', 'einstiegshilfe', 'airconditioned', 'nohafas'):
    try:
        journeys[field] = journeys[field].replace('yes', 1).astype(float)
    except KeyError:
        pass

for field in ('journeyType', 'operator', 'line', 'direction'):    
    journeys[field] = journeys[field].astype('category')
    
journeys['connection'] = journeys['connection'].astype('str')
journeys.drop_duplicates(inplace=True)
journeys.head()

KeyError: 'journeyType'

In [None]:
journeys

In [None]:
journeys.dtypes

## Getting information about stops

The object referenced by ICTS data is the `ScheduledStopPoint`. The path to full information (including the stop place) leads through the `PassengerStopAssignment`. The big issue here is the weird indexing of `StopPlace` objects, many of which share the identifier `LU::StopPlace:0_CdT::`. It turns out XSLT's `key` simply uses the last occurence of the `StopPlace` tag it finds in the file, even though the relationship between the `StopPoint` and the `StopPlace` often seems clear (intuitively) by name and spatially.

**Note**: there is a practical problem when applying a XSLT like the one below to all XML files within the NETEX bundle: some XML files do not define any `StopPoint`, `Quay` etc. objects, in which case `etree.XSLT` returns an empty object, which makes `read_xml` fail miserably. To work around that, I am simply adding a dummy record at the beginning of every file just to delete it manually again.

In [None]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    
    <xsl:output method="xml"/>    
    
    <!-- indices to merge references -->
    <xsl:key name="stoppoints" match="//nx:ServiceFrame/nx:scheduledStopPoints/nx:ScheduledStopPoint" use="@id" />
    <xsl:key name="stopplaces" match="//nx:SiteFrame/nx:stopPlaces/nx:StopPlace" use="@id" />
    <xsl:key name="quays" match="//nx:SiteFrame/nx:stopPlaces/nx:StopPlace/nx:quays/nx:Quay" use="@id" />
    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    <xsl:key name="authorities" match="//nx:organisations/nx:Authority" use="@id" />
    
    <!-- main template: triggers template for every `PassengerStopAssignment`-->
    <xsl:template match="nx:PublicationDelivery">   
        <Stop>
            <StopPointId>(killme)</StopPointId>
        </Stop>
        <xsl:apply-templates select="//nx:ServiceFrame/nx:stopAssignments/nx:PassengerStopAssignment"/>
    </xsl:template>
    
    <!-- reads the stop assigment and triggers templates for the 3 objects that may be referenced in turn -->
    <xsl:template match="//nx:PassengerStopAssignment">   
        <Stop>
            <xsl:apply-templates select="key('stoppoints', nx:ScheduledStopPointRef/@ref)" />
            <xsl:apply-templates select="key('stopplaces', nx:StopPlaceRef/@ref)" />
            <xsl:apply-templates select="key('quays', nx:QuayRef/@ref)" />
            <passengerStopAssignmentID><xsl:value-of select="@id" /></passengerStopAssignmentID>
            <passengerStopAssignmentOrder><xsl:value-of select="@order" /></passengerStopAssignmentOrder>
            <authority><xsl:value-of select="key('authorities', nx:AuthorityRef/@ref)/nx:Name" /></authority>
            <xsl:apply-templates select="../../../../nx:ValidBetween" />
        </Stop>
    </xsl:template>
    
    <xsl:template match="nx:ValidBetween">
        <fromDate><xsl:value-of select="nx:FromDate" /></fromDate>
        <toDate><xsl:value-of select="nx:ToDate" /></toDate>
    </xsl:template>
    
    <xsl:template match="//nx:ScheduledStopPoint">   
        <StopPointId><xsl:value-of select="@id" /></StopPointId>
        <StopPointShortName><xsl:value-of select="nx:ShortName" /></StopPointShortName>
        <StopPointPublicCode><xsl:value-of select="nx:PublicCode" /></StopPointPublicCode>
        <StopPointStopType><xsl:value-of select="nx:StopType" /></StopPointStopType>
        <StopPointLatitude><xsl:value-of select="nx:Location/nx:Latitude" /></StopPointLatitude>
        <StopPointLongitude><xsl:value-of select="nx:Location/nx:Longitude" /></StopPointLongitude>
    </xsl:template>
    <xsl:template match="//nx:Quay">   
        <QuayID><xsl:value-of select="@id" /></QuayID>
        <QuayName><xsl:value-of select="nx:Name" /></QuayName>
        <QuayLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></QuayLatitude>
        <QuayLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></QuayLongitude>
        <QuayStopPlaceID><xsl:value-of select="../../@id" /></QuayStopPlaceID>
    </xsl:template>
    <xsl:template match="//nx:StopPlace">   
        <StopPlaceID><xsl:value-of select="@id" /></StopPlaceID>
        <StopPlaceName><xsl:value-of select="nx:Name" /></StopPlaceName>
        <StopPlaceLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></StopPlaceLatitude>
        <StopPlaceLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></StopPlaceLongitude>
        <xsl:apply-templates select="key('places', nx:TopographicPlaceRef/@ref)" />
    </xsl:template>
    
    <!-- a stop place may reference a topographic place -->
    <xsl:template match="//nx:TopographicPlace">   
        <TopographicPlaceName><xsl:value-of select="nx:Name" /></TopographicPlaceName>
        <TopographicPlaceISOCode><xsl:value-of select="nx:IsoCode" /></TopographicPlaceISOCode>
    </xsl:template>    

</xsl:stylesheet>
'''
stops = []
for file in z.filelist:
    with z.open(file) as h:
        res = pd.read_xml(h, stylesheet=style, xpath='/*')
        stops.append(res.drop(0))
stops = pd.concat(stops)
stops['fromDate'] = pd.to_datetime(stops['fromDate'])
stops['toDate'] = pd.to_datetime(stops['toDate'])

So just to verify, yes there are indeed quite a few `StopPlace` with ID 0:

In [None]:
stops.groupby(['StopPlaceID']).agg({'StopPointId': 'count'}).sort_values('StopPointId', ascending=False).head(10)

The core question here will be, how "unique" the assignment between stop places and stop points are:

In [None]:
d = stops.groupby(['StopPlaceID', 'StopPointId']).agg({'StopPointId': 'count'}).groupby(level=0).count()
d.sort_values('StopPointId', ascending=False).head(10)

Turns out that, except for this special stop place 0, which is referenced by 1468 different `StopPoint` ids, more than $99\%$ of stop places have 1-4 stop points:

In [None]:
d.value_counts().sort_index()

Conversely, not one `StopPoint` refers to more than one `StopPlace`.

In [None]:
(stops.groupby(['StopPointId', 'StopPlaceID']).agg({'StopPlaceID': 'count'}).groupby(level=0).count() != 1).sum()

Quite obviously, the same doesn't hold true by `StopPlace` name, because XSLT will just use the last occurence it found in each file.

Just checking: if there is a `Quay`, then it's parent should always have the same StopPlaceID than the `StopPlace` referenced by the `PassengerStopAssignment`. Let's see if there are any examples where this is not the case:

In [None]:
stops.loc[(~stops['QuayID'].isnull()) & (stops['StopPlaceID'] != stops['QuayStopPlaceID'])]

Thank good! The next problem are ambiguous assignments. In particular:

In [None]:
stops['StopPlaceID'].str.match('LU::StopPlace:0_CdT::').sum()

But more generally, there seem to be quite a few `StopPoint` that share the same name. In fact, there are almost 5000 ids, but a bit more than 3100 unique names. 

In [None]:
len(stops['StopPointShortName'].unique())

In [None]:
len(stops['StopPointId'].unique())

Roughly 1500 `StopPoint` use one name for one id; about 1600 have 2 ids for the same name. And then there are the following extreme cases:

In [None]:
d = stops.groupby(['StopPointShortName', 'StopPointId']).count().groupby(level=0).count()['StopPointPublicCode']
d.value_counts().sort_index()

In [None]:
d.loc[d > 4]

Turns out those special cases are exclusively rail stations:

In [None]:
stops.groupby(['StopPointShortName', 'StopPointId', 'StopPointStopType']).count().groupby(level=[0, 2]).count()[['StopPointPublicCode']].query('StopPointPublicCode > 4')

Let's see if those stations then have quays - and no, they all link to the same mistery quay:

In [None]:
stops.query('StopPointShortName=="Luxembourg"').groupby('QuayID').count()

However, the `StopPoint` ids clearly show, that the quays are there. They are just encoded within the `StopPoint`, not the `StopPlace`/`Quay` pair.

In [None]:
stops.query('StopPointShortName=="Luxembourg"')[['StopPointId']]

While this particular stop isn't an issue for ICTS data (busses don't go to railway quays), it still raises the question whether ther eis hidden information in the id of other stop points. And as it turns out, this only happens on rail, so let's forget about this:

In [None]:
stops.loc[stops['StopPointId'].str.match('LU::ScheduledStopPoint:\d+_\w+_[0-9a-zA-Z]+::')].StopPointStopType.unique()

So codes are fine, except for `StopPlace` 0. Let's see if we can do this merge better

### Definition of StopPlaces

Applying a stylesheet through `read_xml` seems to have one big problem: a `AssertionError` is raised every time no data are found, which is hard to distinguish from other exceptions. So I am doing a little hack here: I am adding a dummy record to the XSLT output, which will have to be removed in the loop. That way, if there are no data, we still get a dataframe.

In [None]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    
    <xsl:output method="xml"/>
    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    
    <xsl:template match="/nx:PublicationDelivery">   
        <StopPlace>
            <StopPlaceID>(killme)</StopPlaceID>
        </StopPlace>
        <xsl:apply-templates select="//nx:StopPlace" />
    </xsl:template>

    <xsl:template match="nx:StopPlace">   
        <StopPlace>
            <StopPlaceID><xsl:value-of select="@id" /></StopPlaceID>
            <StopPlaceName><xsl:value-of select="nx:Name" /></StopPlaceName>
            <StopPlaceLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></StopPlaceLatitude>
            <StopPlaceLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></StopPlaceLongitude>
            <xsl:apply-templates select="key('places', nx:TopographicPlaceRef/@ref)" />
        </StopPlace>
    </xsl:template>
    
    <xsl:template match="//nx:TopographicPlace">   
        <TopographicPlaceName><xsl:value-of select="nx:Name" /></TopographicPlaceName>
        <TopographicPlaceISOCode><xsl:value-of select="nx:IsoCode" /></TopographicPlaceISOCode>
    </xsl:template>

</xsl:stylesheet>
'''
stop_places = []
for file in z.filelist:
    with z.open(file) as h:
        res = pd.read_xml(h, stylesheet=style, xpath='/*')
        stop_places.append(res.drop(0))
stop_places = pd.concat(stop_places)

There is not a single place that has more than two IDs for the same name, i.e. the names are unique (over all files!):

In [None]:
stop_places.groupby(['StopPlaceName', 'StopPlaceID']).count().reset_index().groupby(['StopPlaceName']).count().query('StopPlaceID != 1')

The IDs are not unique. 524 places are known under the same ID `0` (which is about 1/4 of all stop places!):

In [None]:
r = stop_places.groupby(['StopPlaceName', 'StopPlaceID']).count().reset_index().groupby(['StopPlaceID']).count()['StopPlaceName']
print('IDs used for more than one name:\t\t', ', '.join(r.loc[r > 1].index))
print('Number of places with ambiguous ID:\t\t', r.loc[r > 1].sum())
print('Total number of places (unique by name):\t', r.sum())
print('Share of places with ambiguous ID:\t\t', r.loc['LU::StopPlace:0_CdT::'] / r.sum())

Let's see how `StopPlace` 0 occurs by topographic place. Quite expectedly, VdL has a very high ratio. But so do Vianden and Redange, with no apparent reason.

In [None]:
d = stop_places.groupby(['StopPlaceID', 'StopPlaceName', 'TopographicPlaceName']).count()
d = pd.concat({'CdT0': d.loc['LU::StopPlace:0_CdT::'].groupby(level=1).sum()['StopPlaceLatitude'],
               'all': d.groupby(level=2).sum()['StopPlaceLatitude']}, axis=1)
d = d.fillna(0)
d['ratio'] = d.eval('CdT0 / all')
d = d.sort_values('ratio', ascending=False)
d

There generally seems to be quite a dissimilar distribution between communes:

In [None]:
d['ratio'].plot()

Checking the same for quays:

In [None]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    
    <xsl:output method="xml"/>
    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    
    <xsl:template match="/nx:PublicationDelivery">   
        <Quay>
            <QuayID>(forgetme)</QuayID>
        </Quay>
        <xsl:apply-templates select="//nx:Quay" />
    </xsl:template>

    <xsl:template match="nx:Quay">   
        <Quay>
            <QuayID><xsl:value-of select="@id" /></QuayID>
            <QuayName><xsl:value-of select="nx:Name" /></QuayName>
            <QuayLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></QuayLatitude>
            <QuayLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></QuayLongitude>
            <xsl:apply-templates select="../.." />
        </Quay>
    </xsl:template>
    
    <xsl:template match="nx:StopPlace">   
        <StopPlaceID><xsl:value-of select="@id" /></StopPlaceID>
        <StopPlaceName><xsl:value-of select="nx:Name" /></StopPlaceName>
        <StopPlaceLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></StopPlaceLatitude>
        <StopPlaceLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></StopPlaceLongitude>
        <xsl:apply-templates select="key('places', nx:TopographicPlaceRef/@ref)" />
    </xsl:template>
    
    <xsl:template match="//nx:TopographicPlace">   
        <TopographicPlaceName><xsl:value-of select="nx:Name" /></TopographicPlaceName>
        <TopographicPlaceISOCode><xsl:value-of select="nx:IsoCode" /></TopographicPlaceISOCode>
    </xsl:template>

</xsl:stylesheet>
'''
quays = []
for file in z.filelist:
    with z.open(file) as h:
        res = pd.read_xml(h, stylesheet=style, xpath='/*')
        quays.append(res.drop(0))
quays = pd.concat(quays)

In [None]:
(quays['QuayName'] == quays['StopPlaceName']).sum() / len(quays)

In [None]:
len(quays.loc[quays['QuayName'] != quays['StopPlaceName']].QuayName.unique())

In [None]:
len(quays['StopPlaceName'].unique())

In [None]:
22/1847

stops

### Trying this manually

In [None]:
file = next(f for f in z.filelist if '623' in f.filename)

style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">

    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    
    <xsl:template match="/nx:PublicationDelivery">   
        <StopPlace>
            <StopPlaceID>(forgetme)</StopPlaceID>
        </StopPlace>
        <xsl:apply-templates select="//nx:StopPlace" />
    </xsl:template>



    <xsl:template match="nx:StopPlace">
        <StopPlace>
            <StopPlaceID><xsl:value-of select="@id" /></StopPlaceID>
            <StopPlaceName><xsl:value-of select="nx:Name" /></StopPlaceName>
            <StopPlaceLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></StopPlaceLatitude>
            <StopPlaceLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></StopPlaceLongitude>
            <xsl:apply-templates select="key('places', nx:TopographicPlaceRef/@ref)" />
        </StopPlace>
    </xsl:template>
    
    <xsl:template match="nx:TopographicPlace">   
        <TopographicPlaceName><xsl:value-of select="nx:Name" /></TopographicPlaceName>
        <TopographicPlaceISOCode><xsl:value-of select="nx:IsoCode" /></TopographicPlaceISOCode>
    </xsl:template>

</xsl:stylesheet>
'''
stop_places = []
stop_places.append(pd.read_xml(z.open(file), xpath='/*', stylesheet=style).drop(0))
stop_places = pd.concat(stop_places)


style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">

    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    
    <xsl:template match="/nx:PublicationDelivery">   
        <StopPoint>
            <StopPointID>(forgetme)</StopPointID>
        </StopPoint>
        <xsl:apply-templates select="//nx:ScheduledStopPoint" />
    </xsl:template>


    <xsl:template match="nx:ScheduledStopPoint">   
        <StopPoint>
            <StopPointID><xsl:value-of select="@id" /></StopPointID>
            <StopPointShortName><xsl:value-of select="nx:ShortName" /></StopPointShortName>
            <StopPointPublicCode><xsl:value-of select="nx:PublicCode" /></StopPointPublicCode>
            <StopPointStopType><xsl:value-of select="nx:StopType" /></StopPointStopType>
            <StopPointLatitude><xsl:value-of select="nx:Location/nx:Latitude" /></StopPointLatitude>
            <StopPointLongitude><xsl:value-of select="nx:Location/nx:Longitude" /></StopPointLongitude>
        </StopPoint>
    </xsl:template>

</xsl:stylesheet>
'''
stop_points = []
stop_points.append(pd.read_xml(z.open(file), xpath='/*', stylesheet=style).drop(0))
stop_points = pd.concat(stop_points)
    
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">

    <xsl:template match="/nx:PublicationDelivery">   
        <Quay>
            <QuayID>(forgetme)</QuayID>
        </Quay>
        <xsl:apply-templates select="//nx:Quay" />
    </xsl:template>

     <xsl:template match="nx:Quay">   
     <Quay>
        <QuayID><xsl:value-of select="@id" /></QuayID>
        <QuayName><xsl:value-of select="nx:Name" /></QuayName>
        <QuayLatitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Latitude" /></QuayLatitude>
        <QuayLongitude><xsl:value-of select="nx:Centroid/nx:Location/nx:Longitude" /></QuayLongitude>
        <QuayStopPlaceID><xsl:value-of select="../../@id" /></QuayStopPlaceID>
        <QuayStopPlaceName><xsl:value-of select="../../nx:Name" /></QuayStopPlaceName>
    </Quay>
    </xsl:template>

</xsl:stylesheet>
'''
with z.open(file) as h:
    quays = pd.read_xml(h, xpath='/*', stylesheet=style).drop(0)

    
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">
    <xsl:key name="stoppoints" match="//nx:ServiceFrame/nx:scheduledStopPoints/nx:ScheduledStopPoint" use="@id" />
    <xsl:key name="stopplaces" match="//nx:SiteFrame/nx:stopPlaces/nx:StopPlace" use="@id" />
    <xsl:key name="quays" match="//nx:SiteFrame/nx:stopPlaces/nx:StopPlace/nx:quays/nx:Quay" use="@id" />
    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    <xsl:key name="authorities" match="//nx:organisations/nx:Authority" use="@id" />

    <xsl:template match="/nx:PublicationDelivery">   
        <Stop>
            <ID>(forgetme)</ID>
        </Stop>
        <xsl:apply-templates select="//nx:PassengerStopAssignment" />
    </xsl:template>

    <xsl:template match="nx:PassengerStopAssignment">   
        <Stop>
            <ID><xsl:value-of select="@id" /></ID>
            <Order><xsl:value-of select="@order" /></Order>
            <stopPoint><xsl:value-of select="nx:ScheduledStopPointRef/@ref" /></stopPoint>
            <stopPlace><xsl:value-of select="nx:StopPlaceRef/@ref" /></stopPlace>
            <quay><xsl:value-of select="nx:QuayRef/@ref" /></quay>
            <authority><xsl:value-of select="key('authorities', nx:AuthorityRef/@ref)/nx:Name" /></authority>
            <xsl:apply-templates select="../../../../nx:ValidBetween" />
        </Stop>
    </xsl:template>
    
    <xsl:template match="nx:ValidBetween">
        <fromDate><xsl:value-of select="nx:FromDate" /></fromDate>
        <toDate><xsl:value-of select="nx:ToDate" /></toDate>
    </xsl:template>
</xsl:stylesheet>
'''
with z.open(file) as h:   
    stop_assignments = pd.read_xml(h, xpath='/*', stylesheet=style).drop(0, axis=0)

Let's check for uniqueness:

In [None]:
(quays.groupby(['QuayID']).count()['QuayName'] == 1).sum() / len(quays)

In [None]:
(stop_points.groupby(['StopPointID']).count()['StopPointShortName'] == 1).sum() / len(stop_points)

In [None]:
(stop_places.groupby(['StopPlaceID']).count()['StopPlaceName'] == 1).sum() / len(stop_places)

So the assignment procedure will be:

1. There is a unique `stop_place` and/or `quay` and we are done
2. `stop_place` is zero, but there is a unique `quay` - which raises the question whether `quay` 1, 2, 3 etc. are unique
3. There is no quay, and top place is zero.

In [None]:
stop_assignments.iloc[:2]

In [None]:
quays.query('QuayID=="LU::Quay:300034007_CdT::"')

In [None]:
stop_assignments.iloc[:3]

In [None]:
stop_points.query('StopPointID=="LU::ScheduledStopPoint:12140101_RGTR_::"')

In [None]:
stop_places.loc[stop_places['StopPlaceName'].str.match('Cessange, Barrès')]

In [None]:
i = ((stop_places['StopPlaceLatitude'] - stop_points.query('StopPointID=="LU::ScheduledStopPoint:12140101_RGTR_::"')['StopPointLatitude'].values[0])**2
     + (stop_places['StopPlaceLongitude'] - stop_points.query('StopPointID=="LU::ScheduledStopPoint:12140101_RGTR_::"')['StopPointLongitude'].values[0])**2).idxmin()
stop_places.loc[i]

So this seems to generally work. Let's see if finding the closest point works for all stops / points:

In [None]:
def distance(place):
    dist = stop_points.apply(
        lambda p: ((p.StopPointLatitude - place.StopPlaceLatitude)**2
                   + (p.StopPointLongitude - place.StopPlaceLongitude)**2)
        , axis=1)
    return stop_points.loc[dist.idxmin()].StopPointID

filt = stop_places['StopPlaceID']=='LU::StopPlace:0_CdT::'
stop_places.loc[filt, 'point'] = stop_places.loc[filt].apply(distance, axis=1)

stop_places.merge(stop_points, left_on=['point'], right_on=['StopPointID'])[['StopPlaceName', 'StopPointShortName']]

Looking good

## Playing with actual init data

In [None]:
dta = pd.read_csv(r'P:\PM\Proj\18005 - Vitesses Bus RGTR\Doc\data-init-bus-hors-service\20221108180002_LUXGPS.csv.zip', sep=';')

###  Lines

Lines do match, except for a bunch of CFL lines and ' TICE20000'

In [None]:
lines['codeInit'] = lines['authority'].replace('CFL_Bus', 'CFL').str.cat(lines['shortName'].astype('str').str.pad(2, fillchar='0'))
d = dta.groupby('LINIE').count().reset_index().merge(lines, left_on='LINIE', right_on='codeInit', how='left').query('name.isnull()')
d[['LINIE']]

Those CFL lines seem to be actually RGTR lines that have been rewritten. Manual analysis of a few of the corresponding XML files show: those files contain no timetable information, or stops for that matter. This seems to be a data handling problem between CFL and ATP.

In [None]:
lines.query('codeInit=="RGTR201"')

### Halts

Get `StopPoint` of all files

In [None]:
style = '''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:nx="http://www.netex.org.uk/netex">

    <xsl:key name="places" match="//nx:SiteFrame/nx:topographicPlaces/nx:TopographicPlace" use="@id" />
    
    <xsl:template match="/nx:PublicationDelivery">   
        <StopPoint>
            <StopPointID>(forgetme)</StopPointID>
        </StopPoint>
        <xsl:apply-templates select="//nx:ScheduledStopPoint" />
    </xsl:template>


    <xsl:template match="nx:ScheduledStopPoint">   
        <StopPoint>
            <StopPointID><xsl:value-of select="@id" /></StopPointID>
            <StopPointShortName><xsl:value-of select="nx:ShortName" /></StopPointShortName>
            <StopPointPublicCode><xsl:value-of select="nx:PublicCode" /></StopPointPublicCode>
            <StopPointStopType><xsl:value-of select="nx:StopType" /></StopPointStopType>
            <StopPointLatitude><xsl:value-of select="nx:Location/nx:Latitude" /></StopPointLatitude>
            <StopPointLongitude><xsl:value-of select="nx:Location/nx:Longitude" /></StopPointLongitude>
        </StopPoint>
    </xsl:template>

</xsl:stylesheet>
'''
stop_points = []
for file in z.filelist:
    stop_points.append(pd.read_xml(z.open(file), xpath='/*', stylesheet=style).drop(0))
stop_points = pd.concat(stop_points)
stop_points.drop_duplicates(inplace=True)

Let's try merging with `dta`. `dta.HALT` has no operator code. We know that `StopPointID` are not unique by code alone, i.e. without the "operator suffix".

In [None]:
d = stop_points['StopPointID'].str.extract('LU::ScheduledStopPoint:(?P<StopID>\d+)_(?P<Operator>[A-Z]+)_.*::')
stop_points['HALT'] = d['StopID'].astype('int')

There are 86 stops, which have the same code but differ by operator suffix:

In [None]:
(stop_points.groupby('HALT').count()['StopPointID'] > 1).sum()

All that means is that matching `dta` onto `stop_points` is going to be a many to many, and we will not be able to retrieve the correct `StopPoint` object, i.e. that pointing to the right operator `StopPlace` with the authority set correctly. Still, on the code alone, most halts can be successfully matched. Most mismatches are CFL, and particularly those lines operated by CFL as RGTR services - this is almost definitely related to the aforementioned data exchange problem. There's only two RGTR lines and the suspicious `TICE0`.

In [None]:
d = (dta.loc[~dta.HALT.isnull()]
     .merge(stop_points, left_on=['HALT'], right_on=['HALT'], how='left')
     .groupby(['LINIE', 'HALT'])
     .agg({'StopPointID': 'count', 'ZEIT': 'count'})
     .rename(columns={'StopPointID': 'known_stops', 'ZEIT': 'observed_stops'})
    )
d.query('known_stops==0')[['observed_stops']]

Here a more condensed representation, comparing the number of stops overall to the number of unknown stops observed:

In [None]:
d.groupby(level=0).agg({'known_stops': lambda r: (r == 0).sum(), 'observed_stops': 'count'}).rename(columns={'known_stops': 'unknown_stops'}).query('unknown_stops > 0').T

Since we are matching purely by the numeric bit of the idea, I am pretty confident that those codes do indeed not exist, not under any operator. So yeah, data governance!

In [None]:
stop_places