# Creating the data model

In [1]:
# Loading necessary libraries, some must first be installed using pip

# System and File Management
import sys
import os

# Data Handling 
import pandas as pd

# Geospatial Analysis
import geopandas as gpd
from shapely.geometry import Point
import haversine 

# Network Analysis
import networkx as nx
# Data Loading and Connection
import json 

We will now look into describing in more detail the process of transforming the collected data into a data model representing network snapshots of Berlin's public transportation system.

### Data Cleaning and Reconciliation 

As mentioned previously, the information gathered from the Fahrplanbücher was transcribed by hand into a series of spreadsheets for each respective year. The raw spreadsheets were then processed using Python\'s Pandas library, which allows for organising and manipulating the data in table-like structures called dataframes. This had to be done in separate steps. After extracting station name data, OpenRefine was used to reconcile station names with existing Wikidata objects. This reconciliation process aimed to standardise identification and enable potential future connections to the semantic web. We are specifically reconciling with Wikidata objects of the types Berlin S-Bahn station (Q110977521), Berlin U-Bahn station (Q110977120) and tram stop (Q2175765). We then follow the geolocalisation procedure mentioned earlier in section 2.1.1.

Once all stations mentioned in the data for a year had been localised we proceed with the step of dividing the data into three linked tables. The three tables are (1) the list of stations, and station attributes (coordinates, type of station, Wikidata-identifiers, what lines these stations are in) (2) the list of lines, and the their attributes (year, type, length in minutes and km where possible, start and end stations) and (3) the list of stop-order, this table references the other two tables with their respective unique ids and includes a column called "stop-order" which is a numeric value starting at 1 for each new line and adding 1 for each station in that line. When a new line id is referenced the "stop-order" reverts back to 1 before increasing again. These tables are completed for each snapshot. Once all the data for all the snapshots has been completed we merge the tables together, the results can be found in the final-tables folder.

### Database Integration

To manage and analyse our data efficiently, we are integrating it into a custom-designed database. This database allows for powerful analysis and makes the data easier to share with other researchers in the future. The csv files could have been used to directly create our modelled network; however, it was decided that the project would benefit from uploading the csv files as three separate tables into a SQL database. The main benefit is that the relational data structure of the three linked tables is ideal for a relational database, in this case MySQL. We will be able to write queries that traverse the connections easily, such as \"Find stations that were added between 1960 and 1965 and belonged to bus lines\". This would be cumbersome to do with multiple CSV files. The query efficiency is also improved in SQL databases as they are optimised to handle large datasets and complex queries. In addition, it is intended to make the data publicly available in the future to allow for reuse, collaboration and improvement. With standardised access mechanisms, a SQL database will be easier to work with in this regard than CSV files.

In this geographically referenced database approach, the project also brings in a GIS component. By modelling the geographic data into a common frame of reference, namely decimal coordinates, we can hope to add data to the field of Historical GIS. We are here following the call of Jordi Marti Henneberg to produce locational-temporal databases as a justifiable, self-contained project with its own merit.[^47] The goal is to allow multiple disciplines and interests to be pursued using the data we have prepared in this project.

Currently, we operate a local MySQL instance. Finalized CSV files are imported as individual tables within the database. We establish key referencing based on shared node and line IDs between tables. This structure enables us to query the database efficiently, extracting the necessary data for network generation. The process involves constructing a comprehensive dataframe containing all relevant information.

[^47]: Jordi Marti-Henneberg, Geographical Information Systems and the Study of History, in: The Journal of Interdisciplinary History 42, 2011, pg. 11.

In [2]:
# code to open file with queried data
# for use when no access to database possible

df = pd.read_csv('queried_data.csv')

### Data Enrichment

The georeferenced nature of our network allows for integration of a variety of external geographic datasets. For example, we have already integrated Berlin district boundaries to assign each node its respective district. Additionally, we could incorporate historical demographic data, land-use maps, or infrastructure information. These integrations will enable us to analyse the transportation network in relation to urban development patterns, socioeconomic factors, and other relevant variables. In 3.2, we will explore how this enriched dataset along with demographic data can be used to address specific research questions about the evolution of Berlin\'s transport system.

In [3]:
# extract_coords function
def extract_coords(coord_str):
    lon, lat  = map(float, coord_str.split(',')) 
    return Point(lat, lon)  # Use Shapely's Point 

# Load station data and apply the `extract_coords` function to create geometries
stations_gdf = gpd.GeoDataFrame(df, geometry=df['coordinate_location'].apply(extract_coords))

# Load district GeoJSON data
districts_gdf = gpd.read_file("data-external/lor_ortsteile.geojson")

# Read the GeoJSON file and extract ortsteil values
with open('data-external/lor_ortsteile.geojson') as f:
    data = json.load(f)

ortsteil_values = []
for feature in data['features']:
    ortsteil_values.append(feature['properties']['OTEIL'])

# Get unique values
unique_oteil_values = list(set(ortsteil_values))

# Create a dictionary mapping bezirk (district) to a list of its ortsteil (subdistricts)
bezirk_to_ortsteil = {}
for feature in data['features']:
    oteil = feature['properties']['OTEIL']
    bezirk = feature['properties']['BEZIRK']
    bezirk_to_ortsteil.setdefault(bezirk, []).append(oteil) 

# Perform a spatial join between stations and districts 
# Keeps station points that fall within districts 
result_gdf = gpd.sjoin(stations_gdf, districts_gdf, how="left", predicate='within') 

result_gdf = result_gdf.drop(["index_right"], axis=1)

# dropping unnecessary columns
df = result_gdf.drop(["gml_id", "spatial_name", "spatial_alias", "spatial_type", "FLAECHE_HA"], axis=1)

Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: None
Right CRS: EPSG:4326

  result_gdf = gpd.sjoin(stations_gdf, districts_gdf, how="left", predicate='within')


In [4]:
# Print all unique values in the column 'OTEIL'
ortsteile = []
for x in df['OTEIL'].unique():
    ortsteile.append(x)

In [5]:
# Load districts in West Berlin
with open("data-external/West-Berlin-Ortsteile.json", "r") as infile:
    West_Berlin_Ortsteile = json.load(infile)
    West_Berlin_Ortsteile = West_Berlin_Ortsteile["West_Berlin"]

In [6]:
ortsteile_ost = [ort for ort in ortsteile if ort not in West_Berlin_Ortsteile] 
# print leftover Ortsteile to validate
print(ortsteile_ost)  

['Mitte', 'Friedrichshain', 'Lichtenberg', 'Rummelsburg', 'Rosenthal', 'Niederschönhausen', 'Pankow', 'Prenzlauer Berg', 'Schöneberg', 'Neukölln', 'Alt-Treptow', 'Weißensee', 'Alt-Hohenschönhausen', 'Fennpfuhl', 'Friedrichsfelde', 'Karlshorst', 'Biesdorf', 'Kaulsdorf', 'Mahlsdorf', nan, 'Blankenfelde', 'Müggelheim', 'Köpenick', 'Konradshöhe', 'Oberschöneweide', 'Heinersdorf', 'Grünau', 'Schmöckwitz', 'Plänterwald', 'Baumschulenweg', 'Friedrichshagen', 'Niederschöneweide', 'Johannisthal', 'Adlershof', 'Buch', 'Karow', 'Blankenburg', 'Marzahn', 'Französisch Buchholz', 'Altglienicke', 'Lübars', 'Neu-Hohenschönhausen', 'Falkenberg', 'Wartenberg', 'Wilhelmsruh', 'Märkisches Viertel', 'Malchow', 'Stadtrandsiedlung Malchow', 'Hellersdorf']


In [7]:
# Create a new column 'east-west' and assign values based on OTEIL
df['east-west'] = df['OTEIL'].apply(lambda x: 'west' if x in West_Berlin_Ortsteile else 'east')

### Network Model

We will employ an L-space graph model to represent Berlin\'s public transportation network. This model is particularly well-suited because it accurately captures real-world service patterns: stations are connected by an edge if they are consecutive stops on the same line.[^48] There are other models that are used for public transportation systems; however, these add layers of abstraction that are not inherently beneficial for our investigation.

Our model will also be a multi-layer L\' space graph, as we will incorporate different transportation types as distinct node and edge categories. One can think of each transportation system that is modelled as another layer in our network. This type of model for public transportation systems is much less common than L space graphs, which differ due to the presence of only one link type.[^49] This multi-layer approach provides a holistic view of the entire network.

When constructing our L' space graph we will be adding our node and edge attributes as well from the data available in the dataframe. These attributes are specifically the node type, labels, and years, as well as the edge types, labels, years and capacity. We will also be adding edge distance, calculated in a straight line in kilometres, although this metric is inherently flawed, particularly for tram and bus routes where the restriction to the road network means that the actual distance could be considerably longer.

[^48]: Nam Huynh, Johan Barthelmy, A comparative study of topological analysis and temporal network analysis of a public transport system, in: International Journal of Transportation Science and Technology, 2022, pg.393

[^49]: C. von Ferber et al., Network harness: Metropolis public transport, in: Physica A: Statistical Mechanics and Its Applications, 380 2007, pg. 586

In [8]:
# Construction of initial L' space graph

# distinction between Klein- and Großprofil lines for u-bahn netzwerk
Kleinprofil = {'1', '2', '3', '4', 'A', 'A I', 'A II', 'A III', 'A1', 'A2', 'B', 'B I', 'B II', 'B III', 'B1', 'B2', }
Großprofil = {'5', '6', '7', '8', '9','C', 'C I', 'C II', 'D', 'E', 'G'}

def create_network_graph(df):
    # G used to save graph
    G = nx.MultiGraph()

    # Add nodes for each stop_id with coordinates as attributes
    for index, row in df.iterrows():
        stop_id = row["stop_id"]
        coordinates = row["coordinate_location"].split(",")
        year = row["year"]
        type = row["type"]
        east_west = row["east-west"]
        neighbourhood = row["OTEIL"]
        district = row["BEZIRK"]


        if len(coordinates) == 2:
            coordinate_y = float(coordinates[0])
            coordinate_x = float(coordinates[1])

        # adding nodes with attributes
        G.add_node(stop_id, x=coordinate_x, y=coordinate_y, year=year, type=type, east_west=east_west, neighbourhood=neighbourhood, district=district)

    # Add edges
    for line in set(df["line_id"]):
        df_line = df[df["line_id"] == line].sort_values("stop_order")
        line_name = df_line.iloc[0]["line_name"]  # get line name associated with line_id
        
        for i in range(len(df_line) - 1):
            source = df_line.iloc[i]["stop_id"]
            target = df_line.iloc[i + 1]["stop_id"]
            edge_type = df_line.iloc[i]["type"]
            line_id = df_line.iloc[i]["line_id"]
            year = df_line.iloc[i]["year"]
            frequency = df_line.iloc[i]["frequency"]
            east_west = df_line.iloc[i]["east_west"]
            
            # Create a dictionary with edge attributes, including "type"
            edge_attributes = {
                "key": line_id,
                "label": line_name,
                "type": edge_type,
                "year": year,
                "frequency": frequency,
                "east_west": east_west
            }

            # Determine capacity based on "type" and "line_name"
            if edge_type == "u-bahn" and line_name in Kleinprofil: # different for Großprofil and Kleinprofil lines
                edge_attributes["capacity"] = 750
            elif edge_type == "u-bahn" and line_name in Großprofil: # different for Großprofil and Kleinprofil lines
                edge_attributes["capacity"] = 1000
            elif edge_type == "s-bahn":
                edge_attributes["capacity"] = 1100
            elif edge_type == "strassenbahn":
                edge_attributes["capacity"] = 195
            elif edge_type == "bus" or edge_type == "bus (Umlandlinie)":
                edge_attributes["capacity"] = 100
            elif edge_type == "FÃ¤hre":
                edge_attributes["capacity"] = 100
            
            G.add_edge(source, target, **edge_attributes)

    return G


In [9]:
# function to add distances between nodes connected by edge 
# useful for calculation of total line distance and sanity check
def add_distance_attribute(graph):
    for u, v, data in graph.edges(data=True):
        u_coord = (graph.nodes[u]["y"], graph.nodes[u]["x"])
        v_coord = (graph.nodes[v]["y"], graph.nodes[v]["x"])
        distance = haversine.haversine(u_coord, v_coord, unit="km")
        data["distance"] = distance
        data["edge_type"] = data["type"]


    return graph

In [10]:
# Update the frequency where type is "s-bahn" and frequency is 0
# handling missing information regarding s-bahn lines
df.loc[(df['type'] == 's-bahn') & (df['frequency'] == 0), 'frequency'] = 20

In [11]:
# create network graph
G = create_network_graph(df)

In [12]:
def add_graph_attributes(G, df):
    #set node labels to stop names
    stop_names = {row["stop_id"]: row["stop_name"] for index, row in df.iterrows()}
    nx.set_node_attributes(G, stop_names, "node_label")

    return G

# add node labels
G = add_graph_attributes(G, df)

In [13]:
G = add_distance_attribute(G)

Our network snapshots have now been created. However, there\'s some uncertainty about whether stations appearing in multiple snapshots truly represent the same physical location. Minor variations in station names, localisation inaccuracies, or slight real-world shifts in station position over time can create this issue.

To ensure that we can accurately analyse how a station\'s role in the network has changed, we can merge nodes that represent the same station across different years into a single node. This approach is in line with TVG modelling of networks. It is common practice to have an "underlying graph", which flattens the time dimension and indicates all nodes that have relations at some time in T.[^50] We will first approach this by shortening the decimal coordinates of the nodes to three decimal places and then grouping across snapshots using just the locational data and type of node. Merging nodes across snapshots also allows us to take a look into the persistence of nodes, this will provide further insights into the stability of the system we are investigating. We are not merging edges across time periods because the frequency values will be specific to a snapshot.

[^50]: A. Casteigts et al, pg. 350.

In [14]:
# Combine nodes based on x, y, type and save to dictionary, add year and node labels as attributes
combined_nodes = {}
for node in G.nodes(data=True):
    key = (round(node[1]['x'], 3), round(node[1]['y'], 3), node[1]['type']) 
    if key not in combined_nodes:
        combined_nodes[key] = {'node_labels': set(), 'year': set(), "east_west": set(), "postcode": set(), "neighbourhood": set(), "district": set()}
    combined_nodes[key]['node_labels'].add(node[1]['node_label'])
    combined_nodes[key]['year'].add(node[1]['year'])
    combined_nodes[key]['east_west'].add(node[1]['east_west'])
    combined_nodes[key]['neighbourhood'].add(node[1]['neighbourhood'])
    combined_nodes[key]['district'].add(node[1]['district'])

# Create a new graph with combined nodes
H = nx.MultiGraph()
# using MultiGraph to have parrallel edges for each year & line because they all have their own frequencies
for key, data in combined_nodes.items():
    H.add_node(key, x=key[0], y=key[1], type=key[2], node_labels=list(data['node_labels']), years=list(data['year']), east_west=data["east_west"], neighbourhood=data["neighbourhood"], district=data["district"])

# Add edges to the new graph 
for edge in G.edges(data=True):
    node1, node2, edge_attrs = edge
    key1 = (round(G.nodes[node1]['x'], 3), round(G.nodes[node1]['y'], 3), G.nodes[node1]['type'])
    key2 = (round(G.nodes[node2]['x'], 3), round(G.nodes[node2]['y'], 3), G.nodes[node2]['type'])

    if key1 in combined_nodes and key2 in combined_nodes:
        H.add_edge(key1, key2, key=str(edge_attrs), weight=1, **edge_attrs) 

In [15]:
H = add_distance_attribute(H)

In [16]:
for u, v, d in H.edges(data=True):
    if H.nodes[u]['east_west'] == H.nodes[v]['east_west']:
        value = H.nodes[u]['east_west']  # 'east' or 'west'
        d["traverses"] = str(value)
    else:
        d["traverses"] = str(d["east_west"])

### Addressing Data Uncertainties

In theory, we should be left with the same number of edges but fewer nodes. We see however, that there are now fewer edges in our network. The most probable reason for this is that, if multiple nodes in G share these rounded attributes, they\'ll be merged into a single node in H. Consequently, edges that existed between these nodes in G will disappear in H, as they would become self-loops. With the following code we see that the number of self-loops is double of the number of edges we lost, establishing this was the cause of the loss of edges.

In [17]:
print(G.number_of_nodes())
print(G.number_of_edges())
print(H.number_of_nodes())
print(H.number_of_edges())

19514
26867
4031
26763


In [18]:
# check for self-loops that are created by removing specificity in coordinates
self_loops_G = list(nx.selfloop_edges(G))
self_loops_H = list(nx.selfloop_edges(H))
print(len(self_loops_G))
print(len(self_loops_H))

0
313


The impact of this is negligible when considering the total amount of edges in our network. The fault originally lies in the localisation of the stations being incorrect. While few edges were lost, ensuring accurate analysis requires us to track down these self-loops, as they indicate potential errors that could skew results about a station\'s connectivity over time.

### Weight Calculation

We now turn to the issue of assigning a unitary weight to our edges. In transportation networks, edge weights help us understand the relative importance of routes. Higher weights might indicate higher passenger traffic or greater strategic value to the network. Other public transportation networks have assigned edge weight based on the capacity of the vehicle, the distance between nodes or the number of overlapping services.[^51] Our project has two values that can be used to measure weight, these being capacity and frequency. There is no set way in the literature on public transportation system network modelling to calculate an edge weight, so we will test two different figures. The first is the value of passengers-per-hour. This value provides a figure for the number of passengers that could be transported in an hour. To calculate this, we calculate the number of services per hour given the frequency and then multiply this number by the capacity of the transportation type. The assumption here is that the frequency is set for an hour. Based on the available data, this is however not the case, as frequency changes occur at random times. This is therefore a very inaccurate description, given that the frequency might have changed a minute before or after our 7:30 mark.

Because of the imprecision involved in calculating raw passenger-per-hour figures, we will use normalisation to aid our analysis. Normalising our capacity and frequency values puts them on a consistent scale (0 to 1). This lets us easily compare the relative importance of different edges within our network. For example, if one route has a normalised capacity of 0.8 and another has 0.2, we understand the first route has four times the capacity of the second. Normalisation will help us identify potential bottlenecks and key routes within the network, regardless of potentially imprecise absolute capacity figures. In order to maintain flexibility, we will keep the original capacity and frequency values and assign two further attributes to our edges: normalised capacity and normalised frequency. From these two values we will calculate a weight which will be more relative in nature. We do this by adding the two normalised values, as neither of the two values should be given increased importance over the other. A main benefit for normalisation of edge weights for our analysis is that because we are normalising based on the min and max values throughout all snapshots together, we will be able to compare our weights across snapshots. For example, we will be able to track changes in average edge weights for our snapshots to see if the average edge weight changed during the period under observation.

[^51]: Tanuja Shanmukhappa, Ivan Wang-Hei Ho, K.Tse Chi, Recent development in public transport network analysis from the complex network perspective, in: IEEE Circuits and Systems Magazine 19, no. 4, 2019, pg. 55.

In [19]:
# Creating weight for edges based on frequency and capacity
# passengers-per-hour = service frequency per hour * capacity of a vehicle
for u, v, key in H.edges(keys=True):  # Using keys will distinguish parallel edges
    frequency = int(H[u][v][key]['frequency'])
    capacity = int(H[u][v][key]['capacity'])

    if frequency != 0:
        services_per_hour = 60 / frequency
    else:
        services_per_hour = 0
    weight = services_per_hour * capacity
    H[u][v][key]['passengers-per-hour'] = weight

In [20]:
# calculating improved weight score using normalised frequency and capacity measures
# weight = norm_capacity + norm_frequency
def normalize(values):
    max_value = max(values)
    min_value = 0
    return [(value - min_value) / (max_value - min_value) for value in values]

capacities = [edge[2]['capacity'] for edge in H.edges(data=True)]
frequencies = [int(edge[2]['frequency']) for edge in H.edges(data=True)]

normalized_capacities = normalize(capacities)
normalized_frequencies = normalize(frequencies)  

for i, edge in enumerate(H.edges(data=True)):
    edge[2]['normalised_capacity'] = normalized_capacities[i]
    edge[2]['normalised_frequency'] = normalized_frequencies[i]

for x, y, d in H.edges(data=True):
    weight = d['normalised_capacity'] + d['normalised_frequency']
    d['weight'] = weight

## Data model criticism

Although we have chosen a model that reflects the real-world structure and behaviour of the public transportation system, we are still dealing with an abstraction in which elements of the real-world transportation system are lost. It is important that we highlight these and are certain of and clear about their consequences.

### Distance in time

The distance of one station to another in terms of travel time of service is a common data attribute of edges that are often modelled for public transportation data. The information is available in the Fahrplanbücher but it was not included in our model. The main reason for this was that it was not considered feasible given that our network has over 25,000 edges. The importance of this figure was considered not vital to the exploration of how the public transportation system evolved over such a large time scale. Previous work which includes this information does not investigate evolving public transportation networks, but rather the temporality of a transport network at one state, and for this type of analysis the figure would have been vital.

The more distanced perspective in our project did lend itself better to capturing the overall travel time of the line, rather than the travel time between stops. Capturing this information from the sources was much less time intensive. However, the network model we have selected does not have a suitable place to store this information as the lines do not exist as entities themselves, but only as attributes of the edges. A different network model based on this captured data would be able to potentially analyse the total travel times of all lines in different snapshots and thereby gain another perspective into how the system changed.

### Directionality

Directionality presents a limitation in our undirected network model. The model does not inherently capture the direction of travel on lines, limiting analysis of directional passenger flows and how those might have shifted over time due to evolving patterns of urban development. The complexity of adding directionality to our network would have been negligible to the benefits it would have brought to the analysis of the potential of dynamic NA to historical research at this stage. Still, we need to be aware that we have an incomplete picture of the network as currently modelled. There are two important aspects that this undirected model does not capture. Firstly, there are two frequency values for the service at 7:30, these are for most of the times the same but are sometimes different depending on the direction the service sets out from. Secondly, in our 1989 Fahrplanbuch for West Berlin, bus stations differed depending on direction. Meaning that some stops were only serviced for a specific direction. In our model we have only captured the stations in one direction. To capture this and allow for future incorporation of the missing information, we have applied a standard rule for how the data was captured. We always capture the information from the first table (which shows one line direction) in our Fahrplanbuch and in our database we have for each line a start and end station, which allows us to infer directionality, even though at this stage we are treating it as undirected. This allows for the possibility to differentiate between lines based on direction, and thereby including the second frequency value and the stations serviced in the other direction.

## Conclusion on datafication/abstraction

The process of transforming historical records of Berlin\'s public transportation system into a network model involves careful datafication and necessary abstractions. With this network model we attempt to limit the abstraction so as not to remove the object of analysis unnecessarily far out of its context, which makes it harder for our analysis to inform our understanding of the system. Still, we need to critically reflect on what was gained and lost in this process to understand the analytical possibilities and limitations.

By transforming the available information into a L\'-space graph model we keep an understandable structure that reflects the real-world structure of the transportation system, with stations as nodes and lines as edges. This makes the network visually and mathematically comprehensible. Capturing different transportation types within one multilayer model allows us to analyse the overall network behaviour, including interactions between trams, buses, U-Bahn, and S-Bahn lines. We provided substantial node and edge attributes (like type, label, year, capacity, frequency and weight), enabling us to track changes over time and identify patterns based on specific characteristics. Additionally, the geographically referenced database we have created and are querying aligns with the goals of historical GIS, providing a foundation for locational-temporal analysis that could be of interest to multiple disciplines.

There is substantial information that is not captured in our network model. One such aspect is directionality, the model does not inherently capture the direction of travel on lines, limiting analysis of directional flows and the potential to introduce temporality within network snapshots. Another limitation is our use of a single frequency snapshot (7:30 AM), which risks misrepresenting the true importance of lines with varying service levels throughout the day. We know from the inspection of the sources that frequency changed rapidly throughout the day for some lines. The changes often occurred during four periods; the morning peak frequency, the midday frequency, the afternoon peak frequency and the evening frequency. An extension of the project could easily incorporate further frequency values for other times of the day, bringing an element of temporality into our analysis. The current focus on Berlin\'s transport network excludes connections to the surrounding regional train system, potentially understating the network\'s overall reach. This has only a very limited impact on the network in the West, which would not have been connected to the regional network for most of the period under observation. If we were to extend our analysis into the post-1989 era, we may have had to include the regional train system in our network in order to maintain our closed-world assumption. Critically, incomplete records, particularly for tram and bus stops, result in a simplified network that might not account for all transfer points or local routes. Because this issue arises from the information in the sources available to us, the only possibility of completing the records would be if other sources had this information, which so far has not been the case in the associated research.

There are multiple analytical possibilities with the network model we have created. Primarily, we will explore network evolution by tracking changes in network structure over time, including the addition/removal of stations and lines, and the development of the multi-modal system. As part of this, we will calculate various centrality measures to identify critical stations or lines based on different criteria (e.g., connectivity, betweenness). We will also be investigating the impact of historical disruptions on network robustness, potentially highlighting vulnerable areas. While limited by imprecision, the normalised capacity and frequency attributes allow us to explore relative capacity differences across the network, paying specific attention to how these change during the period under observation for both East and West Berlin.

There are analytical limitations that we will have to contend with for the time being. Specifically, the incomplete capacity data makes it difficult to perform precise analyses focused on absolute passenger capacities. Our single-frequency snapshots also limit our understanding of how service levels fluctuate throughout the day or week, however with our focus on evolution this will not have an impact on our ability to investigate change over longer periods. The network model also does not capture individual passenger journeys, making it unsuitable for analysing route choices or granular travel patterns.

In conclusion, the datafication of historical records into a network model offers the potential for valuable insights into the evolution and structure of Berlin\'s public transportation system. Acknowledging the inherent abstractions and limitations, this model provides a robust foundation for diverse analyses within the constraints of available data. The next section will begin this analysis by describing the changes in the network using simple NA metrics. We also hold the promise that future work could focus on expanding data sources to address limitations and enrich the model\'s capabilities.

In [21]:
# Saving network:
# need to convert list attribute values to json strings to save
for node, data in H.nodes(data=True):
    for key, value in data.items():
        if isinstance(value, set) or isinstance(value, list):  # Include lists
            data[key] = ",".join(str(item) for item in value)  # Convert items to strings

for u, v, data in H.edges(data=True):
    for key, value in data.items():
        if isinstance(value, set) or isinstance(value, list): 
            data[key] = ",".join(str(item) for item in value) 

nx.write_graphml(H, "base-graph.graphml")
nx.write_gexf(H, "base-graph.gexf") 