# Welcome to the Notebook for Monha's and Bemi's Bachelor Project

## Content

In this notebook we will:

1. Aggrigate our data into usable travel sequences with only the relevant data 
2. Analyse the appropriate data
3. Create an embedding space using Word2Vec

We will use the following format for the structure of the file:
1. MD file to describe the intention of the following code followed by an explanation of the results from the code if any
2. Code block to write code

# Initial Setup

Please pip install the correct libraries for the following code to work.

In [1]:
%pip install pandas # Pandas for data handling
%pip install numpy  # Maths stuff

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Data import

The data used in this notebook is extracted from the Journeys table from the DB. 

The data in question contains ~43 mil rows. This data is all journeys traveled in the timespan of ~4 years. For the purpose of this project we wish to filter the data, such that we only work with journeys within Copenhagen.

In [3]:
data = pd.read_csv('../Data/All_Journeys_small.csv')
data

Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
0,13581986-9f2d-455b-b5a1-00000010eaeb,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade)",,2024-03-14 10.55.56.6562914,2024-03-14 10.55.56.6562914,,,,
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
2,cbd5ad3b-0bf0-4314-bd74-000001a41c82,,,0,100110021003,,3,0,,Femøren St. (Metro),Bispebjerg Hospital (Tagensvej),2023-08-04 08.33.59.0415651,2023-08-04 08.33.59.0415651,,,,
3,1f4ed562-1e81-40be-ac2b-000001a4e840,,1029,1029,"1001,1002,1008,1029,1032,1043,1054,1066,1076,1...",,9,1001,,,,2020-08-21 19.34.33.6834876,2020-08-21 19.34.33.6834876,,,,
4,afbd1023-61e5-4b47-9f9d-000001c9d32b,Zone,1062,0,"1062,1071,1061,1052,1053,1063,1073,1072,1051,1...",,4,0,,Farum St. (62),Dyssegård St. (31),2023-11-09 17.28.41.0671191,2023-11-09 17.28.41.0671191,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,89aca741-aaad-4e03-ad4e-05dc7d2d388c,Zone,1010,0,"1010,1016,1036,1037,1011,1012,1017,1018,1039,1...",,6,0,,Gilleleje St. (10),Helsingør St. (05),2023-07-26 10.33.40.6757535,2023-07-26 10.33.40.6757535,,,,
999996,64ddaac3-0201-4841-b5d2-05dc7dc42b6b,,,0,100110031004,,3,0,,,,2020-08-25 16.47.39.2628986,2020-08-25 16.47.39.2628986,,,,
999997,139c9211-8f1d-4264-8673-05dc7e038a52,Zone,1002,0,1002100110031030103110321033,,2,0,,My Location (02),København/City (01),2021-03-08 12.02.16.9444135,2021-03-08 12.02.16.9444135,,,,
999998,9b14479b-d29a-4b52-a2bd-05dc7e790206,,,0,"1001,1002,1003,1004,1005,1006,1007,1008,1009,1...",,0,0,,,,2021-02-03 08.48.51.5517535,2021-02-03 08.48.51.5517535,,0.0,,


## Filtering data

In order to filter our data, XXX checks need to be made to be certain a journey is within cph as well as containing information relevant for our purpose. 

For a journey to be within cph they need to only make use of zone 1 through 4
1. Check if *internalStartZones* only contain zones within cph
2. Check if *internalValidZones* only contain zones within cph

For a journey to be relevant for the project, we need the fields *StartStop*, *EndStop*, *SearchStart* and *SearchEnd* to be either fully filled out or partly - that is, if Start- and EndStop are null, then SearchStart and -End need to be filled. Likewise, the fields must not match in their values; a journeys start and end should not be the same.


In [4]:
#Copenhagen filtering
condition_1_cph = (
    (data['internalValidZones'].str.match(r'^(1001|1002|1003|1004)(,(1001|1002|1003|1004))*$')
    | # or
    pd.isna(data['internalValidZones']))
    )

condition_2_cph = (
    (data['internalStartZones'].str.match(r'^(1001|1002|1003|1004)$'))
    | # or
    pd.isna(data['internalStartZones'])
    )

In [5]:
cph_data_1 = data[(condition_1_cph)]
cph_data_2 = cph_data_1[condition_2_cph]

cph_data_3 = cph_data_2[ ~ (cph_data_2['SearchStart'].str.contains("okation", na=False)
                                             | #Or
                                             cph_data_2['SearchStart'].str.contains("zoner", na=False))]
cph_data_4 = cph_data_3[( ~ (cph_data_3['SearchEnd'].str.contains("zoner", na=False) 
                                            | #Or
                                            cph_data_3['SearchEnd'].str.contains("okation", na=False)))]

# next two filters are English filters of the first
cph_data_5 = cph_data_4[( ~ (cph_data_4['SearchEnd'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data_4['SearchEnd'].str.contains("ocation", na=False)))]

cph_data_6 = cph_data_5[( ~ (cph_data_5['SearchStart'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data_5['SearchStart'].str.contains("ocation", na=False)))]

# Next filter is to remove entries where one of the matching search-x or x-stop are Null
cph_data_7 = cph_data_6[(
                                        ( ~ (pd.isna(cph_data_6['SearchStart'])) & ~ (pd.isna(cph_data_6['SearchEnd'])))
                                        | # Or
                                        ( ~ (pd.isna(cph_data_6['StartStop'])) & ~ (pd.isna(cph_data_6['EndStop'])))
                                        )]

# Next filter removes all entries where SearchStart and SearchEnd contain the same value
cph_data = cph_data_7[(
                        ~(cph_data_7['SearchStart'] == cph_data_7['SearchEnd'])
                        )]

cph_data

  cph_data_2 = cph_data_1[condition_2_cph]


Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
22,27ddd4c7-35d5-4e95-84a5-0000092e14a0,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Hulgårds Plads (Frederikssundsvej) (02),2023-01-01 13.16.11.4343765,2023-01-01 13.16.11.4343765,,,,
61,986798c5-e47a-4a6f-ba89-0000177df4cf,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Islands Brygge St. (Metro) (01),2022-08-25 09.08.46.4521964,2022-08-25 09.08.46.4521964,,,,
69,ad7d6db6-ab4e-4782-8976-000019f0ecd6,Zone,1001,0,100110021003,,2,0,,København H (Metro) (01),Frederiksberg Allé St. (Metro) (01),2023-07-24 05.41.23.6936628,2023-07-24 05.41.23.6936628,,,,
92,fec8331d-e54e-48b2-88fe-00002217de59,Zone,1001,0,100110021003,,2,0,,Nørreport St. (01),Sluseholmen (Sjællandsbroen) (02),2022-10-05 12.44.54.5227363,2022-10-05 12.44.54.5227363,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999881,2cf0443e-cf66-464a-a6ba-05dc4e88c9d3,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),København H (01),2023-06-29 14.41.43.3292495,2023-06-29 14.41.43.3292495,,,,
999908,387804ed-cbc1-456e-a24a-05dc5a59af34,Zone,1004,0,1004100310011002,,3,0,,Kastrup St. (Metro) (04),Frederiksberg St. (Metro) (02),2023-08-28 16.06.29.8130118,2023-08-28 16.06.29.8130118,,,,
999930,4d4f9944-9669-4219-820d-05dc65834645,Zone,1003,0,1003100110021004,,2,0,,Bella Center St. (Center Østvej) (03),Frederiksberg St. (Metro) (02),2023-11-22 07.16.16.0657646,2023-11-22 07.16.16.0657646,,,,
999942,ec457061-bae5-4040-95cb-05dc69e1bfa8,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),Nørreport St. (01),2022-08-22 06.27.09.2054089,2022-08-22 06.27.09.2054089,,,,


In [6]:
cph_data = data[(condition_1_cph)]
cph_data = cph_data[condition_2_cph]

cph_data = cph_data[ ~ (cph_data['SearchStart'].str.contains("okation", na=False)
                                             | #Or
                                             cph_data['SearchStart'].str.contains("zoner", na=False))]
cph_data = cph_data[( ~ (cph_data['SearchEnd'].str.contains("zoner", na=False) 
                                            | #Or
                                            cph_data['SearchEnd'].str.contains("okation", na=False)))]

# next two filters are English filters of the first
cph_data = cph_data[( ~ (cph_data['SearchEnd'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data['SearchEnd'].str.contains("ocation", na=False)))]

cph_data = cph_data[( ~ (cph_data['SearchStart'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data['SearchStart'].str.contains("ocation", na=False)))]

# Next filter is to remove entries where one of the matching search-x or x-stop are Null
cph_data = cph_data[(
                                        ( ~ (pd.isna(cph_data['SearchStart'])) & ~ (pd.isna(cph_data['SearchEnd'])))
                                        | # Or
                                        ( ~ (pd.isna(cph_data['StartStop'])) & ~ (pd.isna(cph_data['EndStop'])))
                                        )]

# Next filter removes all entries where SearchStart and SearchEnd contain the same value
cph_data = cph_data[(
                        ~(cph_data['SearchStart'] == cph_data['SearchEnd'])
                        )]

cph_data

  cph_data = cph_data[condition_2_cph]


Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
22,27ddd4c7-35d5-4e95-84a5-0000092e14a0,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Hulgårds Plads (Frederikssundsvej) (02),2023-01-01 13.16.11.4343765,2023-01-01 13.16.11.4343765,,,,
61,986798c5-e47a-4a6f-ba89-0000177df4cf,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Islands Brygge St. (Metro) (01),2022-08-25 09.08.46.4521964,2022-08-25 09.08.46.4521964,,,,
69,ad7d6db6-ab4e-4782-8976-000019f0ecd6,Zone,1001,0,100110021003,,2,0,,København H (Metro) (01),Frederiksberg Allé St. (Metro) (01),2023-07-24 05.41.23.6936628,2023-07-24 05.41.23.6936628,,,,
92,fec8331d-e54e-48b2-88fe-00002217de59,Zone,1001,0,100110021003,,2,0,,Nørreport St. (01),Sluseholmen (Sjællandsbroen) (02),2022-10-05 12.44.54.5227363,2022-10-05 12.44.54.5227363,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999881,2cf0443e-cf66-464a-a6ba-05dc4e88c9d3,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),København H (01),2023-06-29 14.41.43.3292495,2023-06-29 14.41.43.3292495,,,,
999908,387804ed-cbc1-456e-a24a-05dc5a59af34,Zone,1004,0,1004100310011002,,3,0,,Kastrup St. (Metro) (04),Frederiksberg St. (Metro) (02),2023-08-28 16.06.29.8130118,2023-08-28 16.06.29.8130118,,,,
999930,4d4f9944-9669-4219-820d-05dc65834645,Zone,1003,0,1003100110021004,,2,0,,Bella Center St. (Center Østvej) (03),Frederiksberg St. (Metro) (02),2023-11-22 07.16.16.0657646,2023-11-22 07.16.16.0657646,,,,
999942,ec457061-bae5-4040-95cb-05dc69e1bfa8,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),Nørreport St. (01),2022-08-22 06.27.09.2054089,2022-08-22 06.27.09.2054089,,,,


## Testing to see whether our filtering worked

Since we are handling a very large amount of data, it can be difficult to scim through the data in order to see if it is as intended. These tests are used in order to detect whether or not rows that are not supposed to be in our data is in our data.

In [8]:
# Test 1 for whether our data contain seachEnd with contains 'lokation' or 'location'
lokation_count = cph_data[cph_data['SearchEnd'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

location_count = cph_data[cph_data['SearchEnd'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

# Test 2 for whether our data contain seachStart with contains 'lokation' or 'location'
lokation_count_s = cph_data[cph_data['SearchStart'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchStart' : {lokation_count_s['SearchStart']}")

location_count_s = cph_data[cph_data['SearchStart'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchStart' : {location_count_s['SearchStart']}")

# Test 3 for whether our data contain SearchStart with 'zones' or 'zoner'
zones_count = cph_data[cph_data['SearchEnd'].str.contains("zones", na=False)].count()
print(f"Amount of 'zones' entires in 'SearchEnd' : {zones_count['SearchEnd']}")

zones_count_r = cph_data[cph_data['SearchEnd'].str.contains("zoner", na=False)].count()
print(f"Amount of 'zoner' entires in 'SearchEnd' : {zones_count_r['SearchEnd']}")

# Test 4 for whether our data contain None in 3 or more fields (startStop, EndStop, SearchStart and SearchEnd)
num_nulls = cph_data[['StartStop', 'EndStop', 'SearchStart', 'SearchEnd']].isna().sum(axis=1)
b = (num_nulls >= 3).any()
print(f"Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: {b}")

# Test 5 for whether our data contain duplicates in matching fields, i.e. StartStop == EndStop
duplicates_in_stop = cph_data[(cph_data['StartStop'] == cph_data['EndStop'])].count()
print(f"Amount of matching values in StartStop and EndStop : {duplicates_in_stop['StartStop']}")


# Test 6 for whether our data contain duplicates in matching fields, i.e. SearchStart == SearchEnd
duplicates_in_stop = cph_data[(cph_data['SearchStart'] == cph_data['SearchEnd'])].count()
print(f"Amount of matching values in SearchStart and SearchEnd : {duplicates_in_stop['SearchStart']}")

# Test 7 for whether our data contain three of the fields filled.
num_filled = ~(cph_data[['StartStop', 'EndStop', 'SearchStart', 'SearchEnd']].isna()).sum(axis=1)
b = (num_filled == 3).any()
print(f"Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd are filled: {b}")


Amount of 'Lokation' entires in 'SearchEnd' : 0
Amount of 'Location' entires in 'SearchEnd' : 0
Amount of 'Lokation' entires in 'SearchStart' : 0
Amount of 'Location' entires in 'SearchStart' : 0
Amount of 'zones' entires in 'SearchEnd' : 0
Amount of 'zoner' entires in 'SearchEnd' : 0
Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: False
Amount of matching values in StartStop and EndStop : 0
Amount of matching values in SearchStart and SearchEnd : 0
Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd are filled: False


# From ID to Station

We wish to get rid of the IDs used in StartStop and EndStop as these do not really give us a direct understanding of what station is used in a journey. Therefore we will use the table SJWaypoints to match a given Station-Id with a 'Name'. We then wish replace all the entries in our cph_data such that we do not have these integers as IDs but rather a stop-name. 

In [9]:
id_to_name_data = pd.read_csv('../Data/SJ_results.csv')
grouped_id_name = id_to_name_data[['Id', 'Name']].groupby('Id')

In [10]:
grouped_id_name.value_counts()
id_name_list = grouped_id_name.agg(list)

id_to_name_dict = {}

for id, frame in grouped_id_name:
    if id not in id_to_name_dict:
        id_to_name_dict[id] = frame['Name'].iloc[0]


In [11]:
id_to_name_dict

{2.0: 'Engelsborgvej (Buddingevej)',
 3.0: "Christian X's Allé (Buddingevej)",
 6.0: 'Gammelmosevej (Buddingevej)',
 8.0: 'Snogegårdsvej (Buddingevej)',
 9.0: 'Buddinge St. (Buddingevej)',
 11.0: 'Buddinge Torv (Gladsaxe Ringvej)',
 12.0: 'Gladsaxevej (Gladsaxe Ringvej)',
 15.0: 'Dynamovej (Gladsaxe Ringvej)',
 17.0: 'Herlev Hospital (Herlev Ringvej)',
 19.0: 'Herlev Bymidte (Herlev Ringvej)',
 20.0: 'Hyrdindestien (Herlev Hovedgade)',
 21.0: 'Elverhøjen (Stationsalleen)',
 22.0: 'Herlev St.',
 23.0: 'Elverhøjen (Herlev Hovedgade)',
 24.0: 'Herlev Bymidte (Herlev Ringvej)',
 25.0: 'Mileparken (Herlev Ringvej)',
 26.0: 'HF Islegård (Nordre Ringvej)',
 27.0: 'Marielundvej (Nordre Ringvej)',
 28.0: 'Hanevadsbro (Nordre Ringvej)',
 29.0: 'Slotsherrensvej (Nordre Ringvej)',
 30.0: 'Ejby Industrivej (Nordre Ringvej)',
 31.0: 'Ejby Smedevej (Nordre Ringvej)',
 32.0: 'Fabriksparken (Nordre Ringvej)',
 33.0: 'Mellemtoftevej (Nordre Ringvej)',
 34.0: 'Psykiatrisk Center Glostrup (Nordre Ringvej)

## Change of cph_data

We now wish to replace all Ids in cph_data from StartStop and EndStop with the associated Name from the dict. 

In [12]:
test_df = cph_data

In [13]:
def id_to_station(row):
    if pd.notna(row['StartStop']) : 
        row['StartStop'] = id_to_name_dict[row['StartStop']]
        row['EndStop'] = id_to_name_dict[row['EndStop']]
    return row

test_df = test_df.apply(id_to_station, axis=1)
        

# Sequences

We now wish to make sequences from the journeys. The sequnces should either be a value pair of SearchStart and Searchend or a pair of StartStop and EndStop. To do this we simply collect the pairs from the dataframe where StartStop and EndStop Id's are "translated" to station names. 

When making the sequences, certain questions arrise about the data. For instance, of the 3,4 mil datapoints, only 64 of the datapoints contain a value *only* in StartStop and EndStop. (```python test_df[~(pd.isna(test_df['StartStop'])) & (pd.isna(test_df['SearchStart']))]```)

Another important decision is deciding on how to extract stations from SearchStart and SearchEnd, since a lot of the entries does not consist of a directly matching station. i.e. 'Hovedebanegården' being the SearchStart for the station 'København H'. Thus we need to match these inconsistent strings with a consistent naming convention. 



In [16]:
import re

pattern = r' [(]\d\d[)]'

sequences = []

def get_sequence(row) -> None:
    if (pd.isna(row['StartStop'])):
        start   = re.sub(pattern, "", row['SearchStart'])
        end     = re.sub(pattern, "", row['SearchEnd'])
        sequences.append([start, end])
    else:
        start   = re.sub(pattern, "", row['StartStop'])
        end     = re.sub(pattern, "", row['EndStop'])
        sequences.append([start, end])

test_df.apply(get_sequence, axis=1)

1         None
22        None
61        None
69        None
92        None
          ... 
999881    None
999908    None
999930    None
999942    None
999977    None
Length: 80226, dtype: object

In [51]:
len(sequences)

80226

## Dealing with the problem of the same places in Copenhagen being searched or likewise with different naming conventions (and languages)

In [23]:
%pip install geopy


Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.4/125.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading geographiclib-2.0-py3-none-any.whl (40 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m972.9 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1
Note: you may need to restart the kernel to use updated packages.


In [44]:
from geopy.geocoders import Nominatim
from geopy.distance import geodesic

- geopy is a Python client for several popular geocoding web services.

- geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

- geopy includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services. The full list is available on the Geocoders doc section. Geocoder classes are located in geopy.geocoders.

A small test testing some random stations in Copenhagen

In [52]:

geolocator = Nominatim(user_agent="my_geocoder")

place1 = 'Hovedbanegården'
place2 = 'København H, Copenhagen'
place3 = 'Lufthavnen St.'
place4 = 'CPH Lufthavn'
place5 = 'Hovedbanegården (Istedgade)'
place6 = 'København H, Copenhagen'

location_place1 = geolocator.geocode(place1)
location_place2 = geolocator.geocode(place2)
location_place3 = geolocator.geocode(place3)
location_place4 = geolocator.geocode(place4)
location_place5 = geolocator.geocode(place5)
location_place6 = geolocator.geocode(place6)

print("Location 1:", location_place1.latitude, location_place1.longitude)
print("Location 2:", location_place2.latitude, location_place2.longitude)
print('------------------------------')
print("Location 3:", location_place3.latitude, location_place3.longitude)
print("Location 4:", location_place4.latitude, location_place4.longitude)
print('------------------------------')
print("Location 5:", location_place5.latitude, location_place5.longitude)
print("Location 6:", location_place6.latitude, location_place6.longitude)

Location 1: 55.6727587 12.564678938785772
Location 2: 55.6727587 12.564678938785772
------------------------------
Location 3: 55.595098 12.6179894
Location 4: 55.6091282 12.650982248393536
------------------------------
Location 5: 55.672253 12.5629295
Location 6: 55.6727587 12.564678938785772


Using another regex to extract all before a "," in the sequences, this is needed becuase the geeopy will not recognize the more specific names. So for a first test we need this regex


In [56]:
# Define the regular expression pattern
pattern = r'^([^,]+)'

sequences_modified = []

for sequence in sequences:
    modified_sequence = [re.match(pattern, element).group(1) if re.match(pattern, element) else element for element in sequence]
    sequences_modified.append(modified_sequence)


# Print the modified sequences
for sequence in sequences_modified:
    print(sequence)

['Hovedbanegården', 'Borrebyvej 29']
['København H (togbus)', 'Hulgårds Plads (Frederikssundsvej)']
['København H (togbus)', 'Islands Brygge St. (Metro)']
['København H (Metro)', 'Frederiksberg Allé St. (Metro)']
['Nørreport St.', 'Sluseholmen (Sjællandsbroen)']
['Lufthavnen St. (Metro)', 'Aksel Møllers Have St. (Metro)']
['CPH Lufthavn', 'Istedgade 6']
['Ryumgårdsvej (Kongelundsvej)', 'Dybbølsbro St.']
['Teglgårdstræde (Nørre Voldgade)', 'Kapelvej (Nørrebrogade)']
['Nørreport St. (Frederiksborggade)', 'Forum St. (Metro)']
['Drechselsgade (Artillerivej)', 'Hovedbanegården (Reventlowsgade)']
['Nyhavn (Københavns Havn)', 'Refshaleøen (Refshalevej)']
['Nørre Campus (Tagensvej)', 'Dronningens Tværgade 37']
['Elmegade (Nørrebrogade)', 'København H']
['Værnedamsvej (Frederiksberg Allé)', 'Skellet (Roskildevej)']
['Elmegade (Nørrebrogade)', 'Vestamager St. (Metro)']
['Fisketorvet', 'Sjælør St.']
['Dybbølsbro St. (togbus)', 'Østerport St.']
['Orientkaj St. (Sundkrogsgade)', 'Østerport St.']
['

A first tester to find all stations in the sequences that match around the coordinates of 'Hovedbanegården'

In [59]:
# Define the target coordinates
target_coordinates = (55.6727587, 12.564678938785772) #København H's coordinates

# List to store places close to the target coordinates
places_close_to_target = []

test = sequences_modified[0:100]
for sequence in test:
    for place in sequence:
        try:
            location = geolocator.geocode(place)
            if location is not None:
                place_coordinates = (location.latitude, location.longitude)
                distance_to_target = geodesic(target_coordinates, place_coordinates).kilometers
                if distance_to_target < 1:  # Adjust this threshold as needed
                    places_close_to_target.append(place)
        except Exception as e:
            print(f"Error geocoding {place}: {e}")

# Print the list of places close to the target coordinates
print("Places close to the target coordinates:")
for place in places_close_to_target:
    print(place)

test

Places close to the target coordinates:
Hovedbanegården
København H (togbus)
København H (togbus)
Istedgade 6
Teglgårdstræde (Nørre Voldgade)
Hovedbanegården (Reventlowsgade)
København H
Værnedamsvej (Frederiksberg Allé)
Vesterport St. (Vester Farimagsgade)
København H
København H
Polititorvet
Amagertorv 33
København H (togbus)
København H
Hovedbanegården
København H
Rådhuspladsen
Nyropsgade 46
Hovedbanegården (Istedgade)
København H
København H
København H
København H
København H
København H (togbus)
København H


[['Hovedbanegården', 'Borrebyvej 29'],
 ['København H (togbus)', 'Hulgårds Plads (Frederikssundsvej)'],
 ['København H (togbus)', 'Islands Brygge St. (Metro)'],
 ['København H (Metro)', 'Frederiksberg Allé St. (Metro)'],
 ['Nørreport St.', 'Sluseholmen (Sjællandsbroen)'],
 ['Lufthavnen St. (Metro)', 'Aksel Møllers Have St. (Metro)'],
 ['CPH Lufthavn', 'Istedgade 6'],
 ['Ryumgårdsvej (Kongelundsvej)', 'Dybbølsbro St.'],
 ['Teglgårdstræde (Nørre Voldgade)', 'Kapelvej (Nørrebrogade)'],
 ['Nørreport St. (Frederiksborggade)', 'Forum St. (Metro)'],
 ['Drechselsgade (Artillerivej)', 'Hovedbanegården (Reventlowsgade)'],
 ['Nyhavn (Københavns Havn)', 'Refshaleøen (Refshalevej)'],
 ['Nørre Campus (Tagensvej)', 'Dronningens Tværgade 37'],
 ['Elmegade (Nørrebrogade)', 'København H'],
 ['Værnedamsvej (Frederiksberg Allé)', 'Skellet (Roskildevej)'],
 ['Elmegade (Nørrebrogade)', 'Vestamager St. (Metro)'],
 ['Fisketorvet', 'Sjælør St.'],
 ['Dybbølsbro St. (togbus)', 'Østerport St.'],
 ['Orientkaj St. 