# Welcome to the Notebook for Monha's and Bemi's Bachelor Project

## Content

In this notebook we will:

1. Aggrigate our data into usable travel sequences with only the relevant data 
2. Analyse the appropriate data
3. Create an embedding space using Word2Vec

We will use the following format for the structure of the file:
1. MD file to describe the intention of the following code followed by an explanation of the results from the code if any
2. Code block to write code

# Initial Setup

Please pip install the correct libraries for the following code to work.

In [1]:
%pip install pandas # Pandas for data handling
%pip install numpy  # Maths stuff

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import re


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Data import

The data used in this notebook is extracted from the Journeys table from the DB. 

The data in question contains ~43 mil rows. This data is all journeys traveled in the timespan of ~4 years. For the purpose of this project we wish to filter the data, such that we only work with journeys within Copenhagen.

In [3]:
data = pd.read_csv('../Data/All_Journeys_small.csv')
data

Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
0,13581986-9f2d-455b-b5a1-00000010eaeb,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade)",,2024-03-14 10.55.56.6562914,2024-03-14 10.55.56.6562914,,,,
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
2,cbd5ad3b-0bf0-4314-bd74-000001a41c82,,,0,100110021003,,3,0,,Femøren St. (Metro),Bispebjerg Hospital (Tagensvej),2023-08-04 08.33.59.0415651,2023-08-04 08.33.59.0415651,,,,
3,1f4ed562-1e81-40be-ac2b-000001a4e840,,1029,1029,"1001,1002,1008,1029,1032,1043,1054,1066,1076,1...",,9,1001,,,,2020-08-21 19.34.33.6834876,2020-08-21 19.34.33.6834876,,,,
4,afbd1023-61e5-4b47-9f9d-000001c9d32b,Zone,1062,0,"1062,1071,1061,1052,1053,1063,1073,1072,1051,1...",,4,0,,Farum St. (62),Dyssegård St. (31),2023-11-09 17.28.41.0671191,2023-11-09 17.28.41.0671191,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,89aca741-aaad-4e03-ad4e-05dc7d2d388c,Zone,1010,0,"1010,1016,1036,1037,1011,1012,1017,1018,1039,1...",,6,0,,Gilleleje St. (10),Helsingør St. (05),2023-07-26 10.33.40.6757535,2023-07-26 10.33.40.6757535,,,,
999996,64ddaac3-0201-4841-b5d2-05dc7dc42b6b,,,0,100110031004,,3,0,,,,2020-08-25 16.47.39.2628986,2020-08-25 16.47.39.2628986,,,,
999997,139c9211-8f1d-4264-8673-05dc7e038a52,Zone,1002,0,1002100110031030103110321033,,2,0,,My Location (02),København/City (01),2021-03-08 12.02.16.9444135,2021-03-08 12.02.16.9444135,,,,
999998,9b14479b-d29a-4b52-a2bd-05dc7e790206,,,0,"1001,1002,1003,1004,1005,1006,1007,1008,1009,1...",,0,0,,,,2021-02-03 08.48.51.5517535,2021-02-03 08.48.51.5517535,,0.0,,


## Filtering data

In order to filter our data, XXX checks need to be made to be certain a journey is within cph as well as containing information relevant for our purpose. 

For a journey to be within cph they need to only make use of zone 1 through 4
1. Check if *internalStartZones* only contain zones within cph
2. Check if *internalValidZones* only contain zones within cph

For a journey to be relevant for the project, we need the fields *StartStop*, *EndStop*, *SearchStart* and *SearchEnd* to be either fully filled out or partly - that is, if Start- and EndStop are null, then SearchStart and -End need to be filled. Likewise, the fields must not match in their values; a journeys start and end should not be the same.


In [4]:
#Copenhagen filtering
condition_1_cph = (
    (data['internalValidZones'].str.match(r'^(1001|1002|1003|1004)(,(1001|1002|1003|1004))*$')
    | # or
    pd.isna(data['internalValidZones']))
    )

condition_2_cph = (
    (data['internalStartZones'].str.match(r'^(1001|1002|1003|1004)$'))
    | # or
    pd.isna(data['internalStartZones'])
    )

In [5]:
cph_data = data[(condition_1_cph)]
cph_data = cph_data[(condition_2_cph)]

cph_data = cph_data[ ~ (cph_data['SearchStart'].str.contains("okation", na=False)
                                             | #Or
                                             cph_data['SearchStart'].str.contains("zoner", na=False))]
cph_data = cph_data[( ~ (cph_data['SearchEnd'].str.contains("zoner", na=False) 
                                            | #Or
                                            cph_data['SearchEnd'].str.contains("okation", na=False)))]

# next two filters are English filters of the first
cph_data = cph_data[( ~ (cph_data['SearchEnd'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data['SearchEnd'].str.contains("ocation", na=False)))]

cph_data = cph_data[( ~ (cph_data['SearchStart'].str.contains("zones", na=False) 
                                            | #Or
                                            cph_data['SearchStart'].str.contains("ocation", na=False)))]

# Next filter is to remove entries where one of the matching search-x or x-stop are Null
cph_data = cph_data[(
                                        ( ~ (pd.isna(cph_data['SearchStart'])) & ~ (pd.isna(cph_data['SearchEnd'])))
                                        | # Or
                                        ( ~ (pd.isna(cph_data['StartStop'])) & ~ (pd.isna(cph_data['EndStop'])))
                                        )]

# Next filter removes all entries where SearchStart and SearchEnd contain the same value
cph_data = cph_data[(
                        ~(cph_data['SearchStart'] == cph_data['SearchEnd'])
                        )]

cph_data

  cph_data = cph_data[(condition_2_cph)]


Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
22,27ddd4c7-35d5-4e95-84a5-0000092e14a0,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Hulgårds Plads (Frederikssundsvej) (02),2023-01-01 13.16.11.4343765,2023-01-01 13.16.11.4343765,,,,
61,986798c5-e47a-4a6f-ba89-0000177df4cf,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Islands Brygge St. (Metro) (01),2022-08-25 09.08.46.4521964,2022-08-25 09.08.46.4521964,,,,
69,ad7d6db6-ab4e-4782-8976-000019f0ecd6,Zone,1001,0,100110021003,,2,0,,København H (Metro) (01),Frederiksberg Allé St. (Metro) (01),2023-07-24 05.41.23.6936628,2023-07-24 05.41.23.6936628,,,,
92,fec8331d-e54e-48b2-88fe-00002217de59,Zone,1001,0,100110021003,,2,0,,Nørreport St. (01),Sluseholmen (Sjællandsbroen) (02),2022-10-05 12.44.54.5227363,2022-10-05 12.44.54.5227363,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999881,2cf0443e-cf66-464a-a6ba-05dc4e88c9d3,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),København H (01),2023-06-29 14.41.43.3292495,2023-06-29 14.41.43.3292495,,,,
999908,387804ed-cbc1-456e-a24a-05dc5a59af34,Zone,1004,0,1004100310011002,,3,0,,Kastrup St. (Metro) (04),Frederiksberg St. (Metro) (02),2023-08-28 16.06.29.8130118,2023-08-28 16.06.29.8130118,,,,
999930,4d4f9944-9669-4219-820d-05dc65834645,Zone,1003,0,1003100110021004,,2,0,,Bella Center St. (Center Østvej) (03),Frederiksberg St. (Metro) (02),2023-11-22 07.16.16.0657646,2023-11-22 07.16.16.0657646,,,,
999942,ec457061-bae5-4040-95cb-05dc69e1bfa8,Zone,1004,0,1004100310011002,,3,0,,CPH Lufthavn (04),Nørreport St. (01),2022-08-22 06.27.09.2054089,2022-08-22 06.27.09.2054089,,,,


## Testing to see whether our filtering worked

Since we are handling a very large amount of data, it can be difficult to scim through the data in order to see if it is as intended. These tests are used in order to detect whether or not rows that are not supposed to be in our data is in our data.

In [6]:
# Test 1 for whether our data contain seachEnd with contains 'lokation' or 'location'
lokation_count = cph_data[cph_data['SearchEnd'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

location_count = cph_data[cph_data['SearchEnd'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

# Test 2 for whether our data contain seachStart with contains 'lokation' or 'location'
lokation_count_s = cph_data[cph_data['SearchStart'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchStart' : {lokation_count_s['SearchStart']}")

location_count_s = cph_data[cph_data['SearchStart'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchStart' : {location_count_s['SearchStart']}")

# Test 3 for whether our data contain SearchStart with 'zones' or 'zoner'
zones_count = cph_data[cph_data['SearchEnd'].str.contains("zones", na=False)].count()
print(f"Amount of 'zones' entires in 'SearchEnd' : {zones_count['SearchEnd']}")

zones_count_r = cph_data[cph_data['SearchEnd'].str.contains("zoner", na=False)].count()
print(f"Amount of 'zoner' entires in 'SearchEnd' : {zones_count_r['SearchEnd']}")

# Test 4 for whether our data contain None in 3 or more fields (startStop, EndStop, SearchStart and SearchEnd)
num_nulls = cph_data[['StartStop', 'EndStop', 'SearchStart', 'SearchEnd']].isna().sum(axis=1)
b = (num_nulls >= 3).any()
print(f"Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: {b}")

# Test 5 for whether our data contain duplicates in matching fields, i.e. StartStop == EndStop
duplicates_in_stop = cph_data[(cph_data['StartStop'] == cph_data['EndStop'])].count()
print(f"Amount of matching values in StartStop and EndStop : {duplicates_in_stop['StartStop']}")


# Test 6 for whether our data contain duplicates in matching fields, i.e. SearchStart == SearchEnd
duplicates_in_stop = cph_data[(cph_data['SearchStart'] == cph_data['SearchEnd'])].count()
print(f"Amount of matching values in SearchStart and SearchEnd : {duplicates_in_stop['SearchStart']}")

# Test 7 for whether our data contain three of the fields filled.
num_filled = ~(cph_data[['StartStop', 'EndStop', 'SearchStart', 'SearchEnd']].isna()).sum(axis=1)
b = (num_filled == 3).any()
print(f"Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd are filled: {b}")


Amount of 'Lokation' entires in 'SearchEnd' : 0
Amount of 'Location' entires in 'SearchEnd' : 0
Amount of 'Lokation' entires in 'SearchStart' : 0
Amount of 'Location' entires in 'SearchStart' : 0
Amount of 'zones' entires in 'SearchEnd' : 0
Amount of 'zoner' entires in 'SearchEnd' : 0
Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: False
Amount of matching values in StartStop and EndStop : 0
Amount of matching values in SearchStart and SearchEnd : 0
Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd are filled: False


# Sequences

We now wish to make sequences from the journeys. The sequnces should either be a value pair of SearchStart and Searchend or a pair of StartStop and EndStop. To do this we simply collect the pairs from the dataframe where StartStop and EndStop Id's are "translated" to station names. 

When making the sequences, certain questions arrise about the data. For instance, of the 3,4 mil datapoints, only 64 of the datapoints contain a value *only* in StartStop and EndStop. (```python test_df[~(pd.isna(test_df['StartStop'])) & (pd.isna(test_df['SearchStart']))]```)

Another important decision is deciding on how to extract stations from SearchStart and SearchEnd, since a lot of the entries does not consist of a directly matching station. i.e. 'Hovedebanegården' being the SearchStart for the station 'København H'. Thus we need to match these inconsistent strings with a consistent naming convention. 

The first step in the creation of sequences is to prase our strings to fit the same format. A bunch of stations have '(01)' or another number in the parenthesis, probably incidating either which zones the user is searching from or where they are going. We are not interested in this number. The following regex will handle this parsing:

```regex 
    r'[(]\d\d[)]'
```

Likewise we are not interested in detailed searches like a full address; Klokkerhøjen 6 st, 2400 København NV, Denmark. Thus we wish to remove all symbols after ','. We do this with the following regex:

```regex
    r'(,.*$)'
```

(THIS ONE ALSO REPLACES THE FIRST REGEX)
Lastly we wish to remove all '(togbus)' parts of a string. Here we do note, that the strings might contain '(metro)' which we are interested in keeping. Thus we need to remove all symbols inside a parenthesis but not if the symbols are the string 'metro'. This will be done partly through code and partly with the regex:

```regex
    r'\s*\([^)]*\)'
```



In [None]:
cph_data = pd.read_csv('../Data/cph_file.csv')

In [7]:

pattern_for_comma = r'(,.*$)'
pattern_for_parenthesis = r'\s*\([^)]*\)'
pattern_for_parenthesis_number = r'[(]\d\d[)]'

sequences = []

station_counter = {}


def get_sequence(row) -> None:
    initial_start   = row['SearchStart']
    initial_end     = row['SearchEnd']
    if pd.notna(initial_start):
        start   = re.sub(pattern_for_comma, "", initial_start)
        end     = re.sub(pattern_for_comma, "", initial_end)



        if "Metro" not in start:
            start   = re.sub(pattern_for_parenthesis, "", start)
        else: 
            start   = re.sub(pattern_for_parenthesis_number, "", start)
                
        if "Metro" not in end:
            end   = re.sub(pattern_for_parenthesis, "", end)
        else:
            end   = re.sub(pattern_for_parenthesis_number, "", end)
        
        if start.strip() != "":
            sequences.append([start.strip(), end.strip()])
            
            start = start.strip()
            end = end.strip()
            
            if start not in station_counter:
                station_counter[start] = 1
            else:
                station_counter[start] = station_counter[start] + 1
            
            if end not in station_counter:
                station_counter[end] = 1
            else:
                station_counter[end] = station_counter[end] + 1
        
cph_data.apply(get_sequence, axis=1)

1         None
22        None
61        None
69        None
92        None
          ... 
999881    None
999908    None
999930    None
999942    None
999977    None
Length: 80226, dtype: object

In [None]:
res = dict(sorted(station_counter.items(), key = lambda x: x[1], reverse = True)[:10])
res

# Top used stations are:
# 'København H': 402727,
# 'Nørreport St.': 342289,
# 'Kongens Nytorv St. (Metro)': 326808,
# 'CPH Lufthavn': 235769,
# 'Refshaleøen': 189228,
# 'Hovedbanegården': 156454,
# 'Ørestad St.': 139841,
# 'København H (Metro)': 115016,
# 'Christianshavn St. (Metro)': 106272,
# 'Amagerbro St. (Metro)': 101808


In [8]:

for station in station_counter.keys():
    if station.__contains__("Kongens Nytorv"):
        print(f"{station} : {station_counter[station]}")

Kongens Nytorv St. (Metro) : 7455
Kongens Nytorv : 663
Kongens Nytorv St. : 517
Kongens Nytorv 21E : 1
Kongens Nytorv 11 : 23
Kongens Nytorv 13 : 44
Kongens Nytorv 1050 København K : 60
Kongens Nytorv 26 : 2
Kongens Nytorv 19 : 2
Kongens Nytorv 17 : 3
Kongens Nytorv 21F : 6
Kongens Nytorv 4 : 1
Kongens Nytorv 24 : 2
Kongens Nytorv 21A : 1
Kongens Nytorv 9 : 1
Kongens Nytorv 6 : 1
Kongens Nytorv 16F : 2
Kongens Nytorv 1 : 2
Kongens Nytorv 34 : 2
Kongens Nytorv 5 : 1
Kongens Nytorv 23 : 1


## Dealing with the problem of the same places in Copenhagen being searched with different naming conventions (and languages)

In [9]:
%pip install geopy


Note: you may need to restart the kernel to use updated packages.


In [10]:
from geopy.geocoders import Nominatim
from geopy.distance import geodesic
geolocator = Nominatim(user_agent="my_geocoder")

- geopy is a Python client for several popular geocoding web services.

- geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

- geopy includes geocoder classes for the OpenStreetMap Nominatim, Google Geocoding API (V3), and many other geocoding services. The full list is available on the Geocoders doc section. Geocoder classes are located in geopy.geocoders.

A small test testing some random stations in Copenhagen

In [None]:



# Top used stations are:
# 'København H': 402727,
# 'Nørreport St.': 342289,
# 'Kongens Nytorv St. (Metro)': 326808,
# 'CPH Lufthavn': 235769,
# 'Refshaleøen': 189228,
# 'Hovedbanegården': 156454,
# 'Ørestad St.': 139841,
# 'København H (Metro)': 115016,
# 'Christianshavn St. (Metro)': 106272,
# 'Amagerbro St. (Metro)': 101808


place1 = 'København H'  # 55.6727587 12.564678938785772
place2 = 'Nørreport St.' #55.6840689 12.5725383
# place3 = 'Kongens Nytorv St. (Metro)'
place4 = 'CPH Lufthavn' #55.6091282 12.650982248393536
place5 = 'Refshaleøen' #55.693321499999996 12.61966813806588
place6 = 'Hovedbanegården' #55.6727587 12.564678938785772
place7 = 'Ørestad St.' # 55.6727587 12.564678938785772
# place8 = 'København H (Metro)'
# place9 = 'Christianshavn St. (Metro)'
# place10 = 'Amagerbro St. (Metro)'

location_place1 = geolocator.geocode(place1)
location_place2 = geolocator.geocode(place2)
# location_place3 = geolocator.geocode(place3)
location_place4 = geolocator.geocode(place4)
location_place5 = geolocator.geocode(place5)
location_place6 = geolocator.geocode(place6)
# location_place8 = geolocator.geocode(place8)
# location_place9 = geolocator.geocode(place9)
# location_place10 = geolocator.geocode(place10)

print("Location 1:", location_place1.latitude, location_place1.longitude)
print("Location 2:", location_place2.latitude, location_place2.longitude)
# print("Location 3:", location_place3.latitude, location_place3.longitude)
print("Location 4:", location_place4.latitude, location_place4.longitude)
print("Location 5:", location_place5.latitude, location_place5.longitude)
print("Location 6:", location_place6.latitude, location_place6.longitude)
print("Location 7:", location_place6.latitude, location_place6.longitude)
# print("Location 8:", location_place8.latitude, location_place8.longitude)
# print("Location 9:", location_place9.latitude, location_place9.longitude)
# print("Location 10:", location_place10.latitude, location_place10.longitude)

#### A first tester to find all stations in the sequences that match around the coordinates of 'Hovedbanegården'
The targets we want that are known to us is specifically 'København H' / 'Hovedbanegården' and 'Cph lufthavn'/ 'Kbh lufthavnen' /'Lufthavnen st'/ 'Kastrup st'

Trying to create a list of unique stationnames to use with the geolocator to save time. 

In [11]:
distinct_stations_set = set()
for seq in sequences:
    for place in seq:
        distinct_stations_set.add(place)

# Convert the set back to a list
unique_stations_list = list(distinct_stations_set)
print(unique_stations_list)
len(unique_stations_list)


['Amerika Plads 19', 'Nørrebrogade 14', 'Amagertorv 1', 'Hillerødgade 23', 'Hvidkildevej 64', 'Bredgade 78', 'Østerbrogade 74', 'Njalsgade 35', 'Sallingvej 25', 'Lindgreens Allé 9', 'Borgergade 9', 'Ramsingsvej 28B', 'Lynetten', 'Poppelrækken 3', 'Øster Voldgade 20', 'Boldhusgade 6', 'Vesterbrogade 5', 'Margretheholmsvej 38', 'Dag Hammarskjölds Allé 24', 'Kompagnistræde 9', 'Sjællandsbroen 2450 København SV', 'Pilestræde 16', 'Skrivergangen 8', 'Engmarken 6', 'Toldbodgade 4', 'Stærevej 50', 'Grundtvigsvej 8A', 'Refshalevej 161K', 'Øster Farimagsgade 12', 'Tingvej 2300 København S', 'Amagertorv 1F', 'Fredensgade 11C', 'Nørrebrogade 68', 'Tinggården 7', 'Ny Carlsberg Vej 97', 'Marmorkirken St. (Metro)', 'Hollændervænget', 'Ved Vesterport 2', 'Offenbachsvej 31', 'Kongelundsvej 365', 'NH Collection Copenhagen', 'Nordborggade 2100 København Ø', 'Hannemanns Allé 2300 København S', 'Bellahøjvej 154', 'Kongovej 14P', 'Restrupvej 27', 'Langebrogade 5', 'Bystævneparken 21', 'Korsvejens Skole', '

6078

In [12]:
from geopy.distance import geodesic
from geopy.exc import GeocoderTimedOut
from geopy.exc import GeocoderUnavailable
from geopy.exc import GeocoderQueryError
import time
# List to store places close to the target coordinates

#should return a list of all locations that should be mapped to the target coordinations station
def get_locations_close_to_target(targetCoords, unique_stations):
    iteration_counter = 0
    places_close_to_target = []
    # Iterate through the sequences and their elements
    for place in unique_stations:
        
        if "Metro" in place:
            continue
        
        iteration_counter += 1
        # Check if the place already exists in places_close_to_target
        if any(place == p for p in places_close_to_target):
            continue  # Skip this place if it already exists in places_close_to_target

        retries = 3
        while retries > 0:
            try:
                location = geolocator.geocode(place, timeout=None)
                if location is not None:
                    place_coordinates = (location.latitude, location.longitude)
                    distance_to_target = geodesic(targetCoords, place_coordinates).kilometers
                    if distance_to_target < 0.1:  # Adjust this threshold as needed
                        places_close_to_target.append(place)  # Append place name only
                        unique_stations.remove(place)
                        print(f"location number in list: {iteration_counter}")
                break  # Exit the retry loop if successful
            except GeocoderTimedOut as e:
                retries -= 1
                if retries == 0:
                    print(f" Max retries exceeded for {place}. Skipping...")
                time.sleep(1)  # Add a delay between retries to avoid overwhelming the server
            except GeocoderUnavailable as e:
                print(f"Geocoder unavailable: {e}")
                time.sleep(5)  # Wait for 5 seconds before retrying
            except GeocoderQueryError as e:
                print(f"Geocoder query error: {e}")
                break  # Exit the retr  y loop if there's a query error

    return places_close_to_target, unique_stations


In [None]:

København_H  =    (55.6727587, 12.564678938785772)
Nørreport_St =    (55.6840689, 12.5725383)
CPH_Lufthavn =    (55.6091282, 12.650982248393536)
Refshaleøen  =    (55.693321499999996, 12.61966813806588)
# Hovedbanegården = (55.6727587, 12.564678938785772)

all_stations_to_change_kbh_h, updated_list           = get_locations_close_to_target(København_H, unique_stations_list)
all_stations_to_change_nørreport, updated_list       = get_locations_close_to_target(Nørreport_St, updated_list)
all_stations_to_change_cph_lufthavn, updated_list    = get_locations_close_to_target(CPH_Lufthavn, unique_stations_list)
all_stations_to_change_refshaleøen , updated_list    = get_locations_close_to_target(Refshaleøen, updated_list)

# lufthavnen_coords = (55.595098, 12.6179894)
# hovedbanen_coords = (55.595098, 12.6179894)
# lufthavnen = get_locations_close_to_target(lufthavnen_coords,unique_stations_list)
# hovedbanen = get_locations_close_to_target(hovedbanen_coords,unique_stations_list)

For each sequence check if the name is in the list of locations that should be interpreted as 'København H' and replace these with København H

In [None]:
# Iterate through the sequences
for i, sequence in enumerate(sequences):
    station1, station2 = sequence
    # Check if the first station matches any place close to København H
    for place, distance in places_close_to_target:
        if station1 == place:
            sequences[i][0] = 'København H'
    # Check if the second station matches any place close to København H
    for place, distance in places_close_to_target:
        if station2 == place:
            sequences[i][1] = 'København H'

# Print the updated sequences
print("Updated sequences:")
for sequence in sequences:
    print(sequence)


In [13]:
#following code runs twice but should not be a problem.
for seq in sequences:
    for p in seq:
        if not seq[0].__contains__("Metro"):
            if seq[0].__contains__("Hovedbane"):
                seq[0] = "København H"
            elif seq[0].__contains__("Lufthavn"):
                seq[0] = "CPH Lufthavn"
        
        if not seq[1].__contains__("Metro"):
            if seq[1].__contains__("Hovedbane"):
                seq[1] = "København H"
            elif seq[0].__contains__("Lufthavn"):
                seq[0] = "CPH Lufthavn"

In [14]:
import csv

with open('sequences.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(sequences)

In [1]:
%pip install gensim
#from gensim.test.utils import common_texts
import scipy
scipy.__version__


Note: you may need to restart the kernel to use updated packages.


'1.10.0'

In [3]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
