# Welcome to the Notebook for Monha's and Bemi's Bachelor Project

## Content

In this notebook we will:

1. Aggrigate our data into usable travel sequences with only the relevant data 
2. Analyse the appropriate data
3. Create an embedding space using Word2Vec

We will use the following format for the structure of the file:
1. MD file to describe the intention of the following code followed by an explanation of the results from the code if any
2. Code block to write code

# Initial Setup

Please pip install the correct libraries for the following code to work.

In [None]:
%pip install pandas # Pandas for data handling
%pip install numpy  # Maths stuff

In [1]:
import pandas as pd
import numpy as np

# Data import

The data used in this notebook is extracted from the Journeys table from the DB. 

The data in question contains ~43 mil rows. This data is all journeys traveled in the timespan of ~4 years. For the purpose of this project we wish to filter the data, such that we only work with journeys within Copenhagen.

In [2]:
data = pd.read_csv('../Data/All_Journeys.csv')
data

Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
0,13581986-9f2d-455b-b5a1-00000010eaeb,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade)",,2024-03-14 10.55.56.6562914,2024-03-14 10.55.56.6562914,,,,
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
2,cbd5ad3b-0bf0-4314-bd74-000001a41c82,,,0,100110021003,,3,0,,Femøren St. (Metro),Bispebjerg Hospital (Tagensvej),2023-08-04 08.33.59.0415651,2023-08-04 08.33.59.0415651,,,,
3,1f4ed562-1e81-40be-ac2b-000001a4e840,,1029,1029,"1001,1002,1008,1029,1032,1043,1054,1066,1076,1...",,9,1001,,,,2020-08-21 19.34.33.6834876,2020-08-21 19.34.33.6834876,,,,
4,afbd1023-61e5-4b47-9f9d-000001c9d32b,Zone,1062,0,"1062,1071,1061,1052,1053,1063,1073,1072,1051,1...",,4,0,,Farum St. (62),Dyssegård St. (31),2023-11-09 17.28.41.0671191,2023-11-09 17.28.41.0671191,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43345941,74795e73-8f89-4b5f-b392-fffffe36c3d3,Zone,1002,0,"1002,1001,1003,1030,1031,1032,1033,1004,1040,1...",,4,0,,Min Lokation (02),Måløv St. (53),2021-01-05 13.41.15.8483902,2021-01-05 13.41.15.8483902,,,,
43345942,4afd6d02-6599-4345-8fc5-fffffe936d49,Zone,1002,0,"1002,1001,1003,1030,1031,1032,1033,1004,1040,1...",,3,0,,Hans Knudsens Plads (Lyngbyvej) (02),Lyngby St. (41/51),2023-06-08 08.31.27.5187233,2023-06-08 08.31.27.5187233,,,,
43345943,2c273886-33bf-4234-b58c-fffffeb987cf,Zone,1002,0,1002100110031030103110321033,,2,0,,Ryparken St.,Nordhavn St.,2022-03-14 20.46.13.2072936,2022-03-14 20.46.13.2072936,,,,
43345944,d8e9f98b-7354-46e4-bb14-fffffed79bee,Zone,1052,0,"1052,1051,1041,1042,1053,1063,1062,1061,1060,1...",,4,0,,Min Lokation (52),"Struenseegade 45, 2200 København N, Københavns...",2022-01-24 04.57.47.3975380,2022-01-24 04.57.47.3975380,,,,


## Filtering data

In order to filter our data, XXX checks need to be made to be certain a journey is within cph as well as containing information relevant for our purpose. 

For a journey to be within cph they need to only make use of zone 1 through 4
1. Check if *internalStartZones* only contain zones within cph
2. Check if *internalValidZones* only contain zones within cph

For a journey to be relevant for the project, we need the fields *StartStop*, *EndStop*, *SearchStart* and *SearchEnd* to be either fully filled out or partly - that is, if Start- and EndStop are null, then SearchStart and -End need to be filled. Likewise, the fields must not match in their values; a journeys start and end should not be the same.


In [3]:
#Copenhagen filtering
condition_1_cph = (
    (data['internalValidZones'].str.match(r'^(1001|1002|1003|1004)(,(1001|1002|1003|1004))*$')
    | # or
    pd.isna(data['internalValidZones']))
    )

condition_2_cph = (
    (data['internalStartZones'].str.match(r'^(1001|1002|1003|1004)$'))
    | # or
    pd.isna(data['internalStartZones'])
    )

In [4]:
cph_data_1 = data[(condition_1_cph)]
cph_data_2 = cph_data_1[condition_2_cph]

cph_data_3 = cph_data_2[ ~ (cph_data_2['SearchStart'].str.contains("okation", na=False)
                                             | 
                                             cph_data_2['SearchStart'].str.contains("zoner", na=False))]
cph_data_4 = cph_data_3[( ~ (cph_data_3['SearchEnd'].str.contains("zoner", na=False) 
                                            |
                                            cph_data_3['SearchEnd'].str.contains("okation", na=False)))]
# next two filters are English filters of the first
cph_data_5 = cph_data_4[( ~ (cph_data_4['SearchEnd'].str.contains("zones", na=False) 
                                            |
                                            cph_data_4['SearchEnd'].str.contains("ocation", na=False)))]

cph_data_6 = cph_data_5[( ~ (cph_data_5['SearchStart'].str.contains("zones", na=False) 
                                            |
                                            cph_data_5['SearchStart'].str.contains("ocation", na=False)))]
# Next filter is to remove entries where one of the matching search-x or x-stop are Null
cph_data = cph_data_6[(
                                        ( ~ (pd.isna(cph_data_6['SearchStart'])) & ~ (pd.isna(cph_data_6['SearchEnd'])))
                                        | # Or
                                        ( ~ (pd.isna(cph_data_6['StartStop'])) & ~ (pd.isna(cph_data_6['EndStop'])))
                                        )]

cph_data

  cph_data_2 = cph_data_1[condition_2_cph]


Unnamed: 0,Id,Type,internalStartZones,StartZone,internalValidZones,StartStop,AmountOfZones,EndZone,EndStop,SearchStart,SearchEnd,ModifiedOn,CreatedOn,JourneyClasses_Id,TravelType,ExtraFrom,ExtraTo
1,715ec968-7783-4b6b-be27-0000014f64b3,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Tivoli (Bernstorffsgade) (01)","Borrebyvej 29, 2700 Brønshøj, Københavns Kommune",2023-08-18 22.19.25.4586286,2023-08-18 22.19.25.4586286,,,,
22,27ddd4c7-35d5-4e95-84a5-0000092e14a0,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Hulgårds Plads (Frederikssundsvej) (02),2023-01-01 13.16.11.4343765,2023-01-01 13.16.11.4343765,,,,
61,986798c5-e47a-4a6f-ba89-0000177df4cf,Zone,1001,0,100110021003,,2,0,,København H (togbus) (01),Islands Brygge St. (Metro) (01),2022-08-25 09.08.46.4521964,2022-08-25 09.08.46.4521964,,,,
69,ad7d6db6-ab4e-4782-8976-000019f0ecd6,Zone,1001,0,100110021003,,2,0,,København H (Metro) (01),Frederiksberg Allé St. (Metro) (01),2023-07-24 05.41.23.6936628,2023-07-24 05.41.23.6936628,,,,
92,fec8331d-e54e-48b2-88fe-00002217de59,Zone,1001,0,100110021003,,2,0,,Nørreport St. (01),Sluseholmen (Sjællandsbroen) (02),2022-10-05 12.44.54.5227363,2022-10-05 12.44.54.5227363,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43345884,7b4d55b7-a4f5-453c-887f-ffffe57690b7,Zone,1001,0,100110021003,,2,0,,Amagerbro St. (Metro) (01),Kongens Nytorv St. (Metro) (01),2024-01-31 06.42.46.8616806,2024-01-31 06.42.46.8616806,,,,
43345901,490baace-c75d-4fda-b608-ffffee254091,Zone,1001,0,100110021003,,2,0,,Islands Brygge St. (Ørestads Boulevard),Forum St. (Metro),2023-08-23 12.08.49.4881799,2023-08-23 12.08.49.4881799,,,,
43345902,1bb14585-68a2-404d-bc36-ffffeef1aa01,Zone,1001,0,100110021003,,2,0,,Rådhuspladsen St. (Vesterbrogade) (01),Skt. Annæ Gade (Prinsessegade) (01),2024-01-13 18.51.35.6627821,2024-01-13 18.51.35.6627821,,,,
43345916,fb57446e-412e-4365-8656-fffff4b08b3d,Zone,1001,0,100110021003,,2,0,,"Hovedbanegården, Frihedsstøtten (Vesterbrogade...","Roskildevej 96, 2000 Frederiksberg, Frederiksb...",2024-02-14 15.59.19.9272062,2024-02-14 15.59.19.9272062,,,,


## Testing to see whether our filtering worked

Since we are handling a very large amount of data, it can be difficult to scim through the data in order to see if it is as intended. These tests are used in order to detect whether or not rows that are not supposed to be in our data is in our data.

In [16]:
# Test 1 for whether our data contain seachEnd with contains 'lokation' or 'location'
lokation_count = cph_data[cph_data['SearchEnd'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

location_count = cph_data[cph_data['SearchEnd'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchEnd' : {lokation_count['SearchEnd']}")

# Test 2 for whether our data contain seachStart with contains 'lokation' or 'location'
lokation_count_s = cph_data[cph_data['SearchStart'].str.contains("okation", na=False)].count()
print(f"Amount of 'Lokation' entires in 'SearchStart' : {lokation_count_s['SearchStart']}")

location_count_s = cph_data[cph_data['SearchStart'].str.contains("ocation", na=False)].count()
print(f"Amount of 'Location' entires in 'SearchStart' : {location_count_s['SearchStart']}")


# Test 3 for whether our data contain SearchStart with 'zones' or 'zoner'
zones_count = cph_data[cph_data['SearchEnd'].str.contains("zones", na=False)].count()
print(f"Amount of 'zones' entires in 'SearchEnd' : {zones_count['SearchEnd']}")

zones_count_r = cph_data[cph_data['SearchEnd'].str.contains("zoner", na=False)].count()
print(f"Amount of 'zoner' entires in 'SearchEnd' : {zones_count_r['SearchEnd']}")

# Test 4 for whether our data contain None in 3 or more fields (startStop, EndStop, SearchStart and SearchEnd)
num_nulls = cph_data[['StartStop', 'EndStop', 'SearchStart', 'SearchEnd']].isna().sum(axis=1)
b = (num_nulls >= 3).any()
print(f"Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: {b}")

Amount of 'Lokation' entires in 'SearchEnd' : 0
Amount of 'Location' entires in 'SearchEnd' : 0
Amount of 'Lokation' entires in 'SearchStart' : 0
Amount of 'Location' entires in 'SearchStart' : 0
Amount of 'zones' entires in 'SearchEnd' : 0
Amount of 'zoner' entires in 'SearchEnd' : 0
Does the data contain a row which 3 of StartStop, EndStop, SearchStart or SearchEnd is null: False


1           2
22          2
61          2
69          2
92          2
           ..
43345884    2
43345901    2
43345902    2
43345916    2
43345940    2
Length: 3523992, dtype: int64