# Welcome to the Notebook for Monha's and Bemi's Bachelor Project

## Content

In this notebook we will:

1. Aggrigate our data into usable travel sequences with only the relevant data 
2. Analyse the appropriate data
3. Create an embedding space using Word2Vec

We will use the following format for the structure of the file:
1. MD file to describe the intention of the following code followed by an explanation of the results from the code if any
2. Code block to write code

# Initial Setup

Please pip install the correct libraries for the following code to work.

In [1]:
%pip install pandas # Pandas for data handling
%pip install numpy  # Maths stuff

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/a5/78/1d859bfb619c067e3353ed079248ae9532c105c4e018fa9a776d04b34572/pandas-2.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading pandas-2.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting numpy<2,>=1.23.2 (from pandas)
  Obtaining dependency information for numpy<2,>=1.23.2 from https://files.pythonhosted.org/packages/1a/2e/151484f49fd03944c4a3ad9c418ed193cfd02724e138ac8a9505d056c582/numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.8/114.8 kB[0m [31m274.2 kB/s[0m eta [36m0:00:00[0m1m248.2 kB/s[0m eta [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Obtaining dependency information for pytz>=2020.1 from https://files.pythonhosted.org/packages/9c/3d/a121f284241f08268b21359bd425f7d48

In [1]:
import pandas as pd
import numpy as np

# Data import

TODO:
- Explain the data we use
    - introduction to the DB
    - SJ / RP
    - Journey - Tickets - Orders
- SQL explained and reasoned
- Explain the modification of Lat and Long

---
```SQL
SELECT J.Id as JId, J.CreatedOn, J.SearchStart, J.SearchEnd, J.StartStop, J.EndStop,
    J.StartZone, J.Endzone, J.internalStartZones, J.internalValidZones, 
    SJWaypoints._id as SJId, SJWaypoints.Id,  SJWaypoints.Name, SJWaypoints.Latitude, 
    SJWaypoints.Longitude, SJWaypoints.[Type], SJWaypoints.SJSearchJourney_Id
FROM Journeys J
    JOIN Tickets ON J.Id = Tickets.Journey_Id
    JOIN Orders ON Orders.Id = Tickets.OrderId
    JOIN SJSearchJourneys SJ ON SJ.Id = Orders.JourneyClasses_Id
    JOIN SJWaypoints ON SJWaypoints.SJSearchJourney_Id = SJ.Id
WHERE J.CreatedOn BETWEEN '2022-12-01 00:00:00' and '2023-01-01 00:00:00'
```




In [4]:
data_måned_SJ = pd.read_csv('../Data/JTOSJW_dec_jan_2022.csv')

# data = pd.read_csv('../Data/11mil_large.csv', nrows=8874470)
# data = pd.read_csv('../Data/JTOSJW_dec_jan_2022.csv')
temp1 = pd.read_csv('../Data/Journeys_SearchStartEnd_Dec.csv')
temp2 = pd.read_csv('../Data/Journeys_StartEndStop_Dec.csv')
data_måned_Journeys = pd.concat([temp1,temp2])

data_full_Journeys = pd.read_csv('../Data/ALLJourneysWithValues.csv')

data_måned_SJ = data_måned_SJ.rename(columns={'Endzone':'EndZone'})

# data['Latitude'] = data['Latitude'] / 1000000
# data['Longitude'] = data['Longitude'] /1000000

# data

# Analysis of the data

Since or data consists only of a certain 'ticket'-type (DOT-ticket) we wish to analyse the enterierty of our data. For the analysis certain questions serves as our startingpoint:
1. Who actually uses this ticketsystem?
2. How representative in the context of CPH is our data?
3. ...

# Prepare data for Word2Vec

We now wish to transform our data into journey-sequences which we in turn can use to train a model using Word2Vec. 

For this to work as intended we wish to transform our data such that:
1. Only journeys WITHIN CPH is present
2. A data entry consists of a sequence of stops for a given journey
    - The stops should consist of Start, End and Transitional stops
    - We wish to make use of a dictionary 
        - REASONING FOR DICT


## Copenhagen filter

In order to filter our data, 3 checks need to be made to be certain a journey is within cph. For a journey to be within cph they need to only make use of zone 1 through 4
1. Check if *StartZone* and *EndZone* is within 1 and 4
2. Check if *internalStartZones* only contain zones within cph
3. Check if *internalValidZones* only contain zones within cph

In [13]:
def get_conditions(data):

    condition_1 = (
        (data['StartZone'].between(1001, 1004) | pd.isna(data['StartZone']))
        & 
        (data['EndZone'].between(1001, 1004) | pd.isna(data['EndZone']))
        )

    # THIS CONDITION MIGHT NOT BE RELEVANT DUE TO THE TICKETING SYSTEM OF DOT-BILLET
    # SINCE WHEN YOU BUY A TICKET IT IS USUALLY VALID IN ~2+ ZONES FROM WHERE YOU START. THIS MEANS
    # IF YOU START IN ZONE 2, THEN ZONE 3x, 1 AND 3 IS ALL A 'VALID ZONE' - THAT IS, THE START ZONE'S ADJACENT ZONES
    # WE ARE NOT REALLY INTERESTED IN WHAT ZONES A TICKET IS VALID IN, AS LONG AS WE CAN BE SURE, THAT THE JOURNEY
    # ONLY TOOK PLACE INSIDE OF COPENHAGEN -> ZONE 1 TO 4.
    condition_2 = (
        (data['internalValidZones'].str.match(r'^(1001|1002|1003|1004)(,(1001|1002|1003|1004))*$')
        | # or
        pd.isna(data['internalValidZones']))
        )

    condition_3 = (
        (data['internalStartZones'].str.match(r'^(1001|1002|1003|1004)$'))
        | # or
        pd.isna(data['internalStartZones'])
        )
    return (condition_1 & condition_3)

cph_data_SJ = data_måned_SJ[get_conditions(data_måned_SJ)]
cph_data_Journeys = data_måned_Journeys[get_conditions(data_måned_Journeys)]

# grp = cph_data.groupby('SJSearchJourney_Id').agg(list)

# grp.count()

# grp.count()

cph_data_SJ

Unnamed: 0,JId,CreatedOn,SearchStart,SearchEnd,StartStop,EndStop,StartZone,EndZone,internalStartZones,internalValidZones,SJId,Id,Name,Latitude,Longitude,Type,SJSearchJourney_Id
7517,43b9c96c-d4c5-4a23-9f06-5e686f84214f,2022-12-12 06.47.12.1350283,Øm (Hovedvejen) (96),København H (01),,,1001,1001,1001,1001100210031004,23188909-b50a-48e1-a4ae-6406e30d75da,4307.0,Lavringemose (Hovedvejen),55570658,11977767,waypoint,a5dcd93f-d9d4-4ddb-bdf2-06153e6dc089
7518,43b9c96c-d4c5-4a23-9f06-5e686f84214f,2022-12-12 06.47.12.1350283,Øm (Hovedvejen) (96),København H (01),,,1001,1001,1001,1001100210031004,a3a7cbbb-480e-43c3-b4cf-60887ef0eedf,,Viby Sjælland St.,55549605,12024493,waypoint,a5dcd93f-d9d4-4ddb-bdf2-06153e6dc089
7519,43b9c96c-d4c5-4a23-9f06-5e686f84214f,2022-12-12 06.47.12.1350283,Øm (Hovedvejen) (96),København H (01),,,1001,1001,1001,1001100210031004,5aa1f710-8ce8-4818-b4e3-4fa9a0d5c21f,9711.0,Engvej (Assendløsevejen),55557498,12012555,waypoint,a5dcd93f-d9d4-4ddb-bdf2-06153e6dc089
7520,43b9c96c-d4c5-4a23-9f06-5e686f84214f,2022-12-12 06.47.12.1350283,Øm (Hovedvejen) (96),København H (01),,,1001,1001,1001,1001100210031004,a9fcb955-eb08-4768-9c7d-5d945c2402f3,9712.0,"Dåstrup, Bueager (Assendløsevejen)",55555116,12014533,waypoint,a5dcd93f-d9d4-4ddb-bdf2-06153e6dc089
7521,43b9c96c-d4c5-4a23-9f06-5e686f84214f,2022-12-12 06.47.12.1350283,Øm (Hovedvejen) (96),København H (01),,,1001,1001,1001,1001100210031004,7f612499-a8ba-4eb1-bb03-5762fd95c4cf,9709.0,Osted Friskole (Assendløsevejen),55564590,11971267,waypoint,a5dcd93f-d9d4-4ddb-bdf2-06153e6dc089
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
491125,840b24d5-eedc-4ad5-b780-daf728745aae,2022-12-02 08.45.52.0086459,Lejre St. (togbus) (96),København H (01),,,1001,1001,1001,1001100210031004,5ef274de-4351-4bea-97e2-5d7ac39c60df,8600622.0,Glostrup St.,55663148,12397824,waypoint,33ab3e8a-0a7f-4930-bbe7-adb070f4b8e6
491126,840b24d5-eedc-4ad5-b780-daf728745aae,2022-12-02 08.45.52.0086459,Lejre St. (togbus) (96),København H (01),,,1001,1001,1001,1001100210031004,f8026f86-8925-4a38-a94c-50653c3508cc,,til fods,55604709,11971492,WALK,33ab3e8a-0a7f-4930-bbe7-adb070f4b8e6
491127,840b24d5-eedc-4ad5-b780-daf728745aae,2022-12-02 08.45.52.0086459,Lejre St. (togbus) (96),København H (01),,,1001,1001,1001,1001100210031004,5c6cc241-b94e-46d2-9c48-77216f7d923d,,Lejre St.,55604808,11971699,waypoint,33ab3e8a-0a7f-4930-bbe7-adb070f4b8e6
491128,840b24d5-eedc-4ad5-b780-daf728745aae,2022-12-02 08.45.52.0086459,Lejre St. (togbus) (96),København H (01),,,1001,1001,1001,1001100210031004,90bf7908-4376-434e-a16f-7b8a3dd1be92,8600619.0,Hedehusene St.,55648990,12197310,waypoint,33ab3e8a-0a7f-4930-bbe7-adb070f4b8e6


In [15]:
grp_2 = cph_data_SJ.groupby('JId').agg(list)
grp_2.count()

data_måned_SJ.groupby('JId').agg(list).count()

cph_data_Journeys.groupby('Id').agg(list).count()
# cph_data.count()

# data.count()





Type                  152
internalStartZones    152
StartZone             152
internalValidZones    152
StartStop             152
AmountOfZones         152
EndZone               152
EndStop               152
SearchStart           152
SearchEnd             152
ModifiedOn            152
CreatedOn             152
JourneyClasses_Id     152
TravelType            152
ExtraFrom             152
ExtraTo               152
dtype: int64

In [48]:
data_måned_Journeys[(data_måned_Journeys['SearchEnd'] == '2 zoner')] # 310
data_måned_Journeys[data_måned_Journeys['SearchStart'].str.contains("okation", na=False)] # 419
data_filter_1 = data_måned_Journeys[~data_måned_Journeys['SearchStart'].str.contains("okation", na=False) | ~data_måned_Journeys['SearchEnd'].str.contains("zone", na=False)] # 419
data_filter_2 = data_filter_1[(data_filter_1['SearchStart'] != data_filter_1['SearchEnd'])  | (data_filter_1['StartStop'] != data_filter_1['EndStop'])] # 419
data_filter_2






# data_måned_SJ[data_måned_SJ['SearchStart'].str.contains("okation", na=False) | data_måned_SJ['SearchEnd'].str.contains("zone", na=False)] # 419

Id                    465641
Type                  435952
internalStartZones    436646
StartZone             465641
internalValidZones    464641
StartStop              25823
AmountOfZones         465641
EndZone               465641
EndStop                25823
SearchStart           459273
SearchEnd             398442
ModifiedOn            465641
CreatedOn             465641
JourneyClasses_Id          0
TravelType                 0
ExtraFrom                  0
ExtraTo                    0
dtype: int64

# Code used for testing purposes

DISCLAIMER:

NOTEBOOK IS MEMORY BASED SO CERTAIN ELEMENTS OF THE CODE BELOW MIGHT NOT WORK IF RUN AGAIN. LOOK AT THE COMMENETED PRINT STATEMENTS NEXT TO THE CODE FOR THE RELEVANT OUTPUTS AT THE TIME

In [20]:
# grp = data.groupby('SJSearchJourney_Id').agg(list)

def not_null(lst):
    for value in lst:
        if value is None or pd.isna(value):
            return False
    return True

# We wish to check if the amount of StartZones match the amount of Endzones:
print(grp['StartZone'].apply(not_null).sum()) # 31965
print(grp['EndZone'].apply(not_null).sum()) # 31965

# print(grp[(grp['StartZone'].apply(not_null) & grp['EndZone'].apply(not_null))]) # 31965 rows


def digigt_above_1000(lst):
    for elem in lst:
        if elem < 1000:
            False
    return True

print(grp['StartZone'].apply(digigt_above_1000).sum()) # 31965


# def not_contain_comma(lst: list[str]):
#     for elem in lst:
#         if ',' in elem:
#             False
#     return True
# print(grp['internalStartZones'].apply(not_contain_comma).sum()) # 31965 - indicating that all internalStartZones only consists of a single zone


def not_contain_space(lst: list[str]):
    for elem in lst:
        if ' ' in elem:
            False
    return True
print(grp['internalValidZones'].apply(not_contain_space).sum()) # 31965 - indicating that all internalStartZones only consists of a single zone

508320
508320
508320
508320
