# Hackacity 2024
## Mobility Index for Bus Stops Using Wi-Fi Data with POI Features
#### Team: Hackacitizens
#### Authors:
- *João Peixoto*
- *José Cunha*
- *Nuno Oliveira*
- *Pedro Miranda*
- *Sofia Malpique*

# 1 - Introduction


In this project, we aim to predict mobility patterns across Porto using open city data and machine learning. We are working with datasets including:

- **WiFi Accesses Data**: A proxy for human movement and activity in the city.
- **WiFi Hotspots**: Locations of WiFi access points.
- **STCP Bus Stops**: Locations and schedules of bus stops.
- **Google Places API**: Points of Interest (POIs) like schools, hospitals, and tourist spots.

The goal is to predict a "mobility index" for each bus stop at different time slices (e.g., 14:00 - 14:59) using geospatial features such as proximity to POIs. We use WiFi access data to estimate local mobility, which informs the predicted demand for bus services at different times.

## 1.1 - Approach

1. **Data Processing and Risk Mitigation**: We began by conducting a **risk assessment** on the provided data to address privacy concerns. Sensitive information was **anonymized** to ensure compliance with data protection standards.
  
2. **Data Ingestion and Integration**: Raw data from WiFi accesses and external sources like Porto Digital and Google Places API were ingested. These sources enriched our model with geospatial data, enabeling it to predict the mobility index for each bus stop.

3. **Feature Engineering**: We computed key geospatial features, such as the distance to the nearest hospital or the number of schools within a 1 km radius, to assess local mobility conditions. These features power our **AI model** that predicts the mobility index for each bus stop.

4. **Outlier Detection**: We compare the predicted mobility index with actual bus availability to identify:
   - **Underestimated Mobility**: High predicted mobility with insufficient buses.
   - **Overestimated Mobility**: Low predicted mobility with excessive buses.

By cross-referencing predicted mobility with actual bus schedules, we aim to identify inefficiencies and help optimize bus allocation.

## 1.2 - Expected Outcome

This analysis will uncover patterns in mobility and help improve bus service allocation, ensuring better public transport efficiency in Porto.

#### Dependencies

In [33]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Prevent line wrapping

# 2 - Possible threat analysis before anonymization

## 2.1 - PIA Assessment

### 2.1.1 - WiFi Hotspots

| *Threat*               | *Description*                                                                                                                                                                            | *Privacy Risks*                                                                                                                                       | *Mitigation Measures*                                                                                         | *Risk Severity* | *Regulatory Impact*                             |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------|--------------------------------------------------|
| *AP MAC Address Exposure* | Exposure of access point MAC addresses allows attackers to identify and target specific devices or mimic infrastructure.                                                                  | Spoofing Attacks, exploitation of known vulnerabilities from the manufacturer, Network Mapping, Credential-Based Attacks.                                                                             | - Hash or mask MAC addresses.<br>- Regularly update firmware to patch vulnerabilities.<br>- Encrypt MAC data.  | Medium            | Potential non-compliance with GDPR for exposing device identifiers. |
| *Physical Access*       | Detailed device locations expose devices to physical tampering or theft.                                                                                                                | Compromise of physical security, unauthorized network access.                                                                                             | - Use tamper-proof casings.<br>- Restrict physical access to authorized personnel.<br>- Monitor access with surveillance or audits.                                | Low               | Limited regulatory implications, but operational risks. |

### 2.1.2 - WiFi Access

| *Threat*               | *Description*                                                                                                                                                                            | *Privacy Risks*                                                                                                                                       | *Mitigation Measures*                                                                                         | *Risk Severity* | *Regulatory Impact*                             |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------|--------------------------------------------------|
| *User Tracking*         | callingstationid, user mac address, enables long-term tracking when combined with time and location.                                                                                      | Profiling, identification, loss of privacy, and spear-phishing attacks.                                                                                   | - Rotate callingstationid hashes periodically.<br>- Minimize retention of detailed logs.<br>- Apply differential privacy to anonymize patterns.                       | High              | Violation of GDPR Article 5 (data minimization). |
| *Upload Size Analysis*  | Attackers correlate upload size and social media activity to infer individual users' activities.                                                                                        | Breach of anonymity, targeted profiling, and social engineering attacks.                                                                                  | - Use k-anonymization to generalize the upload value.<br>- Aggregate upload/download metrics to obscure individual users.<br>- Limit granularity of session logs.         | Medium            | Breach of anonymity could trigger GDPR penalties. |
| *Session Time Correlation* | Attackers cross-reference session times with external activity logs, inferring users’ identities.                                                                                      | User identification and profiling, leading to targeted attacks.                                                                                           | - Use k-anonymization to generalize the session time value.<br>- Aggregate session data.                          | Medium            | Violation of GDPR Article 32 (security of processing). |
| *Session start time* | Attackers cross-reference session start times with what they see in the access point zones. Start finding patterns of this user behaviour.                                                                                      | User identification and profiling, leading to targeted attacks.                                                                                           | - Use k-anonymization to generalize the session start time value.<br>- Aggregate session start time data.                          | Medium            | Violation of GDPR Article 32 (security of processing). |


**Note:** *For a future analysis, we would collect data to build a fair model for risk security (A fair model for risk security aims to provide a balanced and equitable approach to managing security risks, ensuring that the necessary measures are taken to protect sensitive information while considering both the potential impact of risks and the feasibility of mitigation. The key objective is to create a security framework that is fair to all stakeholders, such as users, organizations, and regulatory bodies, by addressing their concerns in an efficient, transparent, and legally compliant way). At the moment there is no way to implement this kind of solution because of the lack data we would implement this with more information and accordingly quantify the risk*

# 3 - Anonymization process

To ensure the privacy of user data while maintaining its utility, our team employed a multi-step anonymization process. The first step involved applying a cryptographic hash function to both user and access point MAC addresses, ensuring that sensitive identifiers were securely transformed into non-reversible values. This process helps mitigate the risk of exposing personal information.

Next, we applied K-anonymization using the Mondrian algorithm, which is a well-known technique for ensuring data privacy. The Mondrian algorithm partitions the dataset into groups such that each group contains at least K records with indistinguishable attributes, thereby protecting individual privacy. We experimented with various values of K to evaluate the trade-off between data utility and privacy. After conducting several iterations, we determined that K=60 provided the best balance between minimizing information loss and maximizing the utility of the anonymized data.

In addition to K-anonymization, we applied differential privacy to further enhance privacy protection. Specifically, we set the epsilon value to 0.5, which provides a strong privacy guarantee by adding noise to the data in a way that makes it difficult to identify individual records while still allowing meaningful aggregate analysis. The specifics of our differential privacy implementation are detailed in a section below.

This approach ensured that the final dataset retained valuable insights while adhering to privacy standards.

![mondrian](mondrian.jpeg)

First step - Hashing

In [47]:
df = pd.read_csv('raw_hashed.csv', low_memory=False)
df.drop(columns=['Unnamed: 0 '], inplace=True)
df.head()

Unnamed: 0,acctsessionid,acctstarttime,upload,download,acctsessiontime,calledstation_ssid,calledstationid_hashed,callingstationid_hashed
0,fbfe95a31b975faa4b79968f95d834f2204ff050c4cb17...,2024-09-18 14:00:00,0,0,0.0,eduroam,"$argon2id$v=19$m=102400,t=3,p=2$t3SLWvkKnrmXsi...","$argon2id$v=19$m=102400,t=3,p=2$CKDJTgHHSrJ4AN..."
1,7330b1f18b414f72c4895700952fb9f2f5aa4cd3c15369...,2024-09-18 13:00:00,0,0,0.0,eduroam,"$argon2id$v=19$m=102400,t=3,p=2$EaNncYpRo9ogFR...","$argon2id$v=19$m=102400,t=3,p=2$QZfpbCX8xkPFmQ..."
2,8efbb33c1a4b95f434e75cde462caea6dfdf964631744f...,2024-09-12 07:00:00,0,0,6.0,Porto. Free Wi-Fi,"$argon2id$v=19$m=102400,t=3,p=2$IYftQ9pLLk/f+0...","$argon2id$v=19$m=102400,t=3,p=2$GEo1IbHKwnhv1W..."
3,429785416485c487d271060e329442b10e1b965621dc17...,2024-09-12 01:00:00,2,30,32.0,eduroam,"$argon2id$v=19$m=102400,t=3,p=2$UIIYZ1S/FeYS1u...","$argon2id$v=19$m=102400,t=3,p=2$NzcwzqXadwR2xC..."
4,147932564d4a91bdf901570a3f379158e0efe674ac55d4...,2024-09-12 02:00:00,0,1,6.0,eduroam,"$argon2id$v=19$m=102400,t=3,p=2$ZJkNwULpOekGia...","$argon2id$v=19$m=102400,t=3,p=2$i6dkiD6EmnkFHP..."


Second step - K-anonymization with Mondrian

In [53]:
df = pd.read_csv('raw_hashed_and_k_anonymized.csv', low_memory=False, sep=';').sample(200)
df.drop(columns=['Unnamed: 0 '], inplace=True)
df.head()

Unnamed: 0,acctsessionid,acctstarttime,upload,download,acctsessiontime,calledstation_ssid,calledstationid_hashed,callingstationid_hashed
252,c6951e5ec411be7b316385831dee69fb84ec3d070807a3...,2024-09-18 10:00:00,0~1,0,0.0,eduroam,"$argon2id$v=19$m=102400,t=3,p=2$+oSnVHr+KGMFRR...","$argon2id$v=19$m=102400,t=3,p=2$kZJnQ7TMIrlWyU..."
1343,392d2eb84cc096cb30dcbd79bbc6229afa00716d0995a2...,2024-09-12 08:00:00,0~1922,0,122722.0 ~ 33.0 ~ 2...,Porto. Free Wi-Fi,"$argon2id$v=19$m=102400,t=3,p=2$MECkDfT58yeRRo...","$argon2id$v=19$m=102400,t=3,p=2$xuHeliI208XjiE..."
165,1e75c375c34971258fbdcfc01a5f77dad95ac1f7008c72...,2024-09-12 08:00:00,0~1,0,0.0,Porto. Free Wi-Fi,"$argon2id$v=19$m=102400,t=3,p=2$/kvjuqLa3Cxfoy...","$argon2id$v=19$m=102400,t=3,p=2$1p5inLXulTvqH5..."
151,fa0bd8277286e6545cdfa53da5a433c0387f9ed03bfa8b...,2024-09-18 10:00:00,0~1,0,0.0,Porto. Free Wi-Fi,"$argon2id$v=19$m=102400,t=3,p=2$0nQ4FwSFJ2L8eI...","$argon2id$v=19$m=102400,t=3,p=2$ArqugrvpVt7aEG..."
560,6195ac5bb4169d81b7dc92578147f1449c0b7d56226d24...,2024-09-12 06:00:00,0~24,0,1.0 ~ 122690.0,Porto. Free Wi-Fi,"$argon2id$v=19$m=102400,t=3,p=2$e0DhAqAeCZaRGz...","$argon2id$v=19$m=102400,t=3,p=2$D5wchBMhrW4WEy..."


# 4 - Possible threat analysis after anonymization

## 4.1 - Access Point
AP MAC Address Exposure: Exposure of access point MAC addresses allows attackers to identify and target specific devices or mimic infrastructure. After encryption of the MAC Address with proper hashing and a strong algorithm, the potential of this threat is reduced to almost none. However, a risk remains in the form of using the same hash function for every MAC Address. If a breach occurs with one user’s data, every other user could theoretically be compromised, though this is highly unlikely.

Physical Access: Detailed device locations expose devices to physical tampering or theft. There is no change in risk before and after the solution, meaning that the threat remains constant.

## 4.2 - User Privacy Concerns
User Tracking: Callingstationid and the user’s MAC address enable long-term tracking when combined with time and location. The risk here is re-identification, where even if the callingstationid is hashed or anonymized periodically, there is still a possibility for a determined adversary to correlate data across time. This could happen if logs are retained too long or if data is merged with other datasets. Such exposure increases the chance that an attacker could uncover the identity of users. To mitigate this risk, more aggressive anonymization techniques, like k-anonymity or l-diversity, should be implemented to prevent re-identification from log data.

Upload Size Analysis: Attackers can correlate upload size with social media activity to infer individual user activities. A way to combat this is through Differential Privacy, which adds controlled "noise" (random variation) to data to obscure the identity of individual entries. This technique ensures that the removal or addition of a single data point does not significantly affect the overall analysis, making it harder to isolate specific users. For example, instead of reporting precise upload sizes, the data could be altered with noise, making it more difficult for an attacker to deduce user behavior based solely on the upload size.

Session Time Correlation: Attackers can cross-reference session times with external activity logs to infer the identities of users. This could lead to user identification and profiling, increasing the risk of targeted attacks. A mitigation strategy is to use k-anonymization, which generalizes session time values, or to aggregate session data to obscure individual details.

Session Start Time: Similar to session time correlation, attackers can cross-reference session start times with what they see in the access point zones, identifying patterns in user behavior. This could again lead to user identification and profiling, potentially resulting in targeted attacks. The recommended mitigation involves using k-anonymization to generalize session start time data or aggregating these values to protect user privacy.

# 5 - Data potential and analysis

## 5.1 - Data Transformation

### 5.1.1 - Wifi Access (Aggregated)

We used the raw data provided by Porto Digital to calculate the number of wifi access per hotspot in a hourly fashion.


In [34]:
df = pd.read_csv('../datamesh/c_features/datasets/porto_wifi_access_per_hotspot_hourly.csv', low_memory=False).sort_values(by='number_of_sessions_per_hour', ascending=False)
df.head()

Unnamed: 0,calledstationid,date,hour_slice,number_of_sessions_per_hour
157628,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,15:00–15:59,597
149322,4cad16c07e8c4d222c35d87e0e91ca7f748373f1733a74...,2024-09-30,15:00–15:59,584
157627,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,14:00–14:59,578
226749,6f2ed2e71e167341cbbce4447d8a9065f8370294352f18...,2024-07-05,20:00–20:59,570
157626,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,13:00–13:59,567


In order to do this analysis in **privacy preserving** fashion, we used differential privacy to add noise to the data in a way that the data is still useful for the analysis but the data is not as identifiable.

Based on the following categories, we picked an epsilon value of 0.5:

- Strict Privacy (e.g., 0.1 ≤ ϵ < 1)
- Moderate Privacy (e.g., 1 ≤ ϵ ≤ 3)
- Weaker Privacy (e.g., ϵ > 3)

In [35]:
df = pd.read_csv('../datamesh/c_features/datasets/porto_wifi_access_per_hotspot_hourly_dp.csv', low_memory=False).sort_values(by='number_of_sessions_per_hour', ascending=False)
df["number_of_sessions_per_hour"] = df["number_of_sessions_per_hour"].round()
df["number_of_sessions_per_hour"] = df["number_of_sessions_per_hour"].astype(int)
df.head()

Unnamed: 0,calledstationid,date,hour_slice,number_of_sessions_per_hour
157628,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,15:00–15:59,603
149322,4cad16c07e8c4d222c35d87e0e91ca7f748373f1733a74...,2024-09-30,15:00–15:59,596
157626,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,13:00–13:59,570
226749,6f2ed2e71e167341cbbce4447d8a9065f8370294352f18...,2024-07-05,20:00–20:59,568
157627,4f6dc1d796ba1733b53641bcee957694bc26bf1426f03d...,2024-09-30,14:00–14:59,562


### 5.1.2 - Wifi Hotspots (Enriched with POI)

For each wifi hotspot, we enrich the data with the POI features from Google Maps API such as the distance to the nearest hospital or the number of schools within a 1 km radius.

In [36]:
df = pd.read_json('../datamesh/c_features/datasets/porto_wifi_hotspots_nearby_poi.json')
print(f"features: {df.columns[7:]}")
df.head()

features: Index(['number_of_nearby_schools', 'closest_school_km',
       'number_of_nearby_hospitals', 'closest_hospital_km',
       'number_of_nearby_universitys', 'closest_university_km',
       'number_of_nearby_public_offices', 'closest_public_office_km',
       'number_of_nearby_tourist_attractions', 'closest_tourist_attraction_km',
       'number_of_nearby_train_stations', 'closest_train_station_km',
       'number_of_nearby_subway_stations', 'closest_subway_station_km'],
      dtype='object')


Unnamed: 0,MAC_ADDRESS,lat,lon,address,Hotspot,Zone,Parish,number_of_nearby_schools,closest_school_km,number_of_nearby_hospitals,closest_hospital_km,number_of_nearby_universitys,closest_university_km,number_of_nearby_public_offices,closest_public_office_km,number_of_nearby_tourist_attractions,closest_tourist_attraction_km,number_of_nearby_train_stations,closest_train_station_km,number_of_nearby_subway_stations,closest_subway_station_km
0,48-8b-0a-7a-9a-60,41.146929,-8.632913,"Museu C. Elétrico, 4050 Porto, Portugal",Alameda Basílio Teles - Rua de Dom Pedro V,Massarelos,Lordelo do Ouro e Massarelos,5,0.59265,23,0.658964,20,0.456864,0,1.151058,7,0.46516,0,1.588172,0,1.046631
1,a0-23-9f-ca-87-c0,41.149482,-8.610083,"Av.aliados, 4000-125 Porto, Portugal",Aliados - Gabinete do Municipe,Trindade,"Cedofeita, Santo Ildefonso, Sé, Miragaia, São ...",41,0.234685,28,0.238527,28,0.40748,55,0.020486,81,0.057663,67,0.123567,59,0.116824
2,a4-bd-c4-d0-61-00,41.147993,-8.611216,"Av. dos Aliados 107, 4000-067 Porto, Portugal",Aliados - Rua Elísio de Melo,Santo Ildefonso,Cedofeita,36,0.251828,32,0.215924,27,0.295997,46,0.170341,101,0.04866,59,0.069103,49,0.069103
3,a4-bd-c4-d0-60-80,41.14908,-8.6106,"Av. dos Aliados 236, 4000-114 Porto, Portugal",Aliados (Fonte dos Aliados),Santo Ildefonso,Cedofeita,38,0.207945,34,0.288769,28,0.353824,48,0.042509,86,0.034239,63,0.062654,54,0.061666
4,a0-23-9f-94-a2-c0,41.177855,-8.59509,"Pav. Desportivo Luís Falcão, 4200-465 Porto, P...",Asprela - FEUP,Paranhos,Lamas,0,1.359096,8,0.327911,27,0.084433,2,0.059936,0,1.557766,0,1.956223,5,0.817976


### 5.1.3 - Bus Stops (Aggregated)

We aggregate the data from the SCTP bus stops to get the number of trips per hour. The goal then, is to predict the expected mobility index for each time slice and compare it with the actual bus availability.

In [37]:
df = pd.read_csv('../datamesh/c_features/datasets/sctp_bus_stops_trips_hourly.csv').sort_values(by='distinct_trip_count', ascending=False)
df.head()

Unnamed: 0,stop_id,service_id,hourly_slice,stop_lat,stop_lon,stop_name,distinct_trip_count
30435,CMO,UTEIS,08:00-08:59,41.147228,-8.616959,CARMO,50
30436,CMO,UTEIS,09:00-09:59,41.147228,-8.616959,CARMO,49
30434,CMO,UTEIS,07:00-07:59,41.147228,-8.616959,CARMO,45
30444,CMO,UTEIS,17:00-17:59,41.147228,-8.616959,CARMO,45
30445,CMO,UTEIS,18:00-18:59,41.147228,-8.616959,CARMO,45



### 5.1.4 - Bus Stops (Enriched with POI)

For each bus stop, we enrich the data with the POI features from Google Maps API such as the distance to the nearest hospital or the number of schools within a 1 km radius (same features as the wifi hotspots).

In [38]:
df = pd.read_json('../datamesh/c_features/datasets/stcp_stops_nearby_poi.json')
print(f"features: {df.columns[7:]}")
df.head()

features: Index(['number_of_nearby_hospitals', 'closest_hospital_km',
       'number_of_nearby_universitys', 'closest_university_km',
       'number_of_nearby_public_offices', 'closest_public_office_km',
       'number_of_nearby_tourist_attractions', 'closest_tourist_attraction_km',
       'number_of_nearby_train_stations', 'closest_train_station_km',
       'number_of_nearby_subway_stations', 'closest_subway_station_km'],
      dtype='object')


Unnamed: 0,stop_id,stop_name,lat,lon,neighborhood_name,number_of_nearby_schools,closest_school_km,number_of_nearby_hospitals,closest_hospital_km,number_of_nearby_universitys,closest_university_km,number_of_nearby_public_offices,closest_public_office_km,number_of_nearby_tourist_attractions,closest_tourist_attraction_km,number_of_nearby_train_stations,closest_train_station_km,number_of_nearby_subway_stations,closest_subway_station_km
0,AANT5,ALAMEDA DAS ANTAS,41.1632,-8.583404,Campanhã,11,1.112788,6,0.313641,0,1.548288,3,0.502326,0,1.983624,10,0.99217,8,0.297165
1,EDRG1,ESTÁDIO DO DRAGÃO,41.160278,-8.583806,Campanhã,12,0.963362,7,0.117084,0,1.574076,6,0.659874,0,1.800641,19,0.906405,10,0.129556
2,CRJ1,CORUJEIRA,41.159285,-8.580922,Campanhã,8,1.129038,7,0.382527,0,1.83725,6,0.919701,0,1.986427,17,0.811824,6,0.192766
3,PCRJ1,PR. DA CORUJEIRA,41.156919,-8.579865,Campanhã,8,1.022547,6,0.605787,0,1.906887,6,1.096368,0,2.007576,19,0.590985,8,0.463733
4,JFC,JUNTA FREG. CAMPANHÃ,41.158167,-8.578587,Campanhã,4,1.190231,5,0.612591,0,2.061983,5,1.142365,0,2.140804,17,0.763487,6,0.418287


This is a visualization of the bus stop heatmap and closest POIs.

![bus_stop_heatmap](bus_stop_heatmap.png)

## 5.2 - Modeling

### 5.2.1 - Training dataset

For model training, we merged the enriched wifi hotspot data with geospatial features (feature set) with the aggregated wifi access data (target -> wifi access count as a mobility index proxy). We used a LightGBM regression model.

Due to the pre-anonymization of the WiFi access dataset, we were advised to do a arbitrary linking between both datasets. However, we applied an **assumption that wifi hotspots with the highest number of poi in its surrounding are the most likely to be the ones with higher wifi activity**.


In [39]:
df = pd.read_csv('../datamesh/c_features/datasets/mobility_regression_training.csv')
df.head()

Unnamed: 0,hour_slice,number_of_sessions_per_hour,number_of_nearby_schools,closest_school_km,number_of_nearby_hospitals,closest_hospital_km,number_of_nearby_universitys,closest_university_km,number_of_nearby_public_offices,closest_public_office_km,number_of_nearby_tourist_attractions,closest_tourist_attraction_km,number_of_nearby_train_stations,closest_train_station_km,number_of_nearby_subway_stations,closest_subway_station_km,total_pois,day_of_week,day_of_month
0,05:00–05:59,1.165159,24.0,0.483929,31.0,0.398109,26.0,0.33736,36.0,0.464208,108.0,0.137258,59.0,0.001894,46.0,0.001894,330.0,0,1
1,09:00–09:59,1.610552,24.0,0.483929,31.0,0.398109,26.0,0.33736,36.0,0.464208,108.0,0.137258,59.0,0.001894,46.0,0.001894,330.0,0,1
2,10:00–10:59,1.523478,24.0,0.483929,31.0,0.398109,26.0,0.33736,36.0,0.464208,108.0,0.137258,59.0,0.001894,46.0,0.001894,330.0,0,1
3,12:00–12:59,0.866448,24.0,0.483929,31.0,0.398109,26.0,0.33736,36.0,0.464208,108.0,0.137258,59.0,0.001894,46.0,0.001894,330.0,0,1
4,13:00–13:59,2.557742,24.0,0.483929,31.0,0.398109,26.0,0.33736,36.0,0.464208,108.0,0.137258,59.0,0.001894,46.0,0.001894,330.0,0,1


### 5.2.2 - Inference dataset

We predict the expected mobility index for each time slice of each bus stop. In this final table we can then compare the predicted mobility index with the actual bus availability to identify inefficiencies and help optimize bus allocation.

In [40]:
df = pd.read_csv('../datamesh/c_features/datasets/mobility_regression_inference.csv')
df.head()

Unnamed: 0,stop_id,neighborhood_name,date,day_of_week_category,hourly_slice,mobility_score,distinct_trip_count
0,1ADA1,Paranhos,2024-11-01,UTEIS,06:00-06:59,9,2
1,1ADA1,Paranhos,2024-11-01,UTEIS,07:00-07:59,10,4
2,1ADA1,Paranhos,2024-11-01,UTEIS,08:00-08:59,13,3
3,1ADA1,Paranhos,2024-11-01,UTEIS,09:00-09:59,16,3
4,1ADA1,Paranhos,2024-11-01,UTEIS,10:00-10:59,20,3


This is a visualization of the mobility index prediction for each bus on an hourly basis.

![mobility_index](mobility_index_regression.png)

This is a visualization of the mobility index prediction vs the actual bus trips. This allow us to identify outliers, stops that have a high mobility index but low bus trips and vice versa for a given time slice.

![mobility_index_vs_bus_trips](mobility_index_vs_bus_trips.png)

# 6 - Conclusions and Future Work


This project focused on predicting a "mobility index" for bus stops in Porto, leveraging Wi-Fi data along with geospatial features such as proximity to Points of Interest (POIs). The key conclusion is that Wi-Fi access data, which serves as a proxy for human movement, is valuable for estimating mobility and optimizing bus schedules. However, to improve the reliability of the mobility index, it is recommended to integrate additional data sources, particularly payments data (e.g., mobile payment usage at bus stations) and real-time traffic data. These would provide a more comprehensive picture of bus stop demand, reflecting both passenger activity and urban traffic conditions.

# 7 - References

- (**Mondrian K-Anonimity**) LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), 8-7695-2570-9/06, 1-12. IEEE.

- (**Differential Privacy**) Dwork, C. (2006). Differential privacy. Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006), 406–419. Springer. https://doi.org/10.1007/11787006_31

- (**LGBM for Urban Mobility**) Z. Li, J. Tang, T. Feng, B. Liu, J. Cao, T. Yu, and Y. Ji, "Investigating urban mobility through multi-source public transportation data: A multiplex network perspective," Applied Geography, vol. 169, 103337, 2024. [Online]. Available: https://doi.org/10.1016/j.apgeog.2024.103337 

- (**Mobility Index from WiFi Access Data**) S. H. Marakkalage et al., "WiFi Fingerprint Clustering for Urban Mobility Analysis," in IEEE Access, vol. 9, pp. 69527-69538, 2021, doi: 10.1109/ACCESS.2021.3077583.

### Instructions:

Link to external data sources:
- STCP bus stops data: https://opendata.porto.digital/dataset/?q=Sociedade
- Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview

# 8 - Annex

This notebook serves only as a User Interface with the story telling of our proposed solution. Implementation details are available in the application/datamesh folder and follow a standard data engineering architecture (raw layer, staging layer, feature layer). Each ELT python file generates a single output table in its corresponding layer.