# Hackacity 2024
## [SOLUTION NAME]
#### Team: Hackacitizens
#### Authors:
- *João Peixoto*
- *José Cunha*
- *Nuno Oliveira*
- *Pedro Miranda*
- *Sofia Malpique*

# Instructions
- The purpose of this notebook is to provide a detailed description so that the technical and business jury can have an overview of the technical and non-technical aspects and how they interconnect.
- To use the template, you must make a copy and fill in the team name and the challenge title.
- The cells containing the instructions should be deleted, but all headers must be kept.
- All technical information should be included in this report (e.g., code, queries, graphs, tools used, parameters, etc.). If necessary, additional files can be added to the OneDrive folder, with references to those files included in this report.

# Tip
- The technical and business jury will need to evaluate several notebooks, so it’s important to maintain a clear line of thought within the notebook without overloading it with too many similar graphs or visuals that do not add value to the work. On the other hand, it’s important to show that the work done is thorough.
- Therefore, we recommend using the annex section to include code that was produced along with its description but might be unnecessary noise for the jury (e.g., tested models that didn’t work, EDA with very similar results between variables, etc.).


In [2]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Prevent line wrapping

In [None]:
# Run a server to display the html contents: (local host:8000)
!python -m http.server

# 1 - Introduction


### Instructions:
Contextualization and explanation of the problem and the proposed solution. Should only contain text or text and images.

# 2 - Possible threat analysis before anonymization

### Instructions:
This section should cover a threat analysis of the dataset from the privacy perspective and it's quantification.

Important notes:
- Only this section is evaluated for the data privacy award;
- No code is needed for this section;
- A good quantification of the threat is valuade by the jury;
- Include any references that is important.

## 2.1 - Access point

| *Threat*               | *Description*                                                                                                                                                                            | *Privacy Risks*                                                                                                                                       | *Mitigation Measures*                                                                                         | *Risk Severity* | *Regulatory Impact*                             |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------|--------------------------------------------------|
| *AP MAC Address Exposure* | Exposure of access point MAC addresses allows attackers to identify and target specific devices or mimic infrastructure.                                                                  | Spoofing Attacks, exploitation of known vulnerabilities from the manufacturer, Network Mapping, Credential-Based Attacks.                                                                             | - Hash or mask MAC addresses.<br>- Regularly update firmware to patch vulnerabilities.<br>- Encrypt MAC data.  | Medium            | Potential non-compliance with GDPR for exposing device identifiers. |
| *Physical Access*       | Detailed device locations expose devices to physical tampering or theft.                                                                                                                | Compromise of physical security, unauthorized network access.                                                                                             | - Use tamper-proof casings.<br>- Restrict physical access to authorized personnel.<br>- Monitor access with surveillance or audits.                                | Low               | Limited regulatory implications, but operational risks. |


## 2.2 - User Privacy Concerns

| *Threat*               | *Description*                                                                                                                                                                            | *Privacy Risks*                                                                                                                                       | *Mitigation Measures*                                                                                         | *Risk Severity* | *Regulatory Impact*                             |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------|--------------------------------------------------|
| *User Tracking*         | callingstationid, user mac address, enables long-term tracking when combined with time and location.                                                                                    | Profiling, identification, loss of privacy, and spear-phishing attacks.                                                                                   | - Rotate callingstationid hashes periodically.<br>- Minimize retention of detailed logs.<br>- Apply differential privacy to anonymize patterns.                       | High              | Violation of GDPR Article 5 (data minimization). |
| *Upload Size Analysis*  | Attackers correlate upload size and social media activity to infer individual users' activities.                                                                                        | Breach of anonymity, targeted profiling, and social engineering attacks.                                                                                  | - Use k-anonymization to generalize the upload value.<br>- Aggregate upload/download metrics to obscure individual users.<br>- Limit granularity of session logs.         | Medium            | Breach of anonymity could trigger GDPR penalties. |
| *Session Time Correlation* | Attackers cross-reference session times with external activity logs, inferring users’ identities.                                                                                      | User identification and profiling, leading to targeted attacks.                                                                                           | - Use k-anonymization to generalize the session time value.<br>- Aggregate session data.                          | Medium            | Violation of GDPR Article 32 (security of processing). |
*Session start time* | Attackers cross-reference session start times with what they see in the in the access point zones. Start finding patterns of this user behaviour.                                                                                      | User identification and profiling, leading to targeted attacks.                                                                                           | - Use k-anonymization to generalize the session start time value.<br>- Aggregate session start time data.                          | Medium            | Violation of GDPR Article 32 (security of processing).


**Note:** *For a future analysis, we would collect data to build a fair model for risk security (A fair model for risk security aims to provide a balanced and equitable approach to managing security risks, ensuring that the necessary measures are taken to protect sensitive information while considering both the potential impact of risks and the feasibility of mitigation. The key objective is to create a security framework that is fair to all stakeholders, such as users, organizations, and regulatory bodies, by addressing their concerns in an efficient, transparent, and legally compliant way). At the moment there is no way to implement this kind of solution because of the lack data we would implement this with more information and accordingly quantify the risk*

# 3 - Anonymization process

### Instructions

Based on the previous threat assesment, propose any anonymization techniques you might find helpful.

Include all the code and proof needed that the threat has been mitigated.

# 4 - Possible threat analysis after anonymization

## 4.1 - Access point

| **Threat**               | **Description**                                                                                                                                                                            | **Risk analysis after solution**                                                                                                                                     
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *AP MAC Address Exposure* | Exposure of access point MAC addresses allows attackers to identify and target specific devices or mimic infrastructure.                                                                  | After encryption of the MAC Address with proper hashing and using a strong algorithm, it reduces the potential of the threat to almost none. The only risk is that we are using the same hash function for every MAC Address. This means that if there is a hack on the data of a single user, every other user is compromised. However, this is very unlikely. |
| *Physical Access*        | Detailed device locations expose devices to physical tampering or theft.                                                                                                                | There is no change in risk before and after solution.                                                                                                                                                            |

## 4.2 - User Privacy Concerns

| **Threat**               | **Description**                                                                                                                                                                            | **Risk analysis after solution**                                                                                                                                     
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *User Tracking*          | Callingstationid, user MAC address, enables long-term tracking when combined with time and location.                                                                                        | Re-identification Risk: Even if the callingstationid is hashed or anonymized periodically, there is still a possibility that a determined adversary can correlate data across time (e.g., if logs are retained too long or if data is combined with other datasets). This is a risk because data leaks, breaches, or even re-identification techniques might expose the original identity of users. Mitigation: Implement more aggressive anonymization techniques, such as k-anonymity or l-diversity, to prevent the re-identification of individuals from log data. |
| *Upload Size Analysis*   | Attackers correlate upload size and social media activity to infer individual users' activities.                                                                                        | Differential Privacy: <br> **What It Is:** Differential privacy adds controlled "noise" (random variation) to data to ensure that individual entries are not identifiable, even if an attacker has access to a large dataset. <br> **Why It Helps:** It guarantees that the removal or addition of any single data point doesn't significantly affect the overall results of data analysis, making it harder to isolate specific users. <br> **Example:** Instead of reporting precise upload sizes (even in ranges), you could add noise to these values, making it more difficult to draw conclusions about individual user behavior based solely on upload size. |
| *Session Time Correlation* | Attackers cross-reference session times with external activity logs, inferring users’ identities.                                                                                      | User identification and profiling, leading to targeted attacks. <br> **Mitigation:** Use k-anonymization to generalize the session time value.<br> Aggregate session data. |
| *Session Start Time*     | Attackers cross-reference session start times with what they see in the access point zones. They start finding patterns of this user behavior.                                                                                      | User identification and profiling, leading to targeted attacks. <br> **Mitigation:** Use k-anonymization to generalize the session start time value.<br> Aggregate session start time data. |


# 5 - Data potential and analysis

### Instructions

It is typical that during an anonymization process, there is a loss of value in terms of insights. In this section, explore the other side of the scale - What can still be done with the data that is processed and why is the risk above justifiable taking into account the potential return?

Specifically, try to imagine yourself as the city's analytics team, which uses this data to gather conclusions about the city's state and flow and try develop and demonstrate what results and analyses can be done. You can use the given data plus optionally any other data sources. If you complement with additional datasets, be it from the city Open Data Portal (https://opendata.porto.digital) or other datasets, make sure you cite and reference them explicitly.



## 5.1 - Points of Interest

Using the raw data provided by Porto Digital, we focused on 5 points of interest:
* STCP routes: File with information regarding routes of the public transportation
* STCP trips: File with information regarding trips made by public transports
* STCP stops: File with geo localization information regarding stops (bus stops p.example)
* STCP stop times: File with information regarding stop times (when it arrived, when it departed etc.)
* STCP shapes: File with thegeographical layout of a transit route.


This allows us to create a dedicated file with a combination of this information that we are going to use to infer and analyse on our processes. Example:

In [6]:
import os
from IPython.core.display import HTML
os.chdir('C:/Users/CTW02967/Documents/GitHub/hackacitizens_2024')  # Change to root


In [7]:
df = pd.read_csv('./datamesh/b_staging/datasets/stcp_routes_spatio_temporal.csv', low_memory=False)
df.head()

Unnamed: 0,route_id,direction_id,service_id,trip_id,trip_headsign,wheelchair_accessible,block_id,shape_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,stop_code,stop_name,stop_lat,stop_lon,zone_id,stop_url
0,107,0,UTEIS,107_0_U_29,Areias,1,,107_0_1_shp,20:20:00,20:20:00,AANT5,1,,AANT5,ALAMEDA DAS ANTAS,41.1632,-8.583404,PRT1,http://www.stcp.pt/pt/viajar/paragens/?t=detal...
1,107,0,UTEIS,107_0_U_29,Areias,1,,107_0_1_shp,20:22:10,20:22:10,EDRG1,2,,EDRG1,ESTÁDIO DO DRAGÃO,41.160278,-8.583806,PRT1,http://www.stcp.pt/pt/viajar/paragens/?t=detal...
2,107,0,UTEIS,107_0_U_29,Areias,1,,107_0_1_shp,20:23:40,20:23:40,CRJ1,3,,CRJ1,CORUJEIRA,41.159285,-8.580922,PRT3,http://www.stcp.pt/pt/viajar/paragens/?t=detal...
3,107,0,UTEIS,107_0_U_29,Areias,1,,107_0_1_shp,20:24:42,20:24:42,PCRJ1,4,,PCRJ1,PR. DA CORUJEIRA,41.156919,-8.579865,PRT3,http://www.stcp.pt/pt/viajar/paragens/?t=detal...
4,107,0,UTEIS,107_0_U_29,Areias,1,,107_0_1_shp,20:25:57,20:25:57,JFC,5,,JFC,JUNTA FREG. CAMPANHÃ,41.158167,-8.578587,PRT3,http://www.stcp.pt/pt/viajar/paragens/?t=detal...


## 5.2 - Nearby Points of Interest

Furthermore, we use 2 main files from *Google Maps API* that we base on to **calculate the Wi-fi hotspots that are nearby the Points of interest**
* porto_city_main_institutions : File with data regarding the institutions (name,address,lat,long..)
* porto_digital_wifi_hotspots: File with wi-fi hotspots data from Porto digital (address,hotspot,zone..)

Here's an example of the generated data:

In [26]:
df = pd.read_json('./datamesh/c_features/datasets/porto_wifi_hotspots_nearby_poi.json')
df.head()

Unnamed: 0,MAC_ADDRESS,lat,lon,address,Hotspot,Zone,Parish,number_of_nearby_schools,closest_school_km,number_of_nearby_hospitals,closest_hospital_km,number_of_nearby_universitys,closest_university_km,number_of_nearby_public_offices,closest_public_office_km,number_of_nearby_tourist_attractions,closest_tourist_attraction_km,number_of_nearby_train_stations,closest_train_station_km,number_of_nearby_subway_stations,closest_subway_station_km
0,48-8b-0a-7a-9a-60,41.146929,-8.632913,"Museu C. Elétrico, 4050 Porto, Portugal",Alameda Basílio Teles - Rua de Dom Pedro V,Massarelos,Lordelo do Ouro e Massarelos,5,0.59265,23,0.658964,20,0.456864,0,1.151058,7,0.46516,0,1.588172,0,1.046631
1,a0-23-9f-ca-87-c0,41.149482,-8.610083,"Av.aliados, 4000-125 Porto, Portugal",Aliados - Gabinete do Municipe,Trindade,"Cedofeita, Santo Ildefonso, Sé, Miragaia, São ...",41,0.234685,28,0.238527,28,0.40748,55,0.020486,81,0.057663,67,0.123567,59,0.116824
2,a4-bd-c4-d0-61-00,41.147993,-8.611216,"Av. dos Aliados 107, 4000-067 Porto, Portugal",Aliados - Rua Elísio de Melo,Santo Ildefonso,Cedofeita,36,0.251828,32,0.215924,27,0.295997,46,0.170341,101,0.04866,59,0.069103,49,0.069103
3,a4-bd-c4-d0-60-80,41.14908,-8.6106,"Av. dos Aliados 236, 4000-114 Porto, Portugal",Aliados (Fonte dos Aliados),Santo Ildefonso,Cedofeita,38,0.207945,34,0.288769,28,0.353824,48,0.042509,86,0.034239,63,0.062654,54,0.061666
4,a0-23-9f-94-a2-c0,41.177855,-8.59509,"Pav. Desportivo Luís Falcão, 4200-465 Porto, P...",Asprela - FEUP,Paranhos,Lamas,0,1.359096,8,0.327911,27,0.084433,2,0.059936,0,1.557766,0,1.956223,5,0.817976


We also use the file generated on section 5.1 combined with the *porto_city_main_institutions* to **calculate the SCTP stops that are nearby the Points of interest**

Here's an example of the generated data:

In [28]:
df = pd.read_json('./datamesh/c_features/datasets/stcp_stops_nearby_poi.json')
df.head()

Unnamed: 0,stop_id,stop_name,lat,lon,neighborhood_name,number_of_nearby_schools,closest_school_km,number_of_nearby_hospitals,closest_hospital_km,number_of_nearby_universitys,closest_university_km,number_of_nearby_public_offices,closest_public_office_km,number_of_nearby_tourist_attractions,closest_tourist_attraction_km,number_of_nearby_train_stations,closest_train_station_km,number_of_nearby_subway_stations,closest_subway_station_km
0,AANT5,ALAMEDA DAS ANTAS,41.1632,-8.583404,Campanhã,11,1.112788,6,0.313641,0,1.548288,3,0.502326,0,1.983624,10,0.99217,8,0.297165
1,EDRG1,ESTÁDIO DO DRAGÃO,41.160278,-8.583806,Campanhã,12,0.963362,7,0.117084,0,1.574076,6,0.659874,0,1.800641,19,0.906405,10,0.129556
2,CRJ1,CORUJEIRA,41.159285,-8.580922,Campanhã,8,1.129038,7,0.382527,0,1.83725,6,0.919701,0,1.986427,17,0.811824,6,0.192766
3,PCRJ1,PR. DA CORUJEIRA,41.156919,-8.579865,Campanhã,8,1.022547,6,0.605787,0,1.906887,6,1.096368,0,2.007576,19,0.590985,8,0.463733
4,JFC,JUNTA FREG. CAMPANHÃ,41.158167,-8.578587,Campanhã,4,1.190231,5,0.612591,0,2.061983,5,1.142365,0,2.140804,17,0.763487,6,0.418287


## 5.3 - Hourly metric

One of the metrics that we focused is to check the hourly Wi-fi access per hotspot and bus stop trips. This is aimed to calculate the mobility index/indicator which is a measure used to assess the efficiency and sustainability of transportation and to infere a mobility score (0-100) based on these 2 predicaments.

**Here's the generated data for the Wi-fi access per hotspot in a hourly fashion**

In [8]:
df = pd.read_csv('./datamesh/c_features/datasets/porto_wifi_access_per_hotspot_hourly.csv')
df.head()

Unnamed: 0,MAC_ADDRESS,lat,lon,address,date,hour_slice,number_of_sessions_per_hour
0,,41.155923,-8.68099,"Av. do Brasil 432, 4150-153 Porto, Portugal",2024-10-22,13:00–13:59,97
1,,,,,2024-01-25,22:00–22:59,89
2,,,,,2024-07-23,03:00–03:59,104
3,,,,,2024-11-06,22:00–22:59,95
4,,41.155923,-8.68099,"Av. do Brasil 432, 4150-153 Porto, Portugal",2024-02-21,22:00–22:59,142


**Here's the generated data for the Bus stops trips in a hourly fashion**

In [10]:
df = pd.read_csv('./datamesh/c_features/datasets/sctp_bus_stops_trips_hourly.csv')
df.head()

Unnamed: 0,stop_id,service_id,hourly_slice,stop_lat,stop_lon,stop_name,distinct_trip_count
0,.,DOM,06:00-06:59,41.254116,-8.653726,.,1
1,.,DOM,07:00-07:59,41.254116,-8.653726,.,1
2,.,DOM,08:00-08:59,41.254116,-8.653726,.,1
3,.,DOM,09:00-09:59,41.254116,-8.653726,.,2
4,.,DOM,10:00-10:59,41.254116,-8.653726,.,1


## 5.4 - Visualization analysis

### 5.4.1 - Visualization of different POI based in heatmap from Wi-fi access

This visualization aims to show the critical points of interest based on wifi-access data. (ideas on what we could achieve here)

In [12]:
from IPython.display import IFrame

# URL to the HTML file served by the local server
url = "http://localhost:8000/datamesh/e_presentation/dashboards/factory/html/map_porto_wifi_heatmap_poi.html"

# Embed the HTML file
display(IFrame(url, width=1200, height=1200))



# 6 - Conclusions and Future Work


### Instructions:

List the main conclusions focusing on the feasibility, innovation, and applicability of the solution.

In addition, describe the future work still necessary if this solution were to continue: What would the next steps be? What limitations do the data have for implementing the solution? How can the data be improved? How could this solution be improved? What other ideas could be included/analyzed? What other types of data could be used? What other methodologies could be experimented with?

# 7 - References

### Instructions:

Provide the relevant references for materials and/or sources used (reports, articles, external data sources, etc.).


# 8 - Annex