## Strava + Spotify Analysis

The goal of this project is to create a spotify playlist that helps improving user's activity performance (running) for common routes.

In [136]:
pip install gpxpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [137]:
pip install fitparse

^C
[31mERROR: Operation cancelled by user[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import gpxpy
import pandas as pd
from pathlib import Path
from xml.etree import ElementTree as ET
from fitparse import FitFile
import requests
import time
import os

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


## Get Strava and Spotify data

Strava data is composed by several files of different types. In the extraction we can find the *activities.csv* file and several gpx files. The first identifies data related to an activity performed - from the moment the user starts until it finishes it. The second, the gpx/fit files, are associated to an activity in the activities file by the "Activity ID" variable. The gpx files are geodata that track the position of the user (coordinates) and altitude each x amount of seconds. So we can say the activities is a summary of each activity performed and gpx files the granular data of each file.

In [None]:
#Combine all gpx and fit files into one df
folder_path = Path("/Users/joanafcasanova/Documents/Documents Joana Casanova 2025/Projects/Portfolio/Strava + Spotify/Strava/activities")
all_data = []

def process_gpx(gpx_file):
    with gpx_file.open("r") as f:
        gpx = gpxpy.parse(f)

        for track in gpx.tracks:
            for segment in track.segments:
                for point in segment.points:
                    row = {
                        'time': point.time,
                        'latitude': point.latitude,
                        'longitude': point.longitude,
                        'elevation': point.elevation,
                        'activity_id': gpx_file.stem,
                        'source': 'gpx'
                    }

                    # Extract extensions like heart rate
                    if point.extensions:
                        for ext in point.extensions:
                            for child in ext:
                                tag = child.tag.lower().split('}')[-1]
                                row[tag] = child.text

                    all_data.append(row)

def process_fit(fit_file_path):
    with fit_file_path.open("rb") as f:
        fitfile = FitFile(f)

        for record in fitfile.get_messages('record'):
            row = {
                'activity_id': fit_file_path.stem,
                'source': 'fit',
            }

            for field in record:
                name = field.name.lower()
                row[name] = field.value

            # Only process if location data exists
            if row.get('position_lat') is not None and row.get('position_long') is not None:
                row['latitude'] = row['position_lat'] * (180 / 2**31)
                row['longitude'] = row['position_long'] * (180 / 2**31)
                row['time'] = row.get('timestamp')
                row['elevation'] = row.get('altitude')

                all_data.append(row)

#Loop through all activity files
for file in folder_path.glob("*"):
    try:
        if file.suffix == '.gpx':
            process_gpx(file)
        elif file.suffix == '.fit':
            process_fit(file)
    except Exception as e:
        print(f"Failed to parse {file.name}: {e}")

geo_files = pd.DataFrame(all_data)

columns = ['time', 'latitude', 'longitude', 'elevation', 'heart_rate', 'activity_id', 'source']
for col in columns:
    if col not in geo_files.columns:
        geo_files[col] = None

geo_files = geo_files[columns]

print(geo_files.shape)
print(geo_files.head())

KeyboardInterrupt: 

In [15]:
#Get the geo file
file_path = os.path.join('Strava', 'raw_geo.csv')
geo_data = pd.read_csv(file_path)
geo_data

  geo_data = pd.read_csv(file_path)


Unnamed: 0,time,latitude,longitude,elevation,heart_rate,activity_id,source
0,2023-08-04 17:13:20,41.506688,2.343538,,,10276081516,fit
1,2023-08-04 17:13:21,41.506298,2.343978,,119.0,10276081516,fit
2,2023-08-04 17:13:22,41.506070,2.344225,,119.0,10276081516,fit
3,2023-08-04 17:13:29,41.505395,2.345025,163.4,117.0,10276081516,fit
4,2023-08-04 17:13:31,41.505376,2.345025,163.4,115.0,10276081516,fit
...,...,...,...,...,...,...,...
2191732,2018-10-28 11:23:06,41.747120,2.563116,154.0,167.0,2114795252,fit
2191733,2018-10-28 11:23:07,41.747118,2.563091,154.0,167.0,2114795252,fit
2191734,2018-10-28 11:23:09,41.747110,2.563062,153.0,164.0,2114795252,fit
2191735,2018-10-28 11:23:12,41.747100,2.563041,153.0,162.0,2114795252,fit


In [16]:
len(geo_data)

2191737

In [17]:
#Get the activities file
file_path = os.path.join('Strava', 'raw_activities.csv')
activities = pd.read_csv(file_path)
activities

Unnamed: 0,Activity ID,Activity Date,Activity Name,Activity Type,Activity Description,Elapsed Time,Distance,Max Heart Rate,Relative Effort,Commute,...,Activity Count,Total Steps,Carbon Saved,Pool Length,Training Load,Intensity,Average Grade Adjusted Pace,Timer Time,Total Cycles,Media
0,117289933,"Mar 2, 2014, 5:19:42 PM",Rutita. Meitat sense gps,Run,,4510,13.43,,,,...,,,,,,,,,,
1,180884072,"Aug 16, 2014, 5:56:42 AM",Camino etapa 1: irun/pasaia,Run,,14283,13.41,,,False,...,,,,,,,,,,
2,181438050,"Aug 17, 2014, 5:36:13 AM",Etapa 2 pasaia/orio (sense cobertura),Run,,28466,13.44,,,False,...,,,,,,,,,,
3,1020912369,"Jun 3, 2017, 5:40:49 AM",BCN-Montserrat,Ride,,35111,82.52,,,False,...,,,,,,,,,,
4,1020931170,"May 31, 2017, 5:13:12 PM",Tibidabo tontorrón,Ride,,7023,21.39,,,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,13963806133,"Mar 23, 2025, 7:37:01 AM",Trail del Senglar,Run,Cames encara tocades de la Marató però anem fe...,8588,21.35,195.0,573.0,False,...,,23308.0,,,428.0,134.0,3.321873,,,
473,13995026321,"Mar 26, 2025, 4:35:58 PM",Reconeixement Imperdibles,Run,Tot en ordre per divendres🤘🏼,7270,16.05,187.0,203.0,False,...,,18242.0,,,286.0,119.0,3.056847,,,
474,14014657128,"Mar 28, 2025, 5:37:58 PM",Imperdibles vol2,Run,,7525,12.25,182.0,89.0,False,...,,15434.0,,,230.0,106.0,2.213734,,,media/504C2981-C951-49C8-95C7-AFF508703AD6.jpg...
475,14023306661,"Mar 29, 2025, 4:46:36 PM",Afternoon Trail Run,Run,,6676,11.52,164.0,59.0,False,...,,15556.0,,,176.0,98.0,2.409797,,,


In [18]:
activities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 477 entries, 0 to 476
Data columns (total 94 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Activity ID                   477 non-null    int64  
 1   Activity Date                 477 non-null    object 
 2   Activity Name                 477 non-null    object 
 3   Activity Type                 477 non-null    object 
 4   Activity Description          15 non-null     object 
 5   Elapsed Time                  477 non-null    int64  
 6   Distance                      477 non-null    object 
 7   Max Heart Rate                376 non-null    float64
 8   Relative Effort               376 non-null    float64
 9   Commute                       476 non-null    object 
 10  Activity Private Note         0 non-null      float64
 11  Activity Gear                 402 non-null    object 
 12  Filename                      477 non-null    object 
 13  Athle

In [19]:
len(activities)

477

In [20]:
activities["year"] = pd.to_datetime(activities["Activity Date"]).dt.year
activities_per_year = activities["year"].value_counts().sort_index()

print(activities_per_year)


year
2014      3
2015      2
2017     37
2018     38
2019     17
2020     33
2021     30
2022     65
2023    110
2024    116
2025     26
Name: count, dtype: int64


  activities["year"] = pd.to_datetime(activities["Activity Date"]).dt.year


## Data Wrangling

To keep consistency in data and down size it, reducing dataset granularity to 5 seconds

In [21]:
geo=geo_data.copy()

In [22]:
len(geo)

2191737

In [23]:
geo

Unnamed: 0,time,latitude,longitude,elevation,heart_rate,activity_id,source
0,2023-08-04 17:13:20,41.506688,2.343538,,,10276081516,fit
1,2023-08-04 17:13:21,41.506298,2.343978,,119.0,10276081516,fit
2,2023-08-04 17:13:22,41.506070,2.344225,,119.0,10276081516,fit
3,2023-08-04 17:13:29,41.505395,2.345025,163.4,117.0,10276081516,fit
4,2023-08-04 17:13:31,41.505376,2.345025,163.4,115.0,10276081516,fit
...,...,...,...,...,...,...,...
2191732,2018-10-28 11:23:06,41.747120,2.563116,154.0,167.0,2114795252,fit
2191733,2018-10-28 11:23:07,41.747118,2.563091,154.0,167.0,2114795252,fit
2191734,2018-10-28 11:23:09,41.747110,2.563062,153.0,164.0,2114795252,fit
2191735,2018-10-28 11:23:12,41.747100,2.563041,153.0,162.0,2114795252,fit


In [27]:
#Conver timestamp to usable format
geo["time"] = pd.to_datetime(geo["time"], errors="coerce")

#Get in same format for activities df
activities["Activity Date"] = pd.to_datetime(
    activities["Activity Date"], 
    format="%b %d, %Y, %I:%M:%S %p"
)

- Coordinates

GPS data is noisy by nature: even if you run the exact same route, slight variations in signal, your arm swing, or your phone's mood will make the coordinates slightly different every time. This is an issue when trying to identify common routes, therefore a solution is to decrease the precision of the coordinates to less decimal places.

💡Interesting fact: To find the best coordinates precision for this challenge in the city of Barcelona, I dove deep into its architecure. I found that the famous blocks that constitute the center of Barcelona, following the "Plan Cerdá", are 113,3 meters long. Since the departure point of our runner is located within these blocks, finding the coordinates precision that most suit the area will help us get a better view on the routes followed. The answer was 3 decimal points, coordinates approximated to 3 decimal points include points 111 meteres diameter. Therefore, this was the best approximation I found for the challenge.

In [28]:
# Round to 2 and 3 decimal places
geo["latitude_round(2)"] = geo["latitude"].round(2)
geo["longitude_round(2)"] = geo["longitude"].round(2)
geo["latitude_round(3)"] = geo["latitude"].round(3)
geo["longitude_round(3)"] = geo["longitude"].round(3)

# Combine into coordinates
geo["coordinates_round(2)"] = geo["latitude_round(2)"].astype(str) + ", " + geo["longitude_round(2)"].astype(str)
geo["coordinates_round(3)"] = geo["latitude_round(3)"].astype(str) + ", " + geo["longitude_round(3)"].astype(str)

In [29]:
geo_nulls = geo.isnull().sum()
print(geo_nulls)

time                     81743
latitude                     0
longitude                    0
elevation               250172
heart_rate              668721
activity_id                  0
source                       0
latitude_round(2)            0
longitude_round(2)           0
latitude_round(3)            0
longitude_round(3)           0
coordinates_round(2)         0
coordinates_round(3)         0
dtype: int64


In [31]:
#Dropping nulls fro timestamp
geo = geo.dropna(subset=['time'])

In [None]:
geo_zeros = (geo == 0).sum()
print(geo_zeros)

latitude                  0
longitude                 0
elevation               249
heart_rate                0
activity_id               0
source                    0
timestamp                 0
date                      0
time_only                 0
latitude_round(2)         0
longitude_round(2)        0
latitude_round(3)         0
longitude_round(3)        0
coordinates_round(2)      0
coordinates_round(3)      0
dtype: int64


In [None]:
activities_nulls = activities.isnull().sum()
print(activities_nulls)

Activity ID                      0
Activity Date                    0
Activity Name                    0
Activity Type                    0
Activity Description           462
                              ... 
Average Grade Adjusted Pace    367
Timer Time                     477
Total Cycles                   476
Media                          420
year                             0
Length: 95, dtype: int64


In [None]:
activities_zeros = (geo == 0).sum()
print(activities_zeros)

latitude                  0
longitude                 0
elevation               249
heart_rate                0
activity_id               0
source                    0
timestamp                 0
date                      0
time_only                 0
latitude_round(2)         0
longitude_round(2)        0
latitude_round(3)         0
longitude_round(3)        0
coordinates_round(2)      0
coordinates_round(3)      0
dtype: int64


## Exploratory analysis: activities

In [None]:
print(geo_data["activity_id"].unique()[:10])

['10276081516' '9437241590' '11356095144' '2114795061' '10584685854'
 '9870674310' '9545986858' '8779105142' '7848269905' '8702631154']


In [None]:
common_ids = set(geo_data["activity_id"]) & set(activities["Activity ID"])
print(f"Common IDs count: {len(common_ids)}")

Common IDs count: 0


In [None]:
# Clean 'activity_id' in geo_data
geo_data["activity_id"] = geo_data["activity_id"].astype(str).str.strip()

# Clean 'Activity ID' in activities and rename to 'activity_id' for consistency
activities["Activity ID"] = activities["Activity ID"].astype(str).str.strip()
activities = activities.rename(columns={"Activity ID": "activity_id"})

# Now compare or merge using 'activity_id'
common_ids = set(geo_data["activity_id"]) & set(activities["activity_id"])
print(f"Common IDs count: {len(common_ids)}")


Common IDs count: 4


In [None]:
# 1. Extract start times from activities (replace 'Start Time' with your actual column name)
activities["start_time"] = pd.to_datetime(activities["Start Time"])

# 2. Earliest timestamp per activity in geo_data
geo_data["time"] = pd.to_datetime(geo_data["time"])
geo_start_times = geo_data.groupby("activity_id")["time"].min().reset_index().rename(columns={"time": "geo_start_time"})

# 3. Merge on closest timestamps within tolerance (say 5 minutes)
# Merge on cross join and filter by time difference — this can be expensive for large data, so be careful

# A simpler approach: merge on activity_id where possible, then check time diff for unmatched

merged = pd.merge(activities, geo_start_times, on="activity_id", how="left")

# Calculate time difference in seconds
merged["time_diff"] = (merged["start_time"] - merged["geo_start_time"]).abs().dt.total_seconds()

# Filter matches within 5 minutes (300 seconds)
matched = merged[merged["time_diff"] <= 300]

print(f"Matched activities count: {matched.shape[0]}")
print(matched[["activity_id", "start_time", "geo_start_time", "time_diff"]].head())


ValueError: unconverted data remains when parsing with format "%Y-%m-%d %H:%M:%S": "+00:00", at position 279695. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [None]:
activities_ids = set(activities["activity_id"])
geo_ids = set(geo_data["activity_id"])

print("IDs in activities but not in geo_data:", list(activities_ids - geo_ids)[:10])
print("IDs in geo_data but not in activities:", list(geo_ids - activities_ids)[:10])


IDs in activities but not in geo_data: ['1020912369', '1858088924', '8625916824', '5002023360', '1157064576', '8181572250', '8561540607', '2204100744', '7181130773', '4963192585']
IDs in geo_data but not in activities: ['11590146511', '9478440379', '1515963251', '6334148192', '9910072862', '11524197274', '1238071610', '12636723576', '8444517253', '1215635123']


In [None]:
geo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 218389 entries, 0 to 219369
Data columns (total 15 columns):
 #   Column                Non-Null Count   Dtype              
---  ------                --------------   -----              
 0   latitude              218389 non-null  float64            
 1   longitude             218389 non-null  float64            
 2   elevation             193250 non-null  float64            
 3   heart_rate            152355 non-null  float64            
 4   activity_id           218389 non-null  object             
 5   source                218389 non-null  object             
 6   timestamp             218389 non-null  datetime64[ns, UTC]
 7   date                  218389 non-null  object             
 8   time_only             218389 non-null  object             
 9   latitude_round(2)     218389 non-null  float64            
 10  longitude_round(2)    218389 non-null  float64            
 11  latitude_round(3)     218389 non-null  float64           

In [None]:
activities["Distance"] = pd.to_numeric(activities["Distance"], errors="coerce")
activities["distance(0)"] = activities["Distance"].round(0)

In [None]:
activities["distance(0)"].value_counts(ascending=False)

distance(0)
10.0     30
12.0     27
13.0     24
6.0      23
16.0     21
         ..
913.0     1
947.0     1
998.0     1
799.0     1
39.0      1
Name: count, Length: 85, dtype: int64

In [None]:
activities_10km = activities.loc[activities["distance(0)"] == 12, "Activity ID"]
activities_10km

8       1021033275
26      1104565723
194     7861093513
198     8009062777
212     8166326948
225     8322285279
227     8379957634
231     8411123501
236     8454348908
240     8492453712
246     8533943308
260     8641195394
270     8809411498
322     9842990864
352    11057807405
361    11343871511
378    11583221058
398    11904957809
411    12180610299
429    12612752464
431    12636253996
434    12672351401
435    12678353381
451    13350901391
465    13720801654
474    14014657128
475    14023306661
Name: Activity ID, dtype: int64

In [None]:
geo

Unnamed: 0,latitude,longitude,elevation,heart_rate,activity_id,source,timestamp,date,time_only,latitude_round(2),longitude_round(2),latitude_round(3),longitude_round(3),coordinates_round(2),coordinates_round(3)
0,41.412958,2.188182,35.6,,117289933,gpx,2014-03-02 17:19:42+00:00,2014-03-02,17:19:42,41.41,2.19,41.413,2.188,"41.41, 2.19","41.413, 2.188"
1,41.412927,2.187903,35.0,,117289933,gpx,2014-03-02 17:19:59+00:00,2014-03-02,17:19:59,41.41,2.19,41.413,2.188,"41.41, 2.19","41.413, 2.188"
2,41.412728,2.187520,35.5,,117289933,gpx,2014-03-02 17:20:13+00:00,2014-03-02,17:20:13,41.41,2.19,41.413,2.188,"41.41, 2.19","41.413, 2.188"
3,41.412330,2.187218,37.5,,117289933,gpx,2014-03-02 17:20:25+00:00,2014-03-02,17:20:25,41.41,2.19,41.412,2.187,"41.41, 2.19","41.412, 2.187"
4,41.411742,2.187180,38.2,,117289933,gpx,2014-03-02 17:20:35+00:00,2014-03-02,17:20:35,41.41,2.19,41.412,2.187,"41.41, 2.19","41.412, 2.187"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219365,41.918458,3.207615,5.6,167.0,8084935022,fit,2022-08-04 09:02:33+00:00,2022-08-04,09:02:33,41.92,3.21,41.918,3.208,"41.92, 3.21","41.918, 3.208"
219366,41.918755,3.207360,6.4,167.0,8084935022,fit,2022-08-04 09:02:43+00:00,2022-08-04,09:02:43,41.92,3.21,41.919,3.207,"41.92, 3.21","41.919, 3.207"
219367,41.918970,3.207022,6.6,169.0,8084935022,fit,2022-08-04 09:02:53+00:00,2022-08-04,09:02:53,41.92,3.21,41.919,3.207,"41.92, 3.21","41.919, 3.207"
219368,41.919158,3.206738,7.2,175.0,8084935022,fit,2022-08-04 09:03:03+00:00,2022-08-04,09:03:03,41.92,3.21,41.919,3.207,"41.92, 3.21","41.919, 3.207"


In [None]:
geo[geo["activity_id"]==12672351401]

Unnamed: 0,latitude,longitude,elevation,heart_rate,activity_id,source,timestamp,date,time_only,latitude_round(2),longitude_round(2),latitude_round(3),longitude_round(3),coordinates_round(2),coordinates_round(3)


In [None]:
geo_10km = geo[geo["activity_id"].isin(activities_10km)]
geo_10km

Unnamed: 0,latitude,longitude,elevation,heart_rate,activity_id,source,timestamp,date,time_only,latitude_round(2),longitude_round(2),latitude_round(3),longitude_round(3),coordinates_round(2),coordinates_round(3)


## Exploratory analysis: geo

## Routes Analysis

These points were plotted in Tableau (my preffered data viz tool for this challenge) and two main routes were found by highlighting the coordinate points by count of activity id that contain that coordinate. Based on that two main routes were found - within Barcelona city and from within the city up to the mountain Tibidabo.

We found the two main routes geographically. For that I used a tool QGIS that helped me qualify the coordinate points that belong to the main two routes, to help me identify that activities that represent each specific route.

In [None]:
main_routes=pd.read_csv("main_routes_geo.csv")
main_routes

  main_routes=pd.read_csv("main_routes_geo.csv")


Unnamed: 0,latitude,longitude,elevation,heart_rate,activity_id,timestamp,date,time_only,latitude_round(3),longitude_round(3),coordinates_round(3),latitude_round(2),longitude_round(2),coordinates_round(2),route_type
0,41.405210,2.139351,143.724637,,1123775595,2017/05/31 19:00:05+00,2017/05/31,19:00:05,41.41,2.14,"41.405, 2.139",41.41,2.14,"41.41, 2.14",route_3
1,41.404799,2.139954,129.612401,,1123775595,2017/05/31 19:00:15+00,2017/05/31,19:00:15,41.40,2.14,"41.405, 2.14",41.40,2.14,"41.4, 2.14",route_3
2,41.404625,2.140207,126.000000,,1123775595,2017/05/31 19:00:54+00,2017/05/31,19:00:54,41.40,2.14,"41.405, 2.14",41.40,2.14,"41.4, 2.14",route_3
3,41.404295,2.140949,122.500000,,1123775595,2017/05/31 19:01:04+00,2017/05/31,19:01:04,41.40,2.14,"41.404, 2.141",41.40,2.14,"41.4, 2.14",route_3
4,41.403924,2.141883,113.500000,,1123775595,2017/05/31 19:01:14+00,2017/05/31,19:01:14,41.40,2.14,"41.404, 2.142",41.40,2.14,"41.4, 2.14",route_3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219422,41.918458,3.207615,5.600000,167.0,8084935022,2022/08/04 09:02:33+00,2022/08/04,09:02:33,41.92,3.21,"41.918, 3.208",41.92,3.21,"41.92, 3.21",
219423,41.918755,3.207360,6.400000,167.0,8084935022,2022/08/04 09:02:43+00,2022/08/04,09:02:43,41.92,3.21,"41.919, 3.207",41.92,3.21,"41.92, 3.21",
219424,41.918970,3.207022,6.600000,169.0,8084935022,2022/08/04 09:02:53+00,2022/08/04,09:02:53,41.92,3.21,"41.919, 3.207",41.92,3.21,"41.92, 3.21",
219425,41.919158,3.206738,7.200000,175.0,8084935022,2022/08/04 09:03:03+00,2022/08/04,09:03:03,41.92,3.21,"41.919, 3.207",41.92,3.21,"41.92, 3.21",


In [None]:
#Route-type labels defined
main_routes["route_type"].unique()
#route_1 - From home heading to diagonal
#route_2 - Along Diagonal avenue
#route_3 - From Diagonal avenue up to Tibidabo Mountain
#route_4 - Inside Tibidabo Mountain

array(['route_3', 'route_1', 'route_4', 'route_2', nan], dtype=object)

In [None]:
route1= main_routes.loc[main_routes["route_type"] == "route_1", ["activity_id"]].drop_duplicates()
route1

Unnamed: 0,activity_id
9560,2348263869
9561,7804207200
9596,7807853712
9638,7810622142
9670,7857505251
...,...
13346,14682543047
13362,14794199463
13397,14834067547
13403,14938821614


In [None]:
route2= main_routes.loc[main_routes["route_type"] == "route_2", ["activity_id"]].drop_duplicates()
route2

Unnamed: 0,activity_id
34671,1124538066
34727,1133519041
34774,1171891970
34775,1175059913
34786,1180108789
...,...
44036,14663662891
44039,14682543047
44043,14794199463
44220,14938821614


In [None]:
route3= main_routes.loc[main_routes["route_type"] == "route_3", ["activity_id"]].drop_duplicates()
route3

Unnamed: 0,activity_id
0,1123775595
13,1123799470
24,1124538066
26,1125814009
29,1133519041
...,...
9169,14706569004
9175,14898672637
9208,14938821614
9359,14959900611


In [None]:
route4= main_routes.loc[main_routes["route_type"] == "route_4", ["activity_id"]].drop_duplicates()
route4

Unnamed: 0,activity_id
13445,1123756289
13546,1123775595
13717,1123792359
13878,1123799470
14040,1124538066
...,...
33672,14938821614
33852,14959900611
34105,14969212097
34284,14974922280


In [None]:
activities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 477 entries, 0 to 476
Data columns (total 94 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Activity ID                   477 non-null    int64  
 1   Activity Date                 477 non-null    object 
 2   Activity Name                 477 non-null    object 
 3   Activity Type                 477 non-null    object 
 4   Activity Description          15 non-null     object 
 5   Elapsed Time                  477 non-null    int64  
 6   Distance                      477 non-null    object 
 7   Max Heart Rate                376 non-null    float64
 8   Relative Effort               376 non-null    float64
 9   Commute                       476 non-null    object 
 10  Activity Private Note         0 non-null      float64
 11  Activity Gear                 402 non-null    object 
 12  Filename                      477 non-null    object 
 13  Athle

In [None]:
activities_run=activities[activities["Activity Type"]=="Run"]
activities_run

Unnamed: 0,Activity ID,Activity Date,Activity Name,Activity Type,Activity Description,Elapsed Time,Distance,Max Heart Rate,Relative Effort,Commute,...,Activity Count,Total Steps,Carbon Saved,Pool Length,Training Load,Intensity,Average Grade Adjusted Pace,Timer Time,Total Cycles,Media
0,117289933,"Mar 2, 2014, 5:19:42 PM",Rutita. Meitat sense gps,Run,,4510,13.43,,,,...,,,,,,,,,,
1,180884072,"Aug 16, 2014, 5:56:42 AM",Camino etapa 1: irun/pasaia,Run,,14283,13.41,,,False,...,,,,,,,,,,
2,181438050,"Aug 17, 2014, 5:36:13 AM",Etapa 2 pasaia/orio (sense cobertura),Run,,28466,13.44,,,False,...,,,,,,,,,,
8,1021033275,"May 4, 2015, 6:51:06 PM",W,Run,,3412,11.59,202.0,179.0,False,...,,,,,,,,,,
15,1067723964,"Jul 4, 2017, 4:30:14 PM",Evening Run,Run,,653,2.83,,,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,13963806133,"Mar 23, 2025, 7:37:01 AM",Trail del Senglar,Run,Cames encara tocades de la Marató però anem fe...,8588,21.35,195.0,573.0,False,...,,23308.0,,,428.0,134.0,3.321873,,,
473,13995026321,"Mar 26, 2025, 4:35:58 PM",Reconeixement Imperdibles,Run,Tot en ordre per divendres🤘🏼,7270,16.05,187.0,203.0,False,...,,18242.0,,,286.0,119.0,3.056847,,,
474,14014657128,"Mar 28, 2025, 5:37:58 PM",Imperdibles vol2,Run,,7525,12.25,182.0,89.0,False,...,,15434.0,,,230.0,106.0,2.213734,,,media/504C2981-C951-49C8-95C7-AFF508703AD6.jpg...
475,14023306661,"Mar 29, 2025, 4:46:36 PM",Afternoon Trail Run,Run,,6676,11.52,164.0,59.0,False,...,,15556.0,,,176.0,98.0,2.409797,,,


In [None]:
activities_run['Distance(0)'] = activities_run['Distance'].astype(float)
activities_run['Distance(0)'] = activities_run["Distance(0)"].round(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  activities_run['Distance(0)'] = activities_run['Distance'].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  activities_run['Distance(0)'] = activities_run["Distance(0)"].round(0)


In [None]:
activities_run["Distance(0)"].value_counts()

Distance(0)
10.0     30
12.0     27
13.0     21
6.0      21
16.0     20
14.0     19
7.0      17
8.0      15
15.0     14
18.0     14
11.0     14
9.0      12
22.0     10
5.0      10
19.0      9
17.0      9
21.0      8
20.0      8
4.0       6
23.0      5
25.0      5
2.0       5
3.0       5
28.0      4
24.0      3
38.0      3
27.0      3
26.0      3
31.0      2
55.0      2
29.0      2
43.0      2
44.0      1
33.0      1
125.0     1
175.0     1
1.0       1
103.0     1
35.0      1
0.0       1
74.0      1
62.0      1
32.0      1
42.0      1
36.0      1
53.0      1
39.0      1
Name: count, dtype: int64

In [None]:
import folium

# Center map on Barcelona
m = folium.Map(location=[41.3851, 2.1734], zoom_start=13)

# Example point
folium.Marker([41.387, 2.17], popup="Plaça Catalunya").add_to(m)

m.save('barcelona_map.html')  # Open this in a browser
