Step-by-step guide to analyze your own Strava data using Python
In this repo, I’ll be sharing the steps I took in performing an Exploratory Data Analysis (EDA) of my Strava data. In the first part, I'll share to you how was I able to get my Strava data and on the second part, I'll show you my results. To make this readme file short, please refer to this Jupyter notebook 🗒️🗒️🗒️ for the step by step codes that I used. I've been coding for quite some time and I realized that I haven't been making some decent projects hence this first one small analysis.
For you to be able to do some analysis, of course you need to get your data. I got my data as a CSV file from the Strava website. To get it, you need to log in to your account then click the down arrow beside your profile picture in Strava then select My Account on the right.
Afterwards scroll down the resulting page and click on "Get Started" button on the 'Download or Delete Your Account Section'. Don't worry, you'll not delete your account.
In the next page, head towards the 'Download Request (optional)' section and click the 'Request your archive' button. You'll see 'Request received' afterwards then the zip file will be sent towards your registered email.
The next step is to go over your email and get the zip file. This may be in your Spam/Junk folder depending on your settings. (At least that's where I found mine). Download and extract it and it should look like this:
The one you're interested with is the 'activities.csv' file. This is the one that we'll use in our analysis.
Then, you can proceed in making a Jupyter Notebook in the IDE of your choice. I did this analysis using VSCode. For the step-by-step code, please refer to this Jupyter notebook.
After cleaning the data, I started by showing the summary statistics table for ease of inspecting the usual statistical values that I'm interested with. Afterwards, I created a pairs plot chat which will illustrate the distribution of the variables and its relationship with the other variables. Then I made a box-plot to visually show the dispertion of my ride data and if there are signs of skewness. Then, I proceeded with illustrating the distance per month that I've covered through a bar plot. And lastly, I made another box-plot to see which quarter am I most actively riding my bike outdoors
activity id | elapsed time | moving time | distance | max heart rate | elevation gain | max speed | calories | activity_date | year | dayofyear | elapsed hour | km per hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.380000e+02 | 138.0 | 138.0 | 138.0 | 34.0 | 138.0 | 138.0 | 86.0 | 138 | 138.0 | 138.0 | 138.0 | 138.0 |
mean | 6.892066e+09 | 38999.0 | 6876.0 | 34.0 | 176.0 | 155.0 | 12.0 | 1078.0 | 2022-03-18 23:33:04.362318848 | 2022.0 | 228.0 | 11.0 | 12.0 |
min | 3.748056e+09 | 171.0 | 171.0 | 1.0 | 118.0 | 0.0 | 5.0 | 20.0 | 2020-07-04 12:57:37 | 2020.0 | 14.0 | 0.0 | 0.0 |
25% | 5.807027e+09 | 2619.0 | 1843.0 | 9.0 | 171.0 | 32.0 | 10.0 | 277.0 | 2021-08-17 06:39:24.750000128 | 2021.0 | 168.0 | 1.0 | 8.0 |
50% | 7.058098e+09 | 7865.0 | 4951.0 | 25.0 | 178.0 | 104.0 | 12.0 | 579.0 | 2022-04-29 05:08:36 | 2022.0 | 232.0 | 2.0 | 11.0 |
75% | 8.008873e+09 | 20309.0 | 9597.0 | 45.0 | 181.0 | 192.0 | 14.0 | 1239.0 | 2022-10-23 22:42:44 | 2022.0 | 299.0 | 6.0 | 14.0 |
max | 1.056717e+10 | 3055629.0 | 38296.0 | 210.0 | 241.0 | 982.0 | 23.0 | 7699.0 | 2023-12-29 05:53:09 | 2023.0 | 365.0 | 849.0 | 52.0 |
std | 1.535680e+09 | 264373.0 | 6911.0 | 35.0 | 18.0 | 178.0 | 3.0 | 1311.0 | NaN | 1.0 | 89.0 | 73.0 | 6.0 |
The good thing about showing the summary statistics first is that you'll immediately see if there are some extreme and funky values through showing the minimum and maximum values. In my data, it can be observed my maximum elapsed time was 8489rs (Calculated as =3055629/60mins/60hrs) for which I couldn't think of any ride that I made where I was on the bike for that long. However, for the moving time, my max was at 10.63 hrs which was when I did my first Laguna Loop.
Another funky value was my max heart rate which peaked at 241bpm? I don't have the best heart condition but I'm pretty sure that I won't reach that high of a heart rate.
Those two might have skewed my data hence, I think we should exclude those before even proceeding with the other charts and analysis. As I recall, I haven't made an activity which lasted for more than 20 hours. With regard to my max heartrate, there were times that my heart rate monitor was acting strange that's why some of the readings were off. The max reasonable heart rate that I could recall was at 210bpm. Hence, I will set another filter to not include those rides with heart rate greater than 210bpm.
activity id | activity date | activity type | elapsed time | moving time | distance | max heart rate | elevation gain | max speed | calories | activity_date | start_time | start_date_local | month | year | dayofyear | elapsed hour | km per hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3748239651 | Jul 11, 2020, 9:51:06 PM | Ride | 3789 | 3115.0 | 14.95 | NaN | 64.584747 | 17.700001 | NaN | 2020-07-11 21:51:06 | 21:51:06 | 2020-07-11 | July | 2020 | 193 | 1.052500 | 14.204276 |
2 | 3979407827 | Aug 28, 2020, 9:45:19 PM | Ride | 2417 | 1564.0 | 8.76 | NaN | 12.994913 | 11.300000 | NaN | 2020-08-28 21:45:19 | 21:45:19 | 2020-08-28 | August | 2020 | 241 | 0.671389 | 13.047580 |
3 | 4051298220 | Sep 12, 2020, 10:17:13 PM | Ride | 12127 | 8909.0 | 42.74 | NaN | 148.551956 | 11.400000 | NaN | 2020-09-12 22:17:13 | 22:17:13 | 2020-09-12 | September | 2020 | 256 | 3.368611 | 12.687722 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
194 | 10386367323 | Dec 15, 2023, 11:27:39 PM | Ride | 2055 | 1802.0 | 8.82 | 168.0 | 59.000000 | 11.894043 | 286.0 | 2023-12-15 23:27:39 | 23:27:39 | 2023-12-15 | December | 2023 | 349 | 0.570833 | 15.451095 |
196 | 10567168689 | Dec 29, 2023, 5:53:09 AM | Ride | 4504 | 446.0 | 1.65 | NaN | 15.000000 | 7.702002 | 61.0 | 2023-12-29 05:53:09 | 05:53:09 | 2023-12-29 | December | 2023 | 363 | 1.251111 | 1.318828 |
After excluding those activities, I am now down to 135 records from having 138 records to start with.
Using a pairs plot really is very useful as it gives you immediate visualization amongst the variables that you wanted to have an analysis with.
As expected, we can observe that there is a positive relationship between distance and calories -- the longer the distance, the higher number of calories burned. For most of my rides under 100km, I've kept my speed under 30km/hr. And for those rides where I was able to record my heartrate, I was over 160bpm for most of them. This is definitely something that I can improve as I wanted to keep my efforts in Zone 2. And perhaps, I should replace my heart rate monitor.
Next stop is making a box-plot chart for the distance I've covered.
First observation that immediately struct me is the lack of 2024 data. Then I realized that I wasn't able to record a bike ride outside as of writing this because I've suffered from a gout attack at the start of the year. I've mostly had indoor rides this year so far.
Extremes was highest during 2023 where I did my Audax Populaire event in January (140km) and a Laguna Loop ride (>200km). In the previous years however, the distance of the rides that I do mostly range upto 100km with the majority being under 50km.
I then proceeded into illustrating the total distance that I've covered on a monthly basis.
Here, the total distance I covered per month was shown. there were no data for the first half of 2020 as I started recording my rides when I got my fixed gear in July.
The most that I've done was during the December 2022 when I completed the Rapha Festive 500 challenge.
I guess the best insight I wanted to make out of this is that I should ride more outside.
And lastly, I created a box-plot for the quarterly analysis. Given the chart, it can be observed that I ride the most during the first quarter of the year.
Again, I should really ride more outside.