# Introduction and Background
Strava is a popular app used by athletes to track workouts. In our project we will make gives users insight to their currecnt performance and use activity data of users that are currently performing better than them to give reccomendations on how to improve.

The segments feature of strava takes stretches from user activity to compare to any other user that also completes that strech in one of their logged activities. Any segment in which a user performs amongst the top ten best is stored in the profile under top tens.

This project looks at anyone who outperformed the athlete in question and look at their activity data to reccomend ways to imrpove

# General Approach

This project uses a webscraping approach which required the use of scrapy and selenium. While strava does have an API available to the public, it is severly limited and requires authentification by the user before taking any activity data. Given that our project relies so heavily on a analyzing competitor activities, the API was not of much use to us. 

With out webscrpaing approach we followed this general flow:
- Log in to strava on scrapy
- Direct to strava athlete top ten pages 
- Get links to all top ten segments
- Direct to athlete page for all athletes that beat user in that segment
    - Specifically we direct to the athelte page which displays all of the activities from the month before that athlete placed in the the top ten segment, ahead of the user.
- Log in through selenium
- Gather all activiy data from the links for given time period of each athlete, store. 

# Our Results

The result of our webscraping was a lot of activity data, we clean this up first and then found the average statistics for each athlete in the month leading up to the competitors top ten acheivement

In [13]:
import pandas as pd
import numpy as np
df = pd.read_csv('individual_activities.csv')
df

Unnamed: 0,Name,Distance,Elevation,Time
0,Jae Kim,30.57 mi,"2,257 ft",1h 56m
1,Jae Kim,48.77 mi,"2,172 ft",3h 40m
2,Jae Kim,33.18 mi,"1,306 ft",2h 11m
3,Jae Kim,31.78 mi,"1,150 ft",2h 21m
4,Jae Kim,27.99 mi,"1,093 ft",1h 58m
...,...,...,...,...
4229,Sam Boardman,37.55 mi,"1,634 ft",2h 7m
4230,Sam Boardman,36.55 mi,272 ft,2h 13m
4231,Sam Boardman,13.81 mi,571 ft,48m 50s
4232,Sam Boardman,40.32 mi,"1,056 ft",2h 32m


Raw data

---

In [14]:
# Getting rid of duplicate activities and non-bike rides
df = df.drop_duplicates()
df = df[df['Time'].str.endswith('m', 's')]
df = df[df['Distance'].str.endswith('mi')]
df = df[df['Elevation'].str.endswith('ft')]

# Casting Data into integers
df['Distance'] = df['Distance'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Distance'] = df['Distance'].astype(float)

df['Elevation'] = df['Elevation'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Elevation'] = df['Elevation'].str.replace(',', '').astype(int)

# Transforming time into minutes, to perform arithmetic
hours = np.empty((df.shape[0],1))
hrs = []
minutes = np.empty((df.shape[0],1))
mins = []

for i in df['Time']:
    if 'h' in i:
        loc = i.find('h')
        hrs.append(int(i[:loc]))
        loc_min = i.find('m')
        mins.append(int(i[loc + 2:loc_min]))
    else:
        loc = i.find('m')
        mins.append(int(i[:loc]))

hours = np.array(hrs)
minutes = np.array(mins)
time_minutes = (hours * 60) + minutes

df = df.drop('Time', axis=1)
df['Time (min)'] = time_minutes

# Renaming the columns
df = df.rename(columns={'Distance': 'Distance (mi)', 'Elevation': 'Elevation (ft)', 'Name' : 'Athlete Name' })

df

Unnamed: 0,Athlete Name,Distance (mi),Elevation (ft),Time (min)
0,Jae Kim,30.57,2257,116
1,Jae Kim,48.77,2172,220
2,Jae Kim,33.18,1306,131
3,Jae Kim,31.78,1150,141
4,Jae Kim,27.99,1093,118
...,...,...,...,...
4227,Sam Boardman,33.05,2175,103
4229,Sam Boardman,37.55,1634,127
4230,Sam Boardman,36.55,272,133
4232,Sam Boardman,40.32,1056,152


Clean data, statistic turned into integers, no bike-ride activities removed

---

In [15]:
# Calculating the total monthly distances

df['Distance (mi)'] = df.groupby('Athlete Name')['Distance (mi)'].transform('sum')
df['Elevation (ft)'] = df.groupby('Athlete Name')['Elevation (ft)'].transform('sum')
df['Time (min)'] = df.groupby('Athlete Name')['Time (min)'].transform('sum')
df = df.rename(columns={'Distance (mi)': 'Monthly Distance (mi)', 
                        'Elevation (ft)': 'Monthly Elevation (ft)', 
                        'Time (min)': 'Monthly Time (min)'})

df = df.drop_duplicates()

# Putting time into hours
df['Monthly Time (min)'] = df['Monthly Time (min)'].apply(lambda x: x/60)
df = df.rename(columns={'Monthly Time (min)': 'Monthly Time (hours)'})

df



Unnamed: 0,Athlete Name,Monthly Distance (mi),Monthly Elevation (ft),Monthly Time (hours)
0,Jae Kim,397.37,19768,28.300000
15,Jeff Johnson,2198.54,188130,141.116667
34,Brook Sutt🍪n,360.16,33217,28.566667
51,Kerry Werner,203.87,10637,9.766667
60,Spencer Paxson,408.94,48279,42.650000
...,...,...,...,...
3994,Carl Anderson,305.68,41690,61.600000
4031,Nathan Lloyd,474.34,48040,32.466667
4043,Tyler Schwartz,1588.53,107353,90.550000
4072,Safa Brian,733.28,75964,46.666667


Average stats found for each athelte, duplicates removed, time put back into hours

---

In [25]:
cleaned_data = df
average_distance = cleaned_data['Monthly Distance (mi)'].mean().round(2)
average_elevation = cleaned_data['Monthly Elevation (ft)'].mean().round(2)
average_time = cleaned_data['Monthly Time (hours)'].mean().round(2)

average_weekly_distance = (average_distance/4).round(2)
average_weekly_time = (average_time/4).round(2)

print(f'On Average, people that did better than you on your top 10 segments rode {average_distance} miles \na month, climbed {average_elevation} feet per month, and spent {average_time} hours training. \nThis type of exertion amounts to roughly {average_weekly_distance} miles and {average_weekly_time} hours of training per week. \nGo get em champ!')


On Average, people that did better than you on your top 10 segments rode 937.25 miles 
a month, climbed 69084.92 feet per month, and spent 58.04 hours training. 
This type of exertion amounts to roughly 234.31 miles and 14.51 hours of training per week. 
Go get em champ!


### Arch nemesis

In [26]:
links = pd.read_csv("Interval_pages.csv")
links['Athlete ID'] = links['Athlete page'].str.extract(r'/athletes/(\d+)')
links = links.dropna()
from collections import Counter
link_counts = Counter(links["Athlete ID"])
occurrences_df = pd.DataFrame(list(link_counts.items()), columns=["Athlete ID", "Occurrences"])
occurrences_df = occurrences_df.sort_values(by="Occurrences", ascending=False)
occurrences_df = occurrences_df.reset_index(drop=True)

print("After looking through your top ten acheivements we found all user who beat you in your \ntop ten segments. Consider there people your \"Arch nemisises.\" We reccomend \nchecking out their profile to see how you can improve:  ")

print(" ")
print(" ")

print(f'Your greatest arch nemisis beat you {occurrences_df.loc[0, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[0,"Athlete ID"]}')

print(" ")

print(f'Your second greatest arch nemisis beat you {occurrences_df.loc[1, "Occurrences"]} times in your current top ten segments.\nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[1,"Athlete ID"]}')

print(" ")

print(f'Your third greatest arch nemisis beat you {occurrences_df.loc[2, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+ occurrences_df.loc[2,"Athlete ID"]}')

After looking through your top ten acheivements we found all user who beat you in your 
top ten segments. Consider there people your "Arch nemisises." We reccomend 
checking out their profile to see how you can improve:  
 
 
Your greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/14197582
 
Your second greatest arch nemisis beat you 7 times in your current top ten segments.
Find their profile here: https://www.strava.com/athletes/5875016
 
Your third greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/497379


![Joe Emmerling](joe_strava.jpeg "Joe") [Joe Emmerling](https://www.strava.com/athletes/14197582)

![Danny Finneran](danny_strava.png) [Danny Finneran](https://www.strava.com/athletes/5875016)
 
![Tony Manzella](tony_strava2.jpeg) [Tony Manzella](https://www.strava.com/athletes/497379)


Arch nemesis analysis

---

# The Proccess

## The Scraper

- First we compiled our scraper to go into the strava website and through the top ten segments gather all of the athletes ahead of the user and find their activity page corresponding to the month before their acheivement.
- Then we logged into strava with selenium and looped through all of the athlete activity pages and stored their activity data for the month.
- We also stored the name on the activity to later be able to throw out other user's group activities that are also present on the page.




In [None]:
# The scraper class
import scrapy
from scrapy.http import FormRequest
from scrapy_selenium import SeleniumRequest


class StravaScraper(scrapy.Spider):
    name = "test"
    athlete = '16735685'
    segments =[]
    
    allowed_domains = ['strava.com']
    start_urls = ['https://www.strava.com/login']
        
    def __init__(self):
        self.date_dict = {'Jan':'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08',
                          'Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
        self.month_before = {'Jan':'12','Feb':'01','Mar':'02','Apr':'03','May':'04','Jun':'05','Jul':'06','Aug':'07',
                          'Sep':'08','Oct':'09','Nov':'10','Dec':'11'}

        self.url_list = []
    
    def parse(self, response):

        token = response.xpath('//*[@name="csrf-token"]/@content').get()
        return FormRequest.from_response(response,
                                        formdata={
                                            'authenticity_token': token,
                                            
                                            'email': 'sashaprs@gmail.com',
                                            'password': 'PIC16BProject',
                                        },
                                        #dont_filter=True,
                                        #eta={'dont_redirect': True, 'handle_httpstatus_list': [302]},
                                        callback=self.parse_after_login)
    

    def parse_after_login(self, response):
        #Try except statement for if user has no top tens
        top_ten_page = f'https://www.strava.com/athletes/{self.athlete}/segments/leader?top_tens=true'
    
        yield scrapy.Request(url=top_ten_page, callback=self.parse_top_tens)

    def parse_top_tens(self, response):
        top_tens = response.css('table.my-segments tbody tr td a::attr(href)').getall()
        self.segments.extend(top_tens)
        next_page = response.xpath('//li[@class="next_page"]/a[@rel="next"]/@href').get()
        if(next_page):
            next_page_url = 'https://www.strava.com' + next_page
        
            yield scrapy.Request(url=next_page_url, callback = self.parse_top_tens)
            
        for top_ten in self.segments:

            if '/segments/' in top_ten:
            # yield {"top ten": 'https://www.strava.com' + top_ten}
                top_ten = 'https://www.strava.com' + top_ten
                yield scrapy.Request(url=top_ten, callback = self.parse_leaderboard)


    def parse_leaderboard(self, response):
        athlete_pages = response.css('td.athlete.track-click a::attr(href)').getall()
        dates = response.css('a[href^="/segment_efforts/"]::text').getall()
        #This get month of acheivement, we want month before acheivement
        edited_dates = []
        # for d in dates:
        #     month = self.date_dict[d[:3]]
        #     year = d[-4:]
        #     edited_dates.append(year+month)

        #New code: Defined new dictionary that gives month before, if month before is december, also subtracts one from year.
        for d in dates:
            month = self.month_before[d[:3]]
            if month == "12":
                year = int(d[-4:])
                year = str(year-1)
            else:
                year = d[-4:]
            edited_dates.append(year+month)
            year_offset = str(2023 - int(year))


        athlete = '16735685'  # Ensure this is a string

        athletes_data = {}  # Dictionary to store athlete URLs and corresponding dates

        for i, athlete_url in enumerate(athlete_pages):
            if athlete_url.endswith(f'/athletes/{athlete}'):
                break
            athlete_url = 'https://www.strava.com' + athlete_url
            athletes_data[athlete_url] = edited_dates[i]

        for athlete_url, date in athletes_data.items():
            interval = str(date)
        

            athlete_url = f'{athlete_url}#interval?interval={interval}&interval_type=month&chart_type=miles&year_offset={year_offset}'
            # yield {"Athlete page": athlete_url, "date": date} #276 athlete pages
            #yield scrapy.Request(url=athlete_url, callback=self.start_request)
            self.url_list.append(athlete_url)
        yield SeleniumRequest(callback = self.start_request)
        

    def start_request(self):
        urls = ['https://www.strava.com/login']
        for url in urls:
            yield SeleniumRequest(
                url= url,
                callback=self.parse,
                wait_time=3)

    def parse(self,response):
        scrape_url = "http://www.example.com/authen_handler.aspx"
        driver.get(scrape_url)        
        username = self.driver.find_element_by_name("Your Email")
        password = self.driver.find_element_by_name("Password")
        username.send_keys("sashaprs@gmail.com")
        password.send_keys("PIC16BProject")
        self.driver.find_element_by_xpath("//input[@name='login-button']").click()



    def parse_activities(self, response):
        '''
        limited_data = response.css('.limited')
        if limited_data:
            return
        
        stats = response.css('ul.list-stats.preview-stats.bottomless li div.stat')

        for stat in stats:
            distance_unit = stat.css('span.stat-subtext.caption::text').get()
            distance = stat.css('b.stat-text.value::text').get()
            print(distance,distance_unit)

        yield {
            'Distance Unit': distance_unit.strip() if distance_unit else None,
            'Distance': distance.strip() if distance else None
        }
        
        '''
        
        activity_values = response.css('div.------packages-ui-Stat-Stat-module__statValue--phtGK').getall()
        activity_units = response.css('div.------packages-ui-Stat-Stat-module__statValue--phtGK abbr.unit::attr(title)').getall()
    
        for value, unit in zip(activity_values, activity_units):
            yield {
                "Activity Value": value.strip(),  # Remove leading/trailing spaces
                "Activity Unit": unit if unit else "No unit specified"  # Provide default message if unit is not found
            }

## Data Processing and Analysis
### Training Reccomendations

This gave us our activity data. The raw data was 4234 activities.

In [17]:
import pandas as pd
import numpy as np
df = pd.read_csv('individual_activities.csv')
df

Unnamed: 0,Name,Distance,Elevation,Time
0,Jae Kim,30.57 mi,"2,257 ft",1h 56m
1,Jae Kim,48.77 mi,"2,172 ft",3h 40m
2,Jae Kim,33.18 mi,"1,306 ft",2h 11m
3,Jae Kim,31.78 mi,"1,150 ft",2h 21m
4,Jae Kim,27.99 mi,"1,093 ft",1h 58m
...,...,...,...,...
4229,Sam Boardman,37.55 mi,"1,634 ft",2h 7m
4230,Sam Boardman,36.55 mi,272 ft,2h 13m
4231,Sam Boardman,13.81 mi,571 ft,48m 50s
4232,Sam Boardman,40.32 mi,"1,056 ft",2h 32m


To clean up the data we changed the datatype of the statistics to integers, inlcuding changing the time to minutes to help with arithmetic later. We also removed any activites that were not bike rides as well as the group activities that were automatically scraped but did not belong to the user in question. This resulted in 2213 rows of data.

In [18]:
# Getting rid of duplicate activities and non-bike rides
df = df.drop_duplicates()
df = df[df['Time'].str.endswith('m', 's')]
df = df[df['Distance'].str.endswith('mi')]
df = df[df['Elevation'].str.endswith('ft')]

# Casting Data into integers
df['Distance'] = df['Distance'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Distance'] = df['Distance'].astype(float)

df['Elevation'] = df['Elevation'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Elevation'] = df['Elevation'].str.replace(',', '').astype(int)

# Transforming time into minutes, to perform arithmetic
hours = np.empty((df.shape[0],1))
hrs = []
minutes = np.empty((df.shape[0],1))
mins = []

for i in df['Time']:
    if 'h' in i:
        loc = i.find('h')
        hrs.append(int(i[:loc]))
        loc_min = i.find('m')
        mins.append(int(i[loc + 2:loc_min]))
    else:
        loc = i.find('m')
        mins.append(int(i[:loc]))

hours = np.array(hrs)
minutes = np.array(mins)
time_minutes = (hours * 60) + minutes

df = df.drop('Time', axis=1)
df['Time (min)'] = time_minutes

# Renaming the columns
df = df.rename(columns={'Distance': 'Distance (mi)', 'Elevation': 'Elevation (ft)', 'Name' : 'Athlete Name' })

df

Unnamed: 0,Athlete Name,Distance (mi),Elevation (ft),Time (min)
0,Jae Kim,30.57,2257,116
1,Jae Kim,48.77,2172,220
2,Jae Kim,33.18,1306,131
3,Jae Kim,31.78,1150,141
4,Jae Kim,27.99,1093,118
...,...,...,...,...
4227,Sam Boardman,33.05,2175,103
4229,Sam Boardman,37.55,1634,127
4230,Sam Boardman,36.55,272,133
4232,Sam Boardman,40.32,1056,152


We then condesnsed the dataframe to find montly totals for each rider. Now there is one column for each athlete with their monthly bike-ride totals.

In [19]:
# Calculating the total monthly distances

df['Distance (mi)'] = df.groupby('Athlete Name')['Distance (mi)'].transform('sum')
df['Elevation (ft)'] = df.groupby('Athlete Name')['Elevation (ft)'].transform('sum')
df['Time (min)'] = df.groupby('Athlete Name')['Time (min)'].transform('sum')
df = df.rename(columns={'Distance (mi)': 'Monthly Distance (mi)', 
                        'Elevation (ft)': 'Monthly Elevation (ft)', 
                        'Time (min)': 'Monthly Time (min)'})

df = df.drop_duplicates()

# Putting time into hours
df['Monthly Time (min)'] = df['Monthly Time (min)'].apply(lambda x: x/60)
df = df.rename(columns={'Monthly Time (min)': 'Monthly Time (hours)'})

df

Unnamed: 0,Athlete Name,Monthly Distance (mi),Monthly Elevation (ft),Monthly Time (hours)
0,Jae Kim,397.37,19768,28.300000
15,Jeff Johnson,2198.54,188130,141.116667
34,Brook Sutt🍪n,360.16,33217,28.566667
51,Kerry Werner,203.87,10637,9.766667
60,Spencer Paxson,408.94,48279,42.650000
...,...,...,...,...
3994,Carl Anderson,305.68,41690,61.600000
4031,Nathan Lloyd,474.34,48040,32.466667
4043,Tyler Schwartz,1588.53,107353,90.550000
4072,Safa Brian,733.28,75964,46.666667


Finally, we found the average of all the athletes statistic which gave us our training reccomendation for both the month and per week:

In [20]:
cleaned_data = df
average_distance = cleaned_data['Monthly Distance (mi)'].mean().round(2)
average_elevation = cleaned_data['Monthly Elevation (ft)'].mean().round(2)
average_time = cleaned_data['Monthly Time (hours)'].mean().round(2)

average_weekly_distance = (average_distance/4).round(2)
average_weekly_time = (average_time/4).round(2)

print(f'On Average, people that did better than you on your top 10 segments rode {average_distance} miles a month, climbed {average_elevation} feet per month, \nand spent {average_time} hours training. This type of exertion amounts to roughly {average_weekly_distance} miles and {average_weekly_time} hours of training per week. \nGo get em champ!')


On Average, people that did better than you on your top 10 segments rode 937.25 miles a month, climbed 69084.92 feet per month, 
and spent 58.04 hours training. This type of exertion amounts to roughly 234.31 miles and 14.51 hours of training per week. 
Go get em champ!


### Arch nemesis

For the arch nemesis section we used the list of links we used for selinium to find the users that beat the user most often. Since the link is specific to the timeframe we isolated the athlete ID before find the top three most occuring users:

In [22]:
links = pd.read_csv("Interval_pages.csv")
links['Athlete ID'] = links['Athlete page'].str.extract(r'/athletes/(\d+)')
links = links.dropna()
from collections import Counter
link_counts = Counter(links["Athlete ID"])
occurrences_df = pd.DataFrame(list(link_counts.items()), columns=["Athlete ID", "Occurrences"])
occurrences_df = occurrences_df.sort_values(by="Occurrences", ascending=False)
occurrences_df = occurrences_df.reset_index(drop=True)

occurrences_df

Unnamed: 0,Athlete ID,Occurrences
0,14197582,7
1,5875016,7
2,497379,7
3,5350307,6
4,571689,5
...,...,...
180,34799460,1
181,45025631,1
182,497009,1
183,65336496,1


In [21]:
print("After looking through your top ten acheivements we found all user who beat you in your top ten segments. \nConsider there people your \"Arch nemisises.\" We reccomend checking out their profile to see how you can improve:  ")

print(" ")
print(" ")

print(f'Your greatest arch nemisis beat you {occurrences_df.loc[0, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[0,"Athlete ID"]}')

print(" ")

print(f'Your second greatest arch nemisis beat you {occurrences_df.loc[1, "Occurrences"]} times in your current top ten segments.\nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[1,"Athlete ID"]}')

print(" ")

print(f'Your third greatest arch nemisis beat you {occurrences_df.loc[2, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+ occurrences_df.loc[2,"Athlete ID"]}')

After looking through your top ten acheivements we found all user who beat you in your top ten segments. 
Consider there people your "Arch nemisises." We reccomend checking out their profile to see how you can improve:  
 
 
Your greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/14197582
 
Your second greatest arch nemisis beat you 7 times in your current top ten segments.
Find their profile here: https://www.strava.com/athletes/5875016
 
Your third greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/497379


## Conclusion