# Github

https://github.com/max-eisenberg/Project

# Project Description
Strava is a popular app used by athletes to track workouts. In our project we give users insight to their current performance and use activity data of users that are outperform the user to give reccomendations on how to improve. The segments feature of strava takes stretches from user activity to compare to other users who also complete that strech in one of their logged activities. Any segment in which a user performs amongst the top ten best is stored in the profile under top tens.

![](flowchart.png)

# General Approach

This project uses a webscraping approach which required the use of scrapy and selenium. While strava does have an API available to the public, it is severly limited and requires authentification by the user before taking any activity data. Given that our project relies so heavily on a analyzing competitor activities, the API was not of much use to us. This project looks at anyone who outperformed the user in their top tens and use their activity data to help the user improve. With out sraper we first logged into strava, directed to the top tens page, visited all of the top ten segments, got all of the athletes who beat the user in each top ten, then directed to their athelte page to collect all of their activity for the month leading up to their achievement.


# The Proccess

## Flow of work

With our webscrpaing approach we followed this general flow:
- Log in to strava on scrapy
- Direct to strava athlete top ten pages 
- Get links to all top ten segments
- Direct to athlete page for all athletes that beat user in that segment
    - Specifically we direct to the athelte page which displays all of the activities from the month before that athlete placed in the the top ten segment, ahead of the user.
- Log in through selenium
- Gather all activiy data from the links for given time period of each athlete, store. 

## The Scraper
With the scraper we first directed to the strava website and logged in and directed to the top ten segments gather all of the athletes ahead of the user and find their activity page corresponding to the month before their acheivement, returning a list of links. This satisfies the project requirement of "web scraping." For our user, the scraper returned a list of 275 athlete pages, each of which would then be analyzed for statistics. 

Each link that is returned is a specific page from their profile listing the activities for the month prior to their top ten achievement.

In [2]:
import scrapy
import selenium
from selenium import webdriver
from scrapy.http import FormRequest
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time



class StravaScraper(scrapy.Spider):
    name = "test"
    athlete = '16735685'
    segments =[]
    
    allowed_domains = ['strava.com']
    start_urls = ['https://www.strava.com/login']
    
    # To define month for athlete url 
    def __init__(self):
        self.date_dict = {'Jan':'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08',
                          'Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
        self.month_before = {'Jan':'12','Feb':'01','Mar':'02','Apr':'03','May':'04','Jun':'05','Jul':'06','Aug':'07',
                          'Sep':'08','Oct':'09','Nov':'10','Dec':'11'}

    #Login to strava
    def parse(self, response):
        token = response.xpath('//*[@name="csrf-token"]/@content').get()
        return FormRequest.from_response(response,
                                        formdata={
                                            'authenticity_token': token,
                                            
                                            'email': 'sashaprs@gmail.com',
                                            'password': 'PIC16BProject',
                                        },
                                        #dont_filter=True,
                                        #eta={'dont_redirect': True, 'handle_httpstatus_list': [302]},
                                        callback=self.parse_after_login)
    
    #Direct to athlete top ten page
    def parse_after_login(self, response):
        top_ten_page = f'https://www.strava.com/athletes/{self.athlete}/segments/leader?top_tens=true'
        yield scrapy.Request(url=top_ten_page, callback=self.parse_top_tens)

    # Yield all top ten segments on page, direct to next page and repeat
    def parse_top_tens(self, response):
        top_tens = response.css('table.my-segments tbody tr td a::attr(href)').getall()
        self.segments.extend(top_tens)
        next_page = response.xpath('//li[@class="next_page"]/a[@rel="next"]/@href').get()
        if(next_page):
            next_page_url = 'https://www.strava.com' + next_page
        
            yield scrapy.Request(url=next_page_url, callback = self.parse_top_tens)
            
        for top_ten in self.segments:
            #Make sure link is a segment link 
            if '/segments/' in top_ten:
                top_ten = 'https://www.strava.com' + top_ten
                yield scrapy.Request(url=top_ten, callback = self.parse_leaderboard)
    # get athlete links from segments page
    def parse_leaderboard(self, response):
        self.counter +=1
        self.url_list = [] 
        athlete_pages = response.css('td.athlete.track-click a::attr(href)').getall()
        dates = response.css('a[href^="/segment_efforts/"]::text').getall()
        # month before achievement
        edited_dates = []
        # Record date of achievement in numerical form
        for d in dates:
            month = self.month_before[d[:3]]
            if month == "12":
                year = int(d[-4:])
                year = str(year-1)
            else:
                year = d[-4:]
            edited_dates.append(year+month)

        athlete = '16735685'  # Ensure this is a string

        athletes_data = {}  
        #Stop recording athlete links at user (Only get links for people who beat user)
        #Store athlete url and date of achievement together
        for i, athlete_url in enumerate(athlete_pages):
            if athlete_url.endswith(f'/athletes/{athlete}'):
                break
            athlete_url = 'https://www.strava.com' + athlete_url
            athletes_data[athlete_url] = edited_dates[i]
        #Format athlete interval link
        for athlete_url, date in athletes_data.items():
            interval = str(date)
        
            year_offset = str(2023 - int(year))
            athlete_url = f'{athlete_url}#interval?interval={interval}&interval_type=month&chart_type=miles&year_offset={year_offset}'
            yield {'Athlete': athlete_url}
                
            


### Visual

In [1]:
import pandas as pd
link = pd.read_csv("interval_pages.csv")
link.head(5)

Unnamed: 0,Athlete page
0,https://www.strava.com/athletes/29127886#inter...
1,https://www.strava.com/athletes/67093810#inter...
2,https://www.strava.com/athletes/282426#interva...
3,https://www.strava.com/athletes/283349#interva...
4,https://www.strava.com/athletes/43161349#inter...


### Selenium

With selenium we first logged into strava and then looped through the list on links from selenium and gather all the activity data from each page. This required us to set up a chrome driver the go into the webpages and find the element we required. We needed to do this to get around the scrpay error we had as the cource code was dynamically rendering, making scrapy requests not useful.

With selenium, we directed to the statistics on the activities pages by its xpath, and simillarly to the webscraper saved it all to a csv file for data processing.

In [3]:
import scrapy
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

activities = pd.DataFrame()
#Testing links
test = ['https://www.strava.com/athletes/29127886#interval_type?chart_type=miles&interval_type=month&interval=202012&year_offset=3','https://www.strava.com/athletes/5838343#interval_type?chart_type=miles&interval_type=month&interval=201803&year_offset=5','https://www.strava.com/athletes/28626466#interval_type?chart_type=miles&interval_type=month&interval=202205&year_offset=1']

#Actual data,, Set up chrome driver for selenium
urls = pd.read_csv('Interval_pages.csv')
urls_list = urls['Athlete page']
executable_path="/Users/sasha/Documents/PIC16B/strava_scraper/strava_scraper/spiders/chromedriver"
driver = webdriver.Chrome(executable_path=executable_path)
        
#Login to strava
driver.get('https://www.strava.com/login')
email_field = driver.find_element_by_id('email')
email_field.send_keys('sashaprs@gmail.com')
password_field = driver.find_element_by_id('password')
password_field.send_keys('PIC16BProj')
login_button = driver.find_element_by_id('login-button')
login_button.click()

#Collect activity data
for url in urls_list:
    temp=[]
    activity_stats = [] 
    names = []
    driver.get(url)
    driver.implicitly_wait(10)
    #Only get data drom public profiles
    #Store name, distance, elevation, time
    try:
        athlete = driver.find_element(By.CSS_SELECTOR, "h1.text-title1.athlete-name").text
        name = driver.find_elements(By.XPATH("//a[@data-testid='owners-name'"))
        for i in name:
            names.append(i.text)
        activity_values = driver.find_elements(By.CLASS_NAME, "------packages-ui-Stat-Stat-module__statValue--phtGK")
        for i in activity_values:
            temp.append(i.text)
        distance = temp[::3]
        elevation = temp[1::3]
        time = temp[2::3]
        df = pd.DataFrame({'Athlete':athlete,'Name':names,'Distance': distance, 'Elevation': elevation, 'Time': time})
        activities = pd.concat([activities, df], axis=0)
    except:
        pass
activities.to_csv('individual_acitivities.csv', index=False)


WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home


In [5]:
activities = pd.read_csv("individual_activities.csv")
activities.head(5)

Unnamed: 0,Name,Distance,Elevation,Time
0,Jae Kim,30.57 mi,"2,257 ft",1h 56m
1,Jae Kim,48.77 mi,"2,172 ft",3h 40m
2,Jae Kim,33.18 mi,"1,306 ft",2h 11m
3,Jae Kim,31.78 mi,"1,150 ft",2h 21m
4,Jae Kim,27.99 mi,"1,093 ft",1h 58m


## Data Processing and Analysis
### Training Reccomendations
For the training reccomendations we used the activity data obtained through scrapy and selenium to find the average statistic for each athlete that beat the user. We started with about 2300 rows of activity data which was first cleaned and modified so that we could find the averages. this satisfies the technical component of working with messy or large data.

We also used the strava API to get the users acivity data. This required authenification from the user. After obtaining the data we cleaned it and put it in a dataframe. Using the dtae of the activity we were able to organize each activity by week and display their yearly mileage by week to compare to the competitor data. 

We also used SQL to store the users activity data and created a query function to allow extract all of the activities froma certain day of the week. This can help the user achieve their mileage goals by recognizing the day sin which they tend to ride more so that they can adapt their new training plan. this satisfies the technical component of working with SQL databases.

In [4]:
import pandas as pd
import numpy as np
df = pd.read_csv('individual_activities.csv')

# Getting rid of duplicate activities and non-bike rides
df = df.drop_duplicates()
df = df[df['Time'].str.endswith('m', 's')]
df = df[df['Distance'].str.endswith('mi')]
df = df[df['Elevation'].str.endswith('ft')]

# Casting Data into integers
df['Distance'] = df['Distance'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Distance'] = df['Distance'].astype(float)

df['Elevation'] = df['Elevation'].apply(lambda x: x[:-3] if len(x) > 2 else x)
df['Elevation'] = df['Elevation'].str.replace(',', '').astype(int)

# Transforming time into minutes, to perform arithmetic
hours = np.empty((df.shape[0],1))
hrs = []
minutes = np.empty((df.shape[0],1))
mins = []

for i in df['Time']:
    if 'h' in i:
        loc = i.find('h')
        hrs.append(int(i[:loc]))
        loc_min = i.find('m')
        mins.append(int(i[loc + 2:loc_min]))
    else:
        loc = i.find('m')
        mins.append(int(i[:loc]))

hours = np.array(hrs)
minutes = np.array(mins)
time_minutes = (hours * 60) + minutes

df = df.drop('Time', axis=1)
df['Time (min)'] = time_minutes

# Renaming the columns
df = df.rename(columns={'Distance': 'Distance (mi)', 'Elevation': 'Elevation (ft)', 'Name' : 'Athlete Name' })

# Calculating the total monthly distances

df['Distance (mi)'] = df.groupby('Athlete Name')['Distance (mi)'].transform('sum')
df['Elevation (ft)'] = df.groupby('Athlete Name')['Elevation (ft)'].transform('sum')
df['Time (min)'] = df.groupby('Athlete Name')['Time (min)'].transform('sum')
df = df.rename(columns={'Distance (mi)': 'Monthly Distance (mi)', 
                        'Elevation (ft)': 'Monthly Elevation (ft)', 
                        'Time (min)': 'Monthly Time (min)'})

df = df.drop_duplicates()

# Putting time into hours
df['Monthly Time (min)'] = df['Monthly Time (min)'].apply(lambda x: x/60)
df = df.rename(columns={'Monthly Time (min)': 'Monthly Time (hours)'})

cleaned_data = df
average_distance = cleaned_data['Monthly Distance (mi)'].mean().round(2)
average_elevation = cleaned_data['Monthly Elevation (ft)'].mean().round(2)
average_time = cleaned_data['Monthly Time (hours)'].mean().round(2)

average_weekly_distance = (average_distance/4).round(2)
average_weekly_time = (average_time/4).round(2)

print(f'On Average, people that did better than you on your top 10 segments rode {average_distance} miles \na month, climbed {average_elevation} feet per month, and spent {average_time} hours training. This type \nof exertion amounts to roughly {average_weekly_distance} miles and {average_weekly_time} hours of training per week. \nGo get em champ!')


On Average, people that did better than you on your top 10 segments rode 937.25 miles 
a month, climbed 69084.92 feet per month, and spent 58.04 hours training. This type 
of exertion amounts to roughly 234.31 miles and 14.51 hours of training per week. 
Go get em champ!


### Arch nemesis

For the arch nemesis section we used the list of links we used for selinium to find the users that beat the user most often. Since the link is specific to the timeframe we isolated the athlete ID before find the top three most occuring users:

In [5]:
links = pd.read_csv("Interval_pages.csv")
#Extract athlete ID
links['Athlete ID'] = links['Athlete page'].str.extract(r'/athletes/(\d+)')
links = links.dropna()
#Find number of occurrences of each athlete ID
from collections import Counter
link_counts = Counter(links["Athlete ID"])
occurrences_df = pd.DataFrame(list(link_counts.items()), columns=["Athlete ID", "Occurrences"])
#Order by most occurring
occurrences_df = occurrences_df.sort_values(by="Occurrences", ascending=False)
occurrences_df = occurrences_df.reset_index(drop=True)

#Print top three most occurring athletes with their formatted link
print("After looking through your top ten achievements we found all user who beat you in your \ntop ten segments. Consider there people your \"Arch Nemisis.\" We recommend \nchecking out their profile to see how you can improve:  ")

print(" ")
print(" ")

print(f'Your greatest arch nemisis beat you {occurrences_df.loc[0, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[0,"Athlete ID"]}')

print(" ")

print(f'Your second greatest arch nemisis beat you {occurrences_df.loc[1, "Occurrences"]} times in your current top ten segments.\nFind their profile here: {"https://www.strava.com/athletes/"+occurrences_df.loc[1,"Athlete ID"]}')

print(" ")

print(f'Your third greatest arch nemisis beat you {occurrences_df.loc[2, "Occurrences"]} times in your current top ten segments. \nFind their profile here: {"https://www.strava.com/athletes/"+ occurrences_df.loc[2,"Athlete ID"]}')

After looking through your top ten achievements we found all user who beat you in your 
top ten segments. Consider there people your "Arch Nemisis." We recommend 
checking out their profile to see how you can improve:  
 
 
Your greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/14197582
 
Your second greatest arch nemisis beat you 7 times in your current top ten segments.
Find their profile here: https://www.strava.com/athletes/5875016
 
Your third greatest arch nemisis beat you 7 times in your current top ten segments. 
Find their profile here: https://www.strava.com/athletes/497379


### API Data

The API requires Authenitification from the user so we were able to do this with Max's Profile:

In [6]:
!curl -X POST https://www.strava.com/oauth/token \
        -F client_id=116130 \
        -F client_secret=800ca990d9a63dd2f931139defe2a740fafbbb82 \
        -F code=617a9a395452bcdf79f3e1d53c9e5ff3b29095a6 \
        -F grant_type=authorization_code

import pandas as pd
import requests

activities_url = 'https://www.strava.com/api/v3/athlete/activities'

#The authorization below comes from generating an authorization code and then running a 
#command in the terminal to exhange that for an access code.

headers = {'Authorization': 'Bearer 2abb7e7b39cad9c2df18773317b6866029945a96'}

activities = []

page = 1
per_page = 100

while True:
    params = {'page': page, 'per_page': per_page}
    response = requests.get(activities_url, headers=headers, params=params)
    page_activities = response.json()

    if not page_activities:
        break
    activities.extend(page_activities)
    page += 1
#Format activity data into a readable dataframe
df = pd.DataFrame(activities)

keep = ['name', 'distance', 'moving_time', 'total_elevation_gain', 
                'average_heartrate', 'weighted_average_watts', 'start_date']
df = df[keep]
df['distance'] = round(df['distance'] * 0.000621371192,2)
df['total_elevation_gain'] = round(df['total_elevation_gain'] * 3.28084,2)
df['moving_time'] = round(df['moving_time']/3600,2)
df['start_date'] = df['start_date'].str.slice(0, 10)

df = df.rename(columns={'name' : 'Name', 'distance' : 'Distance (mi)', 'moving_time' : 'Moving Time (hr)', 'total_elevation_gain' : 'Elevation Gain (ft)', 
                'average_heartrate' : 'Average Heartrate (bpm)', 'weighted_average_watts' : 'Average Power (w)', 'start_date': 'Date'})

df

df['Date'] = pd.to_datetime(df['Date'])

# Add a new column 'Day of the Week' with day of the week as a number (0=Monday, 6=Sunday)
df['Day of the Week'] = df['Date'].dt.dayofweek

{"message":"Bad Request","errors":[{"resource":"AuthorizationCode","field":"code","code":"invalid"}]}

KeyboardInterrupt: 

2022 Weekly Mileage

In [7]:
df = pd.read_csv('max_data.csv')
df['Year']= df["Date"].str[:4]
df['Month']= df["Date"].str[5:7]

import datetime
week = [] 
#Find numerical week for each activity data
for i in df['Date']:
    month = int(i[:4])
    day = int(i[5:7])
    year = int(i[8:])
    week.append(datetime.date(month,day,year).strftime("%V"))
df['Week']=week

#Weekly mileage
df_2022 = df[df['Year'] == '2022'].sort_values('Week')
import plotly.express as px
import plotly.io as pio
pio.renderers.default="iframe"

fig = px.histogram(df_2022, x="Week", y= "Distance (mi)")
fig.show()

2022 Monthly Mileage

In [8]:
#Monthly mileage
import plotly.express as px
# Here we use a column with categorical data
fig = px.histogram(df_2022, x="Month", y= "Distance (mi)")
fig.show()

In [None]:
df = pd.read_csv('max_data.csv')

# Make a copy of the original DataFrame 'df'
max_df = df.copy()

# Define a dictionary to map the old column names to the new shorter names
column_name_mapping = {
    'Name': 'Name',
    'Distance (mi)': 'Distance',
    'Moving Time (hr)': 'MovingTime',
    'Elevation Gain (ft)': 'ElevationGain',
    'Average Heartrate (bpm)': 'AvgHeartrate',
    'Average Power (w)': 'AvgPower',
    'Date': 'Date',
    'Day of the Week':'DayOfWk'
}

import sqlite3

# Rename the columns in 'max_df' using the mapping
max_df = max_df.rename(columns=column_name_mapping)

# Now, 'max_df' is a new DataFrame with the desired column names, and 'df' remains unchanged.
def query_day_of_week_activities(day):
    conn = sqlite3.connect("max_data.db")

    cmd = \
    """
    SELECT 
        m.Name,
        m.Distance,
        m.MovingTime, 
        m.ElevationGain,
        m.AvgHeartrate, 
        m.AvgPower,
        m.Date,
        m.DayOfWk
    FROM max_data_condensed m
    WHERE m.DayOfWk = ?
    """
    df = pd.read_sql_query(cmd, conn, params=(day,))
    conn.close()
    return df

# Ethical Consideration

Our project utilizes public data that is scraped from the Strava webpage. We only collect data that is publicly shared, and only look at statistics such a distance ridden and time spent working out. In order for the data to appear on the Strava webpage initially, the user has to aggree to the terms and conditions, including certain breaches of privacy such as location infortmation. It is also important to not that the final result of our model excludes data associated with specific athlete names, and is a suggestion which is purely based on quantitative analysis. For this reason, there shouldn't be any ethical issues with regard to individuals from which the data is collected. One potential ethical concern with our final result is that the data that is recommended is not entirely accurate. That is, the final prediction is merely a prediction for a certain fitness plan. A particular recommendation that is given could, perhaps, not consider health issues of an individual and consequently result in injury. It would be useful to include a statement to consider this for an individual planning to structure their training on our recommendation system. 