# Downloading the GPX files of all routes

We can't download the gpx files for every route directly without triggering recaptcha, so we'll have to do it through the app since it has no download limits. 

Our chosen option was **Bluestack's** Android emulation because we can easily create a macro that will download all the routes in the *favorites* section of our profile.

We can program a simple **Selenium** function to add the necessary routes to *favorites* in batches of 1000, since that's the maximum amount that can be stored at any given time. We'll then proceed to download them via **Bluestacks**. Rinse and repeat.

## Creating a function to mark our routes as *favs*

In [2]:
#Let's begin by importing our libraries.

import time
import pandas as pd
from selenium import webdriver
from os import path
import re

In [2]:
#Starting our webdriver, in this case I'm using Chrome.

driver = webdriver.Chrome()
driver.get('https://es.wikiloc.com/wikiloc/start.do') #Loading the login page.

Since we can't automate the login procedure we'll have to manually introduce our username and password. After logging in, we can proceed.

In [74]:
#Defining our main function. Remember that it will only work if you're logged in.

t = 0.3 #This will set the wait between actions. If you have a good connection you can keep it between 0.3-0.5. 
        #Some trial and error might be necessary.

def fav(url_list):
    start = time.time() #Starting a counter to time our code.
    for i in url_list: #url_list will be the list of the route's urls (max 1000 and no duplicates, VERY important).
        try:
            driver.get(i) #Accessing the url.
            time.sleep(t) 
            driver.find_element_by_xpath('//*[@id="container"]/a').click() #Clicking on the 'Add to favorites' item.
            time.sleep(t)
            driver.find_element_by_xpath('//*[@id="container"]/div/div/div/div[3]/div[1]').click() #Marking the route as fav.
            time.sleep(t)
        except:
            time.sleep(1) 
            pass
    stop = time.time() #Stopping our timer.
    duration = (stop - start) / 60 #Calculating the elapsed minutes.
    print(len(url_list), 'favs added in', duration, 'minutes.')

In our case we'll be using the url list in our dataframe of routes.

In [3]:
#Importing the dataframe with all routes.

df = pd.read_csv('df_routes_final.csv')

In [5]:
df.head()

Unnamed: 0,ubicacion,nombre,trailrank,distancia,desnivel,dificultad,url,photo1,photo2,photo3
0,Pico Veleta,Día 1/2 - Sierra Nevada - Granada - Pico Veleta,64,"108,78 km",2.914 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...
1,Pico Veleta,Pinos Genil. Güejar Sierra. Hazas Llanas. Prad...,54,"83,61 km",2.563 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...
2,Pico Veleta,granada - pico veleta,41,"85,89 km",2.785 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...
3,Pico Veleta,Subida al pico veleta y al radiotelescospio de...,34,"32,42 km",1.199 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...
4,Pico Veleta,Pico Veleta por el Monachil-el Purche-Sierra N...,32,"101,77 km",3.234 m,Difícil,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...


In [None]:
#Making a list of all our urls.

url = df['url'].tolist()

In [None]:
#Getting rid of duplicates, we don't want to download routes twice.

url_list = list(dict.fromkeys(url))

In [7]:
#Testing the function.

fav(url_list[:1000])

1000 favs added in 27.659632444381714 minutes.


# Using our function and Bluestacks to download all gpx files

Now that we have a function to mark our routes as favorites we will set up **Bluestacks** (an Android emulator) to download them. Those are the steps needed to accomplish it:

    1. Download and install the latest version of Bluestacks.
    2. Start up the emulator, go to the Play Store and download Wikiloc and Google Drive. 
    3. Sign into both apps.
    4. Set the display resolution to 1600 by 900.
    5. If needed, move the Wikiloc app icon to the position shown in the screenshot.
    6. Go to *Macros* and import the macro *route_downloader.json*
    7. Restart the emulator and play the macro until all routes have been downloaded.



What does the macro do?

It will open **Wikiloc**, go to favorites, click on the first route, download the gpx file to your **Google Drive** and remove it from your favorites. It will download 4 more files before closing the app and opening it again. This must be done in order to clean the cache, otherwise the app will slow down until further downloads are not possible (and it happens quite soon).
Every 4 minutes **Bluestacks** will restart and play the macro automatically, this is also done to keep the macro running in the event of a crash, which tends to happen quite often.


Since we can only add (and download) 1000 files at a time, ideally you will want to set up a timer between instances of the *fav* function, so that **Bluestacks** has enough time to download them. 
In my case 1000 routes took about 19500 seconds to download, so my setup was as follows:

In [None]:
#Setting up the function to download multiple batches of routes. First of all we run Bluestacks with the aforementioned macro.

fav(url_list[:1000]) #Marking the first 1000 routes.
time.sleep(20000) #Waiting for the routes to download.
fav(url_list[1000:2000]) #Marking the next batch.
time.sleep(20000) 
fav(url_list[2000:3000])

#Add as many blocks as needed.

# Checking for missing routes

**Bluestacks** and the **Wikiloc** app are far from perfect (and so is our code), so we need a way to check for routes we've failed to download.

This isn't as straightforward as it might seem, since the names of the gpx tracks won't match the route names perfectly. They are often cropped and a lot of times special characters or spaces might be deleted or changed (for example, a dot for a comma).

To be able to match our gpx files with our dataframe we will need to clean up and shorten their names so that they end up being exactly the same. Let's do it.

In [4]:
#Defining a function that will strip a string of special characters and spaces, leaving only the first 20 alphanumeric chars.

def stripper(name):
    return re.sub(r'\W+', '', name)[:20]

In [5]:
#Creating a new column with every route name stripped with our function.

df['alpha_name'] = df['nombre'].apply(stripper)

In [6]:
df.to_csv('df.csv', index=False)

In [33]:
df.head()

Unnamed: 0,ubicacion,nombre,trailrank,distancia,desnivel,dificultad,url,photo1,photo2,photo3,alpha_name
0,Pico Veleta,Día 1/2 - Sierra Nevada - Granada - Pico Veleta,64,"108,78 km",2.914 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,https://es.wikiloc.com/rutas-ciclismo/dia-1-2-...,Día12SierraNevadaGra
1,Pico Veleta,Pinos Genil. Güejar Sierra. Hazas Llanas. Prad...,54,"83,61 km",2.563 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,https://es.wikiloc.com/rutas-ciclismo/pinos-ge...,PinosGenilGüejarSier
2,Pico Veleta,granada - pico veleta,41,"85,89 km",2.785 m,Moderado,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,https://es.wikiloc.com/rutas-ciclismo/granada-...,granadapicoveleta
3,Pico Veleta,Subida al pico veleta y al radiotelescospio de...,34,"32,42 km",1.199 m,Muy difícil,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,https://es.wikiloc.com/rutas-ciclismo/subida-a...,Subidaalpicoveletaya
4,Pico Veleta,Pico Veleta por el Monachil-el Purche-Sierra N...,32,"101,77 km",3.234 m,Difícil,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,https://es.wikiloc.com/rutas-ciclismo/pico-vel...,PicoVeletaporelMonac


In [49]:
#Creating a list of our shortened alphanumeric names.

alpha_list = df['alpha_name'].tolist()
len(alpha_list)

15954

In [51]:
#Deleting duplicates.

alpha_list = list(dict.fromkeys(alpha_list))

In [52]:
len(alpha_list)

9138

In [70]:
#Renaming every gpx file with the new shortened and stripped name, matching our new df column.

for path in pathlib.Path("/gpx").iterdir(): #Using iterdir to iterate through every file in our gpx folder. 
    try:
        if path.is_file():
            old_name = path.stem
            old_extension = path.suffix
            directory = path.parent
            strip = re.sub(r'\W+', '', old_name) #Stripping the name using the same regex as before.
            new_name = strip[:20] + old_extension #Only keeping the first 20 characters.
            path.rename(pathlib.Path(directory, new_name)) #Renaming the file.
    except:
        pass

In [1]:
import path

In [71]:
#Now that our dataframe and gpx files match we can easily check for missing entries.

missing_routes = [] #This list will store our missing filenames.
ok_routes = [] #Successful downloads.

for i in alpha_list:
    gpx_path = 'gpx/' + i + '.gpx'
    isExist = os.path.exists(gpx_path) #Checking if the file with the given filename exists.
    if isExist == True:
        ok_routes.append(i)
        pass
    else:
        missing_routes.append(i)

In [72]:
len(missing_routes)

47

In [55]:
len(ok_routes)

7072

## Downloading the missing routes

In [None]:
#We need to create a list with the url of the missing routes.

missing_list = []

for i in missing_routes:
    if i == df['alpha_name']:
        missing_list.append(df['url']) #Appending the url to the list if the route is in missing_routes.

In [61]:
#Creating a dataframe with our missing routes. This step isn't really necessary.

df_missing = df.loc[df['alpha_name'].isin(missing_routes)]

In [63]:
df_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87 entries, 41 to 15884
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ubicacion   87 non-null     object
 1   nombre      87 non-null     object
 2   trailrank   87 non-null     int64 
 3   distancia   87 non-null     object
 4   desnivel    87 non-null     object
 5   dificultad  87 non-null     object
 6   url         87 non-null     object
 7   photo1      87 non-null     object
 8   photo2      87 non-null     object
 9   photo3      87 non-null     object
 10  alpha_name  87 non-null     object
dtypes: int64(1), object(10)
memory usage: 8.2+ KB


In [64]:
#Populating our list with the missing routes urls.

missing_url = df_missing['url'].tolist()

In [69]:
#Selecting the missing routes as faves for download.

fav(missing_url)

57 favs added in 1.1880242228507996 minutes.


Now it would simply be a matter of downloading our missing routes and renaming them as we did with the other ones. Once we have all the gpx files we can begin parsing them.

**<div align="right">Ironhack DA PT 2021</div>**
    
**<div align="right">Xavier Esteban</div>**