# Scraping all user-submitted altimetries

We already have the *official* altimetries for Spain's ports from the site **Altimetrias.net** (from our second project), but there's plenty of user-submitted altimetries that we'd like to implement in our database. They can be found here:

https://www.altimetrias.net/usuarios/provinciasusu.asp

We'd love to scrape them using the same BeautifulSoup function developed in my **PR03**, but the data structure is quite different and it's more economical to do it via Selenium. Let's begin.

## Initial testing

To develop our function we'll need to know how to access every element of a user-submitted altimetry given its url.

In [1]:
#Importing necessary libraries. 

import pandas as pd
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re

In [2]:
#Starting webdriver, we're using Chrome as usual.

driver = webdriver.Chrome()

In [3]:
#Url of a random port to test.

url = 'https://www.altimetrias.net/aspbk/verPerfilusu.asp?id=1841'

In [4]:
#Accessing the url.

driver.get(url)

In [5]:
#Using xpath to locate the first elements. Here we have the name of the port:

driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[3]/b/font[1]')[0].text

'ZUMARRAUNDI'

In [6]:
#Province.

driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[2]/font')[0].text

' ÁLAVA'

In [7]:
#Starting point (usually a town or city):

driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[3]/b/font[2]')[0].text

'Araia'

In [5]:
#Scraping the altitude we run into our first problem: there's strings and special characters mixed with our value.

driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[1]')[0].text

'Altitud: 942 m'

In [19]:
#Using regex and typecasting to int we're able to obtain the desired output.

int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[1]')[0].text)))

942

In [20]:
#We'll use the same strategy for the distance and the remaining numerical values. 
#Distance:

float(driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[2]')[0].text[10:14].lstrip().replace(',','.'))

5.6

In [7]:
#Elevation gain:

int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[3]')[0].text)))

341

In [26]:
#Average gradient:

float(driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[4]')[0].text[16:20].lstrip().replace(',','.'))

6.0

In [9]:
#Suffer score:

int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[5]')[0].text)))

89

## Testing our scraper

Now that we have the location of every element of interest in a given url we can proceed to develop a scraping function.
Since there's a base url followed by a number (from 0000 to 2400), we'll have to make sure that every number fed to the url maker (base url + number) has 4 digits. This will be accomplished with a small conditional block developed in **PR03**.

In [27]:
def scraper_user_alt(x): 
    if len(str(x)) == 1: 
        x = '000' + str(x) #If the number passed to the function has a single digit, 3 zeros will be added to its left.   
    elif len(str(x)) == 2: 
        x = '00' + str(x) #The same concept is used for every possible input length.
    elif len(str(x)) == 3:   
        x = '0' + str(x)
    else:
        pass #If the number has 4 digits we'll do nothing.
    try: #Not every url will work or display a port. Using a simple try/except we can skip past those instances.
        driver = webdriver.Chrome()
        base_url_puerto = 'https://www.altimetrias.net/aspbk/verPerfilusu.asp?id='
        url = f'{base_url_puerto}{x}' #Creating the final url combining base url and the given number.                                          
        driver.get(url)
        time.sleep(1) #Sleeping for 1 second so the page has time to load.
        puerto = [{'puerto': driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[3]/b/font[1]')[0].text,              
              'provincia': driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[2]/font')[0].text,
              'pueblo': driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[1]/tbody/tr/td[3]/b/font[2]')[0].text,
              'altitud': int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[1]')[0].text))),   
              'desnivel': int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[3]')[0].text))),
              'distancia': float(driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[2]')[0].text[10:14].lstrip().replace(',','.')),
              'pendiente': float(driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[4]')[0].text[16:20].lstrip().replace(',','.')),
              'coeficiente': int(re.sub("\D", "", (driver.find_elements_by_xpath('/html/body/div/table/tbody/tr/td/table[2]/tbody/tr/td[5]')[0].text))),
              'url': driver.current_url}] #Creating a dictionary with all the previously mentioned elements.
        driver.close() #Closing the current tab.
    except:
        driver.close() #If the url can't be accessed, close the tab and try the next one.
    return puerto #Returning the dictionary we just created.          

In [30]:
#Let's test our function with the previous port:

scraper_user_alt(99)

[{'puerto': 'OIZ MENDIA',
  'provincia': ' BIZKAIA',
  'pueblo': 'Mendata-Totorika-Gortaguren',
  'altitud': 1020,
  'desnivel': 1000,
  'distancia': 27.0,
  'pendiente': 3.6,
  'coeficiente': 405,
  'url': 'https://www.altimetrias.net/aspbk/verPerfilusu.asp?id=0099'}]

## Testing our final function

Now that we can scrape a single url, let's create a function that will scrape every url in a given range and return a Pandas dataframe. We'll also add a little timer to benchmark our function. 

In [31]:
def scraper(port_range):
    start = time.time() #Starting our timer.
    lista = [] #We'll append this list with every generated dictionary.
    for i in range(port_range):
        try:
            puerto = scraper_user_alt(i) #Using the previous function to scrape every url.
            lista.append(puerto) #Appending the generated dict to our list.
        except:
            pass #If we can't run the scraper, try the next url. 
    df = pd.DataFrame(lista[0]) #Once we've scraped every url we'll generate a DF using our first dictionary.
    for i in lista[1:]: #Then, we'll append the remaining dictionaries to our dataframe.
        df = df.append(i, ignore_index=True)
    stop = time.time() #Stopping our timer.
    duration = (stop - start) / 60 #Calculating the elapsed minutes.
    print('You just lost', duration, 'minutes of your life.') #Printing the elapsed minutes.
    return df #Returning the generated DF.

In [32]:
#Testing our function.

df = scraper(5)

You just lost 0.39635935227076213 minutes of your life.


In [33]:
#Checking our generated DF. Everything's in order.

df.head()

Unnamed: 0,puerto,provincia,pueblo,altitud,desnivel,distancia,pendiente,coeficiente,url
0,PRESA DEL ATAZAR,MADRID,Patones,1050,275,4.3,6.4,52,https://www.altimetrias.net/aspbk/verPerfilusu...
1,LA CUESTA,MURCIA,Tallante,349,147,2.2,6.6,29,https://www.altimetrias.net/aspbk/verPerfilusu...
2,MONTE ARRAIZ,BIZKAIA,Bilbao,314,275,2.4,11.0,110,https://www.altimetrias.net/aspbk/verPerfilusu...
3,PUERTO DE ANSÓ,HUESCA,Hecho,1078,266,4.6,5.7,46,https://www.altimetrias.net/aspbk/verPerfilusu...


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       4 non-null      object 
 1   provincia    4 non-null      object 
 2   pueblo       4 non-null      object 
 3   altitud      4 non-null      int64  
 4   desnivel     4 non-null      int64  
 5   distancia    4 non-null      float64
 6   pendiente    4 non-null      float64
 7   coeficiente  4 non-null      int64  
 8   url          4 non-null      object 
dtypes: float64(2), int64(3), object(4)
memory usage: 416.0+ bytes


# Scraping all ports

Since we know that the user-submitted ports have a range from 0000 to ~2400, we'll give our function a range of 2400 to make sure that we aren't missing any. Some of the urls won't be valid, so the actual number of rows might be quite lower.

In [None]:
df_user_alts = scraper(2400)

In [None]:
#Let's see how many rows we have in our dataframe.

df_user_alts.info()

In [None]:
#Now that our work is done, let's save the dataframe as a csv file.

df_user_alts.to_csv('df_user_alts.csv')

**<div align="right">Ironhack DA PT 2021</div>**
    
**<div align="right">Xavier Esteban</div>**