# Obtaining the coordinates for each mountain pass in Spain

Having the exact coordinates for every port in Spain would allow us to determine exactly which routes pass through them, greatly enhancing the capabilities of our dataset. But where could we find those coordinates?

The website **Altimetrias.net** has a handy map with the coordinates for 757 ports, 745 of them being in continental Spain. In the first part of this notebook I will try to extract their coords using web scraping, but since our database has ~1700 ports the quantity of missing information at this stage is not trivial.

## Extracting the coordinates for *Altimetrias.net* 757 ports

As I already mentioned I'm going to use web scraping at this stage. Let's begin by importing our libraries.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
#The map url.

url = 'https://www.altimetrias.net/mapas/default.asp'

In [3]:
#Parsing the content to make it accessible.

html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

## Accessing the data

The port names, coordinates and urls are contained in *script* objects. We can simply isolate and iterate through them in order to extract the needed data.

In [2]:
import re

In [5]:
#We find our first port at the index value 3, the last one being at the index position 760.

puerto = soup.find_all('script')[3]

In [6]:
#Visualizing the data structure within the element.

puerto.string

'\r\n\tL.marker([\'42.166509\', \'-8.352331\'], {icon: Pontevedra}).bindPopup(\'<a target="_blank" href="https://www.altimetrias.net/aspbk/verPuerto.asp?id=664"><b>A PARADANTA</b> x Uma</a>\').addTo(puertos);\r\n'

In [7]:
#Accessing the latitude.

re.search("\'(.*?)\', \'", puerto.string)[1]

'42.166509'

In [8]:
#Accessing the longitude.

re.search(", \'(.*?)\'],", puerto.string)[1]

'-8.352331'

In [9]:
#Accessing the url.

re.search('href="(.*?)"><b', puerto.string)[1]

'https://www.altimetrias.net/aspbk/verPuerto.asp?id=664'

## Using a loop to extract the parsed data

Now that we know the exact location and format of the data it's simply a matter of creating a loop that will iterate through every *script* object and return a dictionary. We will later use those dictionaries to create a dataframe.

In [10]:
dict_list = []

for i in range(3,760):
    puerto = soup.find_all('script')[i]
    dict_puerto = {'url': re.search('href="(.*?)"><b', puerto.string)[1],
                    'lat': re.search("\'(.*?)\', \'", puerto.string)[1],
                    'long': re.search(", \'(.*?)\'],", puerto.string)[1]}
    dict_list.append(dict_puerto)

In [11]:
#Let's check how many ports we obtained.

len(dict_list)

757

In [12]:
#Testing the dictionary list.

dict_list[0]

{'url': 'https://www.altimetrias.net/aspbk/verPuerto.asp?id=664',
 'lat': '42.166509',
 'long': '-8.352331'}

## Creating a dataframe

Now that we have obtained all the usaful data we will package it into a dataframe.

In [3]:
import pandas as pd

In [14]:
df = pd.DataFrame(dict_list)

In [15]:
df.head()

Unnamed: 0,url,lat,long
0,https://www.altimetrias.net/aspbk/verPuerto.as...,42.166509,-8.352331
1,https://www.altimetrias.net/aspbk/verPuerto.as...,40.621878,-4.179733
2,https://www.altimetrias.net/aspbk/verPuerto.as...,40.621878,-4.179833
3,https://www.altimetrias.net/aspbk/verPuerto.as...,38.16339,-2.87367
4,https://www.altimetrias.net/aspbk/verPuerto.as...,42.551714,0.912602


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 757 entries, 0 to 756
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     757 non-null    object
 1   lat     757 non-null    object
 2   long    757 non-null    object
dtypes: object(3)
memory usage: 17.9+ KB


In [17]:
#Saving our dataframe.

df.to_csv('df_coords_puertos.csv', index=False)

# Obtaining the coordinates for the remaining ports

Let's load out our ports dataframe to find out which ones have missing coordinates.

In [18]:
df1 = pd.read_csv('master_df_puertos.csv')

In [19]:
df1.head()

Unnamed: 0,puerto,provincia,pueblo,altitud,desnivel,distancia,pendiente,coeficiente,url
0,Pico Veleta,Granada,El Purche-Sabinas-Pradollano,3384,2662,46.0,5.7,588,https://www.altimetrias.net/aspbk/verPerfilusu...
1,Iram,Granada,Haza Llana,2845,2080,35.0,5.8,546,https://www.altimetrias.net/aspbk/verPerfilusu...
2,Angliru,Asturias,Santa Eulalia,1570,1423,18.0,7.0,528,https://www.altimetrias.net/aspbk/verPuerto.as...
3,Ancares Desde Navia De Suarna,Lugo,Pan do Zarco,1670,1875,35.0,3.8,512,https://www.altimetrias.net/aspbk/verPerfilusu...
4,Gamoniteiro,Asturias,Pola-Cobertoria,1772,1465,15.0,9.0,492,https://www.altimetrias.net/aspbk/verPuerto.as...


In [20]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1723 entries, 0 to 1722
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1723 non-null   object 
 1   provincia    1723 non-null   object 
 2   pueblo       1723 non-null   object 
 3   altitud      1723 non-null   int64  
 4   desnivel     1723 non-null   int64  
 5   distancia    1723 non-null   float64
 6   pendiente    1723 non-null   float64
 7   coeficiente  1723 non-null   int64  
 8   url          1723 non-null   object 
dtypes: float64(2), int64(3), object(4)
memory usage: 121.3+ KB


In [21]:
#Some ports are too small to be of any value, so we will be dropping them and creating a new df.

df2 = df1[df1['desnivel'] >= 200]

In [22]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1467 entries, 0 to 1663
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1467 non-null   object 
 1   provincia    1467 non-null   object 
 2   pueblo       1467 non-null   object 
 3   altitud      1467 non-null   int64  
 4   desnivel     1467 non-null   int64  
 5   distancia    1467 non-null   float64
 6   pendiente    1467 non-null   float64
 7   coeficiente  1467 non-null   int64  
 8   url          1467 non-null   object 
dtypes: float64(2), int64(3), object(4)
memory usage: 114.6+ KB


## Obtaining the missing ports

First of all we will merge the two dataframes to visualize which ports are lacking coordinates.

In [23]:
df3 = pd.merge(left=df2, right=df, how='left', left_on='url', right_on='url')

In [24]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1468 entries, 0 to 1467
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1468 non-null   object 
 1   provincia    1468 non-null   object 
 2   pueblo       1468 non-null   object 
 3   altitud      1468 non-null   int64  
 4   desnivel     1468 non-null   int64  
 5   distancia    1468 non-null   float64
 6   pendiente    1468 non-null   float64
 7   coeficiente  1468 non-null   int64  
 8   url          1468 non-null   object 
 9   lat          402 non-null    object 
 10  long         402 non-null    object 
dtypes: float64(2), int64(3), object(6)
memory usage: 137.6+ KB


In [25]:
df3.head()

Unnamed: 0,puerto,provincia,pueblo,altitud,desnivel,distancia,pendiente,coeficiente,url,lat,long
0,Pico Veleta,Granada,El Purche-Sabinas-Pradollano,3384,2662,46.0,5.7,588,https://www.altimetrias.net/aspbk/verPerfilusu...,,
1,Iram,Granada,Haza Llana,2845,2080,35.0,5.8,546,https://www.altimetrias.net/aspbk/verPerfilusu...,,
2,Angliru,Asturias,Santa Eulalia,1570,1423,18.0,7.0,528,https://www.altimetrias.net/aspbk/verPuerto.as...,43.221596,-5.94178
3,Ancares Desde Navia De Suarna,Lugo,Pan do Zarco,1670,1875,35.0,3.8,512,https://www.altimetrias.net/aspbk/verPerfilusu...,,
4,Gamoniteiro,Asturias,Pola-Cobertoria,1772,1465,15.0,9.0,492,https://www.altimetrias.net/aspbk/verPuerto.as...,43.18786,-5.923458


In [26]:
#Creating a dataframe with all ports that are missing their coordinates.

df4 = df3[df3['lat'].isnull()]

In [27]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1066 entries, 0 to 1466
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1066 non-null   object 
 1   provincia    1066 non-null   object 
 2   pueblo       1066 non-null   object 
 3   altitud      1066 non-null   int64  
 4   desnivel     1066 non-null   int64  
 5   distancia    1066 non-null   float64
 6   pendiente    1066 non-null   float64
 7   coeficiente  1066 non-null   int64  
 8   url          1066 non-null   object 
 9   lat          0 non-null      object 
 10  long         0 non-null      object 
dtypes: float64(2), int64(3), object(6)
memory usage: 99.9+ KB


In [28]:
df4.head()

Unnamed: 0,puerto,provincia,pueblo,altitud,desnivel,distancia,pendiente,coeficiente,url,lat,long
0,Pico Veleta,Granada,El Purche-Sabinas-Pradollano,3384,2662,46.0,5.7,588,https://www.altimetrias.net/aspbk/verPerfilusu...,,
1,Iram,Granada,Haza Llana,2845,2080,35.0,5.8,546,https://www.altimetrias.net/aspbk/verPerfilusu...,,
3,Ancares Desde Navia De Suarna,Lugo,Pan do Zarco,1670,1875,35.0,3.8,512,https://www.altimetrias.net/aspbk/verPerfilusu...,,
5,Sierra De Lújar,Granada,Rubite,1865,1856,31.0,5.8,482,https://www.altimetrias.net/aspbk/verPerfilusu...,,
8,Oiz Mendia,Bizkaia,Mendata-Totorika-Gortaguren,1020,1000,27.0,3.6,405,https://www.altimetrias.net/aspbk/verPerfilusu...,,


In [29]:
#Exporting and importing to tweak some port names manually.

df4.to_csv('df4.csv', index=False)

In [30]:
df4 = pd.read_csv('df4.csv')

## Requesting the missing coordinates

For this purpose I will be using a library called **Geopy**.

In [17]:
from geopy.geocoders import Nominatim

In [32]:
#Testing the function.

geolocator = Nominatim(user_agent="Firefox")
location = geolocator.geocode("Pico Veleta, Granada, España")
location

Location(Pico del Veleta, Capileira, Comarca de la Alpujarra Granadina, Granada, Andalucía, 18196, España, (37.0560211, -3.3656885, 0.0))

In [33]:
#Obtaining the longitude:

location[1][0]

37.0560211

In [34]:
#Latitude:

location[1][1]

-3.3656885

In [35]:
#Let's create a loop that iterates through our dataframe and appends coordinates whenever possible.

for i in range(len(df4)):
    try:
        geolocator = Nominatim(user_agent="Firefox")
        puerto = df4['puerto'].iloc[i]
        provincia = df4['provincia'].iloc[i]
        buscador = puerto + ', ' + provincia + ', ' + 'España'
        location = geolocator.geocode(buscador)
        df4['lat'].iloc[i] = location[1][0]
        df4['long'].iloc[i] = location[1][1]
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [36]:
#We obtained the coordinates for 765 additional ports. The next step will be creating two dataframes, one with all ports that 
#have coords and one without them.

df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1066 entries, 0 to 1065
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1066 non-null   object 
 1   provincia    1066 non-null   object 
 2   pueblo       1066 non-null   object 
 3   altitud      1066 non-null   int64  
 4   desnivel     1066 non-null   int64  
 5   distancia    1066 non-null   float64
 6   pendiente    1066 non-null   float64
 7   coeficiente  1066 non-null   int64  
 8   url          1066 non-null   object 
 9   lat          707 non-null    float64
 10  long         707 non-null    float64
dtypes: float64(4), int64(3), object(4)
memory usage: 91.7+ KB


## Creating the final dataframes

In [37]:
#First, the dataframe with missing coordinates.

df5 = df4[df4['lat'].isnull()]

In [38]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 359 entries, 2 to 1063
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       359 non-null    object 
 1   provincia    359 non-null    object 
 2   pueblo       359 non-null    object 
 3   altitud      359 non-null    int64  
 4   desnivel     359 non-null    int64  
 5   distancia    359 non-null    float64
 6   pendiente    359 non-null    float64
 7   coeficiente  359 non-null    int64  
 8   url          359 non-null    object 
 9   lat          0 non-null      float64
 10  long         0 non-null      float64
dtypes: float64(4), int64(3), object(4)
memory usage: 33.7+ KB


In [39]:
#Saving our dataframe of missing coordinates.

df5.to_csv('df_missing_routes.csv', index=False)

In [40]:
#And second, the dataframe with coords.

df6 = df4.dropna(how='any')

In [41]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 707 entries, 0 to 1065
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       707 non-null    object 
 1   provincia    707 non-null    object 
 2   pueblo       707 non-null    object 
 3   altitud      707 non-null    int64  
 4   desnivel     707 non-null    int64  
 5   distancia    707 non-null    float64
 6   pendiente    707 non-null    float64
 7   coeficiente  707 non-null    int64  
 8   url          707 non-null    object 
 9   lat          707 non-null    float64
 10  long         707 non-null    float64
dtypes: float64(4), int64(3), object(4)
memory usage: 66.3+ KB


In [42]:
#Merging the final dataframes.

df7 = df3.dropna(how='any')

In [43]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402 entries, 2 to 1467
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       402 non-null    object 
 1   provincia    402 non-null    object 
 2   pueblo       402 non-null    object 
 3   altitud      402 non-null    int64  
 4   desnivel     402 non-null    int64  
 5   distancia    402 non-null    float64
 6   pendiente    402 non-null    float64
 7   coeficiente  402 non-null    int64  
 8   url          402 non-null    object 
 9   lat          402 non-null    object 
 10  long         402 non-null    object 
dtypes: float64(2), int64(3), object(6)
memory usage: 37.7+ KB


In [44]:
#Appending the dataframe.

df8 = df7.append(df6, ignore_index=True)

In [45]:
df8.head()

Unnamed: 0,puerto,provincia,pueblo,altitud,desnivel,distancia,pendiente,coeficiente,url,lat,long
0,Angliru,Asturias,Santa Eulalia,1570,1423,18.0,7.0,528,https://www.altimetrias.net/aspbk/verPuerto.as...,43.221596,-5.94178
1,Gamoniteiro,Asturias,Pola-Cobertoria,1772,1465,15.0,9.0,492,https://www.altimetrias.net/aspbk/verPuerto.as...,43.18786,-5.923458
2,Peña Escrita,Granada,Torrecuevas,1200,1150,13.0,8.0,462,https://www.altimetrias.net/aspbk/verPuerto.as...,36.818155,-3.771034
3,Ancares,Lugo,Sª Morela-Balouta,1670,1355,36.0,3.0,427,https://www.altimetrias.net/aspbk/verPuerto.as...,42.868532,-6.818333
4,Pajares-Cuitu Negru,Asturias,Campomanes,1843,1466,25.0,5.0,394,https://www.altimetrias.net/aspbk/verPuerto.as...,42.96824,-5.788388


In [46]:
df8.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1109 entries, 0 to 1108
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   puerto       1109 non-null   object 
 1   provincia    1109 non-null   object 
 2   pueblo       1109 non-null   object 
 3   altitud      1109 non-null   int64  
 4   desnivel     1109 non-null   int64  
 5   distancia    1109 non-null   float64
 6   pendiente    1109 non-null   float64
 7   coeficiente  1109 non-null   int64  
 8   url          1109 non-null   object 
 9   lat          1109 non-null   object 
 10  long         1109 non-null   object 
dtypes: float64(2), int64(3), object(6)
memory usage: 95.4+ KB


In [53]:
#Exporting our dataframe:

df8.to_csv('master_df_puertos_coords.csv', index=False)

# Conclusions

While we managed to obtain the coordinates for 1109 ports, 359 still have missing data. Most of them are quite obscure, but if we are aiming for completeness those values will have to be extracted manually.

Thankfully this won't really hurt our results since we have the geographical data for the most important ports and 1167 is plenty of data to work with.

**<div align="right">Ironhack DA PT 2021</div>**
    
**<div align="right">Xavier Esteban</div>**