# Creating a pipeline for new ports

Our routes database is quite extensive, but in the future we will probable be adding new ports. It's very probable that those new ports already have routes that climb them in our database, so it would be in our best interest to combine all necessary data manipulation and parsing steps in a single notebook (or function) that can be re-run on demand.

## Cleaning our original routes dataframe for testing

Before we begin creating the pipeline we will clean and manipulate our original routes dataframe so that it meets our requirements.

In [1]:
import pandas as pd
import haversine as hs
import time

In [2]:
routes = pd.read_csv('master_1407.csv')

In [3]:
routes.head()

Unnamed: 0,ID,nombre,ccaa,provincia,coords,alt,start,midpoint,distancia,desnivel,min_alt,max_alt,municipios,puertos,trailrank,url
0,0,01-Madrid - Motilla del Palancar,,,"[(40.39467, -3.67912), (40.39546, -3.67998), (...","[592.065, 597.068, 596.014, 597.008, 598.067, ...","(40.39467, -3.67912)","(40.09315, -2.891046)",229,1884,544,976,,,27,https://es.wikiloc.com/rutas-ciclismo/01-madri...
1,1,01-MAY-16 ALMÁCERA-BÉTERA-OLOCAU-GÁTOVA-ALTO D...,,,"[(39.510125, -0.355943), (39.510517, -0.35574)...","[-79.616, -79.676, -79.613, -79.208, -79.662, ...","(39.510125, -0.355943)","(39.809736, -0.515215)",117,1292,0,729,,,21,https://es.wikiloc.com/rutas-ciclismo/01-may-1...
2,2,"02-AGO-15 Coll de La Gallina, Port de Beixalís...",,,"[(42.511074, 1.549479), (42.511086, 1.549457),...","[1054.713, 1059.043, 1064.307, 1064.808, 1069....","(42.511074, 1.549479)","(42.532589, 1.561706)",93,2850,912,2082,,,62,https://es.wikiloc.com/rutas-ciclismo/02-ago-1...
3,3,02-Motilla del Palancar - Valencia,,,"[(39.561199, -1.906015), (39.561199, -1.906015...","[665.256, 665.259, 665.214, 665.208, 665.036, ...","(39.561199, -1.906015)","(39.374283, -1.012429)",167,1001,0,734,,,38,https://es.wikiloc.com/rutas-ciclismo/02-motil...
4,4,05-ABR-15 Les Tres Cales,,,"[(40.913227, 0.804593), (40.913242, 0.804572),...","[63.634, 63.155, 59.71, 59.307, 56.462, 54.985...","(40.913227, 0.804593)","(40.905964, 0.740497)",27,416,25,191,,,27,https://es.wikiloc.com/rutas-ciclismo/05-abr-1...


In [4]:
#Renaming the columns.

routes.rename(columns = {'nombre': 'name', 'provincia': 'province', 'distancia': 'distance', 'desnivel': 'gradient', 'municipios': 'municipalities_ids', 'puertos': 'mountain_passes_ids'}, inplace = True)

In [5]:
routes.head(1)

Unnamed: 0,ID,name,ccaa,province,coords,alt,start,midpoint,distance,gradient,min_alt,max_alt,municipalities_ids,mountain_passes_ids,trailrank,url
0,0,01-Madrid - Motilla del Palancar,,,"[(40.39467, -3.67912), (40.39546, -3.67998), (...","[592.065, 597.068, 596.014, 597.008, 598.067, ...","(40.39467, -3.67912)","(40.09315, -2.891046)",229,1884,544,976,,,27,https://es.wikiloc.com/rutas-ciclismo/01-madri...


In [6]:
#Creating a new column for the gpx file url.

routes['gpx_link'] = None

In [7]:
#Re-ordering the columns.

routes = routes[['ID', 'name', 'ccaa', 'province', 'start', 'midpoint', 'trailrank', 'distance', 'gradient', 'min_alt', 'max_alt', 'mountain_passes_ids', 'municipalities_ids', 'coords', 'alt','gpx_link']]

In [8]:
#Deleting extremely short, long or high routes.

routes = routes[routes['distance'] < 230]
routes = routes[routes['distance'] > 30]
routes = routes[routes['gradient'] < 4700]

In [9]:
#Resetting the index.

routes = routes.reset_index(drop=True)

In [10]:
routes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9497 entries, 0 to 9496
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   9497 non-null   int64  
 1   name                 9497 non-null   object 
 2   ccaa                 0 non-null      float64
 3   province             0 non-null      float64
 4   start                9497 non-null   object 
 5   midpoint             9497 non-null   object 
 6   trailrank            9497 non-null   int64  
 7   distance             9497 non-null   int64  
 8   gradient             9497 non-null   int64  
 9   min_alt              9497 non-null   int64  
 10  max_alt              9497 non-null   int64  
 11  mountain_passes_ids  0 non-null      float64
 12  municipalities_ids   0 non-null      float64
 13  coords               9497 non-null   object 
 14  alt                  9497 non-null   object 
 15  gpx_link             0 non-null      o

## Deleting non-circular routes

We only want circular routes, so we will create a new column with the last coordinate of the route and calculate its distance from the start point. Routes where that distance exceeds 2Km will be deleted.

In [11]:
#Creating column to hold finish coordinates.

routes['finish'] = None

In [12]:
#Extracting the finish coordinates as the last tuple in the 'coords' list.

for i in range(len(routes)):
    routes['finish'].iloc[i] = eval(routes['coords'].iloc[i])[-1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [13]:
#Creating a dummy column.

routes['is_circular'] = None

In [14]:
#Populating it with 'yes' if the start and finish are less than 2Km apart. Otherwise it's a 'no'.

start = time.time()

for i in range(len(routes)):
    if hs.haversine(eval(routes['start'].iloc[i]), routes['finish'].iloc[i]) <= 2:
        routes['is_circular'].iloc[i] = 'yes'
    else:
        routes['is_circular'].iloc[i] = 'no'
        
stop = time.time() 
duration = (stop - start) / 60
print('Minutes:', duration)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Minutes: 0.07371410131454467


In [15]:
#Deleting non-circular routes and the useless columns:

routes = routes[routes['is_circular'] == 'yes']
routes.drop(['is_circular', 'finish'], axis=1, inplace=True)

In [16]:
#Reindexing.

routes = routes.reset_index(drop=True)

In [17]:
#We're down to 8651 routes.

routes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8651 entries, 0 to 8650
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   8651 non-null   int64  
 1   name                 8651 non-null   object 
 2   ccaa                 0 non-null      float64
 3   province             0 non-null      float64
 4   start                8651 non-null   object 
 5   midpoint             8651 non-null   object 
 6   trailrank            8651 non-null   int64  
 7   distance             8651 non-null   int64  
 8   gradient             8651 non-null   int64  
 9   min_alt              8651 non-null   int64  
 10  max_alt              8651 non-null   int64  
 11  mountain_passes_ids  0 non-null      float64
 12  municipalities_ids   0 non-null      float64
 13  coords               8651 non-null   object 
 14  alt                  8651 non-null   object 
 15  gpx_link             0 non-null      o

# Extracting which ports pass through each route

Now that we've cleaned our routes dataframe it's time to search for which ports are climbed in every route.

In [18]:
#Loading our ports dataset.

ports = pd.read_csv('puertos.csv')

In [27]:
#This function checks if two points are less than 80Km apart.

def isnear(a, b):
    if hs.haversine(eval(routes['midpoint'].iloc[a]), eval(ports['peak_coords'].iloc[b])) < 80:
                    return 'Yes'
    else:
                    return 'No'

In [None]:
#This function returns a dataframe of route ID and ports.

start = time.time()

dict_list = []

for i in range(len(routes)):
    lista_puertos = []
    for p in range(len(ports)):
        if isnear(i, p) == 'Yes':
            new_c = eval(routes['coords'].iloc[i])
            for n in new_c[0::30]:
                if hs.haversine(n, eval(ports['peak_coords'].iloc[p])) < 0.3:
                    if ports['ID'].iloc[p] not in lista_puertos:
                        lista_puertos.append(ports['ID'].iloc[p])
                    else:
                        pass
                else:
                    pass
    new = {'ruta': routes['ID'].iloc[i], 'puertos': lista_puertos}
    dict_list.append(new)  
    
test = pd.DataFrame(dict_list)

stop = time.time() 
duration = (stop - start) / 60
print('Minutes:', duration)

In [None]:
test.head()

In [None]:
routes.info()

# Matching routes with towns