# Route Scraping

This code kept timing out, so I had to scrape in batches. The end data is stored in "all_routes_and_desc.csv".

## Get list of URLs

Mountain Project has a feature where I can extract a csv of routes. Unfortunately, I can only do 1000 at a time, so I had to do it about 150 times, then combine the csvs. Fortunately, the csv contains the url of the routes, so I could iterate over that column and merge on it too. 

In [147]:
import os
import glob
import pandas as pd
from progress.bar import Bar
import time
import tqdm

In [4]:
# Combine CSVs that I got directly from Mountain Project
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

#export to csv
combined_csv.to_csv( "all_routes.csv", index=False, encoding='utf-8-sig')
routes = pd.read_csv("all_routes.csv")
routes.drop_duplicates(subset = "URL", inplace = True)

# remove routes with no rating
routes = routes[routes["Avg Stars"]!= -1]

In [22]:
len(routes)

116767

In [23]:
routes.head()

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude
0,Access Denied,El Mirador > El Potrero Chico > Nuevo Leon > N...,https://www.mountainproject.com/route/11014983...,2.9,-1,Sport,5.10b/c,4,350.0,25.95044,-100.47755
1,Agave Nectar,Sugar Shack > Cougar Canyon (Creek) - CONSTRUC...,https://www.mountainproject.com/route/11091386...,2.0,-1,Sport,5.10b/c,1,,51.09642,-115.31767
3,Ant & Bee do Yoga,The Hen House > Kamloops > British Columbia > ...,https://www.mountainproject.com/route/11240652...,2.7,-1,Trad,5.10b/c,1,,50.57212,-120.13874
4,Besame Fuerte,Pilon De Lolita > Loreto Area > Baja Californi...,https://www.mountainproject.com/route/11608640...,2.0,-1,Sport,5.10b/c,1,80.0,26.01097,-111.34166
5,Big Momma's Rock,The Courtyard > Mamquam FSR > Squamish > Briti...,https://www.mountainproject.com/route/11445772...,3.0,-1,Sport,5.10b/c,1,60.0,49.71393,-123.09943


# Now, scrape the route for description

In [24]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

In [182]:
def description_scrape(url_to_scrape, write = True):
    """Get description from route URL"""

    # Get HTML info
    uClient = uReq(url_to_scrape) # request the URL
    page_html = uClient.read() # Read the html
    uClient.close() # close the connection
    route_soup = soup(page_html, "html.parser")
    
    # Get route description headers
    heading_container = route_soup.findAll("h2", {"class":"mt-2"})
    heading_container[0].text.strip()
    headers = ""
    for h in range(len(heading_container)):
        headers += "&&&" + heading_container[h].text.strip()
    headers = headers.split("&&&")[1:]
    
    # Get route description text
    route_soup = soup(page_html, "html.parser")
    desc_container = route_soup.findAll("div", {"class":"fr-view"})
    words = ""
    for l in range(len(desc_container)):
        words += "&&&" + desc_container[l].text
    words = words.split("&&&")[1:]
    
    # Combine into dictionary
    route_dict = dict(zip(headers, words))
    
    # Add URL to dictionary
    route_dict["URL"] = url_to_scrape
    
    # Get number of votes on star rating and add to dictionary
    star_container = route_soup.find("span", id="route-star-avg")
    num_votes = int(star_container.span.span.text.strip().split("from")[1].split("\n")[0].replace(",", ""))
    route_dict["star_votes"] = num_votes
    
    if write == True:
        # Write to file:
        f.write(route_dict["URL"] +","+ 
                route_dict.setdefault("Description", "none listed").replace(",", "~") +","+
                route_dict.setdefault("Protection", "none listed").replace(",", "~") +","+
                str(route_dict["star_votes"]) + "\n")
    else:
        return route_dict

In [131]:
### Testing ###

# Open a new file
filename = "flake.csv"
f = open(filename, "w")
headers = "URL, desc, protection, num_votes\n"
f.write(headers)

# Test it out
url = "https://www.mountainproject.com/route/115964496/wind-walker-traverse"
description_scrape(url)

f.close()

## Get URLs and scrape

In [158]:
os.chdir("/Users/patriciadegner/Documents/MIDS/DL/final_project")
os.getcwd()

'/Users/patriciadegner/Documents/MIDS/DL/final_project'

In [231]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

In [197]:
all_route_urls = list(routes["URL"])

## An attempt to fix this timeout issue:

https://stackoverflow.com/questions/46315135/connection-error-timeout-when-i-am-web-scrapping-with-python-on-debian

In [210]:
from functools import wraps
import time
from requests.exceptions import RequestException
from socket import timeout
from urllib.error import URLError

class Retry(object):
    """Decorator that retries a function call a number of times, optionally
    with particular exceptions triggering a retry, whereas unlisted exceptions
    are raised.
    :param pause: Number of seconds to pause before retrying
    :param retreat: Factor by which to extend pause time each retry
    :param max_pause: Maximum time to pause before retry. Overrides pause times
                      calculated by retreat.
    :param cleanup: Function to run if all retries fail. Takes the same
                    arguments as the decorated function.
    """
    def __init__(self, times, exceptions=(IndexError), pause=1, retreat=1,
                 max_pause=None, cleanup=None):
        """Initiliase all input params"""
        self.times = times
        self.exceptions = exceptions
        self.pause = pause
        self.retreat = retreat
        self.max_pause = max_pause or (pause * retreat ** times)
        self.cleanup = cleanup

    def __call__(self, f):
        """
        A decorator function to retry a function (ie API call, web query) a
        number of times, with optional exceptions under which to retry.

        Returns results of a cleanup function if all retries fail.
        :return: decorator function.
        """
        @wraps(f)
        def wrapped_f(*args, **kwargs):
            for i in range(self.times):
                # Exponential backoff if required and limit to a max pause time
                pause = min(self.pause * self.retreat ** i, self.max_pause)
                try:
                    return f(*args, **kwargs)
                except self.exceptions:
                    if self.pause is not None:
                        time.sleep(pause)
                    else:
                        pass
            if self.cleanup is not None:
                return self.cleanup(*args, **kwargs)
        return wrapped_f


In [211]:
retry = Retry(times=5, pause=1, retreat=2, exceptions=(RequestException, timeout))

In [214]:
@retry
def description_scrape(url_to_scrape, write = True):
    """Get description from route URL"""

    # Get HTML info
    uClient = uReq(url_to_scrape) # request the URL
    page_html = uClient.read() # Read the html
    uClient.close() # close the connection
    route_soup = soup(page_html, "html.parser")
    
    # Get route description headers
    heading_container = route_soup.findAll("h2", {"class":"mt-2"})
    heading_container[0].text.strip()
    headers = ""
    for h in range(len(heading_container)):
        headers += "&&&" + heading_container[h].text.strip()
    headers = headers.split("&&&")[1:]
    
    # Get route description text
    route_soup = soup(page_html, "html.parser")
    desc_container = route_soup.findAll("div", {"class":"fr-view"})
    words = ""
    for l in range(len(desc_container)):
        words += "&&&" + desc_container[l].text
    words = words.split("&&&")[1:]
    
    # Combine into dictionary
    route_dict = dict(zip(headers, words))
    
    # Add URL to dictionary
    route_dict["URL"] = url_to_scrape
    
    # Get number of votes on star rating and add to dictionary
    star_container = route_soup.find("span", id="route-star-avg")
    num_votes = int(star_container.span.span.text.strip().split("from")[1].split("\n")[0].replace(",", ""))
    route_dict["star_votes"] = num_votes
    
    if write == True:
        # Write to file:
        f.write(route_dict["URL"] +","+ 
                route_dict.setdefault("Description", "none listed").replace(",", "~") +","+
                route_dict.setdefault("Protection", "none listed").replace(",", "~") +","+
                str(route_dict["star_votes"]) + "\n")
    else:
        return route_dict

In [None]:
import requests
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'})
response.status_code
t0 = time.time()


# Open a new file
filename = "route_desc.csv"
f = open(filename, "w")
headers = "URL, desc, protection, num_votes\n"
f.write(headers)

for route_url in tqdm(all_route_urls):
    description_scrape(route_url)
    time.sleep(.05)

t1 = time.time()
t1-t0

f.close()

In [160]:
f.close()

# Combine the route csv and description csv

In [232]:
routes.head()

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude
0,Access Denied,El Mirador > El Potrero Chico > Nuevo Leon > N...,https://www.mountainproject.com/route/11014983...,2.9,-1,Sport,5.10b/c,4,350.0,25.95044,-100.47755
1,Agave Nectar,Sugar Shack > Cougar Canyon (Creek) - CONSTRUC...,https://www.mountainproject.com/route/11091386...,2.0,-1,Sport,5.10b/c,1,,51.09642,-115.31767
3,Ant & Bee do Yoga,The Hen House > Kamloops > British Columbia > ...,https://www.mountainproject.com/route/11240652...,2.7,-1,Trad,5.10b/c,1,,50.57212,-120.13874
4,Besame Fuerte,Pilon De Lolita > Loreto Area > Baja Californi...,https://www.mountainproject.com/route/11608640...,2.0,-1,Sport,5.10b/c,1,80.0,26.01097,-111.34166
5,Big Momma's Rock,The Courtyard > Mamquam FSR > Squamish > Briti...,https://www.mountainproject.com/route/11445772...,3.0,-1,Sport,5.10b/c,1,60.0,49.71393,-123.09943


In [234]:
route_desc = pd.read_csv("route_desc.csv")
route_desc.head()

Unnamed: 0,URL,desc,protection,num_votes
0,https://www.mountainproject.com/route/11408243...,The crux lies as the crack really starts to op...,1 pad,2
1,https://www.mountainproject.com/route/11179794...,Start 10 feet to the right of Smooth Criminal....,Gear to 1 inch~ 3 bolts,5
2,https://www.mountainproject.com/route/10623511...,The route follows a very large and very beauti...,regular trad rack,1
3,https://www.mountainproject.com/route/10572417...,Right side of the lower Walls. Climb through ...,Single set of cams orange TCU and up.,5
4,https://www.mountainproject.com/route/11457624...,Starts in an alcove~ hand and finger crack. Fi...,small to 2.5 inches,1


In [235]:
merged = routes.merge(route_desc, on='URL')
merged.to_csv("all_routes_and_desc.csv", index=False)

In [241]:
df = pd.read_csv("all_routes_and_desc.csv")
df.drop(["Your Stars"], axis = 1, inplace=True)
df.head()

Unnamed: 0,Route,Location,URL,Avg Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,desc,protection,num_votes
0,Access Denied,El Mirador > El Potrero Chico > Nuevo Leon > N...,https://www.mountainproject.com/route/11014983...,2.9,Sport,5.10b/c,4,350.0,25.95044,-100.47755,This is a really great route~ with awesome exp...,12 draws + 60m Rope Take 22 draws if you wan...,22
1,Agave Nectar,Sugar Shack > Cougar Canyon (Creek) - CONSTRUC...,https://www.mountainproject.com/route/11091386...,2.0,Sport,5.10b/c,1,,51.09642,-115.31767,from tabvar: Cool fins to roof~ thin holds...,4 bolts to anchor,1
2,Ant & Bee do Yoga,The Hen House > Kamloops > British Columbia > ...,https://www.mountainproject.com/route/11240652...,2.7,Trad,5.10b/c,1,,50.57212,-120.13874,A safe mixed route with a bit of run out up to...,"mixed~ gear to 4""",3
3,Besame Fuerte,Pilon De Lolita > Loreto Area > Baja Californi...,https://www.mountainproject.com/route/11608640...,2.0,Sport,5.10b/c,1,80.0,26.01097,-111.34166,Start on a slab under a left leaning arched ro...,bolts,1
4,Big Momma's Rock,The Courtyard > Mamquam FSR > Squamish > Briti...,https://www.mountainproject.com/route/11445772...,3.0,Sport,5.10b/c,1,60.0,49.71393,-123.09943,Fun technical climbing. Tricky right off the bat.,bolts,3


In [242]:
len(df)

116700