## Getting Mountain Descriptions from Summit Post

This notebook scrapes the full mountain descriptions from the top N pages of mountains, sorted by descending number of hits from https://summitpost.org, a crowd sourced resource for mountaineering and hiking information.

In [1]:
import requests
import time
import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

First, I get all the unique URLs for each mountain.  Each page lists 24 mountains.

In [2]:
mtn_urls = []

def mtns_top_hits(num_pages):
    for i in range(1,num_pages+1):
        top_url = f'https://www.summitpost.org/mountain/rock/?object_type=1&search_select_1=name_only&contributor_id=&order_type_1=DESC&object_name_1=&sort_select_1=hits&page={i}'
        response = requests.get(top_url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "lxml")
            data = soup.find_all('div',attrs={'class':'item-data'})
            for div in data:
                links = div.find_all('a')
                for a in links[::2]: # pulling every other link because every other link is a "parent"
                    mtn_urls.append("http://www.summitpost.org" + a['href'])
        else:
            print(f'Response code error: {response.status_code}')
    return mtn_urls

In [3]:
# get a list of 480 mountains from the top 20 pages from the site
mtns_top_hits(20)

['http://www.summitpost.org/mount-whitney/150227',
 'http://www.summitpost.org/mount-rainier/150291',
 'http://www.summitpost.org/mount-shasta/150188',
 'http://www.summitpost.org/mount-hood/150189',
 'http://www.summitpost.org/denali/150199',
 'http://www.summitpost.org/mount-elbert/150325',
 'http://www.summitpost.org/katahdin/150219',
 'http://www.summitpost.org/aconcagua/150197',
 'http://www.summitpost.org/mount-adams/150198',
 'http://www.summitpost.org/grand-teton/150312',
 'http://www.summitpost.org/longs-peak/150310',
 'http://www.summitpost.org/matterhorn-monte-cervino/150235',
 'http://www.summitpost.org/mont-blanc/150245',
 'http://www.summitpost.org/eiger/150228',
 'http://www.summitpost.org/mount-mansfield/150938',
 'http://www.summitpost.org/humphreys-peak/150241',
 'http://www.summitpost.org/hatu-peak/154227',
 'http://www.summitpost.org/wheeler-peak-nm/150429',
 'http://www.summitpost.org/mt-timpanogos-ut/151365',
 'http://www.summitpost.org/mount-baker/150195',
 'http

Now that I have all the URLs, I go to each URL and get the full mountain descriptions and the name of the mountain.  Each mountain has a different number of sections in the full description.

In [4]:
response = requests.get('http://www.summitpost.org/mount-tom/151260')
soup = BeautifulSoup(response.text, "lxml")
        
details = soup.find('div',attrs={'class':'full-content'})

details.text

'\nOverviewMt. Tom is the immense, good-looking peak just west-northwest of Bishop. It is situated right on the eastern edge of the Sierra crest, allowing for a very large vertical relief. The Owens Valley, at the base of Mt. Tom, is a little over 4,000 feet, and the summit of Mt. Tom is nearly 14,000, for almost 10,000 feet of relief. Most routes allow some elevation gain before arrival at the trailhead/departure point. None of the recognized routes are technically difficult, most are class 2-3, but they are all strenuous and long. The views from the summit are memorable, and you are unlikely to see many tourists along the way. This peak does not get very many ascents, the summit register is sparsely signed.\n\nThis is one the few Sierra peaks that Norman Clyde is not credited with making the first ascent.\nGetting ThereMt. Tom is about 10 miles outside Bishop, California, and you should consult the individual route descriptions to decide how to access the peak. For the east to north 

In [None]:
## still need to fix scrape, see above sample. Need to exclude certain tags' text

mountains = []
descriptions = []

def get_mtn_desc(urls):
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "lxml")
        
        # get full description
        description = soup.find('div',attrs={'class':'full-content'})
        descriptions.append(description)

        # get name of mountain
        mountain = soup.find('div',attrs={'class':'content-title'}).text
        mountains.append(mountain)
                
        time.sleep(.5+2*random.random())

In [None]:
get_mtn_desc(mtn_urls)
print(descriptions[:2])

Put it all together in a pandas dataframe.

In [None]:
df_mtndesc = pd.DataFrame({'mountain': mountains, 'description': descriptions})
print(df_mtndesc.info())
df_mtndesc.sample(20)

In [None]:
# exporting cleaned data to CSV

df_mtndesc.to_csv('./mtn_descriptions.csv', index=False)