## Getting Climber's Logs from Summit Post

This notebook scrapes climber's logs from the top N pages of mountains, sorted by descending number of hits from https://summitpost.org, a crowd sourced resource for mountaineering and hiking information.

In [2]:
import requests
import time
import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from bs4 import BeautifulSoup

First, I get all the unique URLs for each mountain.  Each page lists 24 mountains.

In [3]:
mtn_urls = []

def mtns_top_hits(num_pages):
    for i in range(1,num_pages+1):
        top_url = f'https://www.summitpost.org/mountain/rock/?object_type=1&search_select_1=name_only&contributor_id=&order_type_1=DESC&object_name_1=&sort_select_1=hits&page={i}'
        response = requests.get(top_url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "lxml")
            data = soup.find_all('div',attrs={'class':'item-data'})
            for div in data:
                links = div.find_all('a')
                for a in links[::2]: # pulling every other link because every other link is a "parent"
                    mtn_urls.append("http://www.summitpost.org" + a['href'])
        else:
            print(f'Response code error: {response.status_code}')
    return mtn_urls

In [4]:
# get a list of 960 mountains from the top 40 pages from the site
mtns_top_hits(40)

['http://www.summitpost.org/mount-whitney/150227',
 'http://www.summitpost.org/mount-rainier/150291',
 'http://www.summitpost.org/mount-shasta/150188',
 'http://www.summitpost.org/mount-hood/150189',
 'http://www.summitpost.org/denali/150199',
 'http://www.summitpost.org/mount-elbert/150325',
 'http://www.summitpost.org/katahdin/150219',
 'http://www.summitpost.org/aconcagua/150197',
 'http://www.summitpost.org/mount-adams/150198',
 'http://www.summitpost.org/grand-teton/150312',
 'http://www.summitpost.org/longs-peak/150310',
 'http://www.summitpost.org/matterhorn-monte-cervino/150235',
 'http://www.summitpost.org/mont-blanc/150245',
 'http://www.summitpost.org/eiger/150228',
 'http://www.summitpost.org/mount-mansfield/150938',
 'http://www.summitpost.org/humphreys-peak/150241',
 'http://www.summitpost.org/hatu-peak/154227',
 'http://www.summitpost.org/wheeler-peak-nm/150429',
 'http://www.summitpost.org/mt-timpanogos-ut/151365',
 'http://www.summitpost.org/mount-baker/150195',
 'http

Next, I get the climber's log page URL from each main mountain URL.

In [5]:
climber_log_urls = []

for url in mtn_urls:
    groups = url.split('/')
    groups.insert(-1,'climbers-log')
    climber_log_urls.append("/".join(groups))

climber_log_urls

['http://www.summitpost.org/mount-whitney/climbers-log/150227',
 'http://www.summitpost.org/mount-rainier/climbers-log/150291',
 'http://www.summitpost.org/mount-shasta/climbers-log/150188',
 'http://www.summitpost.org/mount-hood/climbers-log/150189',
 'http://www.summitpost.org/denali/climbers-log/150199',
 'http://www.summitpost.org/mount-elbert/climbers-log/150325',
 'http://www.summitpost.org/katahdin/climbers-log/150219',
 'http://www.summitpost.org/aconcagua/climbers-log/150197',
 'http://www.summitpost.org/mount-adams/climbers-log/150198',
 'http://www.summitpost.org/grand-teton/climbers-log/150312',
 'http://www.summitpost.org/longs-peak/climbers-log/150310',
 'http://www.summitpost.org/matterhorn-monte-cervino/climbers-log/150235',
 'http://www.summitpost.org/mont-blanc/climbers-log/150245',
 'http://www.summitpost.org/eiger/climbers-log/150228',
 'http://www.summitpost.org/mount-mansfield/climbers-log/150938',
 'http://www.summitpost.org/humphreys-peak/climbers-log/150241',
 

Now that I have all the URLs, I go to each URL and get the climber's logs, the date the comment was posted, the date the climb was (if available), and the name of the mountain.  Each mountain has a different number of pages of logs.

In [6]:
mountains = []
dates = []
comments = []

def get_climber_logs(urls):
    for url in urls:
        for i in range(1,100): # assuming there are no mountains with more than 100 pages of logs
            climber_log_url = f'{url}/p{i}'
            response = requests.get(climber_log_url)
            soup = BeautifulSoup(response.text, "lxml")
            if "No climber's log entries yet." in str(soup) or "No comments posted yet." in str(soup):
                break
            else:
                details = soup.find_all('div',attrs={'class':'details'})

                # get each date from between span tags
                for date in details:
                    dates.append(date.find('span').text)

                # get each comment from between p tags
                for comment in details:
                    comments.append(comment.find('p').text)
                
                # get name of mountain
                mountain = soup.find('div',attrs={'class':'custom-page-title'}).find('h2').find('a').text
                mountains.extend([mountain]*len(details))
                
        time.sleep(.5+2*random.random())

In [7]:
get_climber_logs(climber_log_urls)

Finally, to put it all together, I make a pandas dataframe out of the lists of scraped information.

In [8]:
# make a pandas dataframe

df_logs = pd.DataFrame({'mountain': mountains, 'date': dates, 'comment': comments})
print(df_logs.info())
df_logs.sample(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67633 entries, 0 to 67632
Data columns (total 3 columns):
mountain    67633 non-null object
date        67633 non-null object
comment     67633 non-null object
dtypes: object(3)
memory usage: 1.5+ MB
None


Unnamed: 0,mountain,date,comment
64292,Sundial Peak,"Jul 10, 2008 12:54 pm Date Climbed: Jun 28, 2008",A great scramble at the end... and a superb pe...
9967,Gannett Peak,"Feb 24, 2005 2:07 am",I've done this route toghether with Serge Ray....
35026,Mount Diablo,"Oct 30, 2007 11:48 pm",Great mtn to bike and run and even a very few ...
1569,Mount Rainier,"Jul 12, 2007 3:05 pm Date Climbed: Jul 5, 2007","Climbed Rainier via the Emmons Glacier Route, ..."
22368,Eldorado Peak,"Jun 4, 2007 1:58 pm Date Climbed: Jun 1, 2007",Climbed on a 6-day glacier mountaineering cour...
52363,Electric Peak,"Oct 21, 2014 5:12 pm Date Climbed: Jan 13, 2014",got caught in the tracks of a mama grizzly and...
28882,Mount Sherman,"Oct 22, 2002 11:10 am",A nice afternoon jaunt; took a friend on his s...
49811,Moldoveanu,"Oct 14, 2007 11:04 am Date Climbed: Jul 22, 2007","15 hours, all day long!"
25204,Wheeler Peak,"Oct 29, 2017 10:25 pm Date Climbed: Oct 28, 2015",Cold and windy at the top. Started from the lo...
52884,Mount Wrightson,"Jul 20, 2008 2:26 am",Climbed successfully.


In [9]:
# splitting the date information into comment date and climb date (if available)

df_logs[['comment_date','climb_date']] = df_logs['date'].str.split(' Date Climbed: ',expand=True)
df_logs.drop(['date'], axis = 1, inplace=True)
df_logs.head(10)

Unnamed: 0,mountain,comment,comment_date,climb_date
0,Mount Whitney,Did this in a single day... very difficult for...,"Nov 19, 2018 8:51 am","Sep 5, 2015"
1,Mount Whitney,Mountaineers Route - first 14er,"Nov 1, 2018 6:34 am",
2,Mount Whitney,Standard route from Whitney Portal via Trail C...,"Oct 1, 2018 12:44 pm","Oct 27, 2018"
3,Mount Whitney,Worth hiking the 220-something miles along the...,"Sep 10, 2018 10:23 am","Jul 23, 2016"
4,Mount Whitney,"My brother John, his son and I backpacked in t...","Aug 15, 2018 2:28 am","Aug 10, 2018"
5,Mount Whitney,Words will never describe the breathtaking vie...,"Aug 2, 2018 11:50 am","Jul 1, 2018"
6,Mount Whitney,Portal out and back,"Jul 31, 2018 7:51 am","Jul 29, 2018"
7,Mount Whitney,Great 2am ascent from Trail Camp.\nhttps://the...,"Jul 28, 2018 8:09 pm","Jul 28, 2018"
8,Mount Whitney,Hiked from guitar lake in the middle of the ni...,"Jul 24, 2018 6:33 am","Aug 8, 2014"
9,Mount Whitney,Summited Whitney via the Whitney Trail with tw...,"Jul 2, 2018 7:31 am","Jun 24, 2018"


In [10]:
# changing the date columns from object type into datetime type

df_logs['comment_date'] = pd.to_datetime(df_logs['comment_date'])
df_logs['climb_date'] = pd.to_datetime(df_logs['climb_date'])
df_logs.head(10)

Unnamed: 0,mountain,comment,comment_date,climb_date
0,Mount Whitney,Did this in a single day... very difficult for...,2018-11-19 08:51:00,2015-09-05
1,Mount Whitney,Mountaineers Route - first 14er,2018-11-01 06:34:00,NaT
2,Mount Whitney,Standard route from Whitney Portal via Trail C...,2018-10-01 12:44:00,2018-10-27
3,Mount Whitney,Worth hiking the 220-something miles along the...,2018-09-10 10:23:00,2016-07-23
4,Mount Whitney,"My brother John, his son and I backpacked in t...",2018-08-15 02:28:00,2018-08-10
5,Mount Whitney,Words will never describe the breathtaking vie...,2018-08-02 11:50:00,2018-07-01
6,Mount Whitney,Portal out and back,2018-07-31 07:51:00,2018-07-29
7,Mount Whitney,Great 2am ascent from Trail Camp.\nhttps://the...,2018-07-28 20:09:00,2018-07-28
8,Mount Whitney,Hiked from guitar lake in the middle of the ni...,2018-07-24 06:33:00,2014-08-08
9,Mount Whitney,Summited Whitney via the Whitney Trail with tw...,2018-07-02 07:31:00,2018-06-24


In [11]:
df_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67633 entries, 0 to 67632
Data columns (total 4 columns):
mountain        67633 non-null object
comment         67633 non-null object
comment_date    67633 non-null datetime64[ns]
climb_date      44534 non-null datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 2.1+ MB


In [12]:
# exporting cleaned data to CSV

df_logs.to_csv('./climber_logs.csv', index=False)