### Try out Web Scraping to get basic trail info for each national park

Webscraping tools:

- requests
- Beautiful Soup



In [3]:
# package imports
from bs4 import BeautifulSoup
import requests



import pandas as pd

In [2]:
natparks = pd.read_csv('national_park_list.csv')

In [48]:
for it,row in natparks.iterrows():
    if row['states'] == 'CA':
        natparks.at[it,'state'] = 'California'
        break;

In [50]:
natparks.iloc[10]

fullName    Channel Islands National Park
states                                 CA
state                          California
Name: 10, dtype: object

In [59]:
state = natparks.iloc[10]['state'].lower()
name = '-'.join(natparks.iloc[10]['fullName'].lower().split(' '))


In [63]:


url = "https//www.alltrails.com/parks/us/%s/%s"%(state,name)
url

'https//www.alltrails.com/parks/us/california/channel-islands-national-park'

## Idea:

For each National Park, enter into AllTrails and grab all results

AllTrails urls look like: https://www.alltrails.com/parks/us/utah/zion-national-park

- this is a pain because the state is full name, not XX like in NPS data
- ideally, I'd be able to just loop through nat park urls and scrape all data within, but that might not work?
- note that there's also this https://www.alltrails.com/us/national-parks page if I can figure out how to make it click things

#### Proof of Concept for just Zion:

In [4]:
url = 'http://www.alltrails.com/parks/us/utah/zion-national-park'

## added based on: https://stackoverflow.com/questions/38489386/python-requests-403-forbidden
# goto Chrome developer tab, execute > navigator.userAgent to get thisinfo
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36'}

response = requests.get(url,headers=headers)

page = response.text

In [5]:
soup = BeautifulSoup(page, 'html.parser')

In [6]:
# write to file so I can look at this in an IDE that hopefully formats HTML. can also inspect on the webpage
with open("output1.html", "w") as file:
    file.write(str(soup))

In [7]:
# after looking at way too much HTML, it seems like the boxes on the page are called "trailCard"
top10_cards = soup.select('div[class*="trailCard"]')

In [8]:
for card in top10_cards:
    print(card.find('a').get_text())

Angels Landing Trail
The Zion Narrows Riverside Walk
The Watchman Trail
Zion Canyon Overlook Trail
Zion Narrows Bottom Up to Big Springs
Emerald Pools Trail
The Subway Trail
Scout Lookout Trail
Lower Emerald Pool Trail
Observation Point via East Mesa Trail


In [9]:
# above is top 10 trails according to Alltrails reviews! 
## how can I get more info on these trails?
## how can I get more than 10 trails? need to figure out how to "click" the "Show More Trails" button

In [10]:
### looking further at HTML, trying to find names of the different pieces. this part sucked, would not recommend,
# heres my results

In [11]:
# hike description
card.select('div[class*="xlate-none styles-module__description"]')[0].get_text()

'NOTES: All wheel drive is highly recommended to make it to the trailhead in your vehicle. Alternatively, you can park on the road before the trail gets rocky and hike in to the trailhead but it will increase the length of the route.\n\nThis trail is really well maintained, well shaded as well, so you can take a break when needed. It ends at a beautiful observation point.Show more'

In [12]:
# hike difficulty 
card.select('span[class*="styles-module__diff"]')[0].get_text()

'moderate'

In [13]:
# more things
for val in card.select('span[class*="xlate-none"]'):
    print(val.get_text())


Length: 6.7 mi
Est. 3 h 10 m


#### try to apply above to all top 10 hikes:

In [14]:
rows = []
for card in top10_cards:
    trailname = card.find('a').get_text()
    park = card.select('a[class*="xlate-none styles-module__location"]')[0].get_text()
    difficulty = card.select('span[class*="styles-module__diff"]')[0].get_text()
    description = card.select('div[class*="xlate-none styles-module__description"]')[0].get_text()
    other = None; distance= None; time = None;
    for val in card.select('span[class="xlate-none"]'):
        entry = val.get_text()
        if 'Length' in entry:
            distance = entry.split(': ')[-1]
        elif 'Est' in entry:
            time = entry.split('. ')[-1]
        else:
            other = entry

    rows.append([trailname,park,difficulty,distance,time,description,other])

In [15]:
# worked! aggregating
zion_hikes = pd.DataFrame(rows, columns=['name','park','difficulty','distance','time','description','other'])
zion_hikes

Unnamed: 0,name,park,difficulty,distance,time,description,other
0,Angels Landing Trail,Zion National Park,hard,5.0 mi,3 h 7 m,The parking lot here fills up quickly so be su...,
1,The Zion Narrows Riverside Walk,Zion National Park,easy,1.9 mi,45 m,The Narrows may close during extreme weather c...,
2,The Watchman Trail,Zion National Park,easy,3.1 mi,1 h 41 m,The Watchman Trail is a great easy trail that ...,
3,Zion Canyon Overlook Trail,Zion National Park,moderate,1.0 mi,42 m,This trail offers some of the most breathtakin...,
4,Zion Narrows Bottom Up to Big Springs,Zion National Park,hard,8.6 mi,5 h 28 m,Reserve your $ 1 shuttle bus pass on Recreatio...,
5,Emerald Pools Trail,Zion National Park,moderate,3.0 mi,1 h 12 m,A paved trail to Lower Emerald Pool and from t...,
6,The Subway Trail,Zion National Park,hard,9.1 mi,4 h 33 m,Please note: An NPS permit is required to acce...,
7,Scout Lookout Trail,Zion National Park,hard,3.6 mi,1 h 27 m,The road to this trail closes periodically to ...,
8,Lower Emerald Pool Trail,Zion National Park,easy,1.4 mi,34 m,Easy trail in Zion National Park. Minor drop-o...,
9,Observation Point via East Mesa Trail,Zion National Park,moderate,6.7 mi,3 h 10 m,NOTES: All wheel drive is highly recommended t...,


### Extending POC to other Nat Parks

- Try to use the same exact code on Yosemite 

In [16]:
state = 'california'
mypark = 'yosemite-national-park'
url = 'http://www.alltrails.com/parks/us/%s/%s'%(state,mypark)

## added based on: https://stackoverflow.com/questions/38489386/python-requests-403-forbidden
# goto Chrome developer tab, execute > navigator.userAgent to get thisinfo
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36'}

response = requests.get(url,headers=headers)

page = response.text
soup = BeautifulSoup(page, 'html.parser')

In [17]:
rows = []
for card in soup.select('div[class*="trailCard"]'):
    trailname = card.find('a').get_text()
    park = card.select('a[class*="xlate-none styles-module__location"]')[0].get_text()
    difficulty = card.select('span[class*="styles-module__diff"]')[0].get_text()
    description = card.select('div[class*="xlate-none styles-module__description"]')[0].get_text()
    other = None; distance= None; time = None;
    for val in card.select('span[class="xlate-none"]'):
        entry = val.get_text()
        if 'Length' in entry:
            distance = entry.split(': ')[-1]
        elif 'Est' in entry:
            time = entry.split('. ')[-1]
        else:
            other = entry

    rows.append([trailname,park,difficulty,distance,time,description,other])
yosemite_hikes = pd.DataFrame(rows, columns=['name','park','difficulty','distance','time','description','other'])


In [18]:
yosemite_hikes


Unnamed: 0,name,park,difficulty,distance,time,description,other
0,Vernal and Nevada Falls via the Mist Trail,Yosemite National Park,hard,8.8 mi,5 h 1 m,"Note: As of September 2020, The park has decid...",
1,Upper Yosemite Falls Trail,Yosemite National Park,hard,7.6 mi,4 h 30 m,Enjoy the thrilling views of looking down from...,
2,Vernal Falls,Yosemite National Park,moderate,4.0 mi,2 h 16 m,The Mist Trail from its junction with the John...,
3,Half Dome Trail,Yosemite National Park,hard,15.0 mi,9 h 1 m,Half Dome is a serious endurance hike taking v...,
4,Four Mile Trail,Yosemite National Park,hard,9.2 mi,6 h 9 m,Note: This trail and road may close seasonally...,
5,Lower Yosemite Falls Trail,Yosemite National Park,easy,1.2 mi,28 m,"A quick stroll to see Yosemite Falls, the tall...",
6,Clouds Rest Trail via Tenaya Lake,Yosemite National Park,hard,13.0 mi,7 h 3 m,The best place to enjoy the view of Half Dome ...,
7,Glacier Point Trail,Yosemite National Park,easy,0.6 mi,14 m,Please be aware that this trail and its access...,
8,Sentinel Dome Trail,Yosemite National Park,easy,2.1 mi,1 h 9 m,Sentinel Dome starts from Glacier Point Road a...,
9,Vernal Falls and Clark Point via Mist and John...,Yosemite National Park,hard,4.2 mi,2 h 44 m,"Note: As of 11/13/2020, the final section of t...",


In [21]:
pd.concat([yosemite_hikes,zion_hikes]).to_csv('top_hikes_temp.csv',index=None)

# TO DO: 

- make this into a py script and run on all parks. 
    - is there a way to do that without having to write out url for each park? 
- look into getting more than 10 hikes per park?
