In [45]:
from bs4 import BeautifulSoup
import unittest
import requests


## Problem 0
There are 10 images on the page http://newmantaylor.com/gallery.html. Some of them have "alt text", which is the text that is displayed or spoken because of browser limitations, or because someone is using a screen reader, for example. Scrape this page and print out the alt text for each image. If there is no alt text, print "No alternative text provided!" The code you write should be general enough to work for any similar page with 10 images like this (not just this one), just by changing the URL to a different one.

In [46]:
page = requests.get("http://newmantaylor.com/gallery.html")
soup = BeautifulSoup(page.text, 'html.parser')
img_tags = soup.find_all("img")
for img in img_tags:
    print(img.get("alt", "No alternative text provided!"))


Waving Kitty 1
No alternative text provided!
Waving Kitty 3
Waving Kitty 4
Waving Kitty 5
Waving Kitty 6
No alternative text provided!
Waving Kitty 8
Waving Kitty 9
Waving Kitty 10


## Problem 1
Access and cache data, starting from https://www.nps.gov/index.htm. You will ultimately need the HTML data from all the parks from Arkansas, California, and Michigan. So, you should save on your computer data from the following pages, in files with the following names:

Main page data, https://www.nps.gov/index.htm, in a file nps_gov_data.html

Arkansas, https://www.nps.gov/state/ar/index.htm, in a file arkansas_data.html

California, https://www.nps.gov/state/ca/index.htm, in a file california_data.html

Michigan, https://www.nps.gov/state/mi/index.htm, in a file michigan_data.html

You should commit and push each of these .html files to your final Git repository.

Note that this is a much less complex 'caching' system to save data from the internet on your computer than you may be accustomed to from accessing REST API data. A system like the one discussed in our textbook can also certainly be used for HTML data like this, but in this case, we're using a shortcut. Later in the course you'll see more options for structuring your code in an easily-reusable way, and you may also come up with some yourself.

You should not hardcode the above URLs to get the html. Instead, you should write code that begins scraping https://www.nps.gov/index.htm. This code should be written such that you can pretty easily decide to add a new state, such as NY, to the states you want data from, and it would work.

You can access, and thus cache, data from those three listed pages for AK, CA, and MI parks, by starting with a BeautifulSoup object of the HTML on the page https://www.nps.gov/index.htm -- and to get full points on this question, you should do that, rather than simply e.g. resp_text = requests.get("https://www.nps.gov/state/mi/index.htm").text, etc.

If you access and cache the data without starting with a BeautifulSoup instance from the https://www.nps.gov/index.htm data, you will lose at least 50 points from this problem.

We have provided comments as structure to proceed through Part 1. We suggest that you follow them. However, if you choose not to, as long as you end up with the result that is required (the files), and you are not accessing the internet to get HTML data from those 4 pages every time you run the program, you will get credit for Part 1.

In [47]:
# found this to be easier to get as many states as you want. Just need to add abbreviates to the 'states' list
# you can then accumulate them all in a list comprehension to get a list of each page's html
def get_and_cache_page(relative_url, filename):
    base_url = "https://www.nps.gov"
    try:
        page = open(filename, 'r').text
    except:
        page = requests.get(base_url + relative_url).text
        with open(filename, 'w') as f:
            f.write(page)
    return page

def get_state_url(state_abrs):
    if not isinstance(state_abrs, list):
        state_abrs = list(state_abrs)
    state_urls = [main_soup.find('a', href=True, text=item)['href'] for item in states]
    return state_urls



In [48]:
nps_main = get_and_cache_page("/index.htm", "nps_gov_data.html")
main_soup = BeautifulSoup(nps_main, 'html.parser')


states = ['Arkansas', 'California', 'Michigan']
urls = get_state_url(states)

nps_ar = get_and_cache_page(urls[0], "arkansas_data.html")
nps_ca = get_and_cache_page(urls[1], "california_data.html")
nps_mi = get_and_cache_page(urls[2], "michigan_data.html")

ar_soup = BeautifulSoup(nps_ar, 'html.parser')
ca_soup = BeautifulSoup(nps_ca, 'html.parser')
mi_soup = BeautifulSoup(nps_mi, 'html.parser')

## Problem 2
Define a class NationalSite that accepts a BeautifulSoup object as input to its constructor, representing 1 National Park / National Lakeshore / etc (e.g. what you see here or what you see here)

A NationalSite instance should have the following instance variables:

location (state, or a city, or states ... whatever location description is provided)
name (e.g. "Alcatraz Island", "Channel Islands"...)
type (e.g. "National Lakeshore", "National Monument"... if there is no specified type, this value should be the special value None)
description (e.g. "Established in 1911 by presidential proclamation, Devils Postpile National Monument protects and preserves the Devils Postpile formation, the 101-foot high Rainbow Falls, and pristine mountain scenery. The formation is a rare sight in the geologic world and ranks as one of the world's finest examples of columnar basalt. Its columns tower 60 feet high and display an unusual symmetry." -- if there is no description, this instance variable should have the value of the empty string, "")
A NationalSite instance should also have the following methods:

A string method __str__ that returns a string of the format National Park/Site/Monument Name | Location

A get_mailing_address method that returns a string representing the mailing address of the park/site/etc. Because a multi-line string will make a CSV more difficult, you should separate the lines in the address with a forward slash, like this: /. However you decide to get this information and relatively-sensibly put it together is fine. In fact, some addresses may have information included in them twice, e.g. "Yosemite National Park, CA 95389 / Yosemite National Park / CA / 95389", while some will not -- that is also OK! There is enough information to send mail if possible there, which is all that matters for our purposes: is there some address info that will be returned in a single-line string from this function? If so, that is success.

HINT: This address info can be found by clicking the Basic Information link that each park/site/monument specification has; even parks that have many locations have a specific mailing address. It looks like this for Old Spanish in California.

NOTE: If a park has no mailing address, the return value of this function should be the empty string ("").

A __contains__ method that checks whether the additional input to the method is included in the string of the park's name. If the input is inside the name of the park, this method should return True; otherwise, it should return False.

Note that you may make additional design decisions when you define your class NationalSite to help you write these methods successfully -- e.g. you could add other instance variables or other methods if you wanted to/found them useful.

After you complete this, you should try creating an instance of your NationalSite class with the following code, to test and see if your class definition worked properly:

In [49]:
# example of what we're working w/
alca = ca_soup.find('li', class_='clearfix')
# print(type(alca), alca.prettify())
binfo = [link['href'] for link in alca.find_all('a', href=True) if "Basic Information" in link.text][0]
info_soup = BeautifulSoup(requests.get(binfo).text, 'html.parser')
# info_soup.find("div", class_="mailing-address")
street = info_soup.select('.street-address')[0].text
# street.strip().replace('\n', '/')
# street.replace("\n", "/")
# ' '.join([info_soup.find('span', {"itemprop": "addressLocality"}).text, info_soup.find('span', {"itemprop": "addressRegion"}).text, info_soup.find('span', {"itemprop": "postalCode"}).text])



In [54]:
class NationalSite(object):
    def __init__(self, site_soup):
        self.location = site_soup.find('h4').text
        self.name = site_soup.find('h3').text
        try:
            self.description = site_soup.find('p').text
        except:
            self.description = ''
        if site_soup.find('h2').text == '':
            self.type = None
        else:
            self.type = site_soup.find('h2').text
        self.soup = site_soup
        
    def __str__(self):
        return "{0} | {1}".format(self.name, self.location)
    
    def get_mailing_address(self):
        # not sure if space comes before Basic Information each time so i'll play it safe
        # info_url = self.soup.find('a', href=True, text="Basic Information")

        info_url = [link['href'] for link in self.soup.find_all('a', href=True) if "Basic Information" in link.text][0]
        soup_info = BeautifulSoup(requests.get(info_url).text, 'html.parser')
        try:
            mail_address = soup_info.find("p", class_="adr").text.strip().replace('\n', '/')
            # get rid of the three '///' in a row
            mail_address = mail_address.replace(re.findall(r'[/]{3}', mail_address)[0], '/')
            return mail_address
        except:
            return ""
    
    def __contains__(self, input):
        return input in self.name

## Problem 3
Create a list of NationalSite objects from each one of these 3 states: Arkansas, California, and Michigan. They should be saved in the following variables, respectively:

- arkansas_natl_sites
- california_natl_sites
- michigan_natl_sites
(You may accumulate these lists in any way you prefer.)

Write 3 CSV files, arkansas.csv, california.csv, michigan.csv -- one for each state's national parks/sites/etc, each of which has 5 columns:

- Name
- Location
- Type
- Address
- Description
- Remember to handle e.g commas and multi-line strings so that data for 1 field all ends up inside 1 spreadsheet cell when you open the CSV!

For any park/site/monument/etc where a value is None, you should put the string "None" in the CSV file.

In [58]:
arkansas_natl_sites = [NationalSite(x) for x in ar_soup.find_all('li', class_='clearfix', id=True)]
california_natl_sites = [NationalSite(x) for x in ca_soup.find_all('li', class_='clearfix', id=True)]
michigan_natl_sites = [NationalSite(x) for x in mi_soup.find_all('li', class_='clearfix', id=True)]


NoneType

In [59]:
import csv

def write_to_csv(filename, site_list):
    with open(filename, 'w') as outfile:
        outwriter = csv.writer(outfile, delimiter=',')
        header = ["Name", "Location", "Type", "Address", "Description"]
        outwriter.writerow(header)

        for site in site_list:
            if site.type is None:
                typ = "None"
            else:
                typ = site.type
            row = [site.name, site.location, typ, site.get_mailing_address(), site.description]
            outwriter.writerow(row)


write_to_csv("arkansas.csv", arkansas_natl_sites)
write_to_csv("california.csv", california_natl_sites)
write_to_csv("michigan.csv", michigan_natl_sites)