Web Scraping and Parsing



## Step 1  - Download and Parse Data
The **`retrieveRecords`** method is a public method  of the **`SISrapper`** class. It works as a generator that sequentially returns all the records (of tag **`div`** and class **`record`**) on the result as a `BeautifulSoup`'s [`Tag`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag). This method jumps seamlessly  to next page when reaching the end of the current page, and only exits when there is no more result.


In [2]:
import requests
from bs4 import BeautifulSoup
class SIScraper():

    def __init__(self):
        self.session = requests.Session()
        self.prefix = ''


    def retrieveRecords(self, url: str):
      self.url=url
      self.prefix = url[:url.find('?')]
      records=[]

      for soup1 in self.getPages():
        #for div in soup1.find_all('h2'):
          #records.append(div)
        for div in soup1.find_all('div', {'class': 'span10'}):
          if(div.find('h2')):
            records.append(div)
      return records


    def getPages(self):

      while self.url:
          page = self.session.get(self.url)
          soup = BeautifulSoup(page.content)
          yield soup
          self.url = self.nextUrl(soup)
    def nextUrl(self, soup):
        for a in soup.select('div.pagination a'):
            if a.text.strip()=='next':
                return f'{self.prefix}{a["href"]}'
        return None


### Testing the class


In [3]:

URL = 'https://collections.si.edu/search/results.htm?date.slider=&q=&dsort=&fq=object_type%3A%22Outdoor+sculpture%22&fq=data_source%3A%22Art+Inventories+Catalog%2C+Smithsonian+American+Art+Museum%22&fq=date:%221400s%22'
scraper = SIScraper()
records = scraper.retrieveRecords(URL)
for i,record in enumerate(records, 1):
    print(i, record.find('h2').text)

1 Old Testament Children's Doors, (sculpture)
2 Marble Well Head, (sculpture)
3 The Well of Samaria, (sculpture)
4 Joan of Arc, (sculpture)
5 Normanno Wedge #1, (sculpture)
6 Marble Well Head, (sculpture)
7 Well Head, (sculpture)
8 Font, (sculpture)
9 Department of Justice Building: Viking Ships Relief, (sculpture)
10 The Apotheosis of St. Louis, (sculpture)
11 Jupiter, (sculpture)
12 Italia, (sculpture)
13 Joan of Arc, Maiden of Orleans, (sculpture)
14 Recumbent Stone Camels, (sculpture)
15 Well Head, (sculpture)
16 Spheres (2), (sculpture)
17 Aphrodite Fountain, (sculpture)
18 St. Joan of Arc, (sculpture)
19 Jeanne d'Arc, (sculpture)
20 Marble Column with Associated Well Head, (sculpture)
21 Venus, (sculpture)
22 Diana, (sculpture)
23 Joan of Arc, (sculpture)
24 Stone Well Head, (sculpture)
25 Apollo, (sculpture)
26 (Two Medieval Knights), (sculpture)
27 Stone Font, (sculpture)
28 Joan of Arc-Equestrian, (sculpture)
29 Young Meher, (sculpture)
30 The Crusader: Victor Lawson Monument,

## Step 2 - Extract Record into JSON

 **`toJson`** method transforms each record Tag returned by **`retrieveRecords`** in step 1 to a JSON string. The JSON object has the following keys (and their corresponding values).



In [4]:
class SIScraperJson(SIScraper):

    ### DO NOT CHANGE OR EDIT THIS METHOD
    def retrieveRecordsAsJson(self, url):
        yield from map(self.toJson, self.retrieveRecords(url))

    ### You must complete the following method that takes
    ### record of type bs4.Tag (from Task 1), and return a string
    def toJson(self, record):
      for span_tag in record.findAll('span'):
        span_tag.replace_with('')
      d={}
      d["Label"]=record.find('h2').text
      for i in record.find_all('dl'):
        if(i.find('dt').text.strip()!="Title:"):
          list1=[]
          for j in i.find_all('dd'):
            list1.append(j.text.strip())
            d[i.find('dt').text.strip()]=list1

      return d


### Testing the class

In [5]:
### MUST RUN AS-IS FROM BELOW WITHOUT ANY EDITS
URL = 'https://collections.si.edu/search/results.htm?date.slider=&q=&dsort=&fq=object_type%3A%22Outdoor+sculpture%22&fq=data_source%3A%22Art+Inventories+Catalog%2C+Smithsonian+American+Art+Museum%22&fq=date:%221400s%22'
scraper = SIScraperJson()
records = scraper.retrieveRecordsAsJson(URL)


print('\n>> The FIRST record')
display(next(records))

print('\n>> The LAST record')
display(max(enumerate(records))[1])


>> The FIRST record


{'Label': "Old Testament Children's Doors, (sculpture)",
 'Sculptor:': ['Moore, Bruce 1905-1980'],
 'Architect:': ['Fox, William B.'],
 'Founder:': ['Modern Art Foundry', 'Associated Ironworkers'],
 'Medium:': ['Bronze'],
 'Culture:': ['French'],
 'Type:': ['Sculptures-Outdoor Sculpture', 'Sculptures-Door', 'Sculptures'],
 'Owner/Location:': ['Administered by Episcopal Diocese of California 1051 Taylor Street San Francisco California 94108',
  'Located Grace Cathedral Taylor & California Streets Entrance to south tower San Francisco California'],
 'Date:': ['1964'],
 'Topic:': ['Religion--Old Testament--Joseph',
  'Religion--Old Testament--Moses',
  'Religion--Old Testament--Samuel',
  'Religion--Old Testament--David',
  'Religion--Old Testament--Goliath',
  'Religion--Old Testament--Eli',
  'Allegory--Arts & Sciences--Industry',
  'Allegory--Quality--Fortitude',
  'Religion--Saint--St. Joan of Arc',
  'Occupation--Military--Commander',
  'Ethnic',
  'History--Medieval--France'],
 'Con


>> The LAST record


{'Label': 'The Crusader: Victor Lawson Monument, (sculpture)',
 'Sculptor:': ['Taft, Lorado Zadoc 1860-1936'],
 'Medium:': ['Sculpture: granite; Base: granite'],
 'Type:': ['Sculptures-Gravestone',
  'Sculptures-Outdoor Sculpture',
  'Sculptures'],
 'Owner/Location:': ['Graceland Cemetery 4001 North Clark Street Chicago Illinois 60613'],
 'Date:': ['1931'],
 'Topic:': ['Figure male',
  'Occupation--Military--Knight',
  'History--Medieval',
  'Object--Weapon--Sword',
  'Dress--Accessory--Shield',
  'Dress--Historic--Medieval Dress',
  'Homage--Lawson, Victor Fremont'],
 'Control number:': ['IAS 87580154'],
 'Data Source:': ['Art Inventories Catalog, Smithsonian American Art Museums'],
 'EDAN-URL:': ['edanmdm:siris_ari_296284']}