<a href="https://colab.research.google.com/github/moira-du-monde/webscraping_r_python/blob/main/scraping_materials/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PLEASE NOTE: This case study assumes you've already installed Python and Jupyter Notebook (see [these docs](https://docs.jupyter.org/en/latest/install/notebook-classic.html) to learn how to install on your own computer) and/or are able to run Jupyter Notebook on [Google Colab](https://www.geeksforgeeks.org/how-to-use-google-colab/).

# Scraping the Cleveland Museum of Art's Online Collection
A case study in webscraping, designed to complement the February 2023 workshop "INTRO TO WEB SCRAPING USING R OR PYTHON" presentation [slides](link). 

## History and objective

🎨 We are art historians interested in gathering data about ancient artworks of controversial or unknown provenance in the U.S.

The statue ["Apollo the Python Slayer"](https://www.cleveland.com/arts/2013/09/the_cleveland_museum_of_art_wa.html) is one such artwork.  We know it is part of the
[Cleveland Museum of Art's](https://www.clevelandart.org/) permanent collection, and we would like to add it to our master database.  

Instead of manually copying this information from the website, which can take time and lead to errors, we will write a Python program that (1) scrapes *Apollo's* title, artist, geographical origin, medium, year, ownership history, and a description of how the museum came into posession of the piece, and (2) inserts the new record into our existing dataframe.


## Process

**1.  Installing libraries**

Begin by installing (if you haven't already) the two libraries we'll be using.

The first, [Requests](https://requests.readthedocs.io/en/latest/), allows you to make HTTP/1.1 requests to a server (in effect, this establishes the connection) and to read in the target webpage's HTML code.

Once we've made our requests, we can use [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) to parse and extract data from this HTML script.


NOTE: Our Statistical Consulting office heeds the first commandment of programming: "thou shalt not use unnecessary packages and libraries." In that spirit, the packages we will use in this workshop have been carefully and sparingly selected.

In [2]:
%%capture
## install requests

In [3]:
%%capture
## install beautifulsoup4


After installing the libraries, go ahead and import them as well as [pandas](https://pandas.pydata.org/docs/), the pseudo-database management system we will use to organize and store our data.  Pandas is pre-installed on most Python IDEs.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

**2.  Making the GET Request**

If you are targeting a single webpage with a single URL, you'll just need a few lines of code to get started.  We already know what we are looking for so we will go right to the source.

In [None]:
apollo_url = 'https://www.clevelandart.org/art/2004.30'

Once you've identified a target page, use the requests package to access its html.

In [None]:
## get request

## src usually the object label for the server's response


**3. Parsing the HTML/XML**

When we are ready to parse the text, we'll use the BeautifulSoup function to create an HTML tree.

In [None]:
## Parse the response into an HTML tree

## lmxl is Python's HTML parser

If you want to check your work and/or have the html handy in this window, run the next cell to print.

In [None]:
## Print to view the HTML using the "prettify" function which preserves the indentation


## since we know our information is near the bottom of the page, we'll start printing at line 1000

**4. Extracting the HTML elements of interest**

As a reminder, we have set out to collect the artwork's title, date, artist, artist's country of origin, medium, previous owners, and a description of how the museum came into posession of the piece.  

By examining the above HTML, we notice **title** information is held in the "h1" header tag and is of the class "field field-name-field-primary-title field-type-text field-label-hidden".

We will pass those arguments to the "soup" function, which produces a list of all relevant parts of the HTML (it calls this list the "Result Set").  

In [None]:
title = soup("h1", "field field-name-field-primary-title field-type-text field-label-hidden") # soup function args : tag, class
## print title and notice the output is a list

print('Class: ', type(title)) 

To retrieve a single item from the list, select by the item's index to create an object of type "bs4.element.Tag."

In [None]:
## select the first record and call the variable "title_only"
print('Class: ', type(title_only)) # notice the data type for this record is a Beautiful Soup tag

To get the element's text only, call the **text** function on this Tag object.

In [None]:
title_for_db = title_only.text
print(title_for_db) # check this only outputs the relevant data

We have now collected our title data:

1.   Title: "h1" header tag of class "field field-name-field-primary-title field-type-text field-label-hidden"

Let's use the same workflow to get the rest of our data, which involves the following tags and class combinations:

2.   Date: "p" paragraph of class "field field-name-field-date-text field-type-text field-label-hidden"

3.   Artist: a combination of "span" container of class "field field-name-field-artist-qualifier" and "span" container of class "field field-name-field-artist-name"

4.   Artist's country of origin: "p" paragraph of class ""field field-name-field-artist-origin"

5.   Medium: "p" paragraph of class "field field-name-art-object-medium field-type-ds field-label-hidden"

6.   Ownership history: multiple (3) combinations of "div" container of class "field field-name-field-provenance-description" and "field field-name-field-provenance-date"

7.   Museum possession: "span" of class "field field-name-field-credit-line"


In [None]:
# Date

date = soup("p", "field field-name-field-date-text field-type-text field-label-hidden")
date_only = date[0]
date_for_db = date_only.text

## print(date_for_db)

In [None]:
# Artist

qualifier = soup("span", "field field-name-field-artist-qualifier")
qual_for_db = qualifier[0].text

# print(qual_for_db) # uncomment if you'd like to check your work

artist = soup("span", "field field-name-field-artist-name")
artist_for_db = artist[0].text

# print(artist_for_db) # uncomment if you'd like to check your work

qual_artist_db = qual_for_db + " " + artist_for_db ## concatenate qualifier and artist with space in the middle

##print out qual_artist_db to check work

In [None]:
# Artist origin

origin = soup("p", "field field-name-field-artist-origin")
origin_for_db = origin[0].text

print(origin_for_db)

In [None]:
# Medium

medium = soup("p", "field field-name-art-object-medium field-type-ds field-label-hidden")
medium_for_db = medium[0].text

print(medium_for_db)

In [None]:
# Ownership history

ownership = soup("div", "field field-name-field-provenance-description")
owner_1 = ownership[0].text
owner_2 = ownership[1].text
owner_3 = ownership[2].text

date_change = soup("div","field field-name-field-provenance-date")
change_1 = date_change[0].text
change_2 = date_change[1].text

first = owner_1 + " " + change_1
second = owner_2 + " " + change_2
third = owner_3

print(first)
print(second)
print(third)

In [None]:
# Funds used for purchase

funds = soup("span", "field field-name-field-credit-line")
funds_for_db = funds[0].text

print(funds_for_db)

## Simple dataframe

Run the code below to quickly reproduce the dataframe containing our other records and add the new data.

In [7]:
# dataframe
data = [[0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0]]
df = pd.DataFrame(data, columns=['Title', 'Date', 'Artist', 'Artist_origin', 'Medium', 'Ownership_I','Ownership_II', 'Ownership_III','Funds_used_for_purchase'])

In [None]:
# add our scraped data
new_data = [[title_for_db, date_for_db, qual_artist_db,origin_for_db,medium_for_db,first, second, third, funds_for_db]]
new_data = pd.DataFrame(new_data, columns=['Title', 'Date', 'Artist', 'Artist_origin', 'Medium', 'Ownership_I','Ownership_II', 'Ownership_III','Funds_used_for_purchase'])
df_whole = df.append(new_data, ignore_index=True)
df_whole.head()

## Scraping using procedural programming

This simple program does all of the above in just a few steps.

In [None]:
# call the url
apollo_url = 'https://www.clevelandart.org/art/2004.30'

# get request
req = requests.get(apollo_url)

# src usually the object label for the server's response
src = req.text

# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml') # lmxl is Python's HTML parser

results = []
out = []

def apollo(tags_list, classes_list):
  for i in tags:
    for j in classes:
      results.append(soup(i, j))

def output(results_list):
  for r in results_list:
    if len(r) == 1:
      for n in r:
        out.append(n.text)
    elif len(r) > 1:
      for n in r:
        out.append(n.text)
  return out


tags = ["h1", "p", "span", "span", "p", "p", "div", "div", "span"]
classes = ["field field-name-field-primary-title field-type-text field-label-hidden", 
           "field field-name-field-date-text field-type-text field-label-hidden", 
           "field field-name-field-artist-qualifier",
           "field field-name-field-artist-name",
           "field field-name-field-artist-origin",
           "field field-name-art-object-medium field-type-ds field-label-hidden",
           "field field-name-field-provenance-description",
           "field field-name-field-provenance-date",
           "field field-name-field-credit-line"
           ]

apollo(tags, classes)

output(results)

# add our scraped data to dataframe
index_list= [0,1,5,2,3,16,17,18,28]
proc_data = [out[i] for i in index_list]
df.loc[2]=proc_data
df.head()

## Note on the CMA's API

As we mentioned in the presentation, it is best practice to check for an API before you scrape.

Rather incredibly, the CMA does have its own [API](https://openaccess-api.clevelandart.org/).  The institution was founded "for the benefit of all the people forever" and its online strategy deliberately upholds this mission.  Its web data, which includes descriptions, dates, provenances, and images, is unrestricted for both commercial and non-commercial use.

To facilitate the use of the API, the museum even supplies tabulated code books and sample code.  The following example Python script uses the [Requests](https://requests.readthedocs.io/en/latest/user/quickstart/#make-a-request) library, which allows you to make HTTP/1.1 requests to a server and to read in the target webpage's HTML code, and the library JSON to parse it.

In some ways, this API is a little unconventional.  It doesn't require you to register for a Key to access the content, and it still requires some general programming skills (some API services streamline the workflow for you).

However, like the majority of APIs, this one delineates the data for you in an easy-to-read and import layout.

In [None]:
# code source: https://openaccess-api.clevelandart.org/
# additional commentary by Moira O.

import json


def print_openaccess_results(keyword, skip=0, limit=100):
    url = "https://openaccess-api.clevelandart.org/api/artworks"
    params = {
            'q': keyword,
            'skip': skip,
            'limit': limit,
            'has_image': 1
        }

    r = requests.get(url, params=params)

    data = r.json()

    for artwork in data['data']:
        tombstone = artwork['tombstone']
        image = artwork['images']['web']['url']

        print(f"{tombstone}\n{image}\n---")

if __name__ == '__main__':
    print_openaccess_results("monet", 0, 10)


# Test your knowledge

Use your new webscraping skills to answer these three questions!

**Question 1**: What year did Vincent Van Gogh paint "The Large Plane Trees (Road Menders at Saint-Rémy)"?

In [None]:
# use the URL below to answer this question:
van_gogh = 'https://www.clevelandart.org/art/1947.209'

# get request:


# src usually the object label for the server's response:


# Parse the response into an HTML tree:


# Use the appropriate tag and class arguments to get the artwork's date:


# print out the results:


**Question 2:** Who is the artist behind the CMA's photograph "Camera Work: Steeplechase Day, Paris: After the Races"?

In [None]:
# use the URL below to answer this question:
steeplechase = 'https://www.clevelandart.org/art/1995.199.42.k'

# get request:


# src usually the object label for the server's response:


# Parse the response into an HTML tree:


# Use the appropriate tag and class arguments to get the artist's name:


# print out the results:


**Question 3:** What medium(s) did Louise Bourgeois use to create "Untitled c.1950"?

In [None]:
# use the URL below to answer this question
bourgeois = 'https://www.clevelandart.org/art/1998.112'

# get request:


# src usually the object label for the server's response:


# Parse the response into an HTML tree:


# Use the appropriate tag and class arguments to get the artist's name:


# print out the results:
