Using Web Scraping to Enrich Metadata

Watch the 3-minute YouTube video from the Library Publishing Forum, May 13, 2021.

Context and rationale

At the University of Michigan Library we manage ACLS Humanities E-Book, a collection of over 5,000 foundational books in the humanities mostly published by North American university presses, with whom we partner to license the collection. We are enriching the metadata as part of our current Mellon grant. Some of the data we want, like book descriptions, is missing, but we know it's public elsewhere on the internet, including on many university press websites.

We could copy and paste it manually, asking one of our student employees to visit 5,000 individual web pages and copy the text over into a spreadsheet or a document.

But there is a more elegant solution, web scraping. You can think of web scraping like automated copy-pasting. You use a programming language like Python or a data cleaning application like OpenRefine to go onto the internet, without opening your browser, lift text from web pages you name, and save it all in one file on your computer.

How did we decide this method was best for our project? We worked through these questions:

Is web scraping technically feasible? You need to have someone on your project with some programming or data curation expertise. If you don't have this yourself, try teaming up with a metadata librarian, a digital scholarship librarian, or someone in information technology. Also, the website you want to collect data from matters--some websites aren't structured in the right way, and some websites have ways of blocking web scraping scripts.
Is web scraping ethical in this case? If the data is sensitive or the creator community is vulnerable, you might need to get research ethics approval, and if the data is copyrighted, you'd need to contact the copyright holder.
Does web scraping provide efficiency at scale? If you only have a few records, your time is better spent doing the work manually than getting the automated process up and running.

In our case, we found web scraping was just the right sauce for getting book descriptions. We have the right combination of skills on our team, the websites are often structured the right way, the data is not sensitive, the university presses and authors who wrote the descriptions are our partners in this collection, and with over 5,000 books, we have the scale to make automation worthwhile.

Like a lot of people, I'm skeptical of technical methods proposed as magical solutions with no negative side effects. I want to avoid framing web scraping as "technoheroic", to use a term from Catherine D'Ignazio. It sure felt magical to see the book descriptions stacking up quickly in our spreadsheet, but really, we just avoided repeating the hidden labor that all those editorial assistants and marketing assistants put in to publishing those descriptions on their websites in the first place. Scraping did not solve everything, it just freed us up from this one tedious task so that we could focus on some of the more artful to-dos on our list.

Running the script

Make sure you have Python 3 installed. This script was written with Python 3.7.8.
Clone this repository or download the files, and navigate into the directory with Terminal or Git Bash.
Install the libraries used by the script using pip. They are listed as imports at the top of the file, and you can also create a virtual environment with something like virtualenvwrapper and run pip install -r requirements.txt to install them all at once.
Modify metadata.xlsx to list the elements that you will use to build the URLs. In this demo, ISBNs are all we need, since the webpages are found at URLs built with 13-digit ISBNs like https://www.ucpress.edu/book/9780520270183. The script currently expects ISBNs in any of three input columns—ebook, hardcover, and paper—but this can be changed by editing the opening line of the for loop in the manage_scraping function. Depending on the URL structure of the sites you want to scrape, you may need to prepare multiple elements. (You can also add other columns to metadata.xlsx or rearrange them without affecting the script, but you will also have to specify the desired output columns in the new_headers list if you want to keep them.)
Edit the manage_scraping function so that it 1) correctly forms the URLs based on your input data and 2) pinpoints the data point(s) you want from each webpage. You may want to read about requests and you will definitely need to read up on BeautifulSoup.
Edit the output parameters in the manage_apply function as desired. If you output to Excel, you will need to add your scraped data points to the new_headers list. See pandas DataFrame reference. If you want to output to Word, edit the rows that build the new_document object using syntax from python-docx. Note that output to Word is generally only useful if you are sure of the automated workflow to be used to re-ingest that content into your target system, without losing the data structure—we have used Word styles interpreted by python-docx for this.
Run the script with python scraping_demo.py or the appropriate command for your installation or environment.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
metadata.xlsx		metadata.xlsx
oclc_api.py		oclc_api.py
readme.md		readme.md
requirements.txt		requirements.txt
scraping_demo.py		scraping_demo.py
wikidata_sparql.py		wikidata_sparql.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Web Scraping to Enrich Metadata

Context and rationale

Running the script

About

Releases

Packages

Languages

License

joemull/web_scraping_demo

Folders and files

Latest commit

History

Repository files navigation

Using Web Scraping to Enrich Metadata

Context and rationale

Running the script

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages