Collections as Data
This repository hosts Python code for downloading and analyzing scanned OCR text from a collection of southwestern U.S. borderlands newspapers. It also includes Jupyter Notebooks introducing text data mining with Python on the newspaper collection. The work is part of the project Using Newspapers as Data for Collaborative Pedagogy: A Multidisciplinary Interrogation of the Borderlands in Undergraduate Classrooms, funded in part by the Mellon Foundation through the Collections as Data program. More information about the project is available found at https://libguides.library.arizona.edu/newspapers-as-data.
If you are looking for an introduction explaining the concept of text data mining, check out the StoryMap at https://storymaps.arcgis.com/stories/cd7e273c42cd4ab6b6ce3fa89c13132c.
The work focuses on the following titles:
- Arizona Post, a Tucson newspaper by and for the Jewish community
- Arizona Sun, an African American newspaper published in Phoenix
- Apache Sentinel, published by African American soldiers stationed at Fort Huachuca
- Bisbee Daily Review, a newspaper published in Bisbee, a mining town at that time
- Border Vidette, a newspaper published in Nogales, Arizona, on the border with Nogales, Mexico
- Phoenix Tribune, the first African American newspaper published in Arizona
- El Sol, a Spanish-language, Mexican American newspaper published in Phoenix
- El Tucsonense, a Spanish-language, Mexican American newspaper published in Tucson
The text for these newspapers is available at Chronicling America. Downloads of the texts used the API, documented at https://chroniclingamerica.loc.gov/about/api/.
Text data mining lessons
Lessons for using these data in text data mining are available in Jupyter Notebooks. All lessons are licensed under a CC-BY-4.0 license 2020 by Jeffrey C. Oliver.
Data preparation scripts
- download-pages.py: Download files via the Chronicling America API. Pages are in the "pages" folder within an individual title's folder. For example, pages for El Tucsonense are downloaded to 'el-tucsonense/pages'. File names reflect the date, in YYYYMMDD, and the page number; e.g. page 2 of El Tucsonense's paper from January 3, 1925 is stored as 19250103-2.txt.
- assemble-volumes.py: Assemble text file for a single day's newspaper by concatenating all individual pages for a particular day/title combination. The text for each day's paper is stored in the 'volumes' folder for each title. For example, the text for the January 3, 1925 issue of El Tucsonense is located in data/el-tucsonense/volumes/19250103.txt.
- retrieve-complete.py and retrieve-complete.sh: Python and bash scripts for downloading full data set from the University of Arizona Research Data Repository. These scripts download the archive file from the data repository and extract the contents to data/complete. Both scripts accomplish the same thing, they differ only in language used.