This repository contains the source code for the Dura Papyri project. This work was completed as part of CPSC 276/376: Introduction to Applications of Computer and Data Science for the Digital Humanities.
-
Clone this repository.
$ https://github.com/kevin-rono/dura-papyri
-
Create a Python virtual environment and install package requirements.
$ cd src $ python -m venv venv $ source venv/bin/activate $ (venv) pip install -U pip wheel $ (venv) pip install -r requirements.txt
-
Create the SQL database. If you do not have MySQL installed locally, refer to the official docs.
$ (venv) mysql -h localhost -u root -p < init.sql
-
Update path to raw in app.py and change password in secretkeys.py to your own MySQL password. Both files are found in src.
UPLOAD_FOLDER = '/../dura-papyri/raw DB_PASSWORD = '' # CHANGE TO YOUR PASSWORD
-
Execute the main script to run the web application. You should now be able to open the page on localhost:5000
$ (venv) python app.py
An important component of this project involved collecting papyri metadata from disparate online sources and cleaning them into presentable format. This section details the process involved in data preparation and wrangling.
All raw data files used to power the web application are located under raw
.
xmls.json
is a JSON file containing all raw XMLs web scraped from papyri.info.data.csv
is a CSV file containing metadata for each papyrus item, obtained by parsing and extensively cleaningxmls.json
.embeddings.json
is a JSON file containing two-dimensional vector representations of each papyrus item, obtained by using a pretrained language model and applying t-SNE dimensionality reduction on the resulting context vectors.
All executable scripts are located under bin
. To execute these scripts locally, pip
install the dependencies listed in bin/requirements.txt
.
xml_scraper.ipynb
is a jupyter notebook used to scrape XML files to producexmls.json
from papyri.info, using Trismegistos as the gateway portal.make_csv.py
is a Python script used to parsexmls.json
intodata.csv
. Note that some manual parsing of the dates was performed to further clean the field.to_english.py
is a Python script used to further cleandata.csv
by removing any occurrences of non-English texts.embed.py
is a Python script that processes each entry ofdata.csv
to produce two-dimensional vector representations of each metadata, available inembeddings.json
.
All scripts can be invoked from the root of the repository as follows.
$ python bin/FILENAME.py
Released under the MIT License.