dura-papyri

This repository contains the source code for the Dura Papyri project. This work was completed as part of CPSC 276/376: Introduction to Applications of Computer and Data Science for the Digital Humanities.

Quick Start

Clone this repository.

$ https://github.com/kevin-rono/dura-papyri

Create a Python virtual environment and install package requirements.

$ cd src
$ python -m venv venv
$ source venv/bin/activate
$ (venv) pip install -U pip wheel
$ (venv) pip install -r requirements.txt

Create the SQL database. If you do not have MySQL installed locally, refer to the official docs.
```
$ (venv) mysql -h localhost -u root -p < init.sql
```
Update path to raw in app.py and change password in secretkeys.py to your own MySQL password. Both files are found in src.
```
UPLOAD_FOLDER = '/../dura-papyri/raw

DB_PASSWORD = ''  # CHANGE TO YOUR PASSWORD
```
Execute the main script to run the web application. You should now be able to open the page on localhost:5000
```
$ (venv) python app.py
```

Data Preparation

An important component of this project involved collecting papyri metadata from disparate online sources and cleaning them into presentable format. This section details the process involved in data preparation and wrangling.

Data

All raw data files used to power the web application are located under raw.

xmls.json is a JSON file containing all raw XMLs web scraped from papyri.info.
data.csv is a CSV file containing metadata for each papyrus item, obtained by parsing and extensively cleaning xmls.json.
embeddings.json is a JSON file containing two-dimensional vector representations of each papyrus item, obtained by using a pretrained language model and applying t-SNE dimensionality reduction on the resulting context vectors.

Executables

All executable scripts are located under bin. To execute these scripts locally, pip install the dependencies listed in bin/requirements.txt.

xml_scraper.ipynb is a jupyter notebook used to scrape XML files to produce xmls.json from papyri.info, using Trismegistos as the gateway portal.
make_csv.py is a Python script used to parse xmls.json into data.csv. Note that some manual parsing of the dates was performed to further clean the field.
to_english.py is a Python script used to further clean data.csv by removing any occurrences of non-English texts.
embed.py is a Python script that processes each entry of data.csv to produce two-dimensional vector representations of each metadata, available in embeddings.json.

All scripts can be invoked from the root of the repository as follows.

$ python bin/FILENAME.py

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
bin		bin
raw		raw
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dura-papyri

Quick Start

Data Preparation

Data

Executables

License

About

Releases

Packages

Contributors 4

Languages

License

kevin-rono/dura-papyri

Folders and files

Latest commit

History

Repository files navigation

dura-papyri

Quick Start

Data Preparation

Data

Executables

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages