Introduction

Prosody

The goal of this project is to process literary text files for prosodic analysis. I used Pincelate by Allison Parrish alongside pronunciation phonemes available in cmudict to transform each text into its prosodic profile, a string of 0s (no stress), 1s (primary stress), and 2s (secondary stress).

Architecture

User uploads text files via API Gateway, which redirects the request to a lambda handler.
Lambda stores exact copies in the S3 landing bucket and extracted copies in the bronze bucket.
Lambda triggers Airflow dag via API.
Airflow creates EMR and submits a spark job.
Scala Spark engine pipes data to and from Pincelate, a Python machine learning model for guessing the pronunciation of a word.
Spark engine writes intermediary results to a silver copy, which contains clean text and full prosodic sequence. Spark engine outputs a gold copy, which contains the prosodic profiles of each text.
S3 Trigger triggers a Glue crawler, which makes data in the silver and gold buckets available in Athena.
A dashboarding tool can access the tables in Athena for analysis. An example Jupyter notebook is included in this repository.

Files

Scala project structure contains the files necessary for compiling a spark jar.
run_spark_job.py is the Airflow dag.
stressDict.parquet contains the reference table of words and their corresponding stress patterns.
soundout.py is the Python script for using Pincelate.
pincelate.sh is the script for bootstrapping EMR worker nodes.
lambda.py contains the lambda function that the API Gateway invokes.
Prosody Example.ipynb is the example Jupyter notebook.

CI/CD

The spark engine jar file is compiled and copied to the appropriate s3 path through GitHub Actions.

Notes

The spark engine can be easily extended to handle other input formats, e.g., html (literature published online).
The silver copy can be used to do ad-hoc analysis on the full prosodic string.
The gold copy contains percentages of iambs, dactyls, anapests, trochees, and spondees, normalized by the length of the text.
The example Jupyter notebook contains multidimensional scaling of a few texts based on their prosodic profiles.
The stress dictionary (stressDict.parquet) is expanded each time a text is uploaded. The words that are not found in the stress dictionary are piped to Pincelate, then added to the dictionary, so that they're readily available in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
project		project
src/main/scala		src/main/scala
stressDict.parquet		stressDict.parquet
target		target
Prosody Example.ipynb		Prosody Example.ipynb
README.md		README.md
_SUCCESS		_SUCCESS
architecture.jpeg		architecture.jpeg
build.sbt		build.sbt
lambda.py		lambda.py
pincelate.sh		pincelate.sh
run_spark_jobs.py		run_spark_jobs.py
soundout.py		soundout.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Prosody

Architecture

Files

CI/CD

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prosody

Architecture

Files

CI/CD

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages