The goal of this project is to process literary text files for prosodic analysis. I used Pincelate by Allison Parrish alongside pronunciation phonemes available in cmudict to transform each text into its prosodic profile, a string of 0s (no stress), 1s (primary stress), and 2s (secondary stress).
- User uploads text files via API Gateway, which redirects the request to a lambda handler.
- Lambda stores exact copies in the S3 landing bucket and extracted copies in the bronze bucket.
- Lambda triggers Airflow dag via API.
- Airflow creates EMR and submits a spark job.
- Scala Spark engine pipes data to and from Pincelate, a Python machine learning model for guessing the pronunciation of a word.
- Spark engine writes intermediary results to a silver copy, which contains clean text and full prosodic sequence. Spark engine outputs a gold copy, which contains the prosodic profiles of each text.
- S3 Trigger triggers a Glue crawler, which makes data in the silver and gold buckets available in Athena.
- A dashboarding tool can access the tables in Athena for analysis. An example Jupyter notebook is included in this repository.
- Scala project structure contains the files necessary for compiling a spark jar.
- run_spark_job.py is the Airflow dag.
- stressDict.parquet contains the reference table of words and their corresponding stress patterns.
- soundout.py is the Python script for using Pincelate.
- pincelate.sh is the script for bootstrapping EMR worker nodes.
- lambda.py contains the lambda function that the API Gateway invokes.
- Prosody Example.ipynb is the example Jupyter notebook.
The spark engine jar file is compiled and copied to the appropriate s3 path through GitHub Actions.
- The spark engine can be easily extended to handle other input formats, e.g., html (literature published online).
- The silver copy can be used to do ad-hoc analysis on the full prosodic string.
- The gold copy contains percentages of iambs, dactyls, anapests, trochees, and spondees, normalized by the length of the text.
- The example Jupyter notebook contains multidimensional scaling of a few texts based on their prosodic profiles.
- The stress dictionary (stressDict.parquet) is expanded each time a text is uploaded. The words that are not found in the stress dictionary are piped to Pincelate, then added to the dictionary, so that they're readily available in the future.
