Armenian MFA

In-progress project on forced alignment of Armenian using the Montreal Forced Aligner.

We trained an acoustic model on the Armenian data from the FLEURS dataset. The dataset is around 14 hours of Eastern Armenian speech (n=4380 sound files). We normalized the transcript for the following:

to remove word-internal punctuation
to remove word-external punctuation
to convert digits into number lemmas
to find errors in the transcripts

We manually created a pronunciation dictionary by examining the tokens in FLEURS against the Armenian Wiktionary entries on Wikipron.

We at first trained the model with a beam of 100. The model generated TextGrids for 4324 sound files with word-alignment and phone-alignment. We then re-ran the model on the data with a beam of 1000 to get TextGrids for 4379 sound files. One file seems to be broken.

Each TextGrid has the following structure:

words tier, generated by MFA.
phones tier, generated by MFA.
sentenceOriginal tier, manually generated. Lists the original transcript from FLEURS.
sentenceNormalized tier, manually generated. Lists the transcript that we created by normalizing the sentenceOriginal tier. The model was run over this tier.
notes tier, manually generated.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
generated_textgrids		generated_textgrids
pronunciation_dictionaries		pronunciation_dictionaries
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
armenian_model.zip		armenian_model.zip
pronDict.dict		pronDict.dict
pronDict.txt		pronDict.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Armenian MFA

About

Releases

Packages

License

jhdeov/armenianMFA

Folders and files

Latest commit

History

Repository files navigation

Armenian MFA

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages