Skip to content

Word-to-IPA transcriptions extracted from Wiktionary.

Notifications You must be signed in to change notification settings

lggruspe/word2ipa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2ipa

Word-to-IPA transcriptions extracted from Wiktionary.

Download

Sample:

en	ablest	eɪ.bləst
en	abligation	ɑb.ləˈɡeɪ.ʃən
en	abligurition	əˌblɪɡjʊˈɹɪʃən
en	ablings	ˈeɪ.blɪnz
en	ablood	əˈblʌd
en	abloom	əˈbluːm
en	ablow	əˈbloʊ
en	ablude	əˈbluːd
en	abluent	ˈæb.lu.ənt
en	ablur	əˈblɜː
en	ablur	əˈblɜːɹ
en	ablush	əˈblʌʃ
en	ablute	əˈbluːt
en	ablute	əˈblut
en	ablutionary	əˈblu.ʃəˌnɛ.ɹi

Each row contains a Wiktionary language code, a word and its transcription.

Extracting IPA transcriptions

The easiest way to get the IPA transcriptions is to download a pre-packaged release. You may also extract IPA transcriptions yourself, by following these instructions.

# Clone the repo.
git clone https://github.com/lggruspe/word2ipa
cd word2ipa

# Install some requirements.
python -m venv env
. env/bin/activate
pip install orjson

# Download machine-readable Wiktionary data from kaikki.
wget https://kaikki.org/dictionary/All%20languages%20combined/kaikki.org-dictionary-all.json
# Or https://kaikki.org/dictionary/<Language name>/kaikki.org-dictionary-<Language name>.json

# Extract transcriptions.
python -m word2ipa <.json dictionary from kaikki>

Licenses

Copyright 2023 Levi Gruspe

The scripts in this repository are licensed under GPLv3 or later.

The published TSV files are released under a Creative Commons Attribution-ShareAlike 3.0 Unported License. These are extracted from Wiktionary with the help of wiktextract and kaikki.org.

Wiktionary is licensed under CC BY-SA 3.0.