Sem.io is an ongoing effort to mine and organize dictionary data into an efficient database that intrinsically links words to their etymologies with ActiveRecord Associations.
Words have references to their word origins, derivations, and etymological relationships.
Words have spellings and ISO language codes that have been indexed with btree for faster lookup.
See the rake db:import task in /lib for the code for importing the ISO language code and word data while automatically assigning their relationships. With about 470,000 entries in tsv format, this took about 2 hours on my machine, or O(n2) time in the abstract.
# Next Steps
I will soon be deploying to a server so this can be consumed as API. Watch github.com/lady3bean/semio_frontend for updates on the frontend. I am using d3.js in Angular to build an interactive data visual for users to explore and see the relationships between words.
Data is taken from Gerad de Melo’s Etymological Wordnet: www1.icsi.berkeley.edu/~demelo/etymwn/. So far, I have imported only the content with English as a language, and any words that those words reference in other languages. I am planning to source word definitions from Wordnik’s API: developer.wordnik.com and word lemmas and grammar data from LinguaSys’ Linguistic API: nlp.linguasys.com/docs/services/