Wikidata lexeme adapter for the "Great Finnish-Swedish Word List" by the Institute for the Languages of Finland. More on project at https://sv.wikipedia.org/wiki/Wikipedia:Projekt_Fredrika/Suru.
The process is divided into seven scripts that goes from downloading the Suru XML files to creating lexemes in batches on Wikidata.
The steps:
- Downloads Suru XML files
- Create overview of XML files
- Convert XML to xlsx
- Add word cateogry to xlsx
- Filters to subset
- Fetch existing Wikidata lexeme data
- Create Wikidata lexemes
Download suru.zip from from https://kotus.fi/kotus/kieliaineistot/aineistot-verkossa/ and unzip it to directory suru. Terminal commands on Mac OS:
curl -O https://kaino.kotus.fi/lataa/suru.zip
unzip suru.zip -d suru
Run 02_overview_pretty.py to iterate all XML files to
- count amount of DictionaryEntry tags,
- save XML files in prettified format for readibility and easier debugging,
- compile overview of XML structure in DictionaryEntry tags to 02_xml_structure.xml
Prettified xml files will be saved in: suru_pretty/
SuRu-000-aah.xml, DictionaryEntry tags: 987
SuRu-001-ajatella.xml, DictionaryEntry tags: 1002
SuRu-002-alppikiipeily.xml, DictionaryEntry tags: 1014
...
SuRu-110-ymparistomerkki.xml, DictionaryEntry tags: 996
SuRu-111-oljylammitteinen.xml, DictionaryEntry tags: 111
Total DictionaryEntry items: 110811 in 112 files
Run 03_suru_xlsx.py and output 03_suru.xlsx with following columns with extracted XML tag texts - chosen manually with the help of 02_xml_structure.xml created in previous step.
- For each
.//DictionaryEntry- suru_id: get id value, e.g. "SURU_a57ab4b712842b937486ecf07adf5df0"
- within
.//HeadwordCtn- headword:
Headword - subcategorisation:
SubCategorisation - seealso:
SeeAlso("KS" or none)
- headword:
- within
.//TranslationBlock- translations:
TranslationCtn/Translation(possibly several)
- translations:
- for each
.//SenseGrp(possibly several)- sense_groups:
TranslationCtn(possibly several)
- sense_groups:
Add Finnish word category (such as verb, noun, etc.) needed to identify correct Wikidata lexeme:
- Download the word list:
curl -O https://kaino.kotus.fi/lataa/nykysuomensanalista2024.txt
- Run
04_cat.py
Total rows: 111765
Run 05_filter.py to filter to smaller subsets with:
- Most searched for words in Suru (
05_vanligaste.xlsxrequireskotus Vanligaste sökningarna jan-mars 2025.xlsx - Finland specific words in Suru (
05_suom_lista.xlsxrequireskotus uppsl-med-suom-lista.txt
Alternative could also be all existing Finnish Wikidata lexemes, https://w.wiki/DfFQ
Run 06_match_lexeme.py to match Finnish headwords and Swedish translations to Wikidata lexemes. Fetch Wikidata object for lexeme sense.
Use 07_create_lex.py to create new lexemes with suru_id, sense and object.
To update and create lexemes while browsing https://kaino.kotus.fi/finsk-svensk/: load folder suru-wikidata-extension in a Chrome compatible browser at chrome://extensions.
To use the extension widget's "create lexem with flask" link, run python 07_create_lex_flask.py . Requires LexData and adding a .env file with WIKI_USERNAME, WIKI_PASSWORD and WIKI_EMAIL for authentication.