This repository contains datasets for experimentation on the paradigm cell-filling task (PCFP) where the task is to fill in the missing forms in a set of morphological inflection tables. For example,
try INF try INF PRES+3+SG => tries PRES+3+SG PRES+PCPLE trying PRES+PCPLE tried PAST tried PAST
These datasets were used in the experiments for the paper
Miikka Silfverberg and Mans Hulden. An Encoder-Decoder Approach to the Paradigm Cell Filling Problem. EMNLP 2018.
Training and test data
data contains inflection tables for eight languages: Finnish, French, Georgian, German, Latin,
Latvian, Spanish and Turkish. These were created based on the Unimorph data sets for each language.
Originally, the inflection tables have been crawled from Wiktionary.
LAN.um.POS.txt contains 1,000 inflection tables randomly sampled from the Unimorph data set for language
LAN is one of
POS is either
N for nouns or
V for verbs). The files
LAN.um.POS.n contain the same inflection tables but these tables only show
n=1,2 or 3 randomly
sampled forms. Your task is to use each file
LAN.um.POS.n.txt to learn a model which can fill in the missing forms in that file.
LAN.um.POS.top.txt contains a varying number of inflection tables. These have been chosen to include Unimorph
tables which include at least one word form from the list of 10,000 most frequent word forms for language
LAN based on
word frequencies in Wikipedia for lan
LAN. The file
LAN.um.POS.top.10000.txt contains the same inflection tables but only
the most frequent word forms are shown. The task is to fill in the missing forms.
outputs contains the outputs of the models presented in the EMNLP paper. Each file
contains outputs for input file
LAN.um.POS.n.txt. The files
LAN.um.POS.top.res.txt correspond to the test file
LAN.um.POS.top.10000.txt. The files
LAN.um.POS.top.baseline.res.txt are analogous files for the
src directory contains code and documentation for running the experiments in Silfverberg and Hulden (2018).