Machine-readable lists of lemma-token pairs in 23 languages.
Switch branches/tags
Nothing to show
Clone or download
Latest commit 9c91fe4 May 11, 2018

README.md

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

  • Asturian (ast) (108,792 pairs)
  • Bulgarian (bg) (30,323 pairs)
  • Catalan (ca) (591,534 pairs)
  • Czech (cs) (36,400 pairs)
  • English (en) (41,760 pairs)
  • Estonian (et) (80,536 pairs)
  • French (fr) (224,002 pairs)
  • Galician (gl) (392,856 pairs)
  • German (de) (358,473 pairs)
  • Hungarian (hu) (39,898 pairs)
  • Irish (ga) (415,502 pairs)
  • Manx Gaelic (gv) (67,177 pairs)
  • Italian (it) (341,074 pairs)
  • Persian/Farsi (fa) (6,273 pairs)
  • Polish (pl) (3,296,232 pairs)
  • Portuguese (pt) (850,264 pairs)
  • Romanian (ro) (314,810 pairs)
  • Scottish Gaelic (gd) (51,624 pairs)
  • Slovak (sk) (858,414 pairs)
  • Slovene (sl) (99,063 pairs)
  • Spanish (es) (497,560 pairs)
  • Swedish (sv) (675,137 pairs)
  • Ukrainian (uk) (193,703 pairs)
  • Welsh (cy) (359,224 pairs)

Licence

Sources