Skip to content
Machine-readable lists of lemma-token pairs in 23 languages.
Branch: master
Clone or download
Latest commit 9c91fe4 May 12, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENCE Create LICENCE May 11, 2018
README.md
lemmatization-ast.txt
lemmatization-bg.txt Add files via upload May 11, 2018
lemmatization-ca.txt Add files via upload May 11, 2018
lemmatization-cs.txt
lemmatization-cy.txt Add files via upload May 11, 2018
lemmatization-de.txt
lemmatization-en.txt Add files via upload May 11, 2018
lemmatization-es.txt
lemmatization-et.txt
lemmatization-fa.txt
lemmatization-fr.txt
lemmatization-ga.txt
lemmatization-gd.txt
lemmatization-gl.txt
lemmatization-gv.txt
lemmatization-hu.txt
lemmatization-it.txt Add files via upload May 11, 2018
lemmatization-pt.txt
lemmatization-ro.txt Add files via upload May 11, 2018
lemmatization-sk.txt
lemmatization-sl.txt
lemmatization-sv.txt
lemmatization-uk.txt

README.md

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

  • Asturian (ast) (108,792 pairs)
  • Bulgarian (bg) (30,323 pairs)
  • Catalan (ca) (591,534 pairs)
  • Czech (cs) (36,400 pairs)
  • English (en) (41,760 pairs)
  • Estonian (et) (80,536 pairs)
  • French (fr) (224,002 pairs)
  • Galician (gl) (392,856 pairs)
  • German (de) (358,473 pairs)
  • Hungarian (hu) (39,898 pairs)
  • Irish (ga) (415,502 pairs)
  • Manx Gaelic (gv) (67,177 pairs)
  • Italian (it) (341,074 pairs)
  • Persian/Farsi (fa) (6,273 pairs)
  • Polish (pl) (3,296,232 pairs)
  • Portuguese (pt) (850,264 pairs)
  • Romanian (ro) (314,810 pairs)
  • Scottish Gaelic (gd) (51,624 pairs)
  • Slovak (sk) (858,414 pairs)
  • Slovene (sl) (99,063 pairs)
  • Spanish (es) (497,560 pairs)
  • Swedish (sv) (675,137 pairs)
  • Ukrainian (uk) (193,703 pairs)
  • Welsh (cy) (359,224 pairs)

Licence

Sources

You can’t perform that action at this time.