Morpheme Segmentation with LSTM and Transformers

Morpheme segmentation is the process of separating words into their fundamental units of meaning. For example:

foundationalism → found+ation+al+ism

This project is a reproduction of the 2nd and 1st place systems from the 2022 Sigmorphon competition in word segmentation. These systems, from Ben Peters and Andre Martins, are DeepSPIN-2, a recurrent neural network (LSTM) model with character-level tokenization, and DeepSPIN-3 a transformer based model that uses an entmax loss function and ULM subword tokenization.

Organization

This repository is organized as such:

baseline/
baseline/bert       # simple BertTokenizer generator and evaluator
baseline/morfessor  # simple Morfessor 2.0 trainer, generator, and evaluator
deepspin            # flexible implementation of DeepSpin-2 and DeepSpin-3 with fairseq and an LSTM architecture as outlined in the paper above.
yoyodyne            # A basic implementation using yoyodyne - https://github.com/CUNY-CL/yoyodyne
lstm                # an LSTM work-in-progress architecture built with basic PyTorch (for academic purposes).

The baseline directory has 2 scripts for generating baseline segmentations. One uses a pretrained BertTokenizer (baseline/bert) and the other uses Morfessor 2.0 (baseline/morfessor), an unsupervised utility that is not pretrained.

In the case of DeepSPIN-2 and DeepSPIN-3, the original implementations were written by Ben Peters, but the scripts in this repository streamline their usage and decouple tokenization from each. This enables exploring of a transformer architecture with character-level encoding or an LSTM architecture with subword tokenization. This is helpful to really determine whether subword tokenization is a crucial ingredient in the high performance of DeepSPIN-3. Spoiler alert: it only accounts for 0.2% of the F-score.

The Data

Here is a sample of the training data for Hungarian:

tanításokért	tanít @@ás @@ok @@ért	110
Algériánál	Algéria @@nál	100
metélőhagymába	metélő @@hagyma @@ba	101
fülésztől	fül @@ész @@től	110

After training, the model is expected to be able to receive just the first column (the untokenized word) and be able to separate it into morphemes, with the @@ morpheme separator. The final column, which can be used for training has 3 bits that represent the types of morphology (inflection, derivation, & compounding). This is currently unused.

Setup

Make sure to clone this with --recurse-submodules to ensure you get the data from the competition.
After creating a virtual environment with python -m venv <name> or conda create..., you can install necessary python libraries with pip install -r requirements.txt from the root directory of this repository.

Training

Please refer to README's within each subdirectory for training and evaluation for each individual architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 321 Commits
2022SegmentationST @ ac161e1		2022SegmentationST @ ac161e1
baseline		baseline
deepspin		deepspin
lstm		lstm
yoyodyne		yoyodyne
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morpheme Segmentation with LSTM and Transformers

Organization

The Data

Setup

Training

About

Releases

Packages

Languages

joshstephenson/MorphemeSegmentation

Folders and files

Latest commit

History

Repository files navigation

Morpheme Segmentation with LSTM and Transformers

Organization

The Data

Setup

Training

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages