Skip to content
/ manx Public

Fine-tune LLM for early Middle English lemmatization with data from LAEME.

License

Notifications You must be signed in to change notification settings

mdm-code/manx

Repository files navigation

logo

Fine-tune LLM for early Middle English lemmatization

The manx toolkit for early Middle English lemmatization is based on data from the LAEME corpus.

manx lets you fine-tune a ByT5 large language model for the downstream task of lemmatization of historical, early Middle English texts. The example fine-tuned google/byt5-small model published on Huggingface offers the lemma accuracy of 92.5% for the validation part of the data split from the LAEME corpus. Manx was developed for research and educational purposes only. It shows how corpus data from historical languages can be used to fine-tune large language models to support researchers in their daily work.

The project does not interfere with the copyright statement for LAEME given here. The LAEME data is not distributed, and it does not form any part of this project. The toolkit uses the LAEME data only to allow users to fine-tune and use a language model. The data is not persisted in any form in the project online repositories. The copyright statement for LAEME still applies to the data pulled from the LAEME website and persisted in order to fine-tune the model.

The project is distributed under the GPL-3 license meaning all derivatives of whatever kind are to be distributed under the same GPL-3 license with all its parts and source code disclosed in full. Whenever the project is used make sure to explicitly reference this repository and the original LAEME corpus. The license for the toolkit does not apply to the LAEME data, but it does apply to any software it operates on and the form of the data output of the Manx parser.

Installation

In order to use manx on your machine, you have to install it first using Python. You can install it from this repository with the following command:

python3 -m pip install manx@git+https://github.com/mdm-code/manx.git

I am not a big fan of cluttering the Python package index with all sorts of code that folks come up with, and I decided to stick with a simple repository.

As for the version of Python, use Python >=3.10 as declared in the pyproject.toml file.

Once installed, you should be able to invoke manx -h from your terminal.

Usage

You can use manx to fiddle with the data from LAEME, fine-tune a T5 model yourself and serve it behind an API. You can key in manx -h to see all the available options. There three commands that manx supports:

  • download: It lets you download corpus files and store them on disk.
  • parse: It allows you to parse the corpus for model fine-tuning.
  • api: It lets you serve the fine-tuned model behind a REST API.

The download command is straightforward: you give it the -r root, and files are pulled from the website and stored on the drive. The command parse lets you parse the corpus from the files you pulled with download or parse them directly from the web using --from-web flag meaning files will stored in-memory only. You can specify the length of parsed ngrams extracted from the corpus or the size of document chunks later used to shuffle the corpus parts. The two options are useful when --format is set to t5. The default command to get data from LAEME for model fine-tuning would look like this:

manx parse \
	--verbose \
	--from-web \
	--format t5 \
	--ngram-size 11 \
	--chunk-size 200 \
	--t5prefix "Lemmatize:" \
	--output t5-laeme-data.csv

You can head t5-laeme-data.csv to get the idea of how the resulting CSV file looks like.

As for the api command, it lets you specify the host and the port to serve the API. Other environmental variables that can be specified in the .env file or exported in the local environment are given below, so feel free to tweak them to you liking.

MANX_API_HOST=localhost
MANX_API_PORT=8000
MANX_API_LOG_LEVEL=INFO
MANX_API_TEXT_PLACEHOLDER=YOUR PLACEHOLDER TEXT
MANX_MODEL_TYPE=byt5
MANX_MODEL_DIR=mdm-code/me-lemmatize-byt5-small
MANX_USE_GPU=False

You can serve the API locally with default parameters like so: manx api. The default model served on Huggingface used under the hood will be pulled the moment the /v1/lemmatize API endpoint is called for the first time. You can change the path through environmental variables to point to your own models sorted locally or hosted on Huggingface.

With fastapi, you get a Swagger browser GUI for free. Once the server is running, it can be accessed under here by default http://localhost:8000/docs.

Running a container

You can serve the Manx API from inside of a container with an engine of your choice. I'm using Podman but Docker works just fine. In order to do that, you have to build the image with this command invoked from the project root directory:

podman build -t manx:latest .

Then you want to run it and -d detach it so that it runs in the background.

podman run -p 8000:8000 -d manx:latest

Model training

In order to train the model, have a look at the Jupyter notebook at Google Colab byT5-simpleT5-eME-lemmatization-train.ipynb. It lets you fine-tune the base model checkpoint right off the bat, but you have to keep in mind that you'll need to have some compute units available for a better GPU option. The free T4 does not have enough memory to accommodate the model.

Since the notebook uses SimpleT5, the name of the fine-tuned model is generated given the number of epochs, the loss value of the training set and the test set. Make sure you load it with the right name from the Colab local storage to evaluate its precision in terms of how many lemmas are predicted correctly.

Development

You want to have the package pulled the usual way with git and then installed for development purposes with python3 -m pip install -e .. To run tests, linters and type checkers, use make test. Have a look at the Makefile and .github/workflows to see what is already available and what is expected.

License

Copyright (c) 2023 Michał Adamczyk.

This project is licensed under the GPL-3 license. See LICENSE for more details.