Skip to content

This repository demonstrates how to use nlScript to fine-tune an LLM for a custom language created via nlScript.

License

Notifications You must be signed in to change notification settings

nlScript/nlScript-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tune an LLM with the help of nlScript

This repository demonstrates how to use nlScript to fine-tune an LLM for a custom language created via nlScript.

In nlScript, a language is defined by grammar rules. These rules can not only be used to parse user input, but also to generate sample user input scripts. Because languages defined in nlScript typically have a natural English syntax, these automatically generated scripts can be fed into one of today's language models to be rephrased in free-form text. This results in pairs of free-form text samples and corresponding syntactically correct user inputs, which is used to fine-tune an LLM.

Summarizing, the general idea of this repository is the following:

  • Automatically generate a number of syntactically correct in input scripts.

  • Re-phrase those, using one of today's LLMs, to obtain corresponding free-form text

  • Use the resulting data set to fine-tune an LLM that can be deployed with the software featuring the natural language interface.

Below, each of these steps will be described in detail, referring to nlScript's microscope language example here: https://github.com/nlScript/nlScript-microscope-language-java

Python environment

For the python scripts, the following conda environment is used:

conda create -c defaults -n nlscript ipykernel python=3.10
conda activate nlscript

conda install nvidia/label/cuda-12.4.0::cuda
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"
pip install ipywidgets

1. Generate syntactically correct input scripts

Use LanguageControl.java to create random sample scripts. In particular, use LanguageControl.generateSamplesForThePaper() to create 400 samples scripts, each repeated 5 times (in the next step, these samples will be re-phrased, so we'll obtain 5 different variations for each generated script).

BTW: Sample generation can be adjusted using either GeneratorHints

parser.setGeneratorHints(...);

to fine-tune generation or

rule.setGenerator(...);

to set a custom Generator.

Here is an example:

rule = parser.defineType("z-distance", "{z-distance:float} microns", ...);
parser.setGeneratorHints(
    rule,
    "z-distance",
    GeneratorHints.from(Key.MIN_VALUE, 0f, Key.MAX_VALUE, 50f, Key.DECIMAL_PLACES, 1));

The first line defines a custom type z-distance, which consists of a floating-point number and the literal 'microns'. The second line states that in generated samples, the number should be in the interval [0; Inf[ and should be formatted to have one decimal place.

There are plenty of examples in https://github.com/nlScript/nlScript-microscope-language-java/blob/main/src/main/java/nlScript/mic/LanguageControl.java.

The output file of the example is in JSON format (autogenerated-sentences-with-context.json).

2. Re-phrase these sentences, using an existing LLM

Because languages defined in nlScript have a natural English syntax, the sample sentences created in step 1 will just be English sentences, and can be translated into free-form text by re-phrasing them, using an existing LLM. We tested several available models, and from the ones we tested, mistral-7B-instruct-v0.3 (https://huggingface.co/unsloth/mistral-7b-instruct-v0.3) yielded the best compromise between speed and quality of answers. We provide the create-training-sentences.ipynb notebook for this task, which reads autogenerated-sentences-with-context.json and creates for each sample a rephrased version and splits the resulting dataset into a training dataset (dataset-for-finetuning-sentences-train.json, 1600 samples) and a test dataset (dataset-for-finetuning-sentences-test.json, 400 samples).

3. Fine-tune an LLM to translate free-form text into syntactically correct input text

Use the training dataset obtained in Step 2 to fine-tune an LLM, to learn the correct syntax from the free-form text samples. From different models which we tested, mistral-7b (the pre-trained version, not fine-tuned) worked best (https://huggingface.co/unsloth/mistral-7b-v0.3). The notebook fine-tune-sentences.ipynb performs the training, using the unsloth library (https://unsloth.ai). For the example at hand, we trained the 1600 samples in the training dataset for one single epoch, which took ca. 1-2 hours. The notebook saves the fine-tuned adapters under results/Lora_Adapters.zip, which can then e.g. be used in Ollama (https://ollama.com).

4. Use the fine-tuned model in Ollama

The fine-tuned Lora adapters obtained in step 3 can e.g. be used in Ollama. To do this manually, first install Ollama. Then unzip Lora_Adapters.zip (see step 3) into a dedicated directory. In the same directory, create a file Modelfile with the following contents:

FROM mistral
ADAPTER ./sentence-adapters
TEMPLATE """
{{ .Prompt }}
"""
PARAMETER stop "</s>"
PARAMETER stop "<s>"
PARAMETER stop "<unk>"
PARAMETER stop "### Response:"
PARAMETER stop "### Instruction:"
PARAMETER stop "### Input:"

PARAMETER temperature 0.1
PARAMETER top_k 50
PARAMETER top_p 1.0
PARAMETER repeat_penalty 1.0
PARAMETER num_predict 4096

Then create the model in Ollama (from the command line) by

cd /path/to/folder-with-lora-adapters
ollama create mistral-mic

(where mistral-mic is the name of the model within Ollama`)

Now you can start querying ollama:

ollama run mistra-mic
>>>"""
... Below is the description of a microscope timelapse experiment. Transfer this description into valid English microscope-nlScript code.
...
... ### Instruction:
... Add a channel 'DAPI', set the excitation to 385nm at 30% with 10ms exposure time.
...
... ### Input:
...
... ### Response:
...
... """

Note the format of the prompt, which needs to resemble the prompt used for learning (see fine-tune-sentences.ipynb).

In my case, the model replied:

// The following sentence describes the configuration of a channel, i.e. the settings needed
// for exciting a specific fluorophore or for illuminating a brightfield image.
// The name of the channel is 'DAPI', the wavelength of the light source (led or laser) is 385nm,
// the intensity of the light source is 30%. The camera exposure time (illumination time) is 10ms.
Define channel 'DAPI':
  excite with 30% at 385nm
  use an exposure time of 10ms.

5. Use and deploy the fine-tuned model in custom software

There are different ways to use and deploy the fine-tuned model in one's own software. I'd recommend to install Ollama on the target machine and create the model within Ollama, see step 4. Then query Ollama using it's REST API (the docs is a little bit hidden, find it here: https://github.com/ollama/ollama/blob/main/docs/api.md).

For the example at hand, we ship the lora adapters (as a zip-file, just as it is created in step 3) with the software, and then install it automatically in Ollama on the client machine. So the only thing that needs to be done manually on the client computer is the installation of Ollama itself. The classes dealing with installing the model and communicating with Ollama are here:

About

This repository demonstrates how to use nlScript to fine-tune an LLM for a custom language created via nlScript.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published