CLM Model Tuning

Note: This script was adapted from HuggingFace's Transformers/language-modeling code.

Language model tuning

Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2. Causal languages like this are trained or fine-tuned using a causal language modeling (CLM) loss.

In theory, serving a tuned model can increase incentive and earnings on the bittensor network. However this depends on many factors: the choice of model, the data used for tuning, and (to a lesser extent), the hyparameters used for tuning itself. This is not a silver bullet that will immediately guarantee higher earnings, but differences will be more pronounced once the Synapse update is released (time of writing: July 25, 2022).

In the following examples, we will run on datasets hosted on Bittensor's IPFS mountain dataset, on HuggingFace's dataset hub or with your own text files.

For a full list of models that will work with this script, refer to this link.

Installation & Requirements

This code assumes you have bittensor already installed on your machine, and is meant to be run entirely separately. Some basic linux commandline knowledge is assumed, but this guide should provide a good starting point to navigate and move around files, directories, etc.

To start, clone this repository:

git clone https://github.com/opentensor/clm_model_tuning

All following commands assume you are working from this folder, i.e. you must cd into the directory created by the previous step.

Run pip install -r requirements.txt to install the additional packages required by this script.

On bittensor

By default, the script will fine-tune GPT2 for bittensor's mountain dataset. Running:

python finetune_using_clm.py

will tune gpt2 with bittensor's dataset and save the output to tuned-model.

to change the model you are tuning to, e.g. distilgpt2, run:

python finetune_using_clm.py model.name=distilgpt2

Some sample models to try are available under the server customization section of bittensor's documentation. A full list of models that can be trained by this script are available on huggingface.

On huggingface datasets

Any text dataset on huggingface should work by default by overriding the dataset.name and dataset.config parameters:

python finetune_using_clm.py dataset.name=wikitext dataset.config_name=wikitext-103-v1

On your own data

If you have a .txt file saved locally, you can override dataset.name as above:

python finetune_using_clm.py dataset.name=./path/to/your/data.txt

Note if using your own data, you may have many short sentences and the block size may be insufficient for reasonable performance. It's recommended you pass the flag dataset.concatenate_raw=true to give the model more context when training. This will reduce the number of batches.

Configuring training parameters

All configurable parameters are visible and documented in conf/config.yaml. The defaults are chosen for quick training and not tuned; you will need to experiment and adjust these.

Note: The above parameters are the only commands you can override with this script. That is, you may not pass flags you would normally use when running btcli (i.e. --neuron.device will not work). If there is a flag you wish to modify feel free to submit a feature request.

To view the changeable parameters, open conf/config.yaml in whatever text editor you prefer, or use cat conf/config.yaml to view them.

You do not need to edit this file to change the parameters; they may be overridden when you call this script. e.g., if you wish to change the model to distilgpt2, and the output directory to distilgpt-tuned, you would run:

python3 finetune_using_clm.py model.name=distilgpt2 output_dir=distilgpt-tuned

Note the nested structure in the config, since model is above name in conf.yaml, you must override model.name when invoking the command.

Serving custom models on bittensor

To serve your tuned model on bittensor, just override neuron.model_name with the path to your tuned model:

btcli run ..... --neuron.model_name=/home/{YOUR_USENAME}/clm_model_tuning/tuned-model

Limitations & Warnings

Early stopping is not yet supported. Many features are implemented but not thoroughly tested, if you encounter an issue, reach out on discord or (preferably) create an issue on this github page.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
conf		conf
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
finetune_using_clm.py		finetune_using_clm.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLM Model Tuning

Language model tuning

Installation & Requirements

On bittensor

On huggingface datasets

On your own data

Configuring training parameters

Serving custom models on bittensor

Limitations & Warnings

About

Releases

Packages

Contributors 2

Languages

License

opentensor/clm_model_tuning

Folders and files

Latest commit

History

Repository files navigation

CLM Model Tuning

Language model tuning

Installation & Requirements

On bittensor

On huggingface datasets

On your own data

Configuring training parameters

Serving custom models on bittensor

Limitations & Warnings

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages