Skip to content

Minimalist toolkit for managing conversational datasets & fine-tuning LLMs.

License

Notifications You must be signed in to change notification settings

louislva/ftutils

Repository files navigation

ftutils

Minimalist toolkit for managing conversational datasets & fine-tuning LLMs.

Installation

pip install git+https://github.com/louislva/ftutils.git

Why use ftutils?

Creating language datasets is time consuming! Organizing them is hard, especially if you try to build bespoke tooling for it.

So why not just use VSCode? With VSCode, you get to borrow:

  • File system: the file system is a really easy way to organize things into hierarchies
  • Text editor: edit files, have multiple tabs open, etc.
  • Version control: keep track of your dataset over time & don't lose your changes; make branches if you want to test different ways of labeling
  • Github Copilot: speed up your writing - it even works well for natural language files!

Ftutils creates utilities for converting datasets to and from an easy-to-manipulate .txt format.

Documentation

Reading and writing Conversations

The core of ftutils is the Conversation class. It's basically a list of messages, but you can save it to a file, or send it to the openai API.

Here's how to make one with code:

from ftutils.conversation import Conversation, Message, Dataset # the core of the library
from ftutils.utils import str_to_filename, random_hex # utils for naming files
from ftutils.openai import estimate_tokens, openai_start_finetune,openai_create_dataset # helpers related to fine-tuning with the openai api

conv = Conversation.from_json([
    {"role": "system", "content": "You are ChatGPT, a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?"}
])

If you then want to save it, you can:

conv.to_file("cool-convo.txt")

Then, you can read the file and see: cool-convo.txt



system: You are ChatGPT, a helpful assistant.

user: Hello, how are you?

assistant: I'm doing well, thank you! How can I assist you today?

Let's say we're building a dataset for fine-tuning ChatGPT. We might want to make some changes to our conversation: cool-convo.txt



system: You are ChatGPT, a helpful assistant. You speak in uppercase.

user: Hello, how are you?

assistant: I'M DOING WELL, THANK YOU! HOW CAN I ASSIST YOU TODAY?

And when we load it, in another Python session, we can see:

conv = Conversation.from_file("cool-convo.txt")
print(conv.to_json())
# [
#     {'role': 'system', 'content': 'You are ChatGPT, a helpful assistant. You speak in uppercase.'},
#     {'role': 'user', 'content': 'Hello, how are you?'},
#     {'role': 'assistant', 'content': "I'M DOING WELL, THANK YOU! HOW CAN I ASSIST YOU TODAY?"}
# ]

print(conv.to_text())
#
#
# system: You are ChatGPT, a helpful assistant. You speak in uppercase.
#
# user: Hello, how are you?
#
# assistant: I'M DOING WELL, THANK YOU! HOW CAN I ASSIST YOU TODAY?

Building a dataset for fine-tuning

Okay, let's say we've made a folder (datasets/case-sensitive-assistant) full of .txt files in the ftutils format. We now want to bundle them into a .jsonl dataset, which we can then send to OpenAI via the web interface.

We can do this with the Dataset class:

dataset = Dataset.from_dir("datasets/case-sensitive-assistant")

And then we can save it to a .jsonl file:

dataset.to_file("case-sensitive-assistant.jsonl")

Now you can just drag and drop this file into the web interface!

Using the OpenAI API with Conversation

The Conversation class allows you to use .txt format, or the json/object/OpenAI format. The OpenAI format is equivalent to what you use in the Python API.

This means you can generate a new message with OpenAI and add it to the conversation, for example:

from openai import OpenAI

client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=conv.to_json()
)
conv.messages.append(completion.choices[0].message)
print(conv.to_text())
#
#
# system: You are ChatGPT, a helpful assistant. You speak in uppercase.
#
# user: Hello, how are you?
#
# assistant: I'M DOING WELL, THANK YOU! HOW CAN I ASSIST YOU TODAY?
# 
# assistant: WHAT CAN I DO FOR YOU?

Notice the extra message from the assistant in the end, generated by GPT-3.5.

We could save it again, if we like it:

conv.to_file("cool-convo-with-extra-message.txt")

About

Minimalist toolkit for managing conversational datasets & fine-tuning LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages