# Curate fine-tuning data with Lilac

Lilac is an open-source product that helps you analyze, structure, and clean unstructured data with AI. You can use it to enrich datasets of LangChain runs to create better fine-tuning datasets.

In this walkthrough, we will use Lilac on a dataset of LangSmith runs to check for PII and remove approximate duplicates.

The basic workflow is as follows:

- Create a LangSmith dataset of runs data.
- Load LangSmith dataset into Lilac.
- Filter and curate dataset using signals and concepts.
- Export the dataset for fine-tuning.

We will explain each of these steps in more detail below, but first, install some prerequisite packages.

## Setup

In addition to Lilac and LangSmith, this walkthrough requires a couple of additional packages.

In [None]:
%pip install -U "lilac[pii]" langdetect openai langchain --quiet

In [1]:
import uuid
import os

# os.environ["LANGCHAIN_API_KEY"] = "<YOUR-API-KEY>"
unique_id = uuid.uuid4().hex[:8]

## 1: Create LangSmith dataset

We've included an example dataset in this repository that you can use to complete this walkthrough.

This dataset was made by querying prompt and LLM runs from an example deployment of [chat langchain](https://github.com/langchain-ai/chat-langchain). 

For more information on how to query runs in LangSmith, check out the [docs](https://docs.smith.langchain.com/tracing/use-cases/export-runs/local) or explore some of the other recipes in this cookbook.

In [2]:
from langsmith import Client

client = Client()
dataset_name = f"langsmith-prompt-runs-{unique_id}"
ds = client.create_dataset(dataset_name)

In [3]:
import json
from concurrent.futures import ThreadPoolExecutor

def create_example(line: str):
    d = json.loads(line)
    client.create_example(inputs=d['inputs'], outputs=d['outputs'], dataset_id=ds.id)

with open('rag.jsonl', 'r', encoding='utf-8') as f:
    with ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(
               create_example, 
            f
        )

Now you can create the dataset. Lilac works best on flat dataset structures, so we will flatten (and stringify) some of the attributes.

## 2. Import into Lilac

Next, we can import the LangSmith dataset into Lilac. Select the dataset name you created above, 
and run the code below. Once you've run the code, you can view the the results in Lilac's UI.

In [4]:
from IPython.display import display
import lilac as ll

In [None]:
ll.set_project_dir('./langsmith-finetune')

data_source = ll.sources.langsmith.LangSmithSource(
    dataset_name=dataset_name,
)

config = ll.DatasetConfig(
  namespace='local',
  name=dataset_name,
  source=data_source,
)

dataset = ll.create_dataset(config)
ll.start_server()
# await ll.stop_server()

## 3: Enrich Dataset

Now that we have our dataset in Lilac, we can run Lilac’s signals, concepts and labels to help organize and filter the dataset. Our goal is to select distinct examples demonstrating good language model generations for a variety of input types. You can explore and annotate the dataset in the app by navigating to the URL printed out by the local server above. I'd encourage you to try out their off-the-shelf "concepts" or try training your own.

For the sake of this walkthrough, we will focus on using the Python API. You can follow along with the code below.

#### Applying 'signals'

Signals in Lilac refer to any function that is applied over a field. We will use a couple off-the-shelf "signals" to perform the following:

- PII detection: we don't want to leak private data
- Near duplicate detection: we want each training example to be informative

These are useful for filtering bad examples from our dataset before fine-tuning a model.

In [6]:
dataset.compute_signal(ll.PIISignal(), 'question')
dataset.compute_signal(ll.PIISignal(), 'output')

# Apply min-hash LSH (https://en.wikipedia.org/wiki/MinHash) to detect approximate n-gram duplicates
dataset.compute_signal(ll.NearDuplicateSignal(), 'question')
dataset.compute_signal(ll.NearDuplicateSignal(), 'output')

Computing pii on local/langsmith-prompt-runs-e8ae0676:('question',): 100%|████████████████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 793.78it/s]


Computing signal "pii" on local/langsmith-prompt-runs-e8ae0676:('question',) took 0.467s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/question/pii


Computing pii on local/langsmith-prompt-runs-e8ae0676:('output',): 100%|██████████████████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 418.92it/s]


Computing signal "pii" on local/langsmith-prompt-runs-e8ae0676:('output',) took 0.883s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/output/pii


Computing near_dup on local/langsmith-prompt-runs-e8ae0676:('question',):   0%|                                                                                          | 0/369 [00:00<?, ?it/s]
Fingerprinting...: 369it [00:00, 16802.53it/s]

Computing hash collisions...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 908.25it/s][A

Clustering...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 29018.60it/s][A
Computing near_dup on local/langsmith-prompt-runs-e8ae0676:('question',): 100%|██████████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 5708.11it/s]


Computing signal "near_dup" on local/langsmith-prompt-runs-e8ae0676:('question',) took 0.066s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/question/near_dup


Computing near_dup on local/langsmith-prompt-runs-e8ae0676:('output',):   0%|                                                                                            | 0/369 [00:00<?, ?it/s]
Fingerprinting...: 361it [00:00, 3975.29it/s]

Computing hash collisions...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 924.87it/s][A

Clustering...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 46643.24it/s][A
Computing near_dup on local/langsmith-prompt-runs-e8ae0676:('output',): 100%|████████████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 2819.18it/s]

Computing signal "near_dup" on local/langsmith-prompt-runs-e8ae0676:('output',) took 0.132s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/output/near_dup





### Adding labels

Labeling is best done in-app, but you can also programmatically [label rows using the python SDK](https://lilacml.com/datasets/dataset_labels.html). Below is an example that labels all rows not tagged as English as `not_english`

In [7]:
dataset.compute_signal(ll.LangDetectionSignal(), 'question')
dataset.compute_signal(ll.LangDetectionSignal(), 'output')

Computing lang_detection on local/langsmith-prompt-runs-e8ae0676:('question',): 100%|█████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 818.95it/s]


Computing signal "lang_detection" on local/langsmith-prompt-runs-e8ae0676:('question',) took 0.460s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/question/lang_detection


Computing lang_detection on local/langsmith-prompt-runs-e8ae0676:('output',): 100%|███████████████████████████████████████████████████████████████████████████| 369/369 [00:00<00:00, 498.82it/s]

Computing signal "lang_detection" on local/langsmith-prompt-runs-e8ae0676:('output',) took 0.741s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-e8ae0676/output/lang_detection





In [9]:
# You can check the current schema by running the following. Select the fields you want to export.
# dataset.manifest()

In [12]:
dataset.add_labels(
  'not_english',
  filters=[
    (('question', 'lang_detection'), 'not_equal', 'en')
  ]
)

Lilac has a lot more powerful capabilities like custom concepts and signals that you can apply. Check out their [docs](https://lilacml.com/blog/introducing-lilac.html) for more info, and see our [exploratory data analysis](../../exploratory-data-analysis/lilac/lilac.ipynb) noteboook for an introduction on using them with LangSmith datasets.

## 4. Prepare the enriched dataset

Now let's prepare the dataset for fine-tuning, we will fetch the deduplicated rows and filter out any rows that may contain PII.

In [19]:
df = dataset.to_pandas([
    'question', 
    'chat_history',
    'context',
    'output', 
    'question.pii',
    'question.near_dup',
    'user_score',
    'not_english'])

print(f"Original length: {len(df)}")

# Flatten the dataframe
df['cluster_id'] = df['question.near_dup'].apply(lambda x: x['cluster_id'])
df['contains_pii'] = df['question.pii'].apply(lambda x: bool([v for l in x.values() for v in l]))
df['not_english'] = df['not_english'].apply(lambda x: x is not None and x.get('label') == 'true')
# Drop original dotted columns
df.drop(columns=['question.near_dup', 'question.pii'], inplace=True)
# Now filter for only rows for which contains_pii is false, user_score is 1.0
df = df[(~df['contains_pii']) & (df['user_score'] != '0.0') & (~df['output'].isna())]
# And drop the duplicate cluster IDs
df = df.drop_duplicates(subset='cluster_id', keep='first')
print(f"Filtered length: {len(df)}")

Original length: 369
Filtered length: 312


In [20]:
df

Unnamed: 0,question,chat_history,context,output,user_score,not_english,cluster_id,contains_pii
0,"""index = create_index(PERSIST=False) # def c...","[{""content"": ""what StuffDocumentsChain does ? ...","""<doc id='0'># Tic-Tac-Toe\n \n...","""To fix the code and introduce the suggested c...",,False,0,False
1,"""ValidationError: 1 validation error for AIMes...","[{""content"": ""can you add something to this co...","""<doc id='0'>custom_message_converter=CustomMe...","""AIMessage | \ud83e\udd9c\ufe0f\ud83d\udd17 La...",,False,1,False
2,"""What is python""","[{""content"": ""alo"", ""example"": false, ""additio...","""<doc id='0'>Python | \ud83e\udd9c\ufe0f\ud83d...","""Python is a high-level programming language t...",,True,2,False
4,""". Use 'basic_auth' or 'bearer_auth' parameter...","[{""content"": ""I am new to python programming. ...","""<doc id='0'>with search_distanceConversationa...","""I apologize for the confusion. It seems there...",,False,4,False
5,"""test""",[],"[{""metadata"": {""source"": ""https://smith.langch...","{""content"": ""Hello! How can I assist you with ...",,True,5,False
...,...,...,...,...,...,...,...,...
363,"""so in order to work with pdf and chatgpt that...","[{""content"": ""i need to build a pdf database w...","""<doc id='0'>With the EmbeddingsRedundantFilte...","""Embeddings | \ud83e\udd9c\ufe0f\ud83d\udd17 L...",,False,363,False
364,"""what is the answer to the universe?""","[{""content"": ""hello"", ""example"": false, ""addit...","[{""metadata"": {""source"": ""https://python.langc...","""The answer to the universe is often humorousl...",,True,364,False
365,"""what if sqlalchemy is being used in SQLDataba...","[{""content"": ""File \""/usr/local/lib/python3.11...","""<doc id='0'>Querying a SQL DB | \ud83e\udd9c\...","""Hmm, I'm not sure about the specific implemen...",,False,365,False
366,"""can tools work with LLM Chain""",[],"""<doc id='0'>hardware and scaling independentl...","""Yes, tools can work with LLM (Large Language ...",,True,366,False


## 5. Finetune

With the dataset filtered, we can now prepare it to a compatible format for fine-tuning.
We will use OpenAI's fine-tuning endpoint for this, but you could also apply similar logic to finetune a Llama, T5, or other model.

In [21]:
def create_messages(row):
    # print(row)
    chat_history = json.loads(row.chat_history) or []
    roles = ("assistant", "user")
    messages = [{"role": "system", "content": "Helpfully answer the questions about LangChain."}]
    for i, msg in enumerate(chat_history):
        messages.append(
            {"role": roles[i%2], "content": str(msg["content"])}
            )
    messages.append({"role": "user", "content": row.question})
    messages.append({"role": "assistant", "content": row.output})
    return messages

messages = df.apply(create_messages, axis=1).tolist()    

Now you can fine-tune the model! This will take a while (20+ minutes), so we'd encourage you to further explore your local Lilac dataset
while you wait.

In [None]:
import json
from io import BytesIO
import time

import openai

# We will write the jsonl file in memory
my_file = BytesIO()
for m in messages:
    my_file.write((json.dumps({"messages": m}) + "\n").encode('utf-8'))

my_file.seek(0)
training_file = openai.File.create(
  file=my_file,
  purpose='fine-tune'
)

# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.File.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    status = openai.File.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")

job = openai.FineTuningJob.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)

status = openai.FineTuningJob.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    job = openai.FineTuningJob.retrieve(job.id)
    status = job.status

Status=[uploaded]... 0.00s

#### Use fine-tuned model

With the model fine-tuning complete, you can load the fine-tuned model directly in LangChain!

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = ChatOpenAI(
    model=job.fine_tuned_model,
    temperature=1,
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "Helpfully answer the questions about LangChain."),
        ("user", "{input}")
    ]
)
chain = prompt | model
chain.invoke({"input": "What's LangChain Expression Language?"})

## Conclusion

LangSmith makes it easy to collect unstructured data seen by your production LLM application. Lilac can make it easier to filter and analyze with sophisticated methods.

In this tutorial you created a dataset of run traces, filtered by near-duplicates and looking for PII, then used the filtered dataset to fine-tune a new model.