# Curate fine-tuning data with Lilac

Lilac is an open-source product that helps you analyze, structure, and clean unstructured data with AI. You can use it to enrich datasets of LangChain runs to create better fine-tuning datasets.

In this walkthrough, we will use Lilac on a dataset of LangSmith runs to check for PII and remove approximate duplicates.

The basic workflow is as follows:

- Create a LangSmith dataset of runs data.
- Load LangSmith dataset into Lilac.
- Filter and curate dataset using signals and concepts.
- Export the dataset for fine-tuning.

We will explain each of these steps in more detail below, but first, install some prerequisite packages.

## Setup

In addition to Lilac and LangSmith, this walkthrough requires a couple of additional packages.

In [None]:
%pip install -U "lilac[pii]" langdetect openai langchain --quiet

In [None]:
import uuid
import os

os.environ["LANGCHAIN_API_KEY"] = "<YOUR-API-KEY>"
unique_id = uuid.uuid4().hex[:8]

## 1: Create LangSmith dataset

We've included an example dataset in this repository that you can use to complete this walkthrough.

This dataset was made by querying prompt and LLM runs from an example deployment of [chat langchain](https://github.com/langchain-ai/chat-langchain). 

For more information on how to query runs in LangSmith, check out the [docs](https://docs.smith.langchain.com/tracing/use-cases/export-runs/local) or explore some of the other recipes in this cookbook.

In [2]:
from langsmith import Client

client = Client()
dataset_name = f"langsmith-prompt-runs-{unique_id}"
ds = client.create_dataset(dataset_name)

In [3]:
import json
from concurrent.futures import ThreadPoolExecutor

def create_example(line: str):
    d = json.loads(line)
    client.create_example(inputs=d['inputs'], outputs=d['outputs'], dataset_id=ds.id)

with open('rag.jsonl', 'r', encoding='utf-8') as f:
    with ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(
               create_example, 
            f
        )

Now you can create the dataset. Lilac works best on flat dataset structures, so we will flatten (and stringify) some of the attributes.

## 2. Import into Lilac

Next, we can import the LangSmith dataset into Lilac. Select the dataset name you created above, 
and run the code below. Once you've run the code, you can view the the results in Lilac's UI.

In [4]:
from IPython.display import display
import lilac as ll

In [6]:
ll.set_project_dir('./langsmith-finetune')

data_source = ll.sources.langsmith.LangSmithSource(
    dataset_name=dataset_name,
)

config = ll.DatasetConfig(
  namespace='local',
  name=dataset_name+ uuid.uuid4().hex[:4],
  source=data_source,
)

try:
    dataset = ll.create_dataset(config)
except:
    dataset = ll.get_dataset(config)

ll.start_server()
# await ll.stop_server()

Reading from source langsmith...: 100%|█████████████████████████████████████| 369/369 [00:00<00:00, 87209.00it/s]

Dataset "langsmith-prompt-runs-c8725493ede6" written to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-c8725493ede6





## 3: Enrich Dataset

You can explore and annotate the dataset in the app by navigating to the URL printed out by the local server above. I'd encourage you to try out their off-the-shelf "concepts" or try training your own.

For the sake of this walkthrouugh, we will focus on using the Python API. You can follow along with the code below.

#### Applying 'signals'

Signals in Lilac refer to any function that is applied over a field. We will use a couple off-the-shelf "signals" to perform the following:

- PII detection: we don't want to leak private data
- Near duplicate detection: we want each training example to be informative

These are useful for filtering bad examples from our dataset before fine-tuning a model.

In [7]:
dataset.compute_signal(ll.PIISignal(), 'question')
dataset.compute_signal(ll.PIISignal(), 'output')

# Apply min-hash LSH (https://en.wikipedia.org/wiki/MinHash) to detect approximate n-gram duplicates
dataset.compute_signal(ll.NearDuplicateSignal(), 'question')
dataset.compute_signal(ll.NearDuplicateSignal(), 'output')

Computing pii on local/langsmith-prompt-runs-c8725493ede6:('question',): 100%|█| 369/369 [00:00<00:00, 821.90it/s


Computing signal "pii" on local/langsmith-prompt-runs-c8725493ede6:('question',) took 0.451s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-c8725493ede6/question/pii


Computing pii on local/langsmith-prompt-runs-c8725493ede6:('output',): 100%|██| 369/369 [00:00<00:00, 387.43it/s]


Computing signal "pii" on local/langsmith-prompt-runs-c8725493ede6:('output',) took 0.954s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-c8725493ede6/output/pii


Computing near_dup on local/langsmith-prompt-runs-c8725493ede6:('question',):   0%|      | 0/369 [00:00<?, ?it/s]
Fingerprinting...: 369it [00:00, 15314.34it/s]

Computing hash collisions...: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 892.79it/s][A

Clustering...: 100%|██████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26533.31it/s][A
Computing near_dup on local/langsmith-prompt-runs-c8725493ede6:('question',): 100%|█| 369/369 [00:00<00:00, 5611.


Computing signal "near_dup" on local/langsmith-prompt-runs-c8725493ede6:('question',) took 0.067s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-c8725493ede6/question/near_dup


Computing near_dup on local/langsmith-prompt-runs-c8725493ede6:('output',):   0%|        | 0/369 [00:00<?, ?it/s]
Fingerprinting...: 361it [00:00, 4263.73it/s]

Computing hash collisions...: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 940.64it/s][A

Clustering...: 100%|██████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 40240.55it/s][A
Computing near_dup on local/langsmith-prompt-runs-c8725493ede6:('output',): 100%|█| 369/369 [00:00<00:00, 2926.11

Computing signal "near_dup" on local/langsmith-prompt-runs-c8725493ede6:('output',) took 0.127s.
Wrote signal output to ./langsmith-finetune/datasets/local/langsmith-prompt-runs-c8725493ede6/output/near_dup





Lilac has a lot more powerful capabilities like custom concepts and signals that you can apply. Check out their [docs](https://lilacml.com/blog/introducing-lilac.html) for more info, and see our [exploratory data analysis](../../exploratory-data-analysis/lilac/lilac.ipynb) noteboook for an introduction on using them with LangSmith datasets.

In [8]:
query = ll.ConceptSearch(
    concept_namespace='local',
    concept_name='prompt-injection',
    embedding='sbert',
    path='input',
)
r = dataset.select_rows(['input'], searches=[query], limit=30)
df = r.df()
df['score'] = df['input.local/prompt-injection/sbert'].apply(lambda x: x[0]['score'])
display(df.sort_values('score', ascending=False).head(10)[['input', 'score']])

ValueError: Unable to select path ('input',). Path part "input" not found in the dataset.

You may notice a number of these values being given high scores, even if they aren't prompt injection. 
You can further refine the concepts in the app or using the code below.

In [None]:
updated_examples = [
    ll.concepts.ExampleIn(label=False, text="what is the final result of `import hashlib;"),
    ll.concepts.ExampleIn(label=False, text="생존자 는 몇 명인가요?"),
    ll.concepts.ExampleIn(label=False, text="생존자 중에 여성은 몇 명인가요")
]
concept = db.edit('local', 'prompt-injection', ll.concepts.ConceptUpdate(insert=examples))

In [None]:
r = dataset.select_rows(['input'], searches=[query], limit=30)
df = r.df()
df['score'] = df['input.local/prompt-injection/sbert'].apply(lambda x: x[0]['score'])
display(df.sort_values('score', ascending=False).head(10)[['input', 'score']])

## 4. Prepare the enriched dataset

Now let's prepare the dataset for fine-tuning, we will fetch the deduplicated rows and filter out any rows that may contain PII.

In [9]:
# You can check the current schema by running the following. Select the fields you want to export.
# dataset.manifest()

In [10]:
df = dataset.to_pandas([
    'question', 
    'chat_history',
    'context',
    'output', 
    'question.pii',
    'question.near_dup',
    'user_score'])

print(f"Original length: {len(df)}")

# Flatten the dataframe
df['cluster_id'] = df['question.near_dup'].apply(lambda x: x['cluster_id'])
df['contains_pii'] = df['question.pii'].apply(lambda x: bool([v for l in x.values() for v in l]))
# Drop original dotted columns
df.drop(columns=['question.near_dup', 'question.near_dup'], inplace=True)
# Now filter for only rows for which contains_pii is false, user_score is 1.0
df = df[(~df['contains_pii']) & (df['user_score'] != '0.0') & (~df['output'].isna())]
# And drop the duplicate cluster IDs
df = df.drop_duplicates(subset='cluster_id', keep='first')
print(f"Filtered length: {len(df)}")

Original length: 369
Filtered length: 312


## 5. Finetune

With the dataset filtered, we can now prepare it to a compatible format for fine-tuning.
We will use OpenAI's fine-tuning endpoint for this, but you could also apply similar logic to finetune a Llama, T5, or other model.

In [11]:
def create_messages(row):
    # print(row)
    chat_history = json.loads(row.chat_history) or []
    roles = ("assistant", "user")
    messages = [{"role": "system", "content": "Helpfully answer the questions about LangChain."}]
    for i, msg in enumerate(chat_history):
        messages.append(
            {"role": roles[i%2], "content": str(msg["content"])}
            )
    messages.append({"role": "user", "content": row.question})
    messages.append({"role": "assistant", "content": row.output})
    return messages

messages = df.apply(create_messages, axis=1).tolist()    

Now you can fine-tune the model! This will take a while (20+ minutes), so we'd encourage you to further explore your local Lilac dataset
while you wait.

In [15]:
import json
from io import BytesIO
import time

import openai

# We will write the jsonl file in memory
my_file = BytesIO()
for m in messages:
    my_file.write((json.dumps({"messages": m}) + "\n").encode('utf-8'))

my_file.seek(0)
training_file = openai.File.create(
  file=my_file,
  purpose='fine-tune'
)

# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.File.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    status = openai.File.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")

job = openai.FineTuningJob.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)

status = openai.FineTuningJob.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    job = openai.FineTuningJob.retrieve(job.id)
    status = job.status

File file-UrGGUOin9fManPfygLuDlELi ready after 96.76 seconds.
Status=[running]... 1654.28s

#### Use fine-tuned model

With the model fine-tuning complete, you can load the fine-tuned model directly in LangChain!

In [22]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

model = ChatOpenAI(
    model=job.fine_tuned_model,
    temperature=1,
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "Helpfully answer the questions about LangChain."),
        ("user", "{input}")
    ]
)
chain = prompt | model
chain.invoke({"input": "What's LangChain Expression Language?"})

AIMessage(content='"The LangChain Expression Language (LCEL) is a language used within LangChain to define expressions and transformations. It provides a set of functions and operators that can be used to perform various operations on data within a LangChain pipeline. \\n\\nLCEL allows you to manipulate and transform data by applying functions and operators to input values. It supports a wide range of operations, including string manipulation, mathematical calculations, conditional branching, and more. This flexibility makes it a powerful tool for building complex data processing pipelines.\\n\\nLCEL expressions can be used in various contexts within LangChain, such as when defining data sources, data sinks, or in intermediate steps of a data manipulation process. You can also use LCEL expressions to define custom functions and operators, allowing you to extend the capabilities of LangChain to suit your specific needs.\\n\\nOverall, LCEL provides a way to dynamically transform and mani

## Conclusion

LangSmith makes it easy to collect unstructured data seen by your production LLM application. Lilac can make it easier to filter and analyze with sophisticated methods.

In this tutorial you created a dataset of run traces, filtered by near-duplicates and looking for PII, then used the filtered dataset to fine-tune a new model.