### Analyze LangSmith Datasets with Lilac

Lilac is an open-source product that helps you analyze, structure, and clean unstructured data with AI. Let's use it to explore production traces 

In [32]:
# %pip install -U "lilac[pii]" langdetect sentence-transformers langsmith --quiet

### From LangSmith


In [3]:
from IPython.display import display
import lilac as ll

In [10]:
dataset_name = 'Chat LangChain Questions'
data_source = ll.sources.langsmith.LangSmithSource(dataset_name=dataset_name)

config = ll.DatasetConfig(
  namespace='local',
  name='langsmith',
  source=data_source,
)
dataset = ll.create_dataset(config)

Reading from source langsmith...: 100%|███████████████████████████████████████| 46/46 [00:00<00:00, 11020.62it/s]

Dataset "langsmith" written to ./datasets/local/langsmith





## Visualize the data

Now that we have imported a few datasets, let's visualize them to see what they look like.


In [17]:
# ll.start_server()
# await ll.stop_server()

In [22]:
"http://127.0.0.1:5432/datasets#local/langsmith"

'http://127.0.0.1:5432/datasets#local/langsmith'

## View Dataset Schema

The Lil

In [31]:
# Show the dataset schema
dataset.manifest()

DatasetManifest(namespace='local', dataset_name='langsmith', data_schema={
  "fields": {
    "question": {
      "dtype": "string"
    },
    "answer": {
      "dtype": "string"
    }
  }
}, source=LangSmithSource(dataset_name='Chat LangChain Questions'), num_items=46)

## Enriching an unstructured field with metadata

Lilac exposes a number of built-in methods to to add structured metadata to your dataset.
Called "signals", these methods compute a function on each row and add the results as new fields
to the field on which they were applied.

In this example, we will run a "signal" over the `question` field.

In [33]:
dataset.compute_signal(ll.LangDetectionSignal(), 'question')

Computing lang_detection on local/langsmith:('question',): 100%|█████████████████| 46/46 [00:00<00:00, 46.36it/s]

Computing signal "lang_detection" on local/langsmith:('question',) took 1.062s.
Wrote signal output to ./datasets/local/langsmith/question/lang_detection





In [36]:
# Apply min-hash LSH (https://en.wikipedia.org/wiki/MinHash) to detect approximate n-gram duplicates
dataset.compute_signal(ll.NearDuplicateSignal(), 'question')

Computing near_dup on local/langsmith:('question',):   0%|                                | 0/46 [00:00<?, ?it/s]
Fingerprinting...: 46it [00:00, 6597.97it/s]

Computing hash collisions...: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 1426.63it/s][A

Clustering...: 100%|█████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 149386.17it/s][A
Computing near_dup on local/langsmith:('question',): 100%|██████████████████████| 46/46 [00:00<00:00, 984.21it/s]

Computing signal "near_dup" on local/langsmith:('question',) took 0.051s.
Wrote signal output to ./datasets/local/langsmith/question/near_dup





# Query the Dataset

Now that we've enriched the dataset, we can query it to explore it.

In [42]:
r = dataset.select_rows(['question', 'answer'], limit=5)
r.df()

Unnamed: 0,question,answer
0,How can I create a simple chat model using my ...,Certainly! To create a simple chat model using...
1,Can you show a me an example of how to create ...,Certainly! To create a vector store with Azure...
2,Help me debug: TypeError: Pinecone.similarity_...,Your error is likely due to an incorrect param...
3,whats the code to load text file into a vector...,To load a text file into a vector store using ...
4,what is RAG,RAG stands for Retrieval Augmented Generation....


## Searching


### Compute embedding to enable advanced search

Let's compute the `SBERT` embedding on device for the `overview` field.


In [43]:
dataset.compute_embedding('sbert', 'question')

Computing sbert on local/langsmith:('question',): 100%|██████████████████████████| 46/46 [00:05<00:00,  7.72it/s]

Computing signal "sbert" on local/langsmith:('question',) took 5.961s.
hnswlib index creation took 0.004s.
Wrote embedding index to ./datasets/local/langsmith/question/sbert





### Keyword search


In [44]:
query = ll.KeywordQuery(search='runnable')
r = dataset.select_rows(['question'], searches=[ll.Search(path='question', query=query)], limit=5)
display(r.df())

AttributeError: module 'lilac' has no attribute 'KeywordQuery'

In [None]:
ll.

### Semantic search


In [14]:
query = ll.SemanticQuery(search='runnable', embedding='sbert')
r = dataset.select_rows(['overview'], searches=[ll.Search(path='overview', query=query)], limit=5)
display(r.df())

AttributeError: module 'lilac' has no attribute 'SemanticQuery'

### Conceptual search


In [1]:
query = ll.ConceptQuery(concept_namespace='lilac', concept_name='profanity', embedding='sbert')
r = dataset.select_rows(['overview'], searches=[ll.Search(path='overview', query=query)], limit=5)
display(r.df())

NameError: name 'll' is not defined

## Downloading the enriched dataset


In [None]:
dataset.to_csv('the_movies_dataset.csv')

In [None]:
dataset.to_pandas()[:5]

## Using concepts


### Use the positive-sentiment concept


In [None]:
signal = ll.signals.ConceptSignal(
  namespace='lilac', concept_name='positive-sentiment', embedding='gte-small')

result = list(signal.compute(['This product is amazing, thank you!']))

print(result)


### Create a positive product reviews concept


In [None]:
db = ll.DiskConceptDB()

concepts = db.list()
# Don't create the concept twice.
if not list(
    filter(lambda c: c.namespace == 'local' and c.name == 'positive-product-reviews', concepts)):
  db.create('local', 'positive-product-reviews')

#### Add a few training examples


In [None]:
train_data = [
  ll.ExampleIn(label=False, text='The quick brown fox jumps over the lazy dog.'),
  ll.ExampleIn(label=False, text='This is a random sentence.'),
  ll.ExampleIn(label=True, text='This product is amazing!'),
  ll.ExampleIn(label=True, text='Thank you for your awesome work on this UI.')
]
db.edit('local', 'positive-product-reviews', ll.ConceptUpdate(insert=train_data))

#### Show the examples in the concept


In [None]:
concept = db.get('local', 'positive-product-reviews')

if concept:
  print(concept.data)

#### Remove examples


In [None]:
db.edit('local', 'positive-product-reviews',
        ll.ConceptUpdate(remove=['d86e4cb53c70443b8d8782a6847f4752']))

##### Use the new concept


In [None]:
signal = ll.signals.ConceptSignal(
  namespace='local', concept_name='positive-product-reviews', embedding='gte-small')

result = list(signal.compute(['This product is amazing, thank you!']))

print(result)

#### Concept metrics

To compute metrics for a concept, we first have to instantiate a concept model.


In [None]:
model_db = ll.DiskConceptModelDB(ll.DiskConceptDB())

model = model_db.get('local', 'positive-product-reviews', embedding_name='gte-small')

if model:
  print(model.get_metrics())


#### Remove the concept


In [None]:
db.remove('local', 'positive-product-reviews')