In [11]:
# %pip install sentence-transformers umap-learn embetter

## Intro to bulk

In an attempt to come to a quick demo, imagine that this code ran beforehand:

```python
import pandas as pd
from umap import UMAP
from sklearn.pipeline import make_pipeline 

# pip install "embetter[text]"
from embetter.text import SentenceEncoder

# Build a sentence encoder pipeline with UMAP at the end.
enc = SentenceEncoder('all-MiniLM-L6-v2')
umap = UMAP()

text_emb_pipeline = make_pipeline(
  enc, umap
)

# Load sentences
sentences = list(pd.read_csv("tests/data/text.csv")['text'])

# Calculate embeddings 
X_tfm = text_emb_pipeline.fit_transform(sentences)

# Write to disk. Note! Text column must be named "text"
df = pd.DataFrame({"text": sentences})
df['x'] = X_tfm[:, 0]
df['y'] = X_tfm[:, 1]

X = enc.transform(sentences)
```

This gives us a dataframe `df` that contains sentences, but also contains 2D UMAP representations of sentence embeddings. We also have a sentence encoder `enc` loaded and we also have access to our original embeddings `X`. Computing these can take a while on a CPU so we will store these on disk.

```python
import numpy as np

np.save("utils/X", X)
np.save("utils/X_tfm", X_tfm)
df.to_csv("utils/df.csv", index=False)
```

Given that these are stored on disk, we can boostrap the demo a whole lot quicker!

In [5]:
%load_ext autoreload
%autoreload 2

In [26]:
import numpy as np
import pandas as pd

X = np.load("utils/X.npy")
X_tfm = np.load("utils/X_tfm.npy")
df = pd.read_csv("utils/df.csv")

Lets use these variables to conjure up a basic text explorer. This will allow us to quickly explore the clusters that appear in our data. 

In [27]:
from bulk.widgets import BaseTextExplorer

widget = BaseTextExplorer(df)
widget.show()

HBox(children=(HBox(children=(VBox(children=(Button(button_style='primary', icon='arrows', layout=Layout(width…

Being able to explore these clusters is neat, but it feels like we might more easily explore everything if we have some more tools at our disposal. In particular, we want to have an encoder around so that we may use queries in our embedded space. 

The UI below will allow us to explore much more interactively.

In [None]:
from embetter.text import SentenceEncoder

enc = SentenceEncoder('all-MiniLM-L6-v2')

In [29]:
# Pay attention here! The rows in df needs to align with the rows in X!
widget = BaseTextExplorer(df, X=X, encoder=enc)
widget.show()

HBox(children=(VBox(children=(Text(value='', description='String:', placeholder='Type something'), HBox(childr…

Thanks to tools like ipywidget and anywidget, we can really start building some tools to make the notebook hard to beat as the go-to place for your data needs. My primary interest is to work on tools for data quality and being able to select datapoints in bulk feels like a great place to start. Maybe you can find an interesting subset to annotate first, maybe you get suprised when you see two distinct clusters that should be one. All that good stuff can happen in the notebook.

More UI will follow, but this `BaseTextExplorer` feels like a nice place to start!