# Vincent's Initial Text-Labelling Hack!

This notebook contains a convenient pattern to cluster and label new text data. The end-goal is to discover intents that might be used in a virtual assistant setting. This can be especially useful in an early stage and is part of the "iterate on your data"-mindset.

This notebook will guide you through the process. The two main tools that will be used here are [whatlies](https://rasahq.github.io/whatlies/) and [human-learn](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html).

In [1]:
import pathlib 
import numpy as np
from whatlies.language import CountVectorLanguage, UniversalSentenceLanguage, BytePairLanguage
from whatlies.transformers import Pca, Umap

In [2]:
txt = pathlib.Path("nlu.md").read_text()
texts = list(set([t.replace(" - ", "") for t in txt.split("\n") if len(t) > 0 and t[0] != "#"]))

lang_cv = CountVectorLanguage(10)
lang_use = UniversalSentenceLanguage()
lang_bp = BytePairLanguage("en", dim=300, vs=200_000)

INFO:absl:Using /var/folders/0v/pj9vtxhd6ml7mb8n94wvcmkr0000gp/T/tfhub_modules to cache modules.


In [3]:
def make_plot(lang):
    return (lang[texts]
             .transform(Umap(2))
             .plot_interactive(annot=False)
             .properties(width=200, height=200, title=type(lang).__name__))

make_plot(lang_cv) | make_plot(lang_bp) | make_plot(lang_use)

We'll now prepare a dataframe that we'll assign labels to.

In [32]:
df = lang_use[texts].transform(Umap(2)).to_dataframe().reset_index()
df.columns = ['text', 'd1', 'd2']
df['label'] = ''

# Fancy interactive drawing! 

We'll be using Vincent's infamous [human-learn library](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html) for this. First we'll need to instantiate some charts.

In [39]:
from hulearn.experimental.interactive import InteractiveCharts

charts = InteractiveCharts(df, labels=['group_to_retreive'])

Next we get to draw! Drawing can be a bit tricky though, so pay attention. 

1. You'll want to double-click to start drawing. 
2. You can then click points together to form a polygon. 
3. Next you need to double-click to stop drawing. 

This allows you to draw polygons that can be used in the code below to fetch the examples that you're interested in.

In [40]:
charts.add_chart(x='d1', y='d2')

We can now use this selection to retreive a subset of rows.

In [43]:
from hulearn.preprocessing import InteractivePreprocessor
tfm = InteractivePreprocessor(json_desc=charts.data())

df.pipe(tfm.pandas_pipe).loc[lambda d: d['group_to_retreive'] != 0].head(10)

Unnamed: 0,text,d1,d2,label,group_to_retreive
47,Suggest me a good restaurant around,13.013712,9.6021,,1
57,Do you seek me a restaurant?,13.543409,9.084156,,1
80,help me find restaurant,12.889969,9.201532,,1
87,can i be shown a gluten free restaurant,13.43908,9.145265,,1
121,"I am hungry, find me some place to go",13.126829,9.497138,,1
131,Could you find me a restaurant to eat at?,13.379006,9.225938,,1
154,Is there any restaurant?,13.220592,9.177123,,1
165,Find a restaurant for me where I can eat.,13.040291,9.337744,,1
175,Hey help me find a restaurant,12.977052,9.263217,,1
253,"Hi, can you give me the nearest fast food place?",13.293454,9.309745,,1


If you're confident that you'd like to assign a label, you can do so below. 

In [44]:
label_name = ''

In [52]:
idx = df.pipe(tfm.pandas_pipe).loc[lambda d: d['group_to_retreive'] != 0].index

df.iloc[idx, 3] = label_name

You can now scroll up and start relabelling. Once you're confident that this works, you can export.

In [51]:
df.to_csv("first_order_labelled.csv")