# Vincent's Initial Text-Labelling Hack!

This notebook contains a convenient pattern to cluster and label new text data. The end-goal is to discover intents that might be used in a virtual assistant setting. This can be especially useful in an early stage and is part of the "iterate on your data"-mindset.

This notebook will guide you through the process. The two main tools that will be used here are [whatlies](https://rasahq.github.io/whatlies/) and [human-learn](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html).

In [64]:
import pathlib 
import numpy as np
from whatlies.language import CountVectorLanguage, UniversalSentenceLanguage, BytePairLanguage
from whatlies.transformers import Pca, Umap

In [65]:
txt = pathlib.Path("nlu.md").read_text()
texts = list(set([t.replace(" - ", "") for t in txt.split("\n") if len(t) > 0 and t[0] != "#"]))

lang_cv = CountVectorLanguage(10)
lang_use = UniversalSentenceLanguage()
lang_bp = BytePairLanguage("en", dim=300, vs=200_000)

In [66]:
def make_plot(lang):
    return (lang[texts]
             .transform(Umap(2))
             .plot_interactive(annot=False)
             .properties(width=200, height=200, title=type(lang).__name__))

make_plot(lang_cv) | make_plot(lang_bp) | make_plot(lang_use)

We'll now prepare a dataframe that we'll assign labels to.

In [67]:
df = lang_use[texts].transform(Umap(2)).to_dataframe().reset_index()
df.columns = ['text', 'd1', 'd2']
df['label'] = ''

# Fancy interactive drawing! 

We'll be using Vincent's infamous [human-learn library](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html) for this. First we'll need to instantiate some charts.

Next we get to draw! Drawing can be a bit tricky though, so pay attention. 

1. You'll want to double-click to start drawing. 
2. You can then click points together to form a polygon. 
3. Next you need to double-click to stop drawing. 

This allows you to draw polygons that can be used in the code below to fetch the examples that you're interested in.

In [163]:
from hulearn.experimental.interactive import InteractiveCharts

charts = InteractiveCharts(df.loc[lambda d: d['label'] == ''], labels=['group_to_retreive'])

charts.add_chart(x='d1', y='d2')

We can now use this selection to retreive a subset of rows.

In [164]:
from hulearn.preprocessing import InteractivePreprocessor
tfm = InteractivePreprocessor(json_desc=charts.data())

df.pipe(tfm.pandas_pipe).loc[lambda d: d['group_to_retreive'] != 0].head(10)

Unnamed: 0,text,d1,d2,label,group_to_retreive
0,How can I determine who I am?,0.264203,-5.731505,,1
9,can you tell me who I am?,0.513859,-5.672482,,1
14,So who are you ?,1.88204,-4.868009,,1
16,tell me more about your founders,2.342074,-4.0978,,1
38,can you tell me what I am?,0.444628,-5.774251,,1
48,Tell me who am I?,0.689649,-5.3903,,1
50,could please tell me about yourself,1.684289,-4.192813,,1
74,who is your boss?,2.616716,-3.483566,,1
108,who is your employer?,2.626744,-3.494134,,1
142,who r u,1.855409,-4.898279,,1


If you're confident that you'd like to assign a label, you can do so below. 

In [165]:
label_name = 'who_are_you'

In [166]:
idx = df.pipe(tfm.pandas_pipe).loc[lambda d: d['group_to_retreive'] != 0].index

df.iloc[idx, 3] = label_name

In [167]:
df

Unnamed: 0,text,d1,d2,label
0,How can I determine who I am?,0.264203,-5.731505,who_are_you
1,so how were you made?,3.603826,-0.352539,how_made
2,how aold are you,-6.533790,8.196808,how_are_you
3,Can you give me the time?,-1.351469,1.744661,what_time
4,What does Rasa do?,1.215502,4.737106,what_is_rasa
...,...,...,...,...
1171,who created you,3.595397,-3.698710,who_made_you
1172,how works rasa,0.961739,4.559081,what_is_rasa
1173,Where did you come from?,-2.960351,-5.495086,birthday
1174,you are a human,5.613154,7.210114,are_you_real


You can now scroll up and start relabelling. Once you're confident that this works, you can export.

In [168]:
df.to_csv("first_order_labelled.csv")

In [177]:
for lab, grp in df.groupby('label'):
    print(f"# intent:{lab}")
    for t in grp['text']:
        print(f"- {t}")

# intent:are_you_real
- are you a real person
- Ar you a bot ?
- hey are you human
- are you a real bot?
- Are you human ?
- are you artificial
- what are you, a bot?
- are you a Skynet ?
- you idiot bot
- shit bot
- are you a real person?
- are you bot
- sara, are you a robot or human?
- are you a bot ?
- are u a real person?
- you robo
- are you a robot
- oh are you chatbot?
- are you ai
- you are ai
- are you artificial intelligence
- fuck you machine learning bot
- r u a human
- you sound like a real human
- real bot then?
- are you rasa bot?
- Rara, are you a human?
- Are you a human being?
- are you bot?
- Great, is there anything else you can do, bot?
- are you real human?
- are you really a bbot?
- are u human
- are u dump?
- what can I do with this bot
- are you sure that you're a bot?
- are you robot
- Are you a chat bot?
- Are you a bot
- are you a bot?
- r u real?
- you are useless bot
- are you really a bot
- Are you a real person?
- are you a bot
- are you a BOT
- Are you