In [None]:
! pip install "nucliadb-sdk<=2.42.1"
! pip install nucliadb-dataset
! pip install sentence-transformers

In [1]:
import requests
from nucliadb_sdk import KnowledgeBox,Label,create_knowledge_box, get_or_create
from sentence_transformers import SentenceTransformer



  from .autonotebook import tqdm as notebook_tqdm


## Setup NucliaDB

- Run **NucliaDB** image:
```bash
docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest
```
- Or install with pip and run:

```bash
pip install nucliadb
nucliadb
```

## Check everything's up and running

In [2]:
import requests
response = requests.get(f"http://0.0.0.0:8080")

assert response.status_code == 200, "Ups, it seems something is not properly installed"

## Setup - creating a KB

In nucliadb our data containers are called knowledge boxes.

To start working, we need to create one:

*We create it with the function get_or_create so that it won't be created again if it exists*

In [32]:
my_kb = get_or_create("my_reddit_data_kb")

##### Setup - preparing data & model

We download our dataset and the sentence embedding model we are going to use.

I set the size of the sample wo 5K but you can set it to a smaller size if you want to run the notebook faster (it takes around 15 min to load 5K) 

In [46]:
from datasets import load_dataset
dataset = load_dataset("go_emotions", "raw")

SAMPLE_SIZE = 5000
sample = dataset["train"].shuffle(seed=19).select(range(SAMPLE_SIZE))

Found cached dataset go_emotions (/Users/ciniesta/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.29it/s]
Loading cached shuffled indices for dataset at /Users/ciniesta/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d/cache-66031443094dc2fe.arrow


In [30]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

## Uploading data to our KB

we use the upload function to index text, labels and calculated vectors for each sentence of our dataset.
Tips:
- We can have more than one set of vectors in our data, just add another entry to the vectors dict `vectors={"roberta-vectors": vectors-roberta,"bert-vectors": vectors-bert }`
- If you want to avoid uploading the same data twice by mistake, just add a `key` to your upload, its an unique identifier and it will update the resources when uploading them again instead of duplicating them. `key="my_reddit_sample"`
- This can take a while! If you are in a hurry you can always select a smaller size when creating the sample

In [33]:
for row in sample:
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0].tolist()},
    )


Vectorset is not created, we will create it for you


## Checks

Let's explore how many entries we uploaded for each label and which vectorsets

In [34]:
my_labels = my_kb.get_uploaded_labels()
print("Labelsets info : ")
print(my_labels)
print("Labelset: ", ", ".join(my_labels.keys()))
print("Labels:",", ".join(my_labels["reddit"].labels.keys()))
print("Tagged resources:",my_labels["reddit"].count)
my_vectorsets = my_kb.list_vectorset()
print("-----------------")
print("Vectorsets info : ")
print(my_vectorsets)
print("Vectorset: ", ", ".join(my_vectorsets.vectorsets.keys()))
print("Dimension:",", ",my_kb.list_vectorset().vectorsets["all-MiniLM-L6-v2"].dimension)

Labelsets info : 
{'reddit': LabelSet(count=5000, labels={'rpdrcringe': 30, 'loveafterlockup': 27, '90DayFiance': 26, 'yesyesyesyesno': 26, 'DoesAnybodyElse': 25, 'ENLIGHTENEDCENTRISM': 24, 'NYYankees': 24, 'The_Mueller': 24, 'entitledparents': 24, 'teenagers': 24, 'EdmontonOilers': 23, 'Gunners': 23, 'exmormon': 23, 'gaybros': 22, 'nonononoyes': 22, 'CFB': 21, 'detroitlions': 21, 'steelers': 21, '90dayfianceuncensored': 20, 'Documentaries': 20, 'MensRights': 20, 'OkCupid': 20, 'TopMindsOfReddit': 20, 'TrollXChromosomes': 20, 'VoteBlue': 20, 'forwardsfromgrandma': 20, 'nba': 20, 'torontoraptors': 20, '4PanelCringe': 19, 'JordanPeterson': 19, 'LifeProTips': 19, 'Marriage': 19, 'SelfAwarewolves': 19, 'TheSimpsons': 19, 'breakingmom': 19, 'chicago': 19, 'confessions': 19, 'fatlogic': 19, 'minnesotavikings': 19, 'raimimemes': 19, 'texas': 19, '2meirl4meirl': 18, 'Anarchism': 18, 'AnimalsBeingJerks': 18, 'IncelTears': 18, 'Jokes': 18, 'Mavericks': 18, 'Overwatch': 18, 'SeattleWA': 18, 'Teen

## Filter by label

Let's explore results from one of the subreddits. 
For that we filter by label, in this case `socialanxiety`

In [35]:
results = my_kb.search(
        filter=[Label(labelset="reddit", label="socialanxiety")]
    )
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")

Text: I'd like to add spontaneously experiencing unending self-loathing because you suddenly remembered that embarrassing thing you did 3 years ago.
Labels: ['socialanxiety']
Text: Lol well if you see an awkward looking girl in a white car following at a safe distance, just know that’s little ole anxious me!
Labels: ['socialanxiety']
Text: I always worry about my facial expressions. You're not alone there.
Labels: ['socialanxiety']
Text: I know how you feel :( I hope you start getting more good days soon. <3
Labels: ['socialanxiety']
Text: Take a break then get back out there!
Labels: ['socialanxiety']
Text: I don't drink at all specifically because the next day is sheer terror.
Labels: ['socialanxiety']
Text: Reading this made me pretty damn happy. Congrats hope it works out for you two.
Labels: ['socialanxiety']
Text: I did the same thing in school. Thank [NAME] for the library. Are you anxious about how your brother and sister would react to that?
Labels: ['socialanxiety']
Text: You

## Text search

Now let's try the full text search or keyword search.

This search returns entries that contain the word or sets of words we input.

First we'll look for developer and we will output the following fields for each result:
- Text: Text of the matched results
- Labels:  labels associated with the result, in this case the subreddit to which it belongs
- Score: score of the result
- Kind of score (BM25 for keyword search, Cosine similarity for semantic search)


In [36]:
results = my_kb.search(text="developer")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Score Type: {result.score_type}")
    print("------")

Text: It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
Labels: ['CryptoCurrency']
Score: 6.0321736335754395
Score Type: BM25
------


Since we did not get many matches, we'll look for some more words related to technology:
- Tech
- Technology
- Code


In [37]:
print("** tech")
results = my_kb.search(text="tech")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    
print("\n** technology")
results = my_kb.search(text="technology")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")

print("\n** code")
results = my_kb.search(text="code")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    

** tech
Text: Yay! (Tech will be improved. Eeeek.)
Labels: ['exchristian']
Text: How many attempts did that insane tech take to pull off?
Labels: ['CompetitiveForHonor']
Text: Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
Labels: ['rickandmorty']
Text: Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
Labels: ['rickandmorty']

** technology

** code
Text: And those offences are not Criminal Code offences. Criminal Code offences require it be established "beyond a reasonable doubt" that an accused actually committed the offence.
Labels: ['ontario']
Text: Because a lot of people on this thread are ignoring the pirate code and I’m trying to explain that rules are there for a reason.
Labels: ['Seaofthieves']


Still not interesting results so we'll move on to the semantic search

## Vector search

To get results that are related to the meaning of a word/sentence, we have our semantic search. 

This search will return the entries in our KB with higher cosine similarity to some given vectors.

That is, the sentences that the model we use to create our vectors encodes as more similar to the one we are searching for.

To perfom this search, we  convert our desired query to vectors with the same model we used and input them to the search function.

We need to use the field `vector` and we can add `min_score` if we want to define a minimun cosine similarity value for our results

In [39]:
query_vectors = encoder.encode(["Tech, devs, programming and coding"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.2)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: not sure if this is a joke, but if you've ever worked at a dev studio, this adds credibility if anything
Labels: ['StarWarsBattlefront']
Score: 0.4388854205608368
Key: a65d48b72e0949089af23cd9d7d39fe5
Score Type: COSINE
------
Text: Yay! (Tech will be improved. Eeeek.)
Labels: ['exchristian']
Score: 0.364839106798172
Key: e5f585afd5fc4ac0864db1867b2d3586
Score Type: COSINE
------
Text: As a computer science student but also a lifelong [NAME], the issues surrounding Amazon in Queens have left me substantially torn. 
Labels: ['nyc']
Score: 0.3186536133289337
Key: 09b42b88f1944573925c926612da403a
Score Type: COSINE
------
Text: It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
Labels: ['CryptoCurrency']
Score: 0.3089994788169861
Key: c2382b9403ab46e5bd216f007db7ad36
Score Type: COSINE
------
Text: I can't wait for all these now unregulated projects to demonstrate why we had 

In [41]:
query_vectors = encoder.encode(["What is happiness"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.4)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: I’m a crybaby now, but I’m so happy to see happiness
Labels: ['wholesomememes']
Score: 0.5495570302009583
Key: 1d188463e32f4fd5b4b04aaaa6487a17
Score Type: COSINE
------
Text: Happy just came out
Labels: ['netflix']
Score: 0.5028761029243469
Key: 99eeddb655de42a19c720d462ffa5422
Score Type: COSINE
------
Text: This. Making my animals happy is one of the only things that makes me happy.
Labels: ['antinatalism']
Score: 0.4914007782936096
Key: 5b57a682a4f141b0a78445c52998f19b
Score Type: COSINE
------
Text: more like consistent*winner*, i hope you find a lasting happiness
Labels: ['confessions']
Score: 0.4771484136581421
Key: a9129aa730a64a07816326413bf8cc46
Score Type: COSINE
------
Text: Thank you. I’ve been feeling a bit better since they make people happier.
Labels: ['SuicideWatch']
Score: 0.4405825734138489
Key: 100bfd15d9534d52844a8f355efff1e4
Score Type: COSINE
------
Text: One of the 7 dwarfs is called happy
Labels: ['woooosh']
Score: 0.4348929226398468
Key: 8417800d4274475e

In [43]:
query_vectors = encoder.encode(["The meaning of life"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.33)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: iS thIs a MeTapHor FoR LifE?
Labels: ['woooosh']
Score: 0.6362201571464539
Key: d4e36eea8fa44a6893ff140aba7aeadb
Score Type: COSINE
------
Text: So the key to living is to do nothing?
Labels: ['philosophy']
Score: 0.4603619873523712
Key: b553a02516bd4c2d8bb817565e6d67c3
Score Type: COSINE
------
Text: Why do you feel like you have to figure out what life is alone?
Labels: ['confessions']
Score: 0.42643851041793823
Key: d7456dbfb34a4a05af0f56f94432c294
Score Type: COSINE
------
Text: Story of my life
Labels: ['MakingaMurderer']
Score: 0.3753337562084198
Key: 4f62500d80574619a55ef25359b9c938
Score Type: COSINE
------
Text: Life is Strange is terrible. Absolutely zero gameplay...
Labels: ['pcgaming']
Score: 0.3414113521575928
Key: 6517cdecbdd5484d872afaed9d558b21
Score Type: COSINE
------
Text: A concept that doesn't map to reality other than as a metaphor.
Labels: ['DebateAnAtheist']
Score: 0.3300152122974396
Key: 7037e671f8d2496e9c121b93ab26a31d
Score Type: COSINE
------


In [45]:
query_vectors = encoder.encode(["What is love?"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.3)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: Pure unconsitional love can also be dangerous: if the other person turns out to be an abuser you should propably stop loving them
Labels: ['AskMen']
Score: 0.44075626134872437
Key: d42dec051c5e45068e2c0a8ffa1b4c6d
Score Type: COSINE
------
Text: There is obviously love here. Don’t give up. Talk to each other.
Labels: ['DeadBedrooms']
Score: 0.39207711815834045
Key: 4b5050731ad64445acedcadefd1ebaaa
Score Type: COSINE
------
Text: Gotta love [NAME].
Labels: ['Music']
Score: 0.3871612548828125
Key: d520e6c9fc334dc69b5b96b1b488e2ea
Score Type: COSINE
------
Text: Peace and love, my brother!
Labels: ['UpliftingNews']
Score: 0.37950992584228516
Key: 91f6c31f63e045beaad982e26d09c10a
Score Type: COSINE
------
Text: No one in particular....i mean to people who believe that [NAME] and [NAME] are actually in love.
Labels: ['freefolk']
Score: 0.35776492953300476
Key: 04c3534c1b2940279f392770c5495070
Score Type: COSINE
------
Text: Some people on here are infatuated with his infatuation.
Labe