In [None]:
! pip install nucliadb-sdk==1.2.5
! pip install nucliadb-dataset==1.2.3
! pip install nucliadb-models==2.0.4
! pip install sentence-transformers
import requests
from nucliadb_sdk.knowledgebox import KnowledgeBox
from nucliadb_sdk.labels import Label
from nucliadb_sdk.utils import create_knowledge_box, get_or_create
from sentence_transformers import SentenceTransformer


## Setup

Make sure we've started **NucliaDB's container**

``` 
docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest
```
Then, we'll check the connection:

In [2]:
response = requests.get(f"http://localhost:8080")
assert response.ok

## Setup - creating a KB

In nucliadb our data containers are called knowledge boxes.

To start working, we need to create one:

*We create it with the function get_or_create so that it won't be created again if it exists*

In [3]:
my_kb = get_or_create("my_reddit_data_kb")

## Setup - preparing data & model

We download our dataset and the sentence embedding model we are going to use  

In [6]:
from datasets import load_dataset
dataset = load_dataset("go_emotions", "raw")

sample = dataset["train"].shuffle(seed=19).select(range(10000))

Found cached dataset go_emotions (/Users/ciniesta/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)
100%|██████████| 1/1 [00:00<00:00, 179.47it/s]
Loading cached shuffled indices for dataset at /Users/ciniesta/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d/cache-66031443094dc2fe.arrow


In [5]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

## Uploading data to our KB

we use the upload function to index text, labels and calculated vectors for each sentence of our dataset.
Tips:
- We can have more than one set of vectors in our data, just add another entry to the vectors dict `vectors={"roberta-vectors": vectors-roberta,"bert-vectors": vectors-bert }`
- If you want to avoid uploading the same data twice by mistake, just add a `key` to your upload, its an unique identifier and it will update the resources when uploading them again instead of duplicating them. `key="my_reddit_sample"`
- This can take a while! If you are in a hurry you can always select a smaller size when creating the sample

In [7]:
for row in sample:
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0].tolist()},
    )


Vectorset is not created, we will create it for you


## Checks

We uploaded only data with one label. 
But we could have added more if we had code from other modules, or if we wanted to label some other code features

Let's check if the numbers agree!

In [8]:
my_labels = my_kb.get_uploaded_labels()
print("Labelsets info : ")
print(my_labels)
print("Labelset: ", ", ".join(my_labels.keys()))
print("Labels:",", ".join(my_labels["reddit"].labels.keys()))
print("Tagged resources:",my_labels["reddit"].count)
my_vectorsets = my_kb.list_vectorset()
print("-----------------")
print("Vectorsets info : ")
print(my_vectorsets)
print("Vectorset: ", ", ".join(my_vectorsets.vectorsets.keys()))
print("Dimension:",", ",my_kb.list_vectorset().vectorsets["all-MiniLM-L6-v2"].dimension)

Labelsets info : 
{'reddit': LabelSet(count=10000, labels={'90DayFiance': 48, 'rpdrcringe': 47, 'TrollXChromosomes': 46, 'loveafterlockup': 46, 'nonononoyes': 44, 'ENLIGHTENEDCENTRISM': 43, 'teenagers': 43, 'NYYankees': 42, 'TeenMomOGandTeenMom2': 42, 'yesyesyesyesno': 42, 'atheism': 41, 'tifu': 41, 'SubredditSimulator': 40, 'Tinder': 40, 'entitledparents': 40, 'minnesotavikings': 40, 'Overwatch': 39, 'The_Mueller': 39, 'datingoverthirty': 39, 'sadcringe': 39, '2meirl4meirl': 38, 'Advice': 38, 'Blackops4': 38, 'Jokes': 38, 'SuicideWatch': 38, 'barstoolsports': 38, 'detroitlions': 38, 'exmormon': 38, 'unpopularopinion': 38, '90dayfianceuncensored': 37, 'Gunners': 37, 'OkCupid': 37, 'antiMLM': 37, 'confessions': 37, 'gaybros': 37, 'torontoraptors': 37, 'depression': 36, 'fatlogic': 36, 'heroesofthestorm': 36, 'AnimalsBeingBros': 35, 'CFB': 35, 'DoesAnybodyElse': 35, 'EdmontonOilers': 35, 'forwardsfromgrandma': 35, 'timberwolves': 35, 'Anarchism': 34, 'AnimalsBeingJerks': 34, 'Documentari

## Filter by label

Let's explore results from one of the subreddits. 
For that we filter by label, in this case `socialanxiety`

In [12]:
results = my_kb.search(
        filter=[Label(labelset="reddit", label="socialanxiety")]
    )
for result in results:
    print(result.text)
    print(result.labels)

Reading this made me pretty damn happy. Congrats hope it works out for you two.
['socialanxiety']
I don't drink at all specifically because the next day is sheer terror.
['socialanxiety']
Take a break then get back out there!
['socialanxiety']
I know how you feel :( I hope you start getting more good days soon. <3
['socialanxiety']
I always worry about my facial expressions. You're not alone there.
['socialanxiety']
Lol well if you see an awkward looking girl in a white car following at a safe distance, just know that’s little ole anxious me!
['socialanxiety']
I'd like to add spontaneously experiencing unending self-loathing because you suddenly remembered that embarrassing thing you did 3 years ago.
['socialanxiety']
Quitting porn only helped with my performance in bed, other than that I’m still anxious as hell in social spots 
['socialanxiety']
also anxious that people will be angry or surprised or upset at me for me never telling people about this before
['socialanxiety']
In that co

## Text search

Now let's try the full text search or keyword search.

This search returns entries that contain the word or sets of words we input.

First we'll look for developer and we will output the following fields for each result:
- Text: Text of the matched results
- Labels:  labels associated with the result, in this case the subreddit to which it belongs
- Score: score of the result
- Kind of score (BM25 for keyword search, Cosine similarity for semantic search)


In [15]:
results = my_kb.search(text="developer")
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.score_type)
    print("------")

It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
['CryptoCurrency']
6.545155048370361
ScoreType.BM25
------


Since we did not get many matches, we'll look for some more words related to technology:
- Tech
- Technology
- Code


In [16]:
results = my_kb.search(text="tech")
for result in results:
    print(result.text)
    print(result.labels)
    
results = my_kb.search(text="technology")
for result in results:
    print(result.text)
    print(result.labels)
    
results = my_kb.search(text="code")
for result in results:
    print(result.text)
    print(result.labels)
    


Damn. Low tech bait bike
['KidsAreFuckingStupid']
Yay! (Tech will be improved. Eeeek.)
['exchristian']
How many attempts did that insane tech take to pull off?
['CompetitiveForHonor']
Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
['rickandmorty']
Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
['rickandmorty']
I think there’s something to the concept that overindulgence in technology contributes to a lack of physical and emotional bonding these days.
['DeadBedrooms']
And those offences are not Criminal Code offences. Criminal Code offences require it be established "beyond a reasonable doubt" that an accused actually committed the offence.
['ontario']
Don't give away my zip code! Craft beer and fabuloso be dammed!!
['Denver']
Because a lot of people on this thread are ignoring the pirate code and I’m trying to explain that rules are there for

Still not a lot of interesting results and some completely unrelated, so we'll move on to the semantic search

## Vector search

To get results that are related to the meaning of a word/sentence, we have our semantic search. 

This search will return the entries in our KB with higher cosine similarity to some given vectors.

That is, the sentences that the model we use to create our vectors encodes as more similar to the one we are seraching for.

To perfom this search, we  convert our desired query to vectors with the same model we used and input them to the search function.

We need to use the field `vector` and we can add `min_score` if we want to define a minimun cosine similarity value for our results

In [21]:
query_vectors = encoder.encode(["Techn, devs, programming and coding"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2")
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")


A lot of software guys are. Source: in software
['90dayfianceuncensored']
0.48638221621513367
0d8aae4e7bf94effacf006b81d7f9275
ScoreType.COSINE
------
not sure if this is a joke, but if you've ever worked at a dev studio, this adds credibility if anything
['StarWarsBattlefront']
0.42719635367393494
1e15e5614aad46c6960c0e9e97fb6686
ScoreType.COSINE
------
Yay! (Tech will be improved. Eeeek.)
['exchristian']
0.3701585829257965
4200d7d44e3f43f5b0d9175a96a9b196
ScoreType.COSINE
------
It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
['CryptoCurrency']
0.3314987123012543
e2c06136b01a4ca88b6992450b2c64ee
ScoreType.COSINE
------
I have no idea what you’re talking about, so do you have carpentry experience and/or background in engineering?
['SubredditSimulator']
0.32369694113731384
af7a78bca43d43f8afbcb72973d86892
ScoreType.COSINE
------
>outsourced independent contractor Dude there i

In [22]:
query_vectors = encoder.encode(["What is happiness"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

I’m a crybaby now, but I’m so happy to see happiness
['wholesomememes']
0.5495570302009583
cb7eaea631734487b0a813d24ffff752
ScoreType.COSINE
------
Ive never been happier and wouldnt want to live any other way
['AskMenOver30']
0.5448427200317383
933e36b8d17c4a54b8e7a38e1c9dbf55
ScoreType.COSINE
------
Happy just came out
['netflix']
0.5028761029243469
733617d708aa4c9188e4ce79984aa022
ScoreType.COSINE
------
You gotta make yourself happy when no one else does
['sadcringe']
0.4969874322414398
86f7c237ca4b4e998a37dbe5c7c00f7d
ScoreType.COSINE
------
I am happy for you. No, seriously, I am.
['drunk']
0.4953000843524933
2faa48aacb8e4442becad784d1dca470
ScoreType.COSINE
------
This. Making my animals happy is one of the only things that makes me happy.
['antinatalism']
0.4914007782936096
5b79c66fa7f64a4c88fd98e198c3bf07
ScoreType.COSINE
------
more like consistent*winner*, i hope you find a lasting happiness
['confessions']
0.4771484136581421
244eebf437034166a9739c7b20f02c44
ScoreType.COSINE

In [23]:
query_vectors = encoder.encode(["The meaning of life"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

iS thIs a MeTapHor FoR LifE?
['woooosh']
0.6362201571464539
52d8cb1cc91540d2848329bf1e3c5e8f
ScoreType.COSINE
------
So the key to living is to do nothing?
['philosophy']
0.4603619873523712
16edd3f6857d41f980fa71deed13ecfb
ScoreType.COSINE
------
Life is just an endless series of disappointments and no one gets out of here alive.
['INTP']
0.447122722864151
e6e608bf4cf1459ca7e402241e180c4a
ScoreType.COSINE
------
Living his best life.
['cringe']
0.4318602383136749
bc6d7b289a5d4fc18364a403533851b9
ScoreType.COSINE
------
Living his best life.
['cringe']
0.4318602383136749
3c6018c8b90c4b57b8c319f86686d4c9
ScoreType.COSINE
------
Why do you feel like you have to figure out what life is alone?
['confessions']
0.42643851041793823
8c40cecd1da3460ba24044ae3fecfbb7
ScoreType.COSINE
------
Such a sad way to think of life all because these people believe there's something greater than this.
['exmuslim']
0.41576701402664185
f7a1423876334407b85509cd9569ec62
ScoreType.COSINE
------
It's when you lea

In [24]:
query_vectors = encoder.encode(["What is love?"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(result.text)
    print(result.labels)
    print(result.score)
    print(result.key)
    print(result.score_type)
    print("------")

Love has no number
['teenagers']
0.4977535605430603
4ae2e34381de481bbc0b093f203f91df
ScoreType.COSINE
------
“Love” under threat of execution. Wow.
['IncelTears']
0.47561562061309814
60d384052371444a85bdafab9d1338dc
ScoreType.COSINE
------
“Love” under threat of execution. Wow.
['IncelTears']
0.47561562061309814
79ee980caa654bbf94e9a49b7456bb60
ScoreType.COSINE
------
I love you. I had no idea before now, but I love you.
['TIHI']
0.453585147857666
64d63582a6f242f5bdf6791ca9af6660
ScoreType.COSINE
------
Pure unconsitional love can also be dangerous: if the other person turns out to be an abuser you should propably stop loving them
['AskMen']
0.44075626134872437
8318066cddfc4822961ea77073783f5b
ScoreType.COSINE
------
Or a lovegod That supports you
['Tinder']
0.4096187949180603
f446dce6a8514595bddc12c38451a791
ScoreType.COSINE
------
There is obviously love here. Don’t give up. Talk to each other.
['DeadBedrooms']
0.39207711815834045
27b2d7f86c054597921c8300ec4c8dbd
ScoreType.COSINE
---