In [1]:
! pip install nucliadb-sdk==1.2.5
! pip install nucliadb-dataset==1.2.3
! pip install nucliadb-models==2.0.4
! pip install sentence-transformers
import requests
from nucliadb_sdk.knowledgebox import KnowledgeBox
from nucliadb_sdk.labels import Label
from nucliadb_sdk.utils import create_knowledge_box, get_or_create
from sentence_transformers import SentenceTransformer


Looking in indexes: https://pypi.org/simple, https://_json_key_base64:****@europe-west4-python.pkg.dev/stashify-218417/stashify-python/simple/
Collecting nucliadb-sdk==1.2.5
  Downloading nucliadb_sdk-1.2.5-py3-none-any.whl (24 kB)
Collecting protobuf
  Using cached protobuf-4.21.12-cp37-abi3-macosx_10_9_universal2.whl (486 kB)
Installing collected packages: protobuf, nucliadb-sdk
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
  Attempting uninstall: nucliadb-sdk
    Found existing installation: nucliadb-sdk 1.2.3
    Uninstalling nucliadb-sdk-1.2.3:
      Successfully uninstalled nucliadb-sdk-1.2.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.11.0 requires protobuf<4,>=3.9.2, but you have protobuf 4.21.12 which is incompatib

## Setup

Make sure we've started **NucliaDB's container**

``` 
docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest
```
Then, we'll check the connection:

In [2]:
response = requests.get(f"http://localhost:8080")
assert response.ok

## Setup - creating a KB

In nucliadb our data containers are called knowledge boxes.

To start working, we need to create one:

*We create it with the function get_or_create so that it won't be created again if it exists*

In [5]:
my_kb = get_or_create("my_reddit_data_kb2")

## Setup - preparing data & model

We download our dataset and the sentence embedding model we are going to use  

In [4]:
from datasets import load_dataset
dataset = load_dataset("go_emotions", "raw")

sample = dataset["train"].shuffle(seed=19).select(range(10000))

Found cached dataset go_emotions (/Users/ramon/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/ramon/.cache/huggingface/datasets/go_emotions/raw/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d/cache-66031443094dc2fe.arrow


In [6]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

## Uploading data to our KB

we use the upload function to index text, labels and calculated vectors for each sentence of our dataset.
Tips:
- We can have more than one set of vectors in our data, just add another entry to the vectors dict `vectors={"roberta-vectors": vectors-roberta,"bert-vectors": vectors-bert }`
- If you want to avoid uploading the same data twice by mistake, just add a `key` to your upload, its an unique identifier and it will update the resources when uploading them again instead of duplicating them. `key="my_reddit_sample"`
- This can take a while! If you are in a hurry you can always select a smaller size when creating the sample

In [7]:
for row in sample:
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0].tolist()},
    )


Vectorset is not created, we will create it for you


## Checks

We uploaded only data with one label. 
But we could have added more if we had code from other modules, or if we wanted to label some other code features

Let's check if the numbers agree!

In [8]:
my_labels = my_kb.get_uploaded_labels()
print("Labelsets info : ")
print(my_labels)
print("Labelset: ", ", ".join(my_labels.keys()))
print("Labels:",", ".join(my_labels["reddit"].labels.keys()))
print("Tagged resources:",my_labels["reddit"].count)
my_vectorsets = my_kb.list_vectorset()
print("-----------------")
print("Vectorsets info : ")
print(my_vectorsets)
print("Vectorset: ", ", ".join(my_vectorsets.vectorsets.keys()))
print("Dimension:",", ",my_kb.list_vectorset().vectorsets["all-MiniLM-L6-v2"].dimension)

Labelsets info : 
{'reddit': LabelSet(count=10000, labels={'90DayFiance': 48, 'rpdrcringe': 47, 'TrollXChromosomes': 46, 'loveafterlockup': 46, 'nonononoyes': 44, 'ENLIGHTENEDCENTRISM': 43, 'teenagers': 43, 'NYYankees': 42, 'TeenMomOGandTeenMom2': 42, 'yesyesyesyesno': 42, 'atheism': 41, 'tifu': 41, 'SubredditSimulator': 40, 'Tinder': 40, 'entitledparents': 40, 'minnesotavikings': 40, 'Overwatch': 39, 'The_Mueller': 39, 'datingoverthirty': 39, 'sadcringe': 39, '2meirl4meirl': 38, 'Advice': 38, 'Blackops4': 38, 'Jokes': 38, 'SuicideWatch': 38, 'barstoolsports': 38, 'detroitlions': 38, 'exmormon': 38, 'unpopularopinion': 38, '90dayfianceuncensored': 37, 'Gunners': 37, 'OkCupid': 37, 'antiMLM': 37, 'confessions': 37, 'gaybros': 37, 'torontoraptors': 37, 'depression': 36, 'fatlogic': 36, 'heroesofthestorm': 36, 'AnimalsBeingBros': 35, 'CFB': 35, 'DoesAnybodyElse': 35, 'EdmontonOilers': 35, 'forwardsfromgrandma': 35, 'timberwolves': 35, 'Anarchism': 34, 'AnimalsBeingJerks': 34, 'Documentari

## Filter by label

Let's explore results from one of the subreddits. 
For that we filter by label, in this case `socialanxiety`

In [12]:
results = my_kb.search(
        filter=[Label(labelset="reddit", label="socialanxiety")]
    )
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")

Text: Reading this made me pretty damn happy. Congrats hope it works out for you two.
Labels: ['socialanxiety']
Text: I don't drink at all specifically because the next day is sheer terror.
Labels: ['socialanxiety']
Text: Take a break then get back out there!
Labels: ['socialanxiety']
Text: I know how you feel :( I hope you start getting more good days soon. <3
Labels: ['socialanxiety']
Text: I always worry about my facial expressions. You're not alone there.
Labels: ['socialanxiety']
Text: Lol well if you see an awkward looking girl in a white car following at a safe distance, just know that’s little ole anxious me!
Labels: ['socialanxiety']
Text: I'd like to add spontaneously experiencing unending self-loathing because you suddenly remembered that embarrassing thing you did 3 years ago.
Labels: ['socialanxiety']
Text: Quitting porn only helped with my performance in bed, other than that I’m still anxious as hell in social spots 
Labels: ['socialanxiety']
Text: also anxious that peopl

## Text search

Now let's try the full text search or keyword search.

This search returns entries that contain the word or sets of words we input.

First we'll look for developer and we will output the following fields for each result:
- Text: Text of the matched results
- Labels:  labels associated with the result, in this case the subreddit to which it belongs
- Score: score of the result
- Kind of score (BM25 for keyword search, Cosine similarity for semantic search)


In [14]:
results = my_kb.search(text="developer")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Score Type: {result.score_type}")
    print("------")

Text: It's just bizarre you would criticize [NAME] on his development capabilities when he's not a developer... His specialties are centered around crypto philosophy
Labels: ['CryptoCurrency']
Score: 6.545155048370361
Score Type: BM25
------


Since we did not get many matches, we'll look for some more words related to technology:
- Tech
- Technology
- Code


In [16]:
print("** tech")
results = my_kb.search(text="tech")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    
print("\n** technology")
results = my_kb.search(text="technology")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")

print("\n** code")
results = my_kb.search(text="code")
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    


** tech
Text: Damn. Low tech bait bike
Labels: ['KidsAreFuckingStupid']
Text: Yay! (Tech will be improved. Eeeek.)
Labels: ['exchristian']
Text: How many attempts did that insane tech take to pull off?
Labels: ['CompetitiveForHonor']
Text: Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
Labels: ['rickandmorty']
Text: Have nobody thought of the giant tech corporation selling overpriced, inferior products to sheep customers? That's odd.
Labels: ['rickandmorty']

** technology
Text: I think there’s something to the concept that overindulgence in technology contributes to a lack of physical and emotional bonding these days.
Labels: ['DeadBedrooms']

** code
Text: And those offences are not Criminal Code offences. Criminal Code offences require it be established "beyond a reasonable doubt" that an accused actually committed the offence.
Labels: ['ontario']
Text: Don't give away my zip code! Craft beer and fabuloso be da

Still not a lot of interesting results and some completely unrelated, so we'll move on to the semantic search

## Vector search

To get results that are related to the meaning of a word/sentence, we have our semantic search. 

This search will return the entries in our KB with higher cosine similarity to some given vectors.

That is, the sentences that the model we use to create our vectors encodes as more similar to the one we are seraching for.

To perfom this search, we  convert our desired query to vectors with the same model we used and input them to the search function.

We need to use the field `vector` and we can add `min_score` if we want to define a minimun cosine similarity value for our results

In [18]:
query_vectors = encoder.encode(["Tech, devs, programming and coding"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2")
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: A lot of software guys are. Source: in software
Labels: ['90dayfianceuncensored']
Score: 0.48674964904785156
Key: 0c0905cccf7f4591b4b57047a6efb858
Score Type: COSINE
------
Text: I have no idea what you’re talking about, so do you have carpentry experience and/or background in engineering?
Labels: ['SubredditSimulator']
Score: 0.3329728841781616
Key: 5782d4cef8d24bfeba7cac074881c3fe
Score Type: COSINE
------
Text: What do you do for work?
Labels: ['self']
Score: 0.3191085159778595
Key: a039b4cbfda0402b9e0db832ff56659b
Score Type: COSINE
------
Text: I want to be a rapper. I also like to write poetry and stuff. So to be successful with those things also I want to design clothing.
Labels: ['SuicideWatch']
Score: 0.30587318539619446
Key: 565cfc551b814018b578742cfb454f65
Score Type: COSINE
------
Text: What is the job you are going to college for? Now days some jobs are more accepting with tattoos
Labels: ['breakingmom']
Score: 0.28778696060180664
Key: 6f8ce8d396ec4b6cb4576fe09a02469

In [19]:
query_vectors = encoder.encode(["What is happiness"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: I’m a crybaby now, but I’m so happy to see happiness
Labels: ['wholesomememes']
Score: 0.5495570302009583
Key: b5e69c7ecdf143ada0f914ad005ccf2f
Score Type: COSINE
------
Text: Ive never been happier and wouldnt want to live any other way
Labels: ['AskMenOver30']
Score: 0.5448427200317383
Key: d292d3df7df14f65a924602392b809c9
Score Type: COSINE
------
Text: Happy just came out
Labels: ['netflix']
Score: 0.5028761029243469
Key: f22e857832ca4fb0839772d35e1a9815
Score Type: COSINE
------
Text: You gotta make yourself happy when no one else does
Labels: ['sadcringe']
Score: 0.4969874322414398
Key: 326099f6f2b947048cf2e1d7c888eca7
Score Type: COSINE
------
Text: I am happy for you. No, seriously, I am.
Labels: ['drunk']
Score: 0.4953000843524933
Key: d8f2dd5ca5cb4ce3a5a1f055acd3b38c
Score Type: COSINE
------
Text: This. Making my animals happy is one of the only things that makes me happy.
Labels: ['antinatalism']
Score: 0.4914007782936096
Key: 79a12b1b1e9e4d2ea3421f8862bf5a97
Score Ty

In [20]:
query_vectors = encoder.encode(["The meaning of life"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: iS thIs a MeTapHor FoR LifE?
Labels: ['woooosh']
Score: 0.6362201571464539
Key: caa135337da744b2815edef6a23f1369
Score Type: COSINE
------
Text: So the key to living is to do nothing?
Labels: ['philosophy']
Score: 0.4603619873523712
Key: 9b3423bc957e470b89b2f49fce498744
Score Type: COSINE
------
Text: Life is just an endless series of disappointments and no one gets out of here alive.
Labels: ['INTP']
Score: 0.447122722864151
Key: 86999ab5865748cf8bfc3fcf08b53589
Score Type: COSINE
------
Text: Living his best life.
Labels: ['cringe']
Score: 0.4318602383136749
Key: 315c9ef952634f6c922835de2662615c
Score Type: COSINE
------
Text: Living his best life.
Labels: ['cringe']
Score: 0.4318602383136749
Key: cd010fb66c5742fcb93e39f2adf9c95e
Score Type: COSINE
------
Text: Why do you feel like you have to figure out what life is alone?
Labels: ['confessions']
Score: 0.42643851041793823
Key: a002f3b28334454eb25562dd5d792009
Score Type: COSINE
------
Text: Such a sad way to think of life all

In [21]:
query_vectors = encoder.encode(["What is love?"])[0].tolist()
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.15)
   
for result in results:
    print(f"Text: {result.text}")
    print(f"Labels: {result.labels}")
    print(f"Score: {result.score}")
    print(f"Key: {result.key}")
    print(f"Score Type: {result.score_type}")
    print("------")


Text: Love has no number
Labels: ['teenagers']
Score: 0.4977535605430603
Key: f858239f65ac4556b6a19228874b6dbc
Score Type: COSINE
------
Text: “Love” under threat of execution. Wow.
Labels: ['IncelTears']
Score: 0.47561562061309814
Key: 8c72856d170a4e52857511a76440ad0d
Score Type: COSINE
------
Text: “Love” under threat of execution. Wow.
Labels: ['IncelTears']
Score: 0.47561562061309814
Key: 9017abc48df74adabe3f1215bdc0b2db
Score Type: COSINE
------
Text: I love you. I had no idea before now, but I love you.
Labels: ['TIHI']
Score: 0.453585147857666
Key: dec4561f48e44f188453ed51b701406e
Score Type: COSINE
------
Text: Pure unconsitional love can also be dangerous: if the other person turns out to be an abuser you should propably stop loving them
Labels: ['AskMen']
Score: 0.44075626134872437
Key: 9b7764040515468281a38bc2db23e835
Score Type: COSINE
------
Text: Or a lovegod That supports you
Labels: ['Tinder']
Score: 0.4096187949180603
Key: 79395bc8cff14f06abe4af1ee7d5f825
Score Type: C