## Install dependencies

In [None]:
! pip install -U  nucliadb-sdk
! pip install -U sentence-transformers
! pip install datasets
! pip install InstructorEmbedding lxml bs4 gradio

## Setup NucliaDB

- Run **NucliaDB** image:
```bash
docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest
```
- Or install with pip and run:

```bash
pip install nucliadb
nucliadb
```

## Check everything's up and running

In [3]:
import requests
response = requests.get(f"http://0.0.0.0:8080")

assert response.status_code == 200, "Ups, it seems something is not properly installed"

## Load our data

Load and explore the coffee dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("HuggingFaceGECLM/StackExchange_Mar2023", split="coffee")

In [181]:
dataset

Dataset({
    features: ['question_id', 'text', 'metadata', 'date', 'original_text'],
    num_rows: 1277
})

In [183]:
dataset[0]['original_text'][1]

'3483: <p><strong>In a cool, dry place, away from elements and preferably sealed / vacuum sealed</strong></p>\n\n<p><strong>No paper bags or canvas, something clean, and preferably with one of those moisture wicking buttons</strong></p>\n\n<p>My favorite thing is a "<a href="https://www.mysticmonkcoffee.com/collections/equipment/products/monk-coffee-vault-storage-black" rel="nofollow noreferrer">Coffee Vault</a>" from Mystic Monk.  It has two seals and and has a \'lock\' to the second seal.  I use to work in a place where you could smell things easily, my coworkers would love it every morning hearing the seal unlock and the vacuum seal taken off as they would smell the coffee as if it was a fresh roasted bag.  Every morning without fail.  I think they are made by AirScape or <a href="https://planetarydesign.com/shop/airscape-kitchen-canisters/" rel="nofollow noreferrer">Planetary Design</a> or something like that.</p>\n\n<p>For christmas this year my wife got me a <a href="http://www.z

## Clean our data

Filter and clean our data

In [187]:
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(dataset[0]['original_text'][1], "lxml").text
cleantext

'3483: In a cool, dry place, away from elements and preferably sealed / vacuum sealed\nNo paper bags or canvas, something clean, and preferably with one of those moisture wicking buttons\nMy favorite thing is a "Coffee Vault" from Mystic Monk.  It has two seals and and has a \'lock\' to the second seal.  I use to work in a place where you could smell things easily, my coworkers would love it every morning hearing the seal unlock and the vacuum seal taken off as they would smell the coffee as if it was a fresh roasted bag.  Every morning without fail.  I think they are made by AirScape or Planetary Design or something like that.\nFor christmas this year my wife got me a Zurich Coffee Vault , which is supposed to be the highest rated and best coffee vault.  Its really good, the coffee scoop I think is a big odd, and I like it, but to me its still nothing compared to the AirScape / Planetary Design Coffee Vault.\nI saw no paper bags or canvases, as those are easily affected by the element

In [188]:
my_data = ["".join(BeautifulSoup(i['original_text'][1], "lxml").text.split(":")[1:]).strip() for i in dataset]
my_data[0]

'In a cool, dry place, away from elements and preferably sealed / vacuum sealed\nNo paper bags or canvas, something clean, and preferably with one of those moisture wicking buttons\nMy favorite thing is a "Coffee Vault" from Mystic Monk.  It has two seals and and has a \'lock\' to the second seal.  I use to work in a place where you could smell things easily, my coworkers would love it every morning hearing the seal unlock and the vacuum seal taken off as they would smell the coffee as if it was a fresh roasted bag.  Every morning without fail.  I think they are made by AirScape or Planetary Design or something like that.\nFor christmas this year my wife got me a Zurich Coffee Vault , which is supposed to be the highest rated and best coffee vault.  Its really good, the coffee scoop I think is a big odd, and I like it, but to me its still nothing compared to the AirScape / Planetary Design Coffee Vault.\nI saw no paper bags or canvases, as those are easily affected by the elements and 

## Load the models to generate embeddings

In this case we are using Instructor

Instructor is an LLM to which we can indicate with instructions the kind of embeddings we want to generate

In [189]:
from InstructorEmbedding import INSTRUCTOR
model_instructor = INSTRUCTOR('hkunlp/instructor-base')
instruction_query = "Represent the question for retrieving relevant coffee related posts:"
instruction_posts= "Represent the coffee related post for retrieval:"

load INSTRUCTOR_Transformer
max_seq_length  512


## Upload our data to NucliaDB



First we create a KB to store our data

In [201]:
from nucliadb_sdk import *
sdk = NucliaSDK(url="http://0.0.0.0:8080/api", region=Region.ON_PREM)
kb = sdk.create_knowledge_box(slug="my_coffee_kb")

A couple of helper functions to upload our data to the kb

In [202]:
def User_vector(vector_name,vectors):
    return {"field": {"field": "text","field_type": "text",},
            "vectors": {vector_name: {"vectors": {"vector": vectors.tolist()}}}}

In [203]:
def batch_upload(data,vector_name,kb,sdk):
    vectors = model_instructor.encode([[instruction_posts, row] for row in data])
    for i,row in enumerate(data):
        sdk.create_resource(
            kbid=kb.uuid,
            texts={"text": {"body": row}},
            uservectors=[User_vector(vector_name,vectors[i])],
        )

## Uploading!

Here you can change the batch size

In [204]:
batch_size=100
data_size=len(my_data)
vector_name="instructor"
for i in range(0,data_size,batch_size):
    print(f"Uploading batch {(i//batch_size)+1}")
    batch_upload(my_data[i:min(i+batch_size,data_size)],vector_name,kb,sdk)

Uploading batch 1
Uploading batch 2
Uploading batch 3
Uploading batch 4
Uploading batch 5
Uploading batch 6
Uploading batch 7
Uploading batch 8
Uploading batch 9
Uploading batch 10
Uploading batch 11
Uploading batch 12
Uploading batch 13


## And now what?

Now let's do some searches to see what we get!


In [219]:
def search(query,kb):
    results = sdk.search(
        kbid=kb.uuid,
        vector=model_instructor.encode([[instruction_query,query]])[0].tolist(),
        vectorset="instructor",
        min_score=0.4,
        features=[SearchOptions.VECTOR],
        show=[ResourceProperties.BASIC, ResourceProperties.VALUES]
    )
    return [(i.score,results.resources[i.rid].data.texts["text"].value.body) for i in results.sentences.results]
   

In [220]:
def print_results(results):
    for i, result in enumerate(results):
        print(f"----- RESULT {i+1} -----")
        print("Similarity score:",'%.2f' %result[0])
        print("Result:",'%.800s' %result[1],"...\n")
    

In [221]:
query ="What is a macchiato?"
print_results(search(query,kb))

----- RESULT 1 -----
Similarity score: 0.92
Result: The problem with ordering a "Macchiato" is that it's only half of the name of a traditional coffee drink.
"Macchiato" means "stained" or "flecked" and not more.
In Italy, you have either 

Latte macchiato *Milk "stained" with coffee (more precisely caffè or espresso) or
Espresso macchiato Espresso (or caffè) "stained" with a spoonful of milk foam.

If you order a "Macciato" without qualifier, you will either get what the chain defines as such (for Starbucks, a lot of milk and foam with one or multiple shots of espresso) or what your barista thinks of first - and that would often be a (latte) macchiato, i.e. a glass of milk and foam with an espresso layer inbetween. 
So my recommendation is
If you want a serving of espresso with a bit of milk, order an espresso macchiato. I just double-chec ...

----- RESULT 2 -----
Similarity score: 0.88
Result: From what I have found the only rule to make latte macchiato is to use at least 200ml of m

## What about the generative??

We are going to use Nuclia's endpoint for generativeAI, but feel free to integrate any other!


In [206]:
import requests
import json
def get_answer(question,context):
    # API endpoint URL
    url = "https://europe-1.nuclia.cloud/api/v1/predict/chat"
    nua_key=""
    user_id=""
    # Request parameters
    params = {
        "model": "chatgpt"
    }

    query = f'Answer the following question based on the provided context. IMPORTANT, if you can not answer the question with the context, do not generate and answer, say "Not enough data to answer this". Question: {question}'

    # Request body
    data = {
        "question": query,
        "retrieval": False,
        "user_id": user_id,
        "system": "You are a helpful assistant",
        "context": [{"author": "USER","text": context}]
    }
    headers={    "x-stf-nuakey": f"Bearer {nua_key}"
    }
    json_data = json.dumps(data)

    response = requests.post(url, params=params, data=json_data, headers=headers)
    return response


In [214]:
context="""The problem with ordering a "Macchiato" is that it's only half of the name of a traditional coffee drink.
"Macchiato" means "stained" or "flecked" and not more.
In Italy, you have either 

Latte macchiato *Milk "stained" with coffee (more precisely caffè or espresso) or
Espresso macchiato Espresso (or caffè) "stained" with a spoonful of milk foam.

If you order a "Macciato" without qualifier, you will either get what the chain defines as such (for Starbucks, a lot of milk and foam with one or multiple shots of espresso) or what your barista thinks of first - and that would often be a (latte) macchiato, i.e. a glass of milk and foam with an espresso layer inbetween. 
So my recommendation is If you want a serving of espresso with a bit of milk, order an espresso macchiato. I just double-chec ...
From what I have found the only rule to make latte macchiato is to use at least 200ml of milk for a 30ml espresso shot. 
Clearly this is mostly defined by the volume of the glass the coffee is served in and that will definitely vary from shop to shop. 
I would expect the coffee shops to use standart espresso shots but no one can be sure about this either"""

response=get_answer("What is a Macchiato",context)
response.content.decode()

'Macchiato is a traditional coffee drink that means "stained" or "flecked" in Italian. In Italy, there are two types of macchiato: Latte macchiato and Espresso macchiato. Latte macchiato is milk "stained" with coffee, while Espresso macchiato is espresso "stained" with a spoonful of milk foam.'

## Now let's make this pretty!!

Let's integrate everything and make it pretty with gradio :)

In [142]:
from nucliadb_models.search import ResourceProperties, SearchOptions

def get_context(question):
    results = sdk.search(
        kbid=kb.uuid,
        vector=model_instructor.encode([[instruction_query,question]])[0].tolist(),
        vectorset="instructor",
        min_score=0.4,
        features=[SearchOptions.VECTOR],
        show=[ResourceProperties.BASIC, ResourceProperties.VALUES]
    )
    context_text="\n".join([results.resources[i.rid].data.texts["text"].value.body for i in results.sentences.results])[0:3000]
    return context_text


In [178]:
import gradio as gr

def greet(question):
    context=get_context(question)
    answer=get_answer(question,context)
    return answer.content.decode()

demo = gr.Interface(fn=greet, inputs="text", outputs="text")

demo.launch()   

Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.




## What else now?

**CYOA !!**

Possible improvements:
* Make your dataset bigger 
* Tune your prompt
* Try another dataset
* Improve the context
* Try another retrieval model

**NucliaDB** also provides full text search, so maybe combining results from both searches with an heuristic could be interesting



## Tip
Example of how to get fulltext search results:

In [224]:
def fulltext_search(query,kb):
    results = sdk.search(
        kbid=kb.uuid,
        query=query,
        show=[ResourceProperties.BASIC, ResourceProperties.VALUES]
    )
    return [(i.score,results.resources[i.rid].data.texts["text"].value.body) for i in results.fulltext.results]
query ="What is a macchiato?"
print_results(fulltext_search(query,kb))

----- RESULT 1 -----
Similarity score: 13.75
Result: The problem with ordering a "Macchiato" is that it's only half of the name of a traditional coffee drink.
"Macchiato" means "stained" or "flecked" and not more.
In Italy, you have either 

Latte macchiato *Milk "stained" with coffee (more precisely caffè or espresso) or
Espresso macchiato Espresso (or caffè) "stained" with a spoonful of milk foam.

If you order a "Macciato" without qualifier, you will either get what the chain defines as such (for Starbucks, a lot of milk and foam with one or multiple shots of espresso) or what your barista thinks of first - and that would often be a (latte) macchiato, i.e. a glass of milk and foam with an espresso layer inbetween. 
So my recommendation is
If you want a serving of espresso with a bit of milk, order an espresso macchiato. I just double-chec ...

----- RESULT 2 -----
Similarity score: 11.04
Result: From what I have found the only rule to make latte macchiato is to use at least 200ml of