# Showcase

#### This is a simple tutorial to go over vexpresso capabilities

Imports

In [1]:
import vexpresso
import numpy as np
from vexpresso.retrievers import Retriever, FaissRetriever

## Collection Creation

#### First we'll create some sample data. Here we're using just strings, but because `vexpresso` uses `daft`, you can use any datatype!

In [2]:
data = {
    "status": ["read", "unread", "read", "unread", "read", "unread", "read", "unread"],
    "documents": ["A document that discusses domestic policy", "A document that discusses international affairs", "A document that discusses kittens", "A document that discusses dogs", "A document that discusses chocolate", "A document that is sixth that discusses government", "A document that discusses international affairs", "A document that discusses global affairs"],
    "ids": ["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
    "numbers": list(range(1,9))
}

#### To create the collection, use the `create` method. This by default is lazy execution, meaning that we actually don't load in any data until `execute` or `show` is called. (Or if `lazy` is passed)

In [3]:
collection = vexpresso.create(data=data)
collection

2023-06-20 11:27:31.621 | INFO     | daft.context:runner:80 - Using PyRunner


+----------+----------------------+--------+-----------+
| status   | documents            | ids    |   numbers |
| Utf8     | Utf8                 | Utf8   |     Int64 |
| read     | A document that      | id1    |         1 |
|          | discusses domestic   |        |           |
|          | policy               |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that      | id2    |         2 |
|          | discusses            |        |           |
|          | international        |        |           |
|          | affairs              |        |           |
+----------+----------------------+--------+-----------+
| read     | A document that      | id3    |         3 |
|          | discusses kittens    |        |           |
+----------+----------------------+--------+-----------+
| unread   | A document that      | id4    |         4 |
|          | discusses dogs       |        |           |
+----------+-------------------

### If you want to operate directly

### Vexpresso also works on clusters with Ray!

```python
collection = vexpresso.create(data=data, backend="ray", cluster_address=..., cluster_kwargs=...)
```

#### Lets see what's in the collection now!

In [4]:
collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64
read,A document that discusses domestic policy,id1,1
unread,A document that discusses international affairs,id2,2
read,A document that discusses kittens,id3,3
unread,A document that discusses dogs,id4,4
read,A document that discusses chocolate,id5,5


#### vexpresso's `Collection` methods return `Collection` objects, allowing for complex chaining of calls

## Embed Data

#### Lets add a list of dummy vectors to represent embeddings!

In [5]:
embeddings= [
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
    [1.1, 2.3, 3.2],
    [4.5, 6.9, 4.4],
]

In [6]:
collection = collection.add_column("embeddings_documents", embeddings)

#### By default vexpresso is "lazy", meaning that nothing is executed until `.execute` is called
Note: this can be bypassed by passing `lazy=False`

```python
collection = collection.add_column("embeddings_documents", embeddings, lazy=False)
```

In [8]:
collection

+----------+-------------+--------+-----------+------------------------+
| status   | documents   | ids    | numbers   | embeddings_documents   |
| Utf8     | Utf8        | Utf8   | Int64     | List[item:Float64]     |
+----------+-------------+--------+-----------+------------------------+
(No data to display: Dataframe not materialized)

#### Let's execute it to get embeddings (or `.collect`)

In [9]:
collection = collection.execute()

In [10]:
collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]"
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]"
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]"
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]"
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]"


### lets take a look at the embeddings

We can grab the raw data in a form of a dictionary or a list easily

In [11]:
collection.to_dict()["embeddings_documents"][:3]

[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]

In [12]:
collection["embeddings_documents"].to_list()[:3]

[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]]

## Query

Normally, we would use the same embedding function we used to embed content to query. But since we manually inputted the embeddings, lets create a simple embedding function that just returns an array of zeros

#### as you can see we now have an `embeddings_documents` column, let's query it and return the top 5 results!

In [14]:
queried = collection.query("embeddings_documents", query="test", embedding_fn=embed_fn, k=5, return_scores=True).execute()

#### We can see the actual similarity scores in `embeddings__documents_score` column

In [15]:
queried.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64],embeddings_documents_score Float64
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]",0.976796
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]",0.976796
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]",0.976796
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]",0.976796
read,A document that discusses international affairs,id7,7,"[1.1, 2.3, 3.2]",0.931368


You can also query with embeddings directly! You can just call the embedding function directly, but we recommend using the collection object's `embed_query` method for embedding functions that may require resources (like gpus) or if you want to run it on a ray cluster

In [18]:
# you can just call the embedding function
embed_query = embed_fn(["test_1"])

In [19]:
embed_query = collection.embed_query("test1", embedding_fn=collection.embedding_functions["embeddings_documents"])

# you can also just pass in the string column name
embed_query = collection.embed_query("test1", embedding_fn="embeddings_documents")

queried = collection.query("embeddings_documents", query_embedding=embed_query, k=5).execute()
queried.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]"
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]"
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]"
read,A document that discusses international affairs,id7,7,"[1.1, 2.3, 3.2]"


we can also get a list of embeddinngs witth `batch_embed_query`

In [20]:
embed_query = collection.embed_queries(["test1", "test2"], embedding_fn=collection.embedding_functions["embeddings_documents"])
print(embed_query)

[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]


#### Sometimes you will want to batch queries together into a single call. vexpresso has a convenient `batch_query` function. This will return a list of Collections

In [21]:
queries = ["test_1", "test_5", "test_10"]

In [22]:
batch_queried = collection.batch_query("embeddings_documents", queries=queries, k=2)

#### We now have collections for each query

In [23]:
batch_queried[0].show(2)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]"


In [24]:
batch_queried[1].show(2)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]"


In [25]:
batch_queried[2].show(2)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]"


## Filtering

#### With `vexpresso`, filtering is super easy. The syntax is similar to `chromadb`

#### Filter dictionary must have the following structure:

```python
{
    <field>: {
        <filter_method>: <value>
    },
    <field>: {
        <filter_method>: <value>
    },
}

```

Let's filter the original collection to only include rows with `numbers` > 2

In [39]:
filtered_collection = collection.filter(
    {
        "numbers":{
            "gt":2
        }
    }
).execute()

In [40]:
filtered_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]"
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]"
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]"
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
read,A document that discusses international affairs,id7,7,"[1.1, 2.3, 3.2]"


#### We can use multiple filter conditions as well
Let's filter the collection to only return rows with numbers <= 4 and status == "read"

In [61]:
filtered_collection = collection.filter(
    {
        "numbers":{
            "lte":4
        },
        "status":{
            "eq":"read"
        }
        
    }
).execute()

In [62]:
filtered_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]"
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]"


#### Sometimes you need a custom filtering function, with vexpresso its easy to do that with the `custom` filter keyword!
Lets filter a collection to only return rows with even `numbers` and `strings` that contain a "3"

In [63]:
def custom_filter(number, mod_val) -> bool:
    return number % mod_val == 0

In [64]:
filtered_collection = collection.filter(
    {
        "numbers":{
            "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
        },
        "ids":{
            "isin":["id1", "id2", "id4"]
        }
    }
).execute()

In [65]:
filtered_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]"
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]"


#### You can also combine filters + queries in the same call

 Lets query the collection with "test" and filter only even numbers

In [67]:
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}

In [68]:
query_filtered_collection = collection.query("embeddings_documents", "test", k=10, filter_conditions=even_filter).execute()

In [69]:
query_filtered_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]"
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]"
unread,A document that is sixth that discusses government,id6,6,"[4.5, 6.9, 4.4]"
unread,A document that discusses global affairs,id8,8,"[4.5, 6.9, 4.4]"


## Chaining Functions

#### We can chain functions lazily easily

For instance, lets query and filter multiple times

In [70]:
even_filter = {
    "numbers":{
        "custom":{"function":custom_filter, "function_kwargs":{"mod_val":2}}
    }
}

In [71]:
chained_collection = collection.query("embeddings_documents", "test1", k=5) \
                               .filter(even_filter) \
                               .query("embeddings_documents", "test2", k=2) \
                               .filter({"numbers":{"lte":3}})

In [72]:
chained_collection.daft_df

0,1,2,3,4
status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]


Here we queried for the closest 5 elements to "test1", filtered for only even numbers, queried top 2 of "test2", then filtered for numbers <= 3

In [75]:
chained_collection = chained_collection.execute()

In [76]:
chained_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64]
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]"


get_text_features## Transforms

#### Sometimes you want to transform your data. Because of `daft`, you can use `vexpresso` to do this easily! 

#### For example, lets add a new column where we change "document" to "replaced_document" in the document column, named "replaced". Lets specify that this output is also a string type

For a full list of datatypes, visit daft documentation: https://www.getdaft.io/projects/docs/en/latest/api_docs/datatype.html

In [82]:
def simple_apply_fn(strings):
    return [
        s.replace("document", "replaced_document") for s in strings
    ]

In [83]:
transformed_collection = collection.apply(simple_apply_fn, collection["documents"], to="replaced", datatype=vexpresso.DataType.string()).execute()

In [84]:
transformed_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64],replaced Utf8
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]",A replaced_document that discusses domestic policy
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]",A replaced_document that discusses international affairs
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]",A replaced_document that discusses kittens
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]",A replaced_document that discusses dogs
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]",A replaced_document that discusses chocolate


#### We can also pass in args, kwargs, and multiple columns into the apply function

For instance, lets append the number in numbers column to each document in documents

In [86]:
def multi_column_apply_fn(string_columns, numbers):
    out = []
    for string, num in zip(string_columns, numbers):
        replaced = f"{string}_{num}"
        out.append(replaced)
    return out

In [87]:
transformed_collection = collection.apply(
    multi_column_apply_fn,
    collection["documents"],
    numbers=collection["numbers"],
    to="modified",
    datatype=vexpresso.DataType.string()
).execute()

In [88]:
transformed_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64],modified Utf8
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]",A document that discusses domestic policy_1
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]",A document that discusses international affairs_2
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]",A document that discusses kittens_3
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]",A document that discusses dogs_4
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]",A document that discusses chocolate_5


## Adding data

## Saving + Loading

#### Once you've done a bunch of processing on a collection, you probably want to save it somewhere. Vexpresso supports local file saving + huggingface datasets

Lets save the `transformed_collection` above to a directory `saved_transformed_collection`

In [89]:
transformed_collection.save("./saved_collection/saved_transformed_collection")

saving to ./saved_collection/saved_transformed_collection


We can then load the collection with the same `create` function. Make sure to also include the embedding functions that were used on the original collection!

In [90]:
loaded_collection = vexpresso.create(
    directory_or_repo_id = "./saved_collection/saved_transformed_collection",
    embedding_functions = {"embeddings_strings":embed_fn}
)

In [91]:
loaded_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64],modified Utf8
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]",A document that discusses domestic policy_1
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]",A document that discusses international affairs_2
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]",A document that discusses kittens_3
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]",A document that discusses dogs_4
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]",A document that discusses chocolate_5


#### Now let's upload to huggingface!

For this you'll need to install huggingfacehub

In [51]:
# !pip install huggingface-hub

Automatically gets token from env variable: HUGGINGFACEHUB_API_TOKEN = ...

or you can pass in token directly via `collection.save(token=...)`

In [92]:
username = "shyamsn97"
repo_name = "vexpresso_test_showcase"
# username = "REPLACE"
# repo_name = "REPLACE"

In [93]:
loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, )

Uploading collection to None


  from .autonotebook import tqdm as notebook_tqdm

content.parquet: 100%|█████████████████████████████████████████████| 3.23k/3.23k [00:00<00:00, 6.79kB/s][A

Upload 1 LFS files: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.51it/s][A


Upload to shyamsn97/vexpresso_test_showcase complete!


'shyamsn97/vexpresso_test_showcase'

The example is private by default, but this can be changed by the `private` flag

In [54]:
# loaded_collection.save(hf_username = username, repo_name = repo_name, to_hub=True, private=False)

You can see an example of the above data: https://huggingface.co/datasets/shyamsn97/vexpresso_test_showcase

#### Now lets load it!

In [94]:
loaded_collection = vexpresso.create(
    hf_username = username,
    repo_name = repo_name,
    embedding_functions = {"embeddings_documents":embed_fn}
)

Retrieving from hf repo: shyamsn97/vexpresso_test_showcase


Fetching 2 files:  50%|█████████████████████████▌                         | 1/2 [00:00<00:00,  9.84it/s]
Downloading content.parquet: 100%|██████████████████████████████████| 3.23k/3.23k [00:00<00:00, 153kB/s][A
Fetching 2 files: 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31it/s]


In [95]:
loaded_collection.show(5)

status Utf8,documents Utf8,ids Utf8,numbers Int64,embeddings_documents List[item:Float64],modified Utf8
read,A document that discusses domestic policy,id1,1,"[1.1, 2.3, 3.2]",A document that discusses domestic policy_1
unread,A document that discusses international affairs,id2,2,"[4.5, 6.9, 4.4]",A document that discusses international affairs_2
read,A document that discusses kittens,id3,3,"[1.1, 2.3, 3.2]",A document that discusses kittens_3
unread,A document that discusses dogs,id4,4,"[4.5, 6.9, 4.4]",A document that discusses dogs_4
read,A document that discusses chocolate,id5,5,"[1.1, 2.3, 3.2]",A document that discusses chocolate_5
