# Use dot and vectors for weighted ranking

## Goal

Each entry has a (fixed) list of numeric attributes $a_i, i=1...N_{attr}$, which fall in a known `0-to-max` scale.

We want to be able to query and sort by a "score" $S({a_i})$ that is a certain weighted combination of these attributes. Each time the query is ran, the score definition (that is, the weights) can be different.

$$
S \equiv \sum_{i=1}^N w_i \cdot a_i
$$

(Note on fixing the normalization: want that the max theoretical score, i.e. on a hypothetical row with all its attributes set to the max value, is the sum of the weights themselves).

### Sample data

(we take all the ability scores to always lie in 0-10 here)

```
Strength      measuring physical power
Dexterity,    measuring agility
Constitution, measuring endurance
Intelligence  measuring reasoning and memory
Wisdom        measuring perception and insight
Charisma      measuring force of personality
```

In [1]:
characters0 = [
    {
        "name": "Gondolf",
        "description": "A wizard with long nose and a sardonic smile",
        "abilities_map": {
            "CHA": 8,
            "CON": 3,
            "DEX": 5,
            "INT": 10,
            "STR": 4,
            "WIS": 9,
        },
    },
    {
        "name": "Bargul",
        "description": "A mighty brute, able to smash rocks with his forehead.",
        "abilities_map": {
            "CHA": 3,
            "CON": 9,
            "DEX": 4,
            "INT": 2,
            "STR": 10,
            "WIS": 3,
        },
    },
    {
        "name": "Zittur",
        "description": "A lightweight fairy, capable of slipping in unseen and making herself unnoticed.",
        "abilities_map": {
            "CHA": 9,
            "CON": 4,
            "DEX": 9,
            "INT": 7,
            "STR": 3,
            "WIS": 5,
        },
    },
]

## Setup & Secrets

In [2]:
!pip install --quiet "cassio>=0.1.3"

In [3]:
import os
from getpass import getpass

import cassio

In [4]:
if 'ASTRA_DB_ID' not in os.environ:
    os.environ["ASTRA_DB_ID"] = input("ASTRA_DB_ID = ")

if 'ASTRA_DB_APPLICATION_TOKEN' not in os.environ:
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")

if 'ASTRA_DB_KEYSPACE' not in os.environ:
    ks = input("(Optional) ASTRA_DB_KEYSPACE = ")
    if ks:
        os.environ["ASTRA_DB_KEYSPACE"] = ks

## Create the store

In [5]:
cassio.init(
    database_id=os.environ["ASTRA_DB_ID"],
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    keyspace=os.environ.get("ASTRA_DB_KEYSPACE"),
)

In [6]:
session, keyspace = cassio.config.resolve_session(), cassio.config.resolve_keyspace()

### The vector table

Six abilities means a vector of six components.

Also, DOT (I can expand on this)

In [7]:
CREATE_TABLE_STATEMENT = f"""
CREATE TABLE IF NOT EXISTS {keyspace}.fantasy_simple (
    name TEXT PRIMARY KEY,
    description TEXT,
    abilities VECTOR<FLOAT, 6>,
);
"""

session.execute(CREATE_TABLE_STATEMENT)

<cassandra.cluster.ResultSet at 0x7fc37c258f40>

In [8]:
CREATE_INDEX_STATEMENT = f"""
CREATE CUSTOM INDEX IF NOT EXISTS idx_abilities
    ON {keyspace}.fantasy_simple (abilities)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};
"""
# Note: the double '{{' and '}}' are just the F-string escape sequence for '{' and '}'

session.execute(CREATE_INDEX_STATEMENT)

<cassandra.cluster.ResultSet at 0x7fc37c16ec40>

## Write data

In [9]:
# We need a global, unambiguous ordering scheme
ABILITY_LIST = ["CHA", "CON", "DEX", "INT", "STR", "WIS"]

def abilities_to_vector(a_dict):
    return [a_dict[ability] for ability in ABILITY_LIST]

def abilities_to_map(a_vector):
    return {
        ability_key: ability_value
        for ability_key, ability_value in zip(ABILITY_LIST, a_vector)
    }

print(f"=> {characters0[1]['name']}: {abilities_to_vector(characters0[1]['abilities_map'])}")
print(f"=> {abilities_to_map([1,2,3,4,5,6])}")

=> Bargul: [3, 9, 4, 2, 10, 3]
=> {'CHA': 1, 'CON': 2, 'DEX': 3, 'INT': 4, 'STR': 5, 'WIS': 6}


In [10]:
INSERT_STATEMENT = session.prepare(f"""
    INSERT INTO {keyspace}.fantasy_simple
        (name, description, abilities)
    VALUES
        (?,?,?);
    """
)

for character in characters0:
    session.execute(
        INSERT_STATEMENT,
        (
            character["name"],
            character["description"],
            abilities_to_vector(character["abilities_map"]),
        ),
   )

## Query

In [11]:
MAX_ABILITY_VALUES = [10, 10, 10, 10, 10, 10]

QUERY_STATEMENT = session.prepare(f"""
    SELECT name, description, abilities, similarity_dot_product(abilities, ?) as score
        FROM {keyspace}.fantasy_simple
        ORDER BY abilities ANN OF ?
        LIMIT ?;
    """
)

def find_best_matches(ability_weight_map, n = 1):
    """
    the theoretical max score from this query is the sum of the weights.
    """
    # make requested weights into vector
    query_vector0 = abilities_to_vector(ability_weight_map)
    # ensure each of its component encodes the inverse rescaling for the DB values
    query_vector = [x0 / max_val for x0, max_val in zip(query_vector0, MAX_ABILITY_VALUES)]
    # run the proper search
    results = session.execute(
        QUERY_STATEMENT,
        (
            query_vector,
            # again, to match the '?' in the statement:
            query_vector,
            n,
        ),
    )
    return [
        {
            "name": row.name,
            "description": row.description,
            "abilities_map": abilities_to_map(row.abilities),
            "score": row.score,
        }
        for row in results
    ]

In [12]:
find_best_matches({'CHA': 0.1, 'CON': 0.1, 'DEX': 0.1, 'INT': 0.1, 'STR': 0.5, 'WIS': 0.1})

[{'name': 'Bargul',
  'description': 'A mighty brute, able to smash rocks with his forehead.',
  'abilities_map': {'CHA': 3.0,
   'CON': 9.0,
   'DEX': 4.0,
   'INT': 2.0,
   'STR': 10.0,
   'WIS': 3.0},
  'score': 0.8550000190734863}]

In [13]:
find_best_matches({'CHA': 0.1, 'CON': 0.1, 'DEX': 0.5, 'INT': 0.1, 'STR': 0.1, 'WIS': 0.1}, n=2)

[{'name': 'Zittur',
  'description': 'A lightweight fairy, capable of slipping in unseen and making herself unnoticed.',
  'abilities_map': {'CHA': 9.0,
   'CON': 4.0,
   'DEX': 9.0,
   'INT': 7.0,
   'STR': 3.0,
   'WIS': 5.0},
  'score': 0.8650000095367432},
 {'name': 'Gondolf',
  'description': 'A wizard with long nose and a sardonic smile',
  'abilities_map': {'CHA': 8.0,
   'CON': 3.0,
   'DEX': 5.0,
   'INT': 10.0,
   'STR': 4.0,
   'WIS': 9.0},
  'score': 0.7949999570846558}]

In [14]:
find_best_matches({'CHA': 0.1, 'CON': 0.1, 'DEX': 0.1, 'INT': 0.3, 'STR': 0.3, 'WIS': 0.1}, n=1)

[{'name': 'Gondolf',
  'description': 'A wizard with long nose and a sardonic smile',
  'abilities_map': {'CHA': 8.0,
   'CON': 3.0,
   'DEX': 5.0,
   'INT': 10.0,
   'STR': 4.0,
   'WIS': 9.0},
  'score': 0.8349999785423279}]

# TODO

- a single-pass query for "weighted semantic+abilities" search
- add a bit of explanations
- note on what you _cannot_ do without multiple passes
- performance/recall considerations