# Use dot and vectors for weighted ranking

Table of contents:
- Weighted ranking, "attributes" only
- Weighted ranking, "attributes+semantic" in a single pass
- Note 1: _what you can and can't do_
- Note 2: _an aside on performance_

## Goal

Each entry comes with a list of numeric attributes $a_i, i=1...N_{attr}$, which we take to be normalized in the $[0:1]$ interval.

We want to be able to query and sort by a "score function" $S({a_i}) : \mathrm{row} \to \mathbb{R}$ that is a certain weighted combination of these attributes:
$$
S(\mathrm{row}) \equiv \sum_{i=1}^N w_i a_i
$$

We want to lay out a data schema such that different queries, each with arbitrary sets of weights, can all be run on the one table we'll have.

Note that the number of "attributes" must be known at table-creation time, and their names must be known before inserting any data on it.

### The principle

Let us assume for a moment that the (lowercase) attributes $\{a_i\}$ range between zero and one.

The formula above for $S(\mathrm{row})$ is nothing else than the DOT similarity primitive available in Cassandra (modulo a simple transformation that we'll keep into account in a second).

**If we store a _vector_ next to each entry, with the attributes collated in a known sequence, we can then ask Cassandra to compute `S` for us and sort results by its descending value**. In other words, at storing time we insert

$$
v_\mathrm{row} = [a_1, a_2 ... a_N]
$$

so that the query with weights $\{w_i\}$ can be realized by constructing the query vector

$$
q = [w_1, w_2, ... w_N]
$$

Indeed the clause `... ORDER BY vector_column ANN OF q ...` in the `SELECT` will do just that, provided we created the table with the DOT similarity function in the options.

**Note**: while it is customarily said that _you should use DOT only when your vectors are guaranteed to be (L2) unit-length_, this whole applications is an example of "bending the rules" and exploiting the DOT similarity measure right for what it is even outside of the unit sphere.

### Normalization, practicalities

If the attributes are already in the $[0:1]$ interval, everything is fine as is. If this is not the case, we distinguish two possibilities:

1. the attributes are given as a "raw" set of numbers $\{A_i\}$, each in its range $[0: A_i^\mathrm{max}]$ (note the zero!). This is the happy path, as the rescaling $a_i = A_i / A_i^\mathrm{max}$ can be shifted to the query vector, making it possible to store the $\{A_i\}$ on the database and then constructing a modified query vector $\tilde{q} = \{w_i / A_i^\mathrm{max}\}$. **This is the approach demonstrated in the code**

2. The attributes are on a scale that does not start at zero, or even is non-linearly mapped into the unit interval. Examples may be an "alignment" value for the fantasy characters (-1 for full evil, +1 for full good), or values $A'_i$ in $[-\infty: +\infty]$ that require mappings such as $a_i = 0.5 + \arctan(k A'_i)/\pi$ to be tamed into the unit interval. _In those cases it is unavoidable to do some insert-time data preprocessing_, something not shown here.

### Cassandra similarity and the actual score

Using the $\{q_i\}$ with the $\{a_i\}$ (or equivalently the $\{\tilde{q}_i\}$ with the $\{A_i\}$), we have the guarantee that the "theoretical maximum score" (when all attributes assume their maximum value) is one (Also the minimum is trivially zero), a fact that may come handy. And CQL makes it easy to get the similarity itself back from an ANN query, by adding something like `... similarity_dot_product(vector_column, q) as score ...`.

**But beware!**, for the DOT similarity in Cassandra was built with the "shortcut-for-cosine" use case, and a nice $[0:1]$ normalization, in mind.

As a consequence, the DOT similarity in Cassandra does not return exactly $S = \vec{q} \cdot \vec{a}$, rather

$$
\mathrm{sim}_\mathrm{Cass} = \frac{1 + \vec{q} \cdot \vec{a}}{2}
$$

This means that, if we are interested in the _value_ of the returned similarities, we should counter the above rescaling by applying $S = 2 \; \mathrm{sim}_\mathrm{Cass} - 1$ after running the CQL query.

_Related:_ in this notebook, unless specified otherwise, "cosine similarity" denotes the quantity scaled to lie in $[-1: +1]$ (and not the Cassandra rescaling).

### Sample data

Here we take all the ability scores to always lie in 0-10:

```
Strength      measuring physical power
Dexterity,    measuring agility
Constitution, measuring endurance
Intelligence  measuring reasoning and memory
Wisdom        measuring perception and insight
Charisma      measuring force of personality
```

In [42]:
MAX_ABILITY_VALUES = [10, 10, 10, 10, 10, 10]

In [1]:
characters0 = [
    {
        "name": "Gondolf",
        "description": "A wizard with long nose and a sardonic smile",
        "abilities_map": {
            "CHA": 8,
            "CON": 3,
            "DEX": 5,
            "INT": 10,
            "STR": 4,
            "WIS": 9,
        },
    },
    {
        "name": "Bargul",
        "description": "A mighty brute, able to smash rocks with his forehead.",
        "abilities_map": {
            "CHA": 3,
            "CON": 9,
            "DEX": 4,
            "INT": 2,
            "STR": 10,
            "WIS": 3,
        },
    },
    {
        "name": "Zittur",
        "description": "A lightweight fairy, capable of slipping in unseen and making herself unnoticed.",
        "abilities_map": {
            "CHA": 9,
            "CON": 4,
            "DEX": 9,
            "INT": 7,
            "STR": 3,
            "WIS": 5,
        },
    },
]

### Setup & Secrets

In [2]:
!pip install --quiet "cassio>=0.1.3"

In [3]:
import os
from getpass import getpass

import cassio

In [4]:
if 'ASTRA_DB_ID' not in os.environ:
    os.environ["ASTRA_DB_ID"] = input("ASTRA_DB_ID = ")

if 'ASTRA_DB_APPLICATION_TOKEN' not in os.environ:
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")

if 'ASTRA_DB_KEYSPACE' not in os.environ:
    ks = input("(Optional) ASTRA_DB_KEYSPACE = ")
    if ks:
        os.environ["ASTRA_DB_KEYSPACE"] = ks

In [5]:
cassio.init(
    database_id=os.environ["ASTRA_DB_ID"],
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    keyspace=os.environ.get("ASTRA_DB_KEYSPACE"),
)

In [6]:
session, keyspace = cassio.config.resolve_session(), cassio.config.resolve_keyspace()

## Weighted ranking, attributes-only case

### Create vector table

This is where we need to know the number of abilities (six in our example):

In [7]:
CREATE_TABLE_STATEMENT = f"""
CREATE TABLE IF NOT EXISTS {keyspace}.fantasy_simple (
    name TEXT PRIMARY KEY,
    description TEXT,
    abilities VECTOR<FLOAT, 6>,
);
"""

session.execute(CREATE_TABLE_STATEMENT)

<cassandra.cluster.ResultSet at 0x7f6bf37554c0>

In [8]:
CREATE_INDEX_STATEMENT = f"""
CREATE CUSTOM INDEX IF NOT EXISTS idx_abilities
    ON {keyspace}.fantasy_simple (abilities)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};
"""
# Note: the double '{{' and '}}' are just the F-string escape sequence for '{' and '}'

session.execute(CREATE_INDEX_STATEMENT)

<cassandra.cluster.ResultSet at 0x7f6c30482580>

### Write data

This is where we need to nail down a fixed ordering of the attributes in a list to match stored and query vectors.

In [9]:
# We need a global, unambiguous ordering scheme
ABILITY_LIST = ["CHA", "CON", "DEX", "INT", "STR", "WIS"]

Useful conversion utilities:

In [10]:
def abilities_to_vector(a_dict):
    return [a_dict[ability] for ability in ABILITY_LIST]

def abilities_to_map(a_vector):
    return {
        ability_key: ability_value
        for ability_key, ability_value in zip(ABILITY_LIST, a_vector)
    }

print(f"=> {characters0[1]['name']}: {abilities_to_vector(characters0[1]['abilities_map'])}")
print(f"=> {abilities_to_map([1, 2, 3, 4, 5, 6])}")

=> Bargul: [3, 9, 4, 2, 10, 3]
=> {'CHA': 1, 'CON': 2, 'DEX': 3, 'INT': 4, 'STR': 5, 'WIS': 6}


In [11]:
INSERT_STATEMENT = session.prepare(f"""
    INSERT INTO {keyspace}.fantasy_simple
        (name, description, abilities)
    VALUES
        (?,?,?);
    """
)

for character in characters0:
    session.execute(
        INSERT_STATEMENT,
        (
            character["name"],
            character["description"],
            abilities_to_vector(character["abilities_map"]),
        ),
   )

### Query

In [12]:
QUERY_STATEMENT = session.prepare(f"""
    SELECT name, description, abilities, similarity_dot_product(abilities, ?) as score
        FROM {keyspace}.fantasy_simple
        ORDER BY abilities ANN OF ?
        LIMIT ?;
    """
)

def find_best_matches(ability_weight_map, n = 1):
    """
    the theoretical max score from this query is the sum of the weights.
    """
    # make requested weights into vector
    query_vector0 = abilities_to_vector(ability_weight_map)
    # ensure each of its component encodes the inverse rescaling for the DB values
    query_vector = [x0 / max_val for x0, max_val in zip(query_vector0, MAX_ABILITY_VALUES)]
    # run the proper search
    results = session.execute(
        QUERY_STATEMENT,
        (
            query_vector,
            # again, to match the '?' in the statement:
            query_vector,
            n,
        ),
    )
    return [
        {
            "name": row.name,
            "description": row.description,
            "abilities_map": abilities_to_map(row.abilities),
            "score": 2 * row.score - 1,  # remember Cassandra rescales the dot in its similarities
        }
        for row in results
    ]

In [13]:
find_best_matches({"CHA": 0.1, "CON": 0.1, "DEX": 0.1, "INT": 0.1, "STR": 0.5, "WIS": 0.1})

[{'name': 'Bargul',
  'description': 'A mighty brute, able to smash rocks with his forehead.',
  'abilities_map': {'CHA': 3.0,
   'CON': 9.0,
   'DEX': 4.0,
   'INT': 2.0,
   'STR': 10.0,
   'WIS': 3.0},
  'score': 0.7100000381469727}]

In [14]:
find_best_matches({"CHA": 0.1, "CON": 0.1, "DEX": 0.5, "INT": 0.1, "STR": 0.1, "WIS": 0.1}, n=2)

[{'name': 'Zittur',
  'description': 'A lightweight fairy, capable of slipping in unseen and making herself unnoticed.',
  'abilities_map': {'CHA': 9.0,
   'CON': 4.0,
   'DEX': 9.0,
   'INT': 7.0,
   'STR': 3.0,
   'WIS': 5.0},
  'score': 0.7300000190734863},
 {'name': 'Gondolf',
  'description': 'A wizard with long nose and a sardonic smile',
  'abilities_map': {'CHA': 8.0,
   'CON': 3.0,
   'DEX': 5.0,
   'INT': 10.0,
   'STR': 4.0,
   'WIS': 9.0},
  'score': 0.5899999141693115}]

In [15]:
find_best_matches({"CHA": 0.1, "CON": 0.1, "DEX": 0.1, "INT": 0.3, "STR": 0.3, "WIS": 0.1}, n=1)

[{'name': 'Gondolf',
  'description': 'A wizard with long nose and a sardonic smile',
  'abilities_map': {'CHA': 8.0,
   'CON': 3.0,
   'DEX': 5.0,
   'INT': 10.0,
   'STR': 4.0,
   'WIS': 9.0},
  'score': 0.6699999570846558}]

#### Sort by a single attribute

Note that the above machinery trivially contains a sub-use-case that is useful in its own right: sorting results by _exactly one attribute_. It is sufficient to pass a weight set that is effectively a "Kronecker delta":

In [16]:
find_best_matches({"CHA": 0.0, "CON": 0.0, "DEX": 1.0, "INT": 0.0, "STR": 0.0, "WIS": 0.0}, n=3)

[{'name': 'Zittur',
  'description': 'A lightweight fairy, capable of slipping in unseen and making herself unnoticed.',
  'abilities_map': {'CHA': 9.0,
   'CON': 4.0,
   'DEX': 9.0,
   'INT': 7.0,
   'STR': 3.0,
   'WIS': 5.0},
  'score': 0.9000000953674316},
 {'name': 'Gondolf',
  'description': 'A wizard with long nose and a sardonic smile',
  'abilities_map': {'CHA': 8.0,
   'CON': 3.0,
   'DEX': 5.0,
   'INT': 10.0,
   'STR': 4.0,
   'WIS': 9.0},
  'score': 0.5},
 {'name': 'Bargul',
  'description': 'A mighty brute, able to smash rocks with his forehead.',
  'abilities_map': {'CHA': 3.0,
   'CON': 9.0,
   'DEX': 4.0,
   'INT': 2.0,
   'STR': 10.0,
   'WIS': 3.0},
  'score': 0.3999999761581421}]

#### Normalizations check

Just as a sanity check for the normalizations, we'll insert two more characters in the story:

In [17]:
session.execute(
    INSERT_STATEMENT,
    ("Cthulhu", "An almighty ancient god", [10] * 6)
)
session.execute(
    INSERT_STATEMENT,
    ("Puny Terry", "Terry no strong, Terry no smart", [0] * 6)
)

<cassandra.cluster.ResultSet at 0x7f6bf37819d0>

Let's try a couple of queries (ensuring the input weights sum to one):

In [18]:
query1 = abilities_to_map([0.05, 0.1, 0.15, 0.2, 0.3, 0.2])
for result in find_best_matches(query1, n=100):
    print(result["score"])

1.0
0.6449999809265137
0.565000057220459
0.5499999523162842
0.0


In [20]:
query1 = abilities_to_map([0.0, 0.4, 0.2, 0.1, 0.3, 0.0])
for result in find_best_matches(query1, n=100):
    print(result["score"])

0.9999998807907104
0.7599999904632568
0.5
0.440000057220459
0.0


If you don't care about the _value_ of the resulting scores, you can even use negative weights (and forget about sum-to-one weights) to make some attributes into a **penalty**:

In [23]:
# We don't like smartasses around here ...
for result in find_best_matches({'CHA': 1, 'CON': 1, 'DEX': 0.5, 'INT': -1.0, 'STR': 1.0, 'WIS': -1.0}, n=3):
    print(f"{result['name']}: {result['score']}")

Bargul: 1.9000000953674316
Cthulhu: 1.5
Zittur: 0.8500001430511475


## Weighted ranking, attributes+semantic in a single pass

We now turn to the next problem: we want to mix _semantic search_ with the above criterion. And we need it to work in a _single pass_, i.e. with just one database query.

More precisely, with $S$ the attribute-based score as seen in the previous section, we need to sort rows based on the composite "attributes+semantic" score:

$$
S^\star = \lambda T + (1-\lambda) S
$$

where $T$ is the "semantic similarity" between a query text and the text on the stored row, calculated as the cosine similarity between the embedding vectors (of dimension $M$), and $\lambda$ is a parameter in $[0:1]$ tuned in the application and/or supplied with each query along with the weights. The "query" $Q$ then is defined by three things:
- the weights $\{w_i\}$ for $S$;
- a text $t_q$ for the semantic-similarity part of the query;
- a real value $\lambda$ specifying how to mix these two (one for full-semantic, zero for full-attribute-based).

**Note**: in the following, we require that the _embedding vectors have unit L2 norm_ (it will be care of the implementation to ensure this is the case if the embedding service does not satisfy). This allows replacing the cosine similarity with the dot product as far as the semantic part of the score is concerned: calling $\mathrm{emb}(t)$ the embedding vector for the text $t$, we have:

$$
S^\star(Q, \mathrm{row}) = \lambda T(t_q, t_\mathrm{row}) + (1-\lambda) S(\vec{w}, \vec{a}) = \ldots
$$

i.e.

$$
\ldots = \lambda \mathrm{emb}(t_q) \cdot \mathrm{emb}(t_\mathrm{row}) + (1-\lambda)\vec{w}\cdot\vec{a} = \ldots
$$

meaning

$$
\ldots = [\lambda \mathrm{emb}(t_q) ] \cdot \mathrm{emb}(t_\mathrm{row}) + [(1-\lambda)\vec{w}] \cdot\vec{a}
$$

Now it is clear that we can construct a "composite row vector" by _collating the text embedding and the attributes_ and store it in the database table:

$$
\vec{v}^\star(\mathrm{row}) \equiv [\mathrm{emb}(t_\mathrm{row}); a_i]
$$

and construct a "composite query vector" at query time, likewise of dimension $M+N$ , like this:

$$
\vec{q}^\star = [\lambda \mathrm{emb}(t_q) ; (1-\lambda) w_i ]
$$

In this way, queries can rank and return results effectively sorted by descending value of

$$
\vec{q}^\star \cdot \vec{v}^\star = S^\star(Q, \mathrm{row})
$$

The same remarks done for the "attributes-only" case apply here, about rescaling the "similarity" back from the Cassandra-specific transformation and replacing $w_i$ with $w_i / A_i^\mathrm{max}$ if needed.

_Corollary_ of the above derivation: the "theoretical" maximum score one can obtain, assuming the text embeddings have unit norm, is $\lambda + (1-\lambda)\sum w_i$, and the minimum is $-\lambda$. _(The latter will be never attained in practice, given that text embeddings even for totally unrelated/gibberish input texts, seldom have negative similarities.)_

**Let's see this plan in action.**

### Text embeddings

Pick OpenAI's embedding (but we could use any text embedding, of course. You can replace the definition of `embed` to your tastes):

In [29]:
!pip install --quiet "openai>=1.0.0"

In [30]:
if 'OPENAI_API_KEY' not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

In [32]:
import openai

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def embed(text):
    return client.embeddings.create(input=[text], model="text-embedding-ada-002").data[0].embedding

In [35]:
M = len(embed("Just to check the length"))
print(f"Dimension M = {M}")

Dimension M = 1536


#### Create table and index

Note that we add an `abilities_map` map field. Not strictly necessary but we'd rather avoid reconstructing the ability map from the returned rows by isolating the $N$ last components of the composite vectors.

In [44]:
CREATE_TABLE_STATEMENT_C = f"""
CREATE TABLE IF NOT EXISTS {keyspace}.fantasy_composite (
    name TEXT PRIMARY KEY,
    description TEXT,
    abilities_map MAP<TEXT, FLOAT>,
    composite VECTOR<FLOAT, {6 + M}>,
);
"""

session.execute(CREATE_TABLE_STATEMENT_C)

<cassandra.cluster.ResultSet at 0x7f6c304846a0>

In [45]:
CREATE_INDEX_STATEMENT_C = f"""
CREATE CUSTOM INDEX IF NOT EXISTS idx_abilities_composite
    ON {keyspace}.fantasy_composite (composite)
    USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
    WITH OPTIONS = {{'similarity_function' : 'dot_product'}};
"""

session.execute(CREATE_INDEX_STATEMENT_C)

<cassandra.cluster.ResultSet at 0x7f6c09747760>

### Insert data

Let's introduce two new characters, differing from the wizard just in the (name, and) description field:

In [38]:
characters1 = characters0 + [
    {
        "name": "Argold",
        "description": "A powerful, quiet mage with a penchant for gardening and cooking.",
        "abilities_map": {
            "CHA": 8,
            "CON": 3,
            "DEX": 5,
            "INT": 10,
            "STR": 4,
            "WIS": 9,
        },
    },
    {
        "name": "Sabiria",
        "description": "This ruthless witch cooks you alive first, and asks questions later.",
        "abilities_map": {
            "CHA": 8,
            "CON": 3,
            "DEX": 5,
            "INT": 10,
            "STR": 4,
            "WIS": 9,
        },
    },
]

In [48]:
INSERT_STATEMENT = session.prepare(f"""
    INSERT INTO {keyspace}.fantasy_composite
        (name, description, abilities_map, composite)
    VALUES
        (?,?,?, ?);
    """
)

for character in characters1:
    session.execute(
        INSERT_STATEMENT,
        (
            character["name"],
            character["description"],
            character["abilities_map"],
            # here the composite vector is constructed:
            embed(character["description"]) + abilities_to_vector(character["abilities_map"])
        ),
   )

### Query

In [52]:
QUERY_STATEMENT_C = session.prepare(f"""
    SELECT name, description, abilities_map, similarity_dot_product(composite, ?) as score
        FROM {keyspace}.fantasy_composite
        ORDER BY composite ANN OF ?
        LIMIT ?;
    """
)

def find_best_matches_composite(query_text, ability_weight_map, param_lambda=0.5, n = 1):
    """
    the theoretical max score from this query is:
        lambda + (1-lambda) * (the sum of the weights)
    the minimum is
        -lambda
    """
    # the "ability" part:
    ability_q_vector0 = abilities_to_vector(ability_weight_map)
    ability_q_vector = [x0 / max_val for x0, max_val in zip(ability_q_vector0, MAX_ABILITY_VALUES)]
    # the "text" part:
    text_q_vector = embed(query_text)
    # combining the two
    query_vector = [
        param_lambda * emb_x
        for emb_x in text_q_vector
    ] + [
        (1 - param_lambda) * abi_x
        for abi_x in ability_q_vector
    ]
    
    # run the proper search
    results = session.execute(
        QUERY_STATEMENT_C,
        (
            query_vector,
            # again, to match the '?' in the statement:
            query_vector,
            n,
        ),
    )
    return [
        {
            "name": row.name,
            "description": row.description,
            "abilities_map": row.abilities_map,
            "score": 2 * row.score - 1,  # remember Cassandra rescales the dot in its similarities
        }
        for row in results
    ]

A printing tool to make output shorter:

In [50]:
def search_and_show(*pargs, **kwargs):
    for res in find_best_matches_composite(*pargs, **kwargs):
        print(f"* {res['name']:<16s} ({res['score']})")

#### Run some queries

The two cells below will pick "magic characters" by attributes, and the semantic part will do the fine part of the ranking:

In [56]:
search_and_show(
    "Looking for someone definitely evil.",
    {"CHA": 0.1, "CON": 0.1, "DEX": 0.1, "INT": 0.3, "STR": 0.3, "WIS": 0.1},
    param_lambda=0.8,
    n=3
)

* Sabiria          (0.7360055446624756)
* Gondolf          (0.7342482805252075)
* Argold           (0.7205580472946167)


In [57]:
search_and_show(
    "Looking for someone definitely evil.",
    {"CHA": 0.1, "CON": 0.1, "DEX": 0.1, "INT": 0.3, "STR": 0.3, "WIS": 0.1},
    param_lambda=0.2,
    n=3
)

* Sabiria          (0.6964023113250732)
* Gondolf          (0.6956992149353027)
* Argold           (0.690223217010498)


The difference is $\lambda$, which has a _slight_ effect on the final scores. Why such a small effect?

Well, even if vectors' cosine similarity can span the whole $[-1,+1]$ range, as a matter of fact text embeddings always fall on a thin "cone" in vector space, such that their cosine similarity is generally between 0.5 and 1 if not in a narrower window. This important practical fact, perhaps paired with an investigation into the exact nature of the data at hand and the embeddings used, may warrant adaptations of the above flow to take this into account. **This is a very important point not to underestimate for the quality of the final outcomes.**

#### Pure similarity

In [69]:
search_and_show(
    "This one loves a quiet life",
    {"CHA": 0, "CON": 0, "DEX": 0, "INT": 0, "STR": 0, "WIS": 0},
    param_lambda=1,
    n=3
)

* Argold           (0.8322060108184814)
* Zittur           (0.8034632205963135)
* Gondolf          (0.7896300554275513)


#### Pure attribute-based

In [71]:
search_and_show(
    "(this sentence will have no effect since lambda=0)",
    {"CHA": 0, "CON": 0.2, "DEX": 0.3, "INT": 0.0, "STR": 0.5, "WIS": 0.0},
    param_lambda=0,
    n=3
)

* Bargul           (0.7999999523162842)
* Zittur           (0.5)
* Argold           (0.4099999666213989)


## Limitations of this approach, workarounds

Applying a _threshold_ on similarity and, within the passing entries, sort by the composite score is probably impossible in a single pass.

Anything involving nonlinear transformations of the original scores/dot-based similarity is likely to be painful and/or impossible.

### Semantic adjustments, an example

On the other hand, let's see how a "fix" for the problem of the "semantic narrow window" could look like.

Suppose we want to adopt a "cut and stretched" form of the semantic similarity, $\tau = 5(T-0.8)$, after having found that $T \in [0.8:1]$ in _all_ cases: then we're after a modified

$$
S'^\star = \lambda \tau + (1-\lambda) S = 5\lambda T + (1-\lambda) S - 0.8\lambda
$$

The last term is a constant for a given query and we can drop it as far as the results ranking is concerned.
We can also divide the score by any positive constant (we are only interested in getting the results in a certain order at the moment).

So we are left with something that has the same shape as the "regular" care, under a renaming of the $\lambda$ parameter: with $\lambda' = 5\lambda / (4\lambda+1)$, it is:

$$
S'^\star(\lambda) \propto \left(\frac{5\lambda}{4\lambda+1}\right)T + \left(\frac{1-\lambda}{4\lambda+1}\right)S = \lambda' T + (1-\lambda') S = S^\star(\lambda')
$$

## Performance considerations

Departing from the comfort zone of using the DOT measure on the unit sphere, just as a faster form of the cosine similarity, comes with a performance issue that should be kept in mind.

The DOT measure is a "strange beast". It does not behave like a reasonable measure of "how close" two "things" are. The following (informal) expecation, while reasonable and in fact satisfied by both euclidean and cosine, **fails** for DOT:
> Given that "A is close to B" and "B is far from C", you should be able to expect that "A is far from C"

Here's a DOT counterexample, where one takes "close/far" to mean "DOT is ~1 / ~0"

![dot](https://user-images.githubusercontent.com/14221764/284030947-30738034-f1a9-4760-9b4d-bb2a1c3009f4.png)

In other words, DOT is tricky because it looks at "shadows on a wall": things may seem very close if a certain light is turned on, but switch to the other light, look at the shadow on the other wall and suddenly the objects turn out to be not close at all.

Since "switching light source" stands for "changing the query vector", in DOT-land there is no way to build a single HNSW index ready to work efficiently with all query vectors.

The performance of DOT with non-unit-norm vectors has not been the target of any specific optimization, and by all expectations there will be less stellar results that on the happy paths, especially when the data size starts to be massive. (response times, possibly degrade recall?)

_It could be advisable to explore this in a systematic way._