<a href="https://colab.research.google.com/github/pathwaycom/pathway-examples/blob/main/tutorials/indexes.ipynb" target="_parent"><img src="https://pathway.com/assets/colab-badge.svg" alt="Run In Colab" class="inline"/></a>

# Installing Pathway with Python 3.10+

In the cell below, we install Pathway into a Python 3.10+ Linux runtime.

> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.
> 
> **The installation and loading time is less than 1 minute**.


In [None]:
%%capture --no-display
!pip install pathway

# Indexes in Pathway
In this article, you'll learn about reactive indexes in Pathway and how they differ from conventional indexes used in databases. You'll also see how to use them to respond to a stream of queries in real time.

Indexes are data structures that improve the speed of queries. They are often used in databases. They are helpful if you want to retrieve records with a specific value in a given column (then you need an index based on this column). An example of this is answering a stream of queries using contents of a database.
Indexes can also speed up joins - an existing index can be used if it is built on appropriate columns but also an index can be built ad-hoc, during query execution.
Pathway offers indexes, but because it operates on streams, there are some differences as compared to database indexes. To learn about them, continue reading the article.

## Joins
Pathway operates on streams. Unless it is informed otherwise, it assumes that new data can arrive from any stream. Thus, when joining two streams, Pathway has to keep these streams in memory. It builds LSM indexes on both sides of a join. Thanks to that, new records arriving in any of the two streams can be joined quickly - it is enough to look them up in the index of the other table and no costly scans are needed.
In contrast, normal databases, only use an index on one sides of a join because once the query is processed the join results are not updated.
Let's consider a simple example in which you join two tables in Pathway. Here, a table is built from a simulated stream of changes to its rows. The value in the `__time__` column represents an arrival time of the record to the engine. To use an example with a real streaming source it is enough to replace `pw.debug.table_from_markdown` with an appropriate [connector](/developers/user-guide/input-and-output-streams/connectors/) (like Redpanda or Kafka connector).
The tables are joined on the `instance` column.

In [1]:
import pathway as pw

table_a = pw.debug.table_from_markdown(
    """
    value | instance | __time__
      1   |    1     |     2
      2   |    1     |     6
      3   |    2     |     8
      4   |    2     |    12
    """
)
table_b = pw.debug.table_from_markdown(
    """
    value | instance | __time__
      11  |    1     |     4
      12  |    2     |     6
      13  |    1     |     8
    """
)

result = table_a.join(table_b, pw.left.instance == pw.right.instance).select(
    left_value=pw.left.value, right_value=pw.right.value, instance=pw.this.instance
)

pw.debug.compute_and_print(result)

[2023-11-17T12:16:50]:INFO:Preparing Pathway computation


            | left_value | right_value | instance
^MKZX8ZW... | 1          | 11          | 1
^Z3GFZNM... | 1          | 13          | 1
^G5A47FZ... | 2          | 11          | 1
^PSJ822X... | 2          | 13          | 1
^CRYASKY... | 3          | 12          | 2
^1J8DF7F... | 4          | 12          | 2


As you can see, the records from both sides get joined with the future records. It is expected, as Pathway incrementally updates all results to match the input data changes. However, if `table_a` would be `queries` on a `table_b` representing the `data` you want to query, you'd be surprised to see that answers to your queries are updated in the future when `data` changes. Let's say, you want to query the number of your website visits by location:

In [2]:
import pathway as pw

queries = pw.debug.table_from_markdown(
    """
    query_id |  country  | __time__
        1    |   France  |     4
        2    |   Poland  |     6
        3    |  Germany  |     8
        4    |      USA  |    14
    """
)
visits = pw.debug.table_from_markdown(
    """
     country | __time__
      Poland |    2
      France |    2
      Spain  |    2
      Poland |    2
      France |    4
         USA |    4
         USA |    4
     Germany |    6
         USA |    6
         USA |    8
      Poland |    8
      France |    8
      France |   12
     Germany |   14
    """
)
total_visits_by_country = visits.groupby(pw.this.country).reduce(
    pw.this.country, visits=pw.reducers.count()
)

answers = queries.join(
    total_visits_by_country, pw.left.country == pw.right.country
).select(pw.left.query_id, pw.this.country, pw.right.visits)

pw.debug.compute_and_print(answers)

[2023-11-17T12:16:50]:INFO:Preparing Pathway computation


            | query_id | country | visits
^EJPQJPQ... | 1        | France  | 4
^ECX80QV... | 2        | Poland  | 3
^ZVRZ99C... | 3        | Germany | 2
^B55DT4V... | 4        | USA     | 4


Please note how the answer to your query with `query_no=3` is updated a few times. It may be a bit surprising if you're new to Pathway. It turns out, the `join` allows you to keep track of the updates! And it has many cool uses, for instance alerting. You can use it to set up a real-time alerting system. However, if that is not what you want and you'd like to get an answer to you query once, at its processing time, Pathway supports it as well!

## Asof now join
Monitoring changes of anwers to your queries might not be what you want. Especially if you have **a lot of** queries. If you want to get an answer for a query once, and then forget it, you can use `asof_now_join`. Its left side is a queries table and the right side is the data you want to query. Note that the right side is still a table dynamically streaming row changes. You can update it but the updates will only affect future queries - no old answers will be updated.
Let's see what `asof_now_join` would return in the example above:

In [3]:
import pathway as pw

queries = pw.debug.table_from_markdown(
    """
    query_id |  country  | __time__
        1    |   France  |     4
        2    |   Poland  |     6
        3    |  Germany  |     8
        4    |      USA  |    14
    """
)
visits = pw.debug.table_from_markdown(
    """
     country | __time__
      Poland |    2
      France |    2
      Spain  |    2
      Poland |    2
      France |    4
         USA |    4
         USA |    4
     Germany |    6
         USA |    6
         USA |    8
      Poland |    8
      France |    8
      France |   12
     Germany |   14
    """
)
total_visits_by_country = visits.groupby(pw.this.country).reduce(
    pw.this.country, visits=pw.reducers.count()
)

answers = queries.asof_now_join(
    total_visits_by_country, pw.left.country == pw.right.country
).select(pw.left.query_id, pw.this.country, pw.right.visits)

pw.debug.compute_and_print(answers)

[2023-11-17T12:16:50]:INFO:Preparing Pathway computation


            | query_id | country | visits
^EJPQJPQ... | 1        | France  | 2
^ECX80QV... | 2        | Poland  | 2
^ZVRZ99C... | 3        | Germany | 1
^B55DT4V... | 4        | USA     | 4


In contrast to an ordinary `join`, `asof_now_join` is not symmetric. New rows on the left side of the join will produce a result under the condition they can be joined with at least one row from the right side. If you want to produce at least one result from every query, you can use `asof_now_join_left` - then all columns from the right side in the output row will be set to `None`. On the other hand, new rows on the right side of the join won't immediately produce any new rows in the output but will update the index and if they're matched with new records from the left side later, they will appear in the output.
Please note that for correct operation, the left table of the `asof_now_join` (`queries`) can only be extended with new queries. Pathway verifies it for you. You can't delete or update the queries. It is quite reasonable. Instead of updating the query, you can just send a new query because your previous query has been already forgotten anyway.

## KNN Index
An approximate [K Nearest Neighbors (KNN) Index](/developers/showcases/lsh/lsh_chapter1) behaves similarly to a join. The default method `get_nearest_items` maintains always up-to-date answers to all queries when the set of indexed documents changes. In fact, it uses a join under the hood.
If you don't want answers to your queries to be updated, you can use `get_nearest_items_asof_now` (experimental). It'll return the closest points once and will forget the query. However, it'll monitor the stream containing index data and update the index if new data arrives (but won't update old queries). As a result, if you ask the same query again and the index has changed in the meantime, you can get a different answer. This behavior is used in our [llm-app](/developers/showcases/llm-app-pathway/) to answer queries using an always up-to-date index of documents.
In the examples below, you can see the differences between these methods.

In [4]:
import pathway as pw
from pathway.stdlib.ml.index import KNNIndex

queries = pw.debug.table_from_markdown(
    """
    query_id |  x |  y | __time__
        1    |  0 |  0 |    4
        2    |  2 | -2 |    6
        3    | -1 |  1 |    8
        4    | -2 | -3 |    10
    """
).select(pw.this.query_id, coords=pw.make_tuple(pw.this.x, pw.this.y))

data = pw.debug.table_from_markdown(
    """
     x |  y | __time__
     2 |  2 |    2
     3 | -2 |    2
    -1 |  0 |    6
     1 |  2 |    8
    -3 |  1 |   10
     1 | -4 |   12
    """
).select(coords=pw.make_tuple(pw.this.x, pw.this.y))

index = KNNIndex(data.coords, data, n_dimensions=2, n_and=5)
result = queries + index.get_nearest_items(queries.coords, k=2).select(
    nns=pw.this.coords
)
pw.debug.compute_and_print(result)

[2023-11-17T12:16:55]:INFO:Preparing Pathway computation


            | query_id | coords   | nns
^X1MXHYY... | 1        | (0, 0)   | ((-1, 0), (1, 2))
^YYY4HAB... | 2        | (2, -2)  | ((1, -4), (3, -2))
^Z3QWT29... | 3        | (-1, 1)  | ((-3, 1), (-1, 0))
^3CZ78B4... | 4        | (-2, -3) | ((-1, 0), (1, -4))


In [5]:
index = KNNIndex(data.coords, data, n_dimensions=2, n_and=5)
result = queries + index.get_nearest_items_asof_now(queries.coords, k=2).select(
    nns=pw.this.coords
)
pw.debug.compute_and_print(result)

[2023-11-17T12:17:00]:INFO:Preparing Pathway computation


            | query_id | coords   | nns
^X1MXHYY... | 1        | (0, 0)   | ((2, 2), (3, -2))
^YYY4HAB... | 2        | (2, -2)  | ((-1, 0), (3, -2))
^Z3QWT29... | 3        | (-1, 1)  | ((-1, 0), (1, 2))
^3CZ78B4... | 4        | (-2, -3) | ((-3, 1), (-1, 0))


In the example above, 2-dimensional vectors were used to make the analysis simpler. The **llm-app** uses n-dimensional vectors but the general principle doesn't change.

## Applications of `asof_now` indexes to data read using HTTP REST connector
If you want a more practical example, you can set up a webserver that answers queries. You can read our [llm-app](/developers/showcases/llm-app-pathway/) article to see how it can be done.

## Summary
In this article you learned about the differences in indexing between databases and Pathway. You can see that both approaches - keeping the queries to update them in the future or forgetting queries immediately after answering, are useful. It depends on your objective which approach should be used. Pathway provides methods to handle both variants.