In this cookbook, we will learn how to use Opsmate to manage knowledge.

Notes the knowledge management feature is currently in the early stage of development, the features and the UX are subject to change.
At the moment 2 type of data source can be ingested as knowledge:

1. Any text based files from your local file system or network-attached storage.
2. Any text based files from Github repositories.

We use [lancedb](https://lancedb.github.io/lancedb/) as the underlying vector database to store the knowledge. We use lancedb mainly because of the serverless nature of the database where you can use the cloud storage as the backend, which reduces the cost of ownership.

Knowledge retrieval is can be achieved via the `KnowledgeRetrival` tool - which is a built-in tool in Opsmate.

## Environment variablebased configuration Options

### FS_EMBEDDINGS_CONFIG

This is a JSON key-value pair where the key is the path to the directory to be ingested and the value is the glob pattern to match the files to be ingested.

Example usage:

```
FS_EMBEDDINGS_CONFIG='{"./docs/cookbooks": "*.md"}'
```

This will ingest all the markdown files in the `./docs/cookbooks` directory.

### GITHUB_EMBEDDINGS_CONFIG

This is a JSON key-value pair where the key is the `owner/repo:optional[branch]` and the value is the glob pattern to match the files to be ingested.

Example usage:

```
GITHUB_EMBEDDINGS_CONFIG='{"opsmate/opsmate": "*.md", "kubernetes/kubernetes:test-branch": "*.txt"}'
```

In the example above, the first entry will ingest all the markdown files in the `opsmate/opsmate` repository. The second entry will ingest all the text files in the `kubernetes/kubernetes` repository on the `test-branch` branch.

If the branch is not specified, it will default to `main`.

:important: The Github token is required to be set in the environment variable `GITHUB_TOKEN`.


### EMBEDDING_REGISTRY_NAME and EMBEDDING_MODEL_NAME

`EMBEDDING_REGISTRY_NAME` is the name of the embedding registry to use. It is default to `openai`.

`EMBEDDING_MODEL_NAME` is the name of the embedding model to use. It is default to `text-embedding-ada-002`.

LanceDB supports wide range of embedding models, you can refer to the [lancedb embedding documentation](https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/#text-embedding-functions) for more details.

### EMBEDDINGS_DB_PATH

EMBEDDINGS_DB_PATH is the path to the lancedb database. It is default to `~/.data/opsmate-embeddings`.

Right now it is defaulted to the local file system, but there are wide range of storage options supported by lancedb, you can refer to the [lancedb storage documentation](https://lancedb.github.io/lancedb/concepts/storage/) for more details. In the documentation it provides a very comprehensive diagram to show case the thought process that goes into choosing the right storage backend.

**WARNING**: Currently the ingestion chunk size is set to 1000, with overlap set to 0, with recursive text splitter as the default chunking strategy. This is hardcoded right now through environment-variable based configuration, but we will support more flexible configuration in the future.


## SDK-based data ingestion

You can also choose to ingest the knowledge via the SDK which provides greater flexibility in terms of configuration.

In the example below, we ingest all the markdown files in the `docs/book/src` directory of the `kubernetes-sigs/kubebuilder` repository to learn about the kubebuilder.

Note this is going to take a while to complete and emit a lot of logs so we are not going to run it here.

```python
from opsmate.ingestions import ingest_from_config
from opsmate.libs.config import Config

await ingest_from_config(Config(
  github_embeddings_config={
    "kubernetes-sigs/kubebuilder:master": "./docs/book/src/**/*.md"
  },
  categorise=False, # By default we categorise the knowledge into categories for better segmentation, but we disable it here for the sake of speed.
))
```

Once the knowledge of the kubebuilder is ingested, we can use the `KnowledgeRetrieval` tool to provide retrieval augmented generation (RAG):

In [5]:
from opsmate.tools import KnowledgeRetrieval

result = await KnowledgeRetrieval(query="how to do env test against a real cluster in kubebuilder using environment variables?").run()

print(result)


[2m2025-01-16 22:33:09[0m [[32m[1minfo     [0m] [1mrunning knowledge retrieval tool[0m [36mquery[0m=[35mhow to do env test against a real cluster in kubebuilder using environment variables?[0m
To run integration tests using `envtest` against an existing cluster in Kubebuilder, you can follow these steps utilizing environment variables:

1. **Set Environment Variables**: You need to specify several environment variables before executing your tests. Key variables include:
   - `USE_EXISTING_CLUSTER`: Set this to `true` to connect to an existing cluster instead of setting up a local one.
   - `KUBEBUILDER_ASSETS`: Point this to a directory containing the necessary binaries (`kube-apiserver`, `etcd`, `kubectl`). This should be set if you want the tests to use specific binaries rather than default ones.
   - Alternatively, you can set specific paths for the binaries:
     - `TEST_ASSET_KUBE_APISERVER`: Path to the `kube-apiserver` binary.
     - `TEST_ASSET_ETCD`: Path to the `et

# Future capabilities:

* Right now the async based knowledge ingestion is fairly naive and is not designed to be run in a distributed and fault-tolerant manner. We need to design a more robust system to support this - Potentially brining in the big gun such as [Celery](https://docs.celeryq.dev/en/stable/) but ideally anything easy to maintain and scale.
* We need to support more data source types, such as databases or other API-based data sources.
* Currently only text-based files are supported, we need to support more file types, such as images, videos, and other binary data.
