# 2b - Tuning and Debugging Queries

## Prerequisites

Complete <a href="../../../../nbclassic/notebooks/graphrag-toolkit/2-Querying.ipynb"><b>Exercise 2 - Querying</b></a> before beginning these additional exercises.

## Overview

The code cells below allow you to debug and tune the query engine.

These examples use the `LexicalGraphQueryEngine.retrieve()` method, rather than `query()`. This method fetches the search results, but doesn't pass them to an LLM to generate a natural-language response (i.e. it performs steps 1-3 described in the introduction to the **Exercise 2 - Querying** notebook).

With these examples, you can vary some of the query engine configuration options, and then view the search results without generating a response. If you do want to generate a response, there's a helper function, `to_response()`, which you can use to invoke the LLM.

### ðŸŽ¯ 2b.1 Debugging

Comment/uncomment the lines in the `set_logging_config` method below. These lines allow you to view the:

  - openCypher graph queries
  - Entity network generation

In [None]:
%reload_ext dotenv
%dotenv

import os

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine, set_logging_config
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.storage.graph import NonRedactedGraphQueryLogFormatting

set_logging_config('DEBUG', [
    #'graphrag_toolkit.lexical_graph.storage.graph',
    #'graphrag_toolkit.lexical_graph.retrieval.query'
])

query = "What are the sales prospects for Example Corp in the UK?"

with (
    GraphStoreFactory.for_graph_store(
        os.environ['GRAPH_STORE'],
        # --------- openCypher query logging ---------
        log_formatting=NonRedactedGraphQueryLogFormatting()
    ) as graph_store,
    VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):

    query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
        graph_store, 
        vector_store,
        tenant_id='ecorp',
        # --------- Search results config ---------
        max_search_results=5, # default 5
        max_statements_per_topic=10, # default 10
        max_statements=200, #default 200
        statement_pruning_factor=0.05, # default 0.05
        # --------- Entity network config ---------
        ec_max_contexts=3, # default 3
        ec_max_depth=3, # default 3
        ec_max_score_factor=10, # default 10
        ec_min_score_factor=0.1, # default 0.1
    ) 

    search_results = query_engine.retrieve(query)
    
print('Ready')

In [None]:
# results to be passed to LLM

for n in search_results:
    print(n.text)

In [None]:
# underlying, more detailed results

import json
for n in search_results:
    print(json.dumps(n.metadata, indent=2))

In [None]:
%run './misc/helpers.py'

to_response(search_results, query).print_response_stream()

### Search results configuration
When configuring search functionality, you can use the following parameters to control the number and quality of returned results:

#####  `max_search_results`
Defines the maximum number of search results to return. Each search result contains one or more statements that belong to the same topic (and source). If you set this to `None`, all matching search results will be returned. The default value is `5`.

#####  `max_statements_per_topic`
Controls how many statements can be included with a single topic, effectively limiting the size of each search result. If set to `None`, all statements belonging to the topic that match the search will be included in the result. The default value is `10`.

#####  `max_statements`
Limits the total number of statements across the entire resultset. If you set this to `None`, all statements from all results will be returned. The default value is `100`.

#####  `statement_pruning_factor`
This parameter helps filter out lower-quality statements based on a percentage of the highest statement score in the entire set of results. Statements are scored by a reranker based on a combination of the original question, keywords, and entity network transcriptions. Any statement with a score less than `<maximum_statement_score> * statement_pruning_factor` will be removed from the results. The default value is `0.05` (5% of the maximum score).

#### When to modify search results configuration
The `max_search_results`, `max_statements_per_topic` and `max_statements parameters` allow you to control the overall size of the results.

Each search result comprises one or more statements belonging to one or more topics from a single source. Increasing `max_search_results` increases the variety of sources in your results. Increasing `max_statements_per_topic` adds more detail to each individual search result.

When increasing the number of statements (either overall or per topic), you should consider increasing the statement pruning parameters as well. This helps ensure that even with larger result sets, you're still getting highly relevant statements rather than less relevant information.

### ðŸŽ¯ 2b.2 Adjust the search results configuration parameters

Experiment with changing the search results configuration parameters, and review the impact on the results to be passed to the LLM (you don't need to generate a response, but you can if you want to):

  - Lower the `max_search_results` parameter. At what point does the response omit details of the blockage in the Turquoise Canal?
  - Increase `max_search_results` so that the results _include_ details of the blockage in the Turquoise Canal, and then set `max_statements_per_topic` to `5`, and then to `2`. Notice how the response remains relevant, but begins to lack detail.
  - Reset the search results configuration parameters to their defaults, and then adjust `statement_pruning_factor`, setting it first to `0.1` and then `0.2`. Notice how increasing thsi parameter helps eliminate irrelevant search results at the end of the list.

### Entity network context selection

You can configure entity network generation using the following parameters:

##### `ec_max_depth`
Determines the maximum number of entities in each entity network path. The default value is `3`.

##### `ec_max_contexts`
Limits the number of entity contexts returned by providers. Note: Multiple entity contexts may originate from the same root entity. The default value is `3`.

##### `ec_max_score_factor`
Filters out entities whose degree centrality exceeds a threshold based on a percentage of the degree centrality of the top entity. The default value is `10` (1000% of the top entity's degree centrality score).

##### `ec_min_score_factor`
Filters out entities whose degree centrality falls below a threshold based on a percentage of the degree centrality of the top entity. The default value is `0.1` (10% of the top entity's degree centrality  score).

#### When to adjust entity network generation
The entity network context settings control how extensively the system searches for related content and how it filters results based on entity relationships. Increase the search scope to find structurally relevant but dissimilar content. Reduce the search scope to focus on content similar to the query.

  - A broad but shallow search (e.g. `ec_max_depth=1` and `ec_max_contexts=5`) helps explore diverse contexts focused on direct matches to the query
  - A deep but narrow search (e.g. `ec_max_depth=3` and `ec_max_contexts=2`) helps explore distantly related content through key entities
  - `ec_max_contexts=0` turns off entity network generation entirely

The `ec_max_score_factor` and `ec_min_score_factor` parameters allow you to filter out 'whales' and 'minnows' in proportion to the significance of the top entity.

`ec_max_score_factor` controls how prominently high-scoring distant entities appear in the search results. Higher values will include well-connected entities even if they're distantly related. Increase `ec_max_score_factor` when you want to see important entities that aren't directly connected.

`ec_min_score_factor` controls the inclusion of less significant distant entities. Lower values will result in the inclusion of rarely mentioned entities even if they're distantly related. Decrease `ec_min_score_factor` to find niche or uncommon connections.

### ðŸŽ¯ 2b.3 Adjust the entity network configuration parameters

Set the logging so that the debug messages from `graphrag_toolkit.lexical_graph.retrieval.query` are sent to the output. Then experiment with changing the entity network configuration parameters, and review the entity networks generated, and the impact on the results to be passed to the LLM (you don't need to generate a response, but you can if you want to):

  - Reduce the `ec_max_contexts` parameter in 1-step increments. At what point do the results stop showing details of the blockage in the Turquoise Canal?
  - Reset `ec_max_contexts` to its default value (`3`), and then experiment with reducing the `ec_max_depth` parameter in 1-step increments. Again, at what point do the results stop showing details of the blockage in the Turquoise Canal?
  
You can experiment with adjusting the `ec_max_score_factor` and `ec_min_score_factor`, but the dataset is too small for there to be much impact.