# Summarizing RFM Explanations with Snowflake Cortex
This is a short example on how to use Snowflake Cortex Complete to summarize a KumoRFM explanation.

‼️ This notebook is expected to run in a Snowflake environment ‼️

In [None]:
!pip install -q kumoai==2.12.0

In [None]:
# kumo imports
import kumoai as kumo
import kumoai.experimental.rfm as rfm

# snowflake imports
from snowflake.snowpark import Session
from snowflake.cortex import complete

import pandas as pd

## Preparing the KumoRFM instance
In order to make predictions using the Kumo relational foundation model, we need to first initialize it. 

We will first connect to KumoRFM which is running as a native application in SPCS. This step below might differ depending on your specific setup/app.

‼️ Some code in the cells below is incomplete and needs to be adjusted to setup at hand. ‼️

In [None]:
# start by obtaining the active snowflake session
from snowflake.snowpark.context import get_active_session
session = get_active_session()

set the app name

In [None]:
app_name="<KUMO APP NAME>"

get the `dns_name`

In [None]:
dns_name = "<YOUR DNS NAME>"

print(f"{dns_name}")

and finally, initialize KumoRFM with our `dns_name`

In [None]:
import kumoai.experimental.rfm as rfm

rfm.init(url=f"http://{dns_name}:8000/api", api_key="<YOUR API KEY>")

## Making a prediction (with explanation)

We will make a simple prediction with explainability enabled. In order to learn more about how to structure data, create the graph and write predictive queries please refer to our other notebooks and documentation (https://kumo.ai/docs/rfm/RFM-quickstart/).

In [None]:
path = 's3://kumo-sdk-public/rfm-datasets/online-shopping'
df_dict = {
    'users': pd.read_parquet(f'{path}/users.parquet'),
    'items': pd.read_parquet(f'{path}/items.parquet'),
    'orders': pd.read_parquet(f'{path}/orders.parquet')
}

graph = rfm.LocalGraph.from_data(df_dict)
model = rfm.KumoRFM(graph, verbose=False)

We will make a simple _churn_ prediction, by predicting if a user will make any orders in the next 30 days. By setting `explain=True` in `model.predict()` we will also obtain raw explainability scores from the model.

In [None]:
query = "PREDICT COUNT(orders.*, 0, 30, days) > 0 FOR users.user_id=32"
result = model.predict(query=query, verbose=True, explain=True)

we can inspect the prediction first

In [None]:
result.prediction

We can additionally access the explanation details (raw scores) with `result.details`.

In [None]:
result.details

## Parsing Raw Explanations into Natural Language

Conveniently, we can just use snowflake Cortex functionality to summarize the explanations into natural language. We will make use of `cortex.Complete` - https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex.

Let us first extract the relevant information from the RFM output

In [None]:
# recall the query used for prediction
query = "PREDICT COUNT(orders.*, 0, 30, days) > 0 FOR users.user_id=32"

# get the predictions
prediction = result.prediction

# get cohorts
cohorts = result.details.cohorts

# get subgraph
subgraph = result.details.subgraphs[0]

In [None]:
# Construct SYSTEM_PROMPT prompt:
SYSTEM_PROMPT = """
Your goal is to extract meaningful insight from structured explanations of a prediction. You will be given

1. a predictive query which defines a predictive problem
2. the prediction for that particular query
3. column analysis
4. subgraph explanation
5. documentation on how to understand explanations (column analysis and subgraph explanation)

**Provide insight to the user, make reference to specific quantitative details which led the model to it's prediction.**

# Explainability

KumoRFM explanations provide two complementary views of model predictions:

1. **Global View (Cohorts):** Column-level patterns across in-context examples that reveal what data characteristics drive predictions
1. **Local View (Subgraph):** Cell-level attribution scores showing which specific values in this entity's subgraph influenced the prediction

Together, these views answer: "What patterns does the model see globally?" and "Which specific data points matter for this prediction?"

## Understanding the Global View: Cohorts

Cohorts reveal how different value ranges or categories in columns correlate with prediction outcomes across all in-context examples.

- `table_name`: Which table this analysis covers
- `column_name`: Which column or statistic (e.g., `COUNT(*)`) this analysis covers
- `hop`: Distance from the entity table (0 = entity attributes, 1 = direct neighbors, 2 = second-degree neighbors, ...)
- `stype`: Semantic type (numerical, categorical, timestamp, etc)
- `cohorts`: List of value ranges/categories (e.g., `["[0-5]", "(5-10]", "(10-20+]"]`)
- `populations`: Proportion of in-context examples in each cohort
- `targets`: Average prediction score within each cohort

High-impact columns usually have large variance in `targets` across different cohorts.

**Example for a churn predictive query:**

```
table_name: "orders"
column_name: "COUNT(*)"
hop: 1
cohorts: ["[0-0]", "(0-1]", "(1-2]", "(2-4]", "(4-6+]"]
populations: [0.20, 0.08, 0.07, 0.11, 0.54]
targets: [0.0, 0.78, 0.74, 0.64, 0.35]
```

**What this means:**

- Users with 0 orders have 0% churn risk (they already churned)
- Users with 1-2 orders have ~75% churn risk (early stage, not sticky)
- Users with 6+ orders have 35% churn risk (established, but not immune)
- Key insight: Order count is strongly predictive; more orders = lower churn

## Understanding the Local View: Subgraph

Subgraphs show the actual data neighborhood around the specific entity being predicted, with attribution scores indicating importance.
Node indices are different from primary keys and are mapped to a contiguous range from 0 to N.
The entity being predicted is guaranteed to have ID 0.
Some cells may have a `null` value with non-zero scores, indicating missingness itself is informative.

Each node represents a row from a table, containing:

- `cells`: Dictionary of column values with attribution scores
  - `value`: Actual data value
  - `score`: Gradient-based importance between 0 and 1 (higher = more influential)
- `links`: Connections to other nodes via foreign keys

Scores reflect how much changing this value would change the prediction.
High scores on specific cells explain "why this prediction, not another".

**Score Magnitude Interpretation:**

- 0.00 - 0.05: Negligible influence
- 0.05 - 0.15: Moderate influence
- 0.15 - 0.30: Strong influence
- 0.30+: Critical influence

**Example:**
```
cells: {
  "club_member_status": {value: "ACTIVE", score: 1.0},
  "age": {value: 49, score: 0.089},
  "fashion_news_frequency": {value: "Regularly", score: 0.411}
}
links: {
  "user_id->orders": [1,2,3,...,32]
}
```

**What this means:**

Club membership status is the most important attribute (score=1.0)
Fashion news subscription is moderately important (score=0.411).
Age contributes but is less critical (score=0.089).
User has 32 orders linked (indicates high activity).

You can follow paths in the subgraph to understand data connectivity and how tables/cells far away may contribute to the prediction.

## Connecting Global and Local Views

Often times, you can understand high subgraph attribution scores by relating their cell values to the average prediction of the cohort.

1. **Find influential cells for the prediction in the local view:**
   Which cells have scores > 0.15?
1. **Locate entity in global context:**
   Find which cohorts the specific entity falls into and compare entity's values to high/low risk cohorts.
   Focus on highest-scoring cells and most divergent cohorts.
1. **Relate attribution score and cohort prediction:**
   Check if entity exhibits typical or atypical patterns.
1. **Find general global trends** in the data that might explain the prediction.
   Additionally, look for missing expected signals (why ISN'T something important?)

Tell a coherent story connecting global patterns to local evidence.
Use concrete numbers from the subgraph.
Avoid jargon; explain in business terms.

## Common Interpretation Pitfalls

- **Don't assume correlation = causation:**
  High scores show model importance, not real-world causality.
  For example, "black clothing" might correlate with churn, but color isn't the cause.
- **Consider data distribution:**
  Rare cohorts may show extreme `targets` with small `populations`.
  Focus on cohorts with both significant population AND divergent targets.
- **Missing cohort analysis:**
  Not all columns have a cohort analysis since some semantic types are unsupported.
  For example, text and ID columns typically only appear in local view.

--- 


"""


We can now append the `query`, `model output`, `cohorts`, and `subgraph` to get the full input we will pass to the LLM for summary

In [None]:
input = SYSTEM_PROMPT
input += f"USER QUERY: {query}\n\n"
input += f"MODEL PREDICTION: {prediction}\n\n"
input += f"COLUMN ANALYSIS: {cohorts}\n\n"
input += f"SUBGRAPH EXPLANATION: {subgraph}"

In [None]:
print(input)

We can now summarize the explanation using the model of our choice!

In [None]:
cortex_summary = complete(
    model='mistral-large',  # or 'snowflake-arctic', 'llama3-70b', etc.
    prompt=input,
    session=session
)

In [None]:
print(cortex_summary)