# KumoRFM quickstart

KumoRFM is a Foundation Model for machine learning on enterprise data. With just your data and a few lines of code, you can generate accurate predictions in realtime—no model training or pipelines required (see [blog](https://kumo.ai/company/news/kumo-relational-foundation-model/) | [paper](https://kumo.ai/research/kumo_relational_foundation_model.pdf)).


This notebook shows you how to use KumoRFM.

## Introduction

KumoRFM is grounded in three key worldviews:

### Worldview 1: Enterprise data is a graph

Enterprise data is a graph where tables are connected by keys.   
Below is an example database where `ITEMS` table and `ORDERS` table are linked by `item_id`.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/relational-database.png"
       alt="relational database as a graph"
       width="500"
</div>

Once we structure enterprise data as a graph, we can apply pretrained graph transformers to extract insights and patterns.

### Worldview 2: With timestamp, we place events on a timeline

By placing events on a timeline, we unlock the ability to model how things evolve over time. This makes it possible to select any point in time and predict what is likely to happen next, based on the sequence and patterns in historical data.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/timeline.png"
       alt="timeline"
       width="300"
</div>

### Worldview 3: Machine learning tasks can be described by Predictive Query (pQuery)

All major machine learning tasks—regression, classification, recommendation—can be defined using a Predictive Query (pQuery).

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/predictive-query-multiple.png"
       alt="predictive query"
       width="600"
</div>

If you know SQL, picking up pQuery is a breeze—it’ll feel familiar right away.

Learn more about pQuery [here](https://kumo.ai/docs/pquery-structure).

### Let's get started!

## Step 1. Install an SDK

KumoRFM provides [SDK](https://kumo-ai.github.io/kumo-sdk/docs/get_started/rfm/index.html) in Python.

Note: The Kumo SDK is available for Python 3.9 to Python 3.13.

In [None]:
!pip install kumoai

In [None]:
import os
import kumoai as kumo
import kumoai.experimental.rfm as rfm

## Step 2. Get an API key

You'll need an API key to make calls to KumoRFM,

Use the widget below to generate a key. Click "Generate API Key". If you don't have a KumoRFM account, the widget will prompt you to signup.

You'll see the following when your key has been created.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/api-key-created.png"
       alt="timeline"
       width="400"
</div>


In [None]:
if not os.environ.get("KUMO_API_KEY"):
    rfm.authenticate()

## Step 3. Initialize a client

If you complete Step2 via the widget, you don't need to change anything. KUMO_API_KEY is already set as environment variable.

If you bring the API key from the website, you can manually change the KUMO_API_KEY below.


In [None]:
# Initialize a Kumo client with your API key

KUMO_API_KEY = os.environ.get("KUMO_API_KEY")

rfm.init(
    api_key=KUMO_API_KEY,
)

## Step 4. Import your data

We only support Pandas DataFrames as input currently.

In [None]:
import pandas as pd

# You can use pd.read_csv to read csv files

users_df = pd.read_parquet(f's3://kumo-sdk-public/rfm-datasets/online-shopping/users.parquet')
items_df = pd.read_parquet(f's3://kumo-sdk-public/rfm-datasets/online-shopping/items.parquet')
orders_df = pd.read_parquet(f's3://kumo-sdk-public/rfm-datasets/online-shopping/orders.parquet')

In [None]:
# Example: Inspect a pandas dataframe
users_df.head(3)

In [None]:
# Example: Inspect a pandas dataframe's dtype
users_df.dtypes

In [None]:
# Example: Change a column to int dtype
# Here the user_id should be infered correctly (no change needed).
# The code example can come handy when you want to change other columns.
users_df['user_id'] = users_df['user_id'].astype(int)

## Step 5. Create local tables

How it works: A LocalTable acts as a lightweight abstraction of a pandas DataFrame, providing additional integration.

The local table defines 3 critical things about the table:

(1) **Semantic type:** How data will be encoded. For instance, int can be encoded differently as ID or numercial or categorical, depending on the actual meaning of the data. (see reference at the end of this section)

(2) **Primary key**: defines the column where other tables can be linked to this table. (e.g. `user_id` is the primary key in the `users` table)

(3) **Time column**: defines when an event happens (e.g. timestamp)

In [None]:
# KumoRFM is smart enough to infer most things correctly
# But you might still want to inspect the results of inferred metadata

users_table = rfm.LocalTable(df = users_df, name = "users").infer_metadata()
items_table = rfm.LocalTable(df = items_df, name = "items").infer_metadata()
orders_table = rfm.LocalTable(df = orders_df, name = "orders").infer_metadata()

In [None]:
# Tip: If you prefer more explicit control, you can manually assign metadata during table creation instead of relying on automatic inference.
orders_table = rfm.LocalTable(
    df=orders_df,
    name="orders",
    primary_key="order_id",
    time_column="date"
)


In [None]:
# Example: Inspect a local table
users_table.print_metadata()

In [None]:
# Example: Update local table metadata

# Set stype
users_table['user_id'].stype = "ID"
users_table['age'].stype = "numerical"

# Set primary key
users_table.primary_key = "user_id"

# Set time column
orders_table.time_column = "date"

Quick Reference:

(1) `stype` (semantic type)
- `stype` will determine how the column will be encoded.
- `stype` will determine the column's eligibility for special roles such as `is_time_column` and `is_primary_key`.
- Correctly setting each column’s stype is critical for model performance. For instance, you want to set stype to be `numeric` when it's a regression task, set stype to be `categorical` when it's a classification task, and set stype to be `ID` when it's a link-prediction (e.g. item recommendation) task.

(2) `is_primary_key`
- Indicate the column that will be used as primary key to link with other tables. For instance, `user_id` should be the primary key for `users_table`.
- In relational database, primary key needs to be unique. If there're duplicated primary keys, the system will keep only one.
- `is_primary_key` can only be assigned to a column with ID stype.
- Each table can have at most one `primary_key` column

(3) `is_time_column`
- Indicate the timestamp column to record when the event happens. Note:
- `is_time_column` can only be assigned to a column with `timestamp` stype.
- Each table can have at most one `time_column` column


Quick reference on stype

| Type            | Explanation                                                                | Example                                                                 |
|-----------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------|
| numerical       | Numerical values (e.g. price, age)                                         | 25, 3.14, -10                                                           |
| categorical     | Discrete categories with limited cardinality                               | Color: red, blue, green (One field can have only one category)         |
| multicategorical| Multiple categories in a single field                                      | One field can have multiple categories simultaneously, e.g., "Action, Drama" or "Comedy, Action, Thriller" |
| ID              | Unique identifiers                                                         | user_id: 123, product_id: PRD-8729453                                   |
| text            | Natural language text                                                      | descriptions, sentences                                                 |
| timestamp       | Specific point in time                                                     | 2025-07-11 09:47:58                                                     |
| sequence        | Embedding                                                                  | [0.25, -0.75, 0.50, ...] (text embedding)                               |


## Step 6. Create a graph in two simple steps

How do you get started with a graph? What tables should you include?

A good guiding principle is to start simple: begin with just the minimal set of tables needed to support the prediction task you care about. Focus on the core entities and relationships essential to prediction.

For example, suppose your goal is to predict a user's future orders. At a minimum, your graph only needs two tables:

- `users` – representing each user
- `orders` – representing the orders placed by those users

This minimal setup forms a usable graph for prediction. From there, you can gradually add complexity. For instance, you might later introduce an `items` table, so that RFM can take into account item information.


### (1) Select the tables

In [None]:
graph = rfm.LocalGraph(tables=[users_table, orders_table, items_table])

### (2) Link the tables

In [None]:
graph.link(src_table="orders", fkey="user_id", dst_table="users")

# Interpretation:
# In `orders` table (src_table), there's a column named `user_id` (foreign key),
# that can be linked to the primary key in the `users` table (dst_table).
# Note:
# (1) You don't need to specify the column name of primary key here since it's already known in the `users` table meta data.
# (2) You cannot swap the order of the two tables in the graph.link(...) call.
# Because the src_table is where the foreign key lives, and the dst_table is where the referenced primary key resides.

In [None]:
graph.link(src_table="orders", fkey="item_id", dst_table="items")

In [None]:
# Inspect the graph

# Method 1: Visualize the graph
graph.visualize()

In [None]:
# Method 2: print the graph, it'll show a simplified version, suitable for users who are already familiar with graph
graph.print_metadata()
graph.print_links()

### (3) Update links as needed

In [None]:
# Remove link
graph.unlink(src_table="orders", fkey="user_id", dst_table="users")

# Add link
graph.link(src_table="orders", fkey="user_id", dst_table="users")

## Step 7. Write a predictive query

Note: The data is synthetic, and the query and results are intended for demo. We encourage you to benchmark the model using your own data.

In [None]:
# create the model for your graph

model = rfm.KumoRFM(graph)

### Example 1A. Forecast 30-day product demand

In [None]:
query = "PREDICT SUM(orders.price, 0, 30, days) FOR items.item_id = 42"

prediction_result = model.predict(query)
print(prediction_result)

# How to interpret the result:
# 1. ENTITY: the user (user_id = 42)
# 2. ANCHOR_TIMESTAMP: assuming predicting at this anchor time (2024-09-19)
# - what's happening between (2024-09-19 to 2024-10-18]?
# - by default, anchor time is the max(timestamp) on the temporal graph
# 3. TARGET_PRED: the total sum of orders for item_id = 42 in the next 30 days

#### You can use the result for sales forecasting.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/sales-forecasting.png"
       alt="sales forecast"
       width="500"
</div>


### Example 1B. Forecast 30-day product demand, with an anchor_time

By default, predictions are based on the max timestamp in temporal graph.

However, you can explicitly set a historical `anchor_time` to simulate what a prediction would have looked like at that point in time.

For instance, if anchor_time is '2024-09-20'. The model will predict--assuming today is '2024-09-20'--what's the product demand in the next 30 days?
KumoRFM will only use information before the anchor time to avoid data leakage.

This feature can be useful when you're evaluating model performance on time-based splits.

In [None]:
prediction_result_with_anchor_time = model.predict(query, anchor_time=pd.Timestamp('2024-09-20'))
print(prediction_result_with_anchor_time)

### Example 2. Predict customer churn

In [None]:
# Predict the likelihood that users (user_id = 42, 123) will place 0 orders
# in the next 90 days.
query = "PREDICT COUNT(orders.*, 0, 90, days) = 0 FOR users.user_id IN (42, 123)"

prediction_result = model.predict(query)
print(prediction_result)

# How to interpret the result:
# 1. ENTITY: the user (user_id = 123)
# 2. ANCHOR_TIMESTAMP: assuming we are predicting at this moment in time, what's happening in the next 90 days?
# 3. TARGET_PRED: Whether the event (COUNT(orders.*, 0, 90, days) = 0) will happen (True: Event will happen; False: Event will not happen)
# 4. False_PROB: The probability that the event will not happen
# 5. True_PROB: The probability that the event will happen.

#### You can use the result to prevent customer churn (e.g. sending a personalized coupon).

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/churn.png"
       alt="customer churn"
       width="500"
</div>


### Example 3. Product recommendation

In [None]:
# Predict the top 10 items that user (user_id = 123) is likely to buy
# in the next 30 days.
query = "PREDICT LIST_DISTINCT(orders.item_id, 0, 30, days) FOR users.user_id=123"

prediction_result = model.predict(query)
print(prediction_result)

# How to interpret the result:
# 1. ENTITY: the user (user_id = 123)
# 2. ANCHOR_TIMESTAMP: assuming we are predicting at this moment in time, what's happening in the next 30 days?
# 3. CLASS: the items (item_id)
# 4. SCORE: Higher score indicates higher likelihood.

#### You can use the result to power product recommendation.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/product-recommendation.png"
       alt="product recommendation"
       width="500"
</div>


### Example 4. Infer entity attributes

In [None]:
# Predict user (user_id = 8)'s age (the original age field is NA for this user)
query = "PREDICT users.age FOR users.user_id = 8"

prediction_result = model.predict(query)
print(prediction_result)

# How to interpret the result:
# 1. ENTITY: the user (user_id = 8)
# 2. ANCHOR_TIMESTAMP: assuming the predicting at this time
# 3. TARGET_PRED: users.age

#### You can use the result for customer segmentation.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/customer-segmantation.png"
       alt="segmantation"
       width="500"
</div>


## We'd love to hear from you!

1. **Found a bug or have a feature request?**  

Submit issues directly on [github](https://github.com/kumo-ai/kumo-rfm)—your feedback helps us improve RFM for everyone.

2. **Built something cool with RFM? We'd love to see it!**  

Share your project on LinkedIn and tag @kumo.  

We regularly spotlight on our official channels—yours could be next!



<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/kumo_ai_logo.jpeg"
       alt="kumo"
       width="30"
</div>

