# `KumoRFM` Hands-on

[KumoRFM](https://kumorfm.ai) provides an [SDK](https://kumo-ai.github.io/kumo-sdk/docs/get_started/rfm/index.html) in Python.
The Kumo SDK is available for Python 3.9 to Python 3.13.

In [None]:
!pip install kumoai --pre --upgrade

In [None]:
import kumoai.experimental.rfm as rfm

**Note:** The API of `kumoai.experimental.rfm` may change in the near future.

You will need an API key to make calls to KumoRFM.
Use the widget below to generate one for free by clicking **"Generate API Key"**.
If you don't have a KumoRFM account, the widget will prompt you to signup.

In [None]:
import os

if not os.environ.get("KUMO_API_KEY"):
    rfm.authenticate()

We are now ready to initialize the Kumo SDK using our newly created API key:

In [None]:
rfm.init()

## Dataset Creation

In this hands-on session, we are using the **Super Mario Maker** dataset from [Kaggle](https://www.kaggle.com/datasets/leomauro/smmnet) to playfully introduce you to the world of KumoRFM.

<img src="https://www.historyassociates.com/wp-content/uploads/2021/04/Mario-Maker.jpg" width="700" />

The Super Mario Maker dataset provides over 115K game maps created on Super Mario Maker with over 880K players which performed over 7M of interactions on these maps.
By interactions, this means that a player can: **(1)** create a game map; **(2)** play a map created by other players; if a player completes the challenge of the game map, he/she **(3)** "cleared" the map; also can be the **(4)** first clear; **(5)** at any time, the player can "like" a game map. This dataset also presents temporal changes over time for each game map.

KumoRFM interacts with a set of `pandas.DataFrame` objects, so let's download and import the dataset into `pandas`:

In [None]:
import pandas as pd

root = 's3://kumo-sdk-public/rfm-datasets/super_mario_maker'

# Player data:
players = pd.read_parquet(f'{root}/players.parquet')
# Game map data:
maps = pd.read_parquet(f'{root}/maps.parquet')
# Plays over time:
plays = pd.read_parquet(f'{root}/plays.parquet')
# Clears over time:
clears = pd.read_parquet(f'{root}/clears.parquet')
# Likes over time:
likes = pd.read_parquet(f'{root}/likes.parquet')
# Records over time:
records = pd.read_parquet(f'{root}/records.parquet')
# Temporal changes on game maps:
maps_meta = pd.read_parquet(f'{root}/maps_meta.parquet')

Let's study and analyze the dataset to get familiar with it:

#### Player Data

Players are defined by their name (`name`) and nationality (`flag`):

In [None]:
display(players.head(n=3))

The United States and Japan make up the largest share of the player base:

In [None]:
players['flag'].value_counts().head(n=10).plot(kind='bar', figsize=(4, 2));

#### Game Map Data

<img src="https://mario.wiki.gallery/images/thumb/f/f0/Facebook_Nintendo_2015-10-05_SMM_course_1.jpg/250px-Facebook_Nintendo_2015-10-05_SMM_course_1.jpg" width="250" />
<img src="https://mario.wiki.gallery/images/thumb/9/9f/Facebook_Nintendo_2015-10-05_SMM_course_4.jpg/250px-Facebook_Nintendo_2015-10-05_SMM_course_4.jpg" />
<img src="https://mario.wiki.gallery/images/thumb/e/e0/Facebook_Nintendo_2015-10-05_SMM_course_5.jpg/250px-Facebook_Nintendo_2015-10-05_SMM_course_5.jpg" />

Game maps are defined based on difficulty (`difficulity`), game style (`game_style`) and title (`title`).
In addition, each map references its maker (`maker_player_id`) and creation time (`creation_time`):

In [None]:
display(maps.head(n=3))

A game map can have four different levels of difficulty:

In [None]:
maps['difficulty'].value_counts().plot(kind='bar', figsize=(2, 2));

A game map can have four different styles:

In [None]:
maps['game_style'].value_counts().plot(kind='bar', figsize=(2, 2));

#### Interaction Tables

In addition to players and maps, we have rich interaction data connecting the two entity tables:

* `plays`: Plays over time
* `clears`: Clears over time
* `likes`: Likes over time
* `records`: Records over time

Each of these interactions hold a timestamp (`time`) and the pair of players (`player_id`) and maps (`map_id`) this interaction belong to:

In [None]:
from IPython.display import Markdown

display(Markdown("### Plays over Time"))
display(plays.head(n=3))
display(Markdown("### Clears over Time"))
display(clears.head(n=3))
display(Markdown("### Likes over Time"))
display(likes.head(n=3))
display(Markdown("### Records over Time"))
display(records.head(n=3))

Lastly, we have access to temporal changes on game maps, *e.g.*, the number of players, clears, attempts, starts or tweets:

In [None]:
display(Markdown("### Temporal Changes on Game Maps"))
display(maps_meta.head(n=3))

Game maps are also assigned to one out of 15 different tags:

In [None]:
maps_meta['tag'].value_counts().plot(kind='bar', figsize=(8, 2));

We are now ready to convert this dataset into a graph for `KumoRFM` to predict on.

We use the handy short-cut `rfm.LocalGraph.from_data(...)` to instantiate tables and inter-connect them by automatically inferring semantic types, primary keys, time columns and foreign keys:

In [None]:
df_dict = {
    'players': players,
    'maps': maps,
    'plays': plays,
    'clears': clears,
    'likes': likes,
    'records': records,
    'maps_meta': maps_meta,
}

graph = rfm.LocalGraph.from_data(df_dict)

We can see that `KumoRFM` correctly inferred `player_id` and `map_id` as the primary keys in the `players` and `maps` tables, respectively.
In addition, it correctly identified the time columns of all interaction tables.
Lastly, it inferred foreign key<>primary keys across all interaction tables.

**Note:** If metadata is incomplete or wrong, everything can be conveniently changed at this point in time.

We can visualize the graph in order to confirm that links have been set up correctly:

In [None]:
graph.visualize(show_columns=False);

It is also good practice to double-check the semantic types of columns, which will be used as features within the model downstream.
As such, correctly setting each column's semantic type (`stype`) is critical for model performance.

The following semantic types are supported:

| Type | Explanation | Example |
|------|-------------|---------|
| `"numerical"` | Numerical values (*e.g.*, `price`, `age`) | `25`, `3.14`, `-10` |
| `"categorical"` | Discrete categories with limited cardinality | Color: `"red"`, `"blue"`, `"green"` (one cell may only have one category) |
| `"multicategorical"` | Multiple categories in a single cell | `"Action\|Drama\|Comedy"`, `"Action\|Thriller"` |
| `"ID"` | An identifier, *e.g.*, primary keys or foreign keys | `user_id: 123`, `product_id: PRD-8729453` |
| `"text"` | Natural language text | Descriptions |
| `"timestamp"` | Specific point in time | `"2025-07-11"`,  `"2023-02-12 09:47:58`" |
| `"sequence"` | Custom embeddings or sequential data  | `[0.25, -0.75, 0.50, ...]` |

We can inspect the semantic types of each table via `table.print_metadata()`:

In [None]:
for table in graph.tables.values():
    table.print_metadata()

Mostly everything looks good! However, the `tweets` column in `maps_meta` is inferred to hold `"categorical"` data since its total cardinality is low:

In [None]:
maps_meta['tweets'].value_counts().plot(kind='bar', figsize=(2, 2));

### 🎯 Your Turn!

 Since this column holds the **number of tweets** a map received over time, let's change its semantic type (`stype`) to `"numerical"`. You can reference a column via `graph[table_name][column_name]`:

In [None]:
# TODO: Make your adjustments to the "tweets" column in the "maps_meta" table here:
...

graph['maps_meta'].print_metadata()

In [None]:
#@title 👀 Solution { display-mode: "form" }

graph['maps_meta']['tweets'].stype = 'numerical'

Feel free to play around with setting up your graph, *e.g.*, by removing columns, tables, edges or modifying their underlying semantic types.

## Predicting with `KumoRFM`

We are now ready to plug our graph into `KumoRFM` to make predictions, without any training required!

The great thing about the graph is that it's a one-time setup.
Once it's in place, you can generate a variety of predictions from it and power many different use-cases.

In [None]:
model = rfm.KumoRFM(graph)

We can query the `model` in two different ways:

* [**`model.predict(query)`**](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.experimental.rfm.KumoRFM.html#kumoai.experimental.rfm.KumoRFM.predict): Returns predictions for a predictive query.
* [**`model.evaluate(query)`**](https://kumo-ai.github.io/kumo-sdk/docs/generated/kumoai.experimental.rfm.KumoRFM.html#kumoai.experimental.rfm.KumoRFM.evaluate): Evaluates the model performance of a predictive query.

The interface of `KumoRFM` is based on the [**Predictive Query Language (PQL)**](https://kumo-ai.github.io/kumo-sdk/docs/get_started/rfm/querying_rfm.html), which allows you to define predictive problems by specifying:

1. A **target** which decleares the value or aggregate the model should predict
1. The **entities** (single ID or list of IDs) to predict for
1. *Optional* **filters** that can be used to refine the context

The general structure of a predictive query is:

```
PREDICT <target> FOR <entity> WHERE <optional_filter>
```

<details>
<summary><b>💡 Click here for a short introduction to Predictive Query!</b></summary>

**Entities** to query can be specified in one of two ways:

* `FOR <entity_table>.<entity_pkey>=1` for single entities
* `FOR <etntiy_table>.<entity_pkey> IN (1, 2, 3)` for a list of entities

**Targets** can be specified in one of two ways:

* `PREDICT <target_table>.<target_column>`: Imputes missing values of a column
* `PREDICT <aggr>(<target_table>.<target_column>, <start_offset>, <end_offset>, <time_unit>)`: Predicts an aggregates in the future, where
  * `<aggr>` can be `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, `LIST_DISTINCT`;
  * <start_offset> is an integer that defines the relative start offset of the prediction window (`0` would mean "from now");
  * <end_offset> is an integer that defines the relative end offset of the prediction window;
  * <time_unit> defines the unit of the prediction window, *e.g.*, `hours`, `days` or `months`.

  For example, `PREDICT COUNT(plays.*, 0, 7, days)` predicts the number of plays in the next seven days.

Targets can be further refined via **conditions.** For example, `PREDICT COUNT(plays.*, 0, 7, days)=0` will predict the probability of zero plays in the next seven days.

Lastly, filters can be used to refine the in-context examples:

* `FOR maps.map_id=1 WHERE maps.difficulty='normal`: Only account for maps with normal difficulity
* `FOR maps.map_id=1 WHERE COUNT(plays.*, -1, 0, months)>0`: Only account for maps that have been played in the last month (backward looking filter)
* `FOR maps.map_id=1 ASSUMING COUNT(plays.*, 0, 1, months)>0`: For making predictions in hypothetical scenarios, *e.g.*, in case a map gets played in the next month (forward looking filter)
</details>

Let's write our first Predictive Query!

### 🎯 Predict Nationality of a Player

We want to predict/impute the nationality of the player with `player_id=5`:

In [None]:
# TODO: Write a predictive query to impute the nationality/flag of player 5.
query = "PREDICT ..."

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = "PREDICT players.flag FOR players.player_id=5"

Let's run the query and see if it is working.  For this, `KumoRFM` will treat the `flag` column as blank, and will try to infer it from the given graph.

In [None]:
model.predict(query)

This predictive query gets inferred as **multi-class classification**. By default, `1,000` in-context examples were used to derive at a prediction.

Let's compare it to the actual value in our dataset (`KumoRFM` treats any value to predict as missing, independent of whether it actually exists or not):

In [None]:
display(players.iloc[5])

This is working quite well. Let's see how our model performs over ground-truth data by calling `model.evaluate()`. We can report both accuracy (`'acc'`) and the mean reciprocal rank (`'mrr')`:

In [None]:
model.evaluate(query, metrics=['acc', 'mrr'])

### 🎯 Predict the Difficulty of a Map

Next, lets predict/impute the difficulty of the map with `map_id=1`, `map_id=2`, and `map_id=3`:

In [None]:
# TODO: Write a predictive query to impute the difficulty of maps 1, 2, and 3.
query = "PREDICT ..."

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = "PREDICT maps.difficulty FOR maps.map_id IN (1, 2, 3)"

Let's run the query and see if it is working.  For this, `KumoRFM` will treat the `difficulty` column as blank, and will try to infer it from the given graph (*e.g.* from number of attemps, clear rate, time records, *etc*):

In [None]:
model.predict(query)

Let's compare the predictions to the actual difficulty:

In [None]:
maps.iloc[[1, 2, 3]]

Again, we can observe more generally how our model compares to the ground-truth data:

In [None]:
model.evaluate(query, metrics=['acc', 'mrr'])

Let's see if we can improve the performance further. One option to tune the model quality is via the `run_mode="fast"|"normal"|"best"` argument, which provides a trade-off between runtime and model performance. Using a heavier `run_mode` will increase context size and uses a deeper model downstream:

In [None]:
model.evaluate(query, run_mode='best', metrics=['acc', 'mrr'])

Using a heavier `run_mode` usually improves performance (as in this case), although the model will take longer to run.

### 🎯 Predict the Tag of a Map's Metadata

Lastly, let's predict/impute the tag of a map's metadata.

In order to reference `maps_meta` as part of our predictive query, we need to assign it a primary key (*e.g.*, `meta_id`). We can do this by adding a contiguous index to the underlying `DataFrame`, refresh the `LocalGraph`, assign it the newly added primary key, and re-initialize our `KumoRFM` model with the updated data:

In [None]:
# TODO Update `df_dict['maps_meta']` to hold a unique identifier.
...

# TODO Re-initialize the graph with the updated `df_dict`.
...

# TODO Update the `primary_key` of `graph['maps_meta'].
...

# TODO Re-initialize `KumoRFM` with the updated graph.
...

In [None]:
#@title 👀 Solution { display-mode: "form" }

df_dict['maps_meta']['meta_id'] = range(len(df_dict['maps_meta']))

graph = rfm.LocalGraph.from_data(df_dict, verbose=False)

graph['maps_meta'].primary_key = 'meta_id'
graph.print_metadata()

model = rfm.KumoRFM(graph)

Once our newly added primary key is set up, we are able to predict the tag of a map's meta information (*e.g.*, `maps_meta.meta_id=1`):

In [None]:
# TODO Write a predictive query to impute the tag of a map's meta information with ID 1.
query = "PREDICT ..."

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = "PREDICT maps_meta.tag for maps_meta.meta_id=1"

In [None]:
model.predict(query)

Again, let's evaluate our model:

In [None]:
model.evaluate(query, metrics=['acc', 'mrr'])

In [None]:
model.evaluate(query, run_mode='best', metrics=['acc', 'mrr'])

In addition, we can verify the importance of graph structure to solve this task. For example, if we do the same prediction without the multi-table reasoning (`num_neighbors=[]`), we observe a dramatic decrease in performance:

In [None]:
model.evaluate(query, num_neighbors=[], metrics=['acc', 'mrr'])

### 🎯 Predict the Average Clear Rate of a Map in the Next 14 Days

Let's move beyond missing value imputation and towards **temporal forecasting**.
We want to predict the average `clear_rate` (as given in `maps_meta`) in the next 14 days for `map_id=1`. We can use the `AVG` aggregation to define a prediction window over the next 14 days as:
```
PREDICT <aggr>(<target_table><target_column, 0, 14, days)
```

In [None]:
# TODO Write a temporal predictive query to predict the average clear rate in
# the next 14 days of the map with map_id=1:
query = "PREDICT ..."

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = "PREDICT AVG(maps_meta.clear_rate, 0, 14, days) for maps.map_id=1"

Let's evaluate our query:

In [None]:
model.evaluate(query, run_mode='best')

Predictive Query can be very expressive. For example, we can easily add some filters to our predictive query to refine the task.

Let's predict the average clear rate of a map with `difficulty='expert'` *(entity filter)* for which more than 10 attempts where registered in the map's meta information *(target_filter)*:

In [None]:
# TODO Add the necessary filters to the predictive query:
query = ("PREDICT AVG(maps_meta.clear_rate WHERE ..., 0, 14, days) "
         "FOR maps.map_id=1 WHERE ...")

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = ("PREDICT AVG(maps_meta.clear_rate WHERE maps_meta.attempts > 10, 0, 14, days) "
         "FOR maps.map_id=1 WHERE maps.difficulty='expert'")

Let us again evaluate our model:

In [None]:
model.evaluate(query)

### 🎯 Predict whether a Map Will be Played in the Next Month

Finally, let's predict whether the map with `map_id=1` will be played in the next month. For this, we can make use of **conditions** w.r.t. the output of our aggregation (*e.g.*, `= 0` or `> 0`). In order to check whether an event exists, we can leverage the `COUNT(<target_table>.*, ...)` aggregation:

In [None]:
# TODO Predict whether the map with map_id=1 will be cleared in the next month:
query = "PREDICT ..."

In [None]:
#@title 👀 Solution { display-mode: "form" }

query = "PREDICT COUNT(plays.*, 0, 1, months)>0 FOR maps.map_id=1"

Let's again evaluate:

In [None]:
model.evaluate(query)

And that's it. Congratulations on reaching this far. Feel free to explore more predictive queries, or read in your own dataset now.

### We'd love to hear from you!


1. **Leave Feedback on `KumoRFM` [here](https://tinyurl.com/rfmbeta)**.

1. **Found a bug or have a feature request?** Submit issues directly on [GitHub](https://github.com/kumo-ai/kumo-rfm). Your feedback helps us improve KumoRFM for everyone.

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/kumo_ai_logo.jpeg" width="30" />
</div>