In [1]:
import kumoai.experimental.rfm as rfm, os
from pathlib import Path
import pandas as pd

In [2]:
home_api_key_file = Path.home() / "kumoai_key.txt"
with open(home_api_key_file, "r") as file:
    api_key = file.read().strip()
os.environ["KUMO_API_KEY"] = api_key

rfm.init()

[2025-08-08 17:07:09 - kumoai:196 - INFO] Successfully initialized the Kumo SDK against deployment https://kumorfm.ai/api, with log level INFO.


In [3]:
# root = 's3://kumo-sdk-public/rfm-datasets/online-shopping'
# users_df = pd.read_parquet(f'{root}/users.parquet')    # (1000, 3) 
# items_df = pd.read_parquet(f'{root}/items.parquet')    # (1000, 5) (267774, 6)
# orders_df = pd.read_parquet(f'{root}/orders.parquet')  # (267774, 6) 

## Creation of synthetic data: 4 tables

Create a synthetic dataset with the purpose of testing how KumoRFM deals with temporal dependency that is not in the transactional history, but instead in a user type that can change over time.
Essentially, there are users of 2 types (premium/free), and premium users can do actions that free users cannot.
The exercise is about testing if KumoRFM detects this dependency and include them into its predictions.

### Dataset description
We create a synthetic dataset representing users of 2 types (tier free or tier premium), uploading files. 
Premium users can upload any size, while free users can upload files below 10GB. 
For simplicity, there are only 2 sizes: 5GB and 50GB.
Users can change their tier after 24 hours. 
Within an hour of a user switching to premium, the likelihood of the user uploading a 50GB is higher.
There are 100 users, covering 5 cohorts:
- Always premium.
- Always free.
- Premium --> Free (once)
- Free --> Premium (once)
- Free --> Premium --> Free (twice, each change >= 24 hours apart)
The history of transactions last 10 days. 
The prediction tasks will perform for different users, at different points in time, regarding their likelihood of uploading a 50GB file within the next hour.
The expectations are: is that for users that just became Free, this should be 0. For users that just became Premium it should be high. For users that had been Premium for a while it should be >> 0. An

#### users
- user_id (PK)
- name

#### tiers (user tiers: free/premium, temporal, non-overalapping for each user)
- user_id (FK -> users.user_id)
- from_datetime
- until_datetime (NULL means "still in effect")
- tier in {'free', 'premium'}
- Composite PK: user_id, from_datetime)

#### items
- item_id (PK)
- size_gb in {5, 50}

#### uploads
- txn_id (PK)
- user_id (FK -> users.user_id)
- item_id (KF -> items.item_id)
- datetime



#### Notes:
- items is only 2 rows. We do not distinguish between items: just make it 