# `KumoRFM` on single tabular data

While `KumoRFM` especially shines on multi-table relational data, it can also be applied to single tabular data.
For this example, we make use of the [`breast_cancer`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) dataset.

In [None]:
!pip install kumoai --pre --upgrade

In [None]:
from kumoai.experimental import rfm

In [None]:
import os

if not os.environ.get("KUMO_API_KEY"):
    rfm.authenticate()

In [None]:
rfm.init()

Let's load the dataset directly from `sklearn`:

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

print('X_train:', X_train.shape)
print(' X_test:', X_test.shape)

In `KumoRFM`, we can operate on single tabular data by creating a graph without any links that simply holds a single table.

In order to query the model, we treat the task as **missing value imputation**. That is, we group `X_train` and `X_test` into a single table. We additionally add the training labels to the table, but **keep test labels blank** and ask `KumoRFM` to infer them.

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': range(len(X)),
    'emb': [x for x in np.concatenate([X_test, X_train], axis=0)],
    'target': np.concatenate([y_test, y_train], axis=0),
})
target = df['target'].astype('Int64')
target.iloc[:len(y_test)] = None  # Mask out test labels!
df['target'] = target

display(df)

Note that we have added the feature matrix as a single embedding to our `pandas.DataFrame` since `KumoRFM` natively knows how to operate on such custom embeddings (*e.g.*, we could also add LLM or image encodings as feature columns).
Alternatively, we could have added each feature to a separate column.

Let's read this data frame into `KumoRFM`:

In [None]:
graph = rfm.LocalGraph.from_data({'table': df})
model = rfm.KumoRFM(graph)

We can see that `KumoRFM` has correctly detected the feature embedding column (`dtype='floatlist'`, `stype='sequence'`):

In [None]:
graph['table'].print_metadata()

In order to query the model, we ask `KumoRFM` to predict the `target` column for all IDs within our test set:

In [None]:
query = (f"PREDICT table.target=1 "
         f"FOR table.id IN ({', '.join(str(i) for i in range(len(y_test)))})")

result = model.predict(query)

Finally, we can easily report metrics on top:

In [None]:
from sklearn.metrics import roc_auc_score

y_pred = result['True_PROB'].to_numpy()
print(f"AUROC: {roc_auc_score(y_test, y_pred):.4f}")