# Predibase backend for ludwig

This notebook has an example of how to create a predibase backend for ludwig.

In order to keep this simple, we will just use the [getting started](https://ludwig.ai/latest/getting_started/) example.

```bash
git clone -b enh-predibase-backend https://github.com/brightsparc/ludwig.git
cd ludwig
pip install -e .
```

We will be looking to extend Ludwig by adding:
1. Metrics 
1. Predibase Backend

## Data Prep

Download the dataset

In [None]:
!wget -q https://ludwig.ai/latest/data/rotten_tomatoes.csv

Get the dataset

In [None]:
import pandas as pd

df = pd.read_csv("rotten_tomatoes.csv")
df.head()

### Train a Model

Create a config, and train a model, with a `local` backend:

In [None]:
%%writefile rotten_tomatoes.yaml
input_features:
    - name: genres
      type: set
      preprocessing:
          tokenizer: comma
    - name: content_rating
      type: category
    - name: top_critic
      type: binary
    - name: runtime
      type: number
    - name: review_content
      type: text
      encoder: 
          type: embed
output_features:
    - name: recommended
      type: binary
backend:
  type: local 

Define Ludwig model object that drive model training

In [None]:
import yaml

with open("rotten_tomatoes.yaml", 'r') as file:
    config = yaml.safe_load(file)

config

In [None]:
import logging
from ludwig.api import LudwigModel

logging_level = logging.INFO
config = yaml.safe_load("rotten_tomatoes.yaml")
model = LudwigModel(config=config, logging_level=logging_level)

Define experiment and model name:

In [None]:
experiment_name="simple_experiment"
model_name="simple_model"

In [None]:
(
    train_stats,  # dictionary containing training statistics
    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
    output_directory,  # location of training results stored on disk
) = model.train(
    dataset=df, experiment_name=experiment_name, model_name=model_name, skip_save_processed_input=True
)

## Predibase API

Let's go ahead and create a dataset, train and model in predibase using the SDK.

In [None]:
!pip install -q predibase

In [None]:
from predibase import PredibaseClient
from predibase.pql import get_session
import os

# Get the api token, and set the serving endpoint for staging
token = os.getenv("PREDIBASE_API_TOKEN", "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyVVVJRCI6IjRjNjQwYjM5LTE4M2UtNDdlZS04NjEwLTQ3YWU0MjJhNGFjMyIsInRlbmFudFVVSUQiOiIiLCJlbmdpbmVVVUlEIjoiIiwic2NvcGUiOiJVU0VSIiwiZXhwIjoxNzMxNjE0MzM2LCJpYXQiOjE2OTk5OTE5MzYsImlzcyI6InByZWRpYmFzZSIsIm5iZiI6MTY5OTk5MTkzNiwic3ViIjoiN2M3ZWZhIn0.pXfd6GdpVkZ2Mzk9oKdL9DhueIBTJry9lQZqeARLsPKOzfpQggNFYkbeR9DQibqsplPLyumUZLZzlR9iTDPe9iLKrZpYllC8sbF3RkhJHAWOz9MPU82crfRCtR75DMBZMHe78zt_19KkHWDyyU9v0j6TkR6sb6XA_pHbU1gyE6apTtvrvYxwSSpxaDEDIxHFNmM-2lgpHiyt92O46-PfGdnffX3xOWW-NMdZDnzxFyKVdeJGPK4iwHq9jfVRs6eHlO_YA5NUzBuyXZeByozoW0KsVD12uOKQr5e6lJz5Ogp8w2nVPpxPv5Zw5n4Yp4ltq-t7mGW5Ueiux08f8xb1eA")
session = get_session(token=token, gateway="https://api.staging.predibase.com/v1")

# Get current user to output session tenant
client = PredibaseClient(session)
print(f"Current engine: {client.get_current_engine().name}")

### Create Dataset

Upload dataset from file or dataframe.

In [None]:
try:
    ds = client.get_dataset(experiment_name, "file_uploads")
    print("Got dataset", ds.name)
except:
    print("Creating dataset from dataframe")
    # ds = client.upload_dataset("rotten_tomatoes.csv", name=experiment_name)
    ds = client.create_dataset_from_df(df, name=experiment_name)

### Train a Model

Train a model on the dataset

In [None]:
engine = client.get_engine("train_engine")
print("Got engine", engine.name)

In [None]:
try:
    model = client.get_model(model_name)
    print("Got latest model", model.name)
except:
    print("Creating model")
    model = client.create_model(repository_name=model_name, dataset=ds, config=config, engine=engine)

## Clean up

Delete the dataset

In [None]:
!rm rotten_tomatoes.csv*