<a target="_blank" href="https://colab.research.google.com/github/predibase/examples/blob/main/model-augmented-generation/Personalized_Email_Campaigns.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
%%capture
! pip install predibase ludwig

In [None]:
import numpy as np
import pandas as pd
import warnings
import yaml

from predibase import PredibaseClient

from ludwig.data.split import get_splitter
from ludwig.data.negative_sampling import negative_sample
from ludwig.backend.base import LocalBackend

from google.colab import files

%load_ext autoreload
%autoreload 2
warnings.filterwarnings('ignore')

<br/>

# Personalized Email Campaigns 📨

__Recommender System Model --> LLM Generated Personalized Email Campaigns__

In this tutorial, we show how to use the Predibase SDK to train a recommender system (recsys) model, then take the results from the recsys model and generate personalized outreach emails via generations from open-source LLMs.

<br/>

## Authentication 🔐

The first step is to sign into Predibase.

- If you do not have a Predibase account set up yet, you may sign up for a free account [here](https://predibase.com/free-trial)
- If you already have an account, navigate to Settings -> My Profile and generate a new API token.
- Finally, plug in the generated API token in the code below to authenticate.

In [None]:
pc = PredibaseClient(
    token="API TOKEN HERE"
)

<br/>

## Dataset Preparation 📄

Next we'll prepare the dataset needed to train the model the recsys model. For this demonstration, we will be using the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data)  dataset from Kaggle. For this tutorial, we do not need the images, which are the largest part of the download, so we recommend individually downloading `articles.csv`, `customers.csv`, and `transactions_train.csv` to minimize the download process time. This dataset contains the purchase history of customers across time, along with supporting metadata.

In [None]:
#@title <h3>Authenticate Kaggle</h3> Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of kaggle.json, a file containing your API credentials. Then run this cell to upload kaggle.json to your Colab runtime.
uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv "{fn}" ~/.kaggle/kaggle.json && chmod 600 ~/.kaggle/kaggle.json

### Load data
Agree to the [Kaggle competition rules](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data) before continuing with the following cells.

In [None]:
! kaggle competitions download --quiet -c h-and-m-personalized-fashion-recommendations -f 'articles.csv'
! kaggle competitions download --quiet -c h-and-m-personalized-fashion-recommendations -f 'customers.csv'
! kaggle competitions download --quiet -c h-and-m-personalized-fashion-recommendations -f 'transactions_train.csv'
! unzip -q -o articles.csv.zip
! unzip -q -o customers.csv.zip
! unzip -q -o transactions_train.csv.zip

In [None]:
articles_df = pd.read_csv("articles.csv")
customers_df = pd.read_csv("customers.csv")
transactions_df = pd.read_csv("transactions_train.csv")

### Sampling
First, we sample the dataset here because of scale. The `transactions_df` has ~31.8 million rows and ~1.3 million unique customers. This is great for training a very strong model, however, processing this data will take considerably longer. For the purposes of this tutorial, we are sampling in the following ways:
- Sample from transactions that occured after August 21st, 2020
- Sample from customers who purchased 10 or more items after August 21st, 2020
- Select 500 customers

In [None]:
# Convert transaction date column to datetime object
transactions_df["t_dat"] = pd.to_datetime(transactions_df.t_dat)

# Slice dataset to keep everything after "2020-08-21"
sampled_transactions_df = transactions_df[transactions_df.t_dat > "2020-08-21"]
del transactions_df

# Strip out year and month for splitting/sampling purposes
sampled_transactions_df["year_month"] = sampled_transactions_df.t_dat.dt.to_period("M").dt.strftime("%Y-%m")

# Sample 500 customers with 10 or more transactions, since ~1.3 million total customers takes quite a while to process
customer_ids = np.random.choice(sampled_transactions_df.groupby("customer_id").filter(lambda x: len(x) >= 10).customer_id, 500, replace=False)
sampled_transactions_df = sampled_transactions_df[sampled_transactions_df.customer_id.isin(customer_ids)]

Next, we're going to merge the three separate dataframes together to form a single transaction dataframe. This dataframe will contain all transactions that took place by the users in the dataset. Because every row is a transaction that took place, we need to add in some examples of transactions that didn't take place - we call these negative samples. We will handle that a little later.

In [None]:
# Merge the transactions and articles dataframes
sampled_transactions_df = pd.merge(
    sampled_transactions_df,
    articles_df,
    how="left",
    left_on="article_id",
    right_on="article_id",
)

# Merge the transactions and customers dataframes
sampled_transactions_df = pd.merge(
    sampled_transactions_df,
    customers_df,
    how="left",
    left_on="customer_id",
    right_on="customer_id",
)

# Set label to 1 for all known transactions, since the customer bought the article
sampled_transactions_df["label"] = 1

Now that we have a single dataframe with all transaction, article, and customer data present in every row, we are going to both split and negative sample the dataset.



### Split
We need to split our data into a training, validation, and test set. In addition to this though, we also need to split in a way that makes sure a given customer's transaction are present in each set. This way, the learning that takes place in the training set can be fairly evaluated in the validation and test sets.

In [None]:
# Get ludwig datetime splitter to ensure no target leakage by date. Split 70%, 20%, 10% for train, validation, and test sets
splitter = get_splitter("datetime", column="year_month", probabilities=(0.7, 0.2, 0.1))

# Split per customer_id to ensure that interactions for a customer are across all splits
train_dfs, val_dfs, test_dfs = [], [], []
for customer_id in sampled_transactions_df["customer_id"].unique():
    train_df, val_df, test_df = splitter.split(sampled_transactions_df[sampled_transactions_df["customer_id"] == customer_id], backend=LocalBackend())

    train_dfs.append(train_df)
    val_dfs.append(val_df)
    test_dfs.append(test_df)

# Concatenate all customer id specific splits into their respective datasets
train_set = pd.concat(train_dfs)
val_set = pd.concat(val_dfs)
test_set = pd.concat(test_dfs)

# Set the split value for each set
train_set["split"] = 0
val_set["split"] = 1
test_set["split"] = 2

# Combine train, val, and test set into final dataset
full_df = pd.concat([train_set, val_set, test_set])

### Negative sample
As mentioned before, we need to add negative samples so that the model can learn to distinguish between something a user would buy, and something a user would not. To that end, we will create some synthetic data points per user by selecting products that we know the user didn't purchase.

In [None]:
# Negative sample each split separately
train_df = negative_sample(full_df[full_df.split == 0], neg_pos_ratio=10, neg_val=0)
val_df = negative_sample(full_df[full_df.split == 1], neg_pos_ratio=10, neg_val=0)
test_df = negative_sample(full_df[full_df.split == 2], neg_pos_ratio=10, neg_val=0)

# The negative_sample utility from Ludwig only returns a user_id, item_id, and label column. So
# we need to define the columns we want to add back into the dataset.
article_cols = [
    "prod_name",
    "product_type_name",
    "product_group_name",
    "graphical_appearance_name",
    "colour_group_name",
    "perceived_colour_value_name",
    "perceived_colour_master_name",
    "department_name",
    "index_name",
    "index_group_name",
    "section_name",
    "garment_group_name",
    "detail_desc",
]
customer_cols = [
    "customer_id",
    "FN",
    "Active",
    "club_member_status",
    "fashion_news_frequency",
    "age",
    "postal_code",
]

# Add back customer and article features defined above
articles = full_df[["article_id"] + article_cols].drop_duplicates(["article_id"])
customers = full_df[customer_cols].drop_duplicates(["customer_id"])

train_df = pd.merge(train_df, articles, how="left", left_on="article_id", right_on="article_id")
train_df = pd.merge(train_df, customers, how="left", left_on="customer_id", right_on="customer_id")
train_df["split"] = 0

val_df = pd.merge(val_df, articles, how="left", left_on="article_id", right_on="article_id")
val_df = pd.merge(val_df, customers, how="left", left_on="customer_id", right_on="customer_id")
val_df["split"] = 1

test_df = pd.merge(test_df, articles, how="left", left_on="article_id", right_on="article_id")
test_df = pd.merge(test_df, customers, how="left", left_on="customer_id", right_on="customer_id")
test_df["split"] = 2

# Combine train, val, and test sets together to get the final dataset we will train the model with.
final_df = pd.concat([train_df, val_df, test_df])

<br/>

__NOTE:__ Because we decided to sample the dataset to a fraction of it's size, the file size of our dataset is under the 1GB limit that Predibase has on file uploads. So for the purposes of this example, we will go straight from a dataframe to a Predibase dataset. However, for production use cases, it is recommended that you use an object storage such as AWS S3 to hold your dataset artifacts. This way you can connect and train on datasets much larger than 1GB in Predibase.

<br/>

## Connect Data ♾️

Here we are using the [`create_dataset_from_df`](https://docs.staging.predibase.com/sdk-guide/datasets/dataset_from_df) method which creates a converts a pandas dataframe to a Predibase File Upload Dataset. It is likely that you are using another connection option and should use a different method of creating a Predibase Dataset. We have all the [connection](https://docs.staging.predibase.com/sdk-guide/connections/) and [dataset](https://docs.staging.predibase.com/sdk-guide/datasets/) options available to reference in the [SDK docs](https://docs.staging.predibase.com/sdk-guide/getting-started).

In [None]:
# Use the dataset to dataframe SDK method to create a Predibase dataset from our final df above.
HM_dataset = pc.create_dataset_from_df(final_df, "H&M_Recsys_Dataset")
HM_dataset

<br/>

## Engine 🚂

At Predibase, engines are our solution to common compute and infrastructure pain points that everyone runs into while training models. These are problems like:
- Encountering Out of Memory errors due to insufficient compute
- Challenges distributing a model training job over multiple compute resources
- Losing progress when transient issues interrupt the training process

Predibase training engines mitigate these issues by:
- Analyzing the training job details to assign the right amount of compute
- Logic to distribute the training job over the assigned compute resources
- Retry logic when things go wrong

With this in mind, we will select the engine we want to use for training.

In [None]:
train_engine = pc.get_engine("train_engine")
train_engine

<br/>

## Config 📝

Next, we're going to define the config with the specs for our recommender systems model. Here is a readable yaml representation of the config - below I will explain the key parameters we're setting and why.
```
model_type: ecd
input_features:
  - name: prod_name
    type: text
  - name: product_type_name
    type: category
  - name: product_group_name
    type: category
  - name: graphical_appearance_name
    type: category
  - name: colour_group_name
    type: category
  - name: perceived_colour_value_name
    type: category
  - name: perceived_colour_master_name
    type: category
  - name: department_name
    type: category
  - name: index_name
    type: category
  - name: index_group_name
    type: category
  - name: section_name
    type: category
  - name: garment_group_name
    type: category
  - name: detail_desc
    type: text
  - name: customer_id
    type: category
  - name: FN
    type: category
  - name: Active
    type: category
  - name: club_member_status
    type: category
  - name: fashion_news_frequency
    type: category
  - name: age
    type: number
  - name: postal_code
    type: category
output_features:
  - name: label
    type: binary
    calibration: true
preprocessing:
  split:
    type: fixed
combiner:
  type: comparator
  entity_1:
    - prod_name
    - product_type_name
    - product_group_name
    - graphical_appearance_name
    - colour_group_name
    - perceived_colour_value_name
    - perceived_colour_master_name
    - department_name
    - index_name
    - index_group_name
    - section_name
    - garment_group_name
    - detail_desc
  entity_2:
    - customer_id
    - FN
    - Active
    - club_member_status
    - fashion_news_frequency
    - age
    - postal_code
trainer:
  batch_size: 1024
```
A few of the key parameters set are outlined below:
- `output_features.calibration`:
- `preprocessing.split.type`:
- `combiner.type`:
- `combiner.entity_1`:
- `combiner.entity_2`:

For more configuration details, check out the [Ludwig LLM Docs](https://ludwig.ai/0.8/configuration/large_language_model/)!

In [None]:
recsys_config = {
    'model_type': 'ecd',
    'input_features': [
        {'name': 'prod_name', 'type': 'text'},
        {'name': 'product_type_name', 'type': 'category'},
        {'name': 'product_group_name', 'type': 'category'},
        {'name': 'graphical_appearance_name', 'type': 'category'},
        {'name': 'colour_group_name', 'type': 'category'},
        {'name': 'perceived_colour_value_name', 'type': 'category'},
        {'name': 'perceived_colour_master_name', 'type': 'category'},
        {'name': 'department_name', 'type': 'category'},
        {'name': 'index_name', 'type': 'category'},
        {'name': 'index_group_name', 'type': 'category'},
        {'name': 'section_name', 'type': 'category'},
        {'name': 'garment_group_name', 'type': 'category'},
        {'name': 'detail_desc', 'type': 'text'},
        {'name': 'customer_id', 'type': 'category'},
        {'name': 'FN', 'type': 'category'},
        {'name': 'Active', 'type': 'category'},
        {'name': 'club_member_status', 'type': 'category'},
        {'name': 'fashion_news_frequency', 'type': 'category'},
        {'name': 'age', 'type': 'number'},
        {'name': 'postal_code', 'type': 'category'}
    ],
    'output_features': [
        {'name': 'label', 'type': 'binary', 'calibration': True}
    ],
    'preprocessing': {
        'split': {
            'type': 'fixed'
        }
    },
    'combiner': {
        'type': 'comparator',
        'entity_1': [
            'prod_name',
            'product_type_name',
            'product_group_name',
            'graphical_appearance_name',
            'colour_group_name',
            'perceived_colour_value_name',
            'perceived_colour_master_name',
            'department_name',
            'index_name',
            'index_group_name',
            'section_name',
            'garment_group_name',
            'detail_desc'
        ],
        'entity_2': [
            'customer_id',
            'FN',
            'Active',
            'club_member_status',
            'fashion_news_frequency',
            'age',
            'postal_code'
        ]
    },
    'trainer': {
        'batch_size': 1024,
    }
}

<br/>

## Model Training 🏁

Finally, we can kick off our model training job! With this call, we will create both a [Model Repository](https://docs.predibase.com/user-guide/models/model-repos) and train our first recsys model in that repo. As you can see, all of the pieces above have been plugged into this function call. Once you run this cell, you can click on the link to track the fine-tuning progress in the UI.

In [None]:
HM_recsys_model = pc.create_model(
    repository_name="H&M Recommender System Model",
    dataset=HM_dataset,
    config=recsys_config,
    engine=train_engine,
    repo_description="Recommend fashion products to customers",
    model_description="Baseline Model"
)

## Deployment and Prediction 🎯

Now that our model has finished training, we can deploy and start generating predictions. There are a few ways that you can generate predictions from a Predibase model:

1. __REST API__ - The [deployment object](https://docs.predibase.com/user-guide/supervised-ml/deployments/) has an attribute called *deployment_url* which you can make an HTTP  request to in order to get predictions.

2. __deployment.predict()__ - You can call `predict()` on the deployment object itself, passing in a dataframe. Just make sure to install the predibase predictor with `pip install "predibase[predictor]"`.

3. __PQL__ - You can predict directly on data within Predibase using Predictive Query Language (PQL), our extension to SQL that allows you to run predictions with models trained in Predibase on data selected with SQL.

Our deployment API is recommended for real-time inference whereas PQL is tailored to batch prediction. For this example though, we will use PQL which is available in the free trial.

<br/>

For these prediction examples, we're going to grab a random customer from the test set and generate recomendations with the products that they've interacted with in this dataset.

Generally speaking, candidate products to generate recommendations with are generated with another service (collaborative filtering, content-based filtering, etc.). The output of this service becomes the input to the recsys model that we've built, which will rank the provided candidates, letting us know which ones to recommend.

In [None]:
# Select random customer to generate predictions for
random_customer_id = test_df.customer_id.sample(1).iloc[0]

# Grab the set of products that this customer has interacted with - the input to our model
prediction_data = test_df[test_df.customer_id == random_customer_id]

# Run inference using PQL. This might take a while the first time.
# In a production setting, you'd use the deployment API for real-time predictions.
preds = HM_recsys_model.predict('label', source=prediction_data, probabilities=True)

# Convert output vector into propensity to purchase value
preds.label_probabilities = preds.label_probabilities.apply(lambda x: x[1])

# Add propensity to purchase to product data
prediction_data = prediction_data.join(preds[["label_probabilities"]])

<br/>

## Generate Targeted Emails 🦙

At last, we can use the recommended products to generate targeted emails for the customer based on the customer + product info. We will be using Predibase LLM capabilities to prompt LLaMa2-7B and generate these custom tailored emails.


In [None]:
def generate_custom_email(product: pd.Series):
    """
    Function that takes in a row of a pandas dataframe containing product and user information
    and returns a custom tailored email advertising that product.

    :param product: A pandas Series containing product details
    :return: Custom generated email
    """
    # Extract and format product information
    product_info = ", ".join(f"{key}: {val}" for key, val in product[article_cols].items())

    # Construct prompt containing product info to feed into LLM
    prompt = f"""
    Generate a personalized email for an outreach campaign advertising a product.
    Product information: '{product_info}'.
    Email:
    """

    return pc.prompt(prompt, "llama-2-13b-chat", options={"max_new_tokens": 512}).response[0]

In [None]:
# Grab the top three recommendations to generate custom emails for
top_recommendations = prediction_data.sort_values("label_probabilities", ascending=False).head(3)

<br/>

### Email to be sent out on campaign round 1:

In [None]:
email_1 = generate_custom_email(top_recommendations.iloc[0])
print(email_1)


    Subject: 🌸 Introducing the Angie Blouse - Elevate Your Wardrobe with Timeless Elegance 🌸
    
    Dear [Recipient's Name],
    
    We hope this email finds you well! 😊
    
    We are thrilled to introduce our latest addition to our Ladieswear collection - the stunning Angie Blouse! 💃
    
    This beautiful blouse is crafted from a high-quality textured weave with a subtle sheen, giving it a sophisticated and timeless look. The concealed buttons at the top and small stand-up collar with a frill add a touch of elegance, while the double-layered yoke and gathered seam at the back create a flattering fit. The long, voluminous sleeves with wide, buttoned cuffs complete the look, making it perfect for any formal or semi-formal occasion. 👩‍💼👩‍🎤
    
    The Angie Blouse is part of our Garment Upper body category, and we have carefully selected the Off White colour with a Dusty Light perceived colour value to ensure it complements a variety of skin tones and outfits. 🌸👠
    
    Don't 

<br/>

### Email to be sent out on campaign round 2:

In [None]:
email_2 = generate_custom_email(top_recommendations.iloc[1])
print(email_2)


    Subject: 💃🏻 Shimmering Socks for Your Nighttime Adventures! 💃🏻
    
    Hey [Name],
    
    Are you tired of boring socks ruining your nighttime adventures? 😴 Well, we've got just the thing for you! 😍
    
    Introducing our new line of 1p Lurex Socks - the perfect accessory for any night out or cozy night in! 💃🏻 These socks are made of a soft, rib-knit cotton blend containing glittery threads that will add a touch of sparkle to your outfit. 💎
    
    Not only do they look amazing, but they're also super comfortable and perfect for any occasion. Whether you're going out with friends, having a movie night, or just lounging around the house, these socks have got you covered! 😊
    
    So why wait? 🤔 Treat yourself to a pair (or two, or three... 😉) today and add a little bit of glamour to your nighttime routine. Plus, with our affordable prices, you can stock up and never sacrifice style for comfort again! 💰
    
    Click the link below to shop now and get ready to shimmer and s

<br/>

### Email to be sent out on campaign round 3:

In [None]:
email_3 = generate_custom_email(top_recommendations.iloc[2])
print(email_3)


    Subject: 🔥 Get Ready to Elevate Your Style with Our Latest Addition - Siri Basic Cardigan 🔥
    
    Dear [Recipient Name],
    
    We hope this email finds you well! 😊
    
    We are thrilled to introduce our latest addition to our Divided Collection - the stunning Siri Basic Cardigan! 😍 This cardigan is a must-have for any fashion-forward individual looking to elevate their style this season.
    
    Here are some key features of the Siri Basic Cardigan:
    
    • Soft knit with dropped shoulders, long sleeves, front pockets and no buttons
    • Perfect for both casual and dressy occasions
    • Available in Dark Pink, a versatile and timeless colour that will complement any outfit
    • Part of our Tops Knitwear department, ensuring a comfortable and stylish fit
    
    We believe that this cardigan is a game-changer for anyone looking to add a touch of sophistication and elegance to their wardrobe. With its dropped shoulders, long sleeves, and front pockets, it's perfect 

<br/>

## Conclusion

And there you have it, an end to end tutorial on how to set up a recommender system model in Predibase, and chain the outputs of this model with Predibase's LLM capabilities to generate custom tailored emails for the recommended products.

When it comes to the generation step, there is actually a lot of control you have over the generated email via the prompt you pass in. As you saw in the `generate_custom_email` function, we were using a pretty generic prompt where we just pass in the product info and ask the LLM to generate an advertising email. However, you can try all sorts of things with the prompt such as adjusting, tone, length, or even passing in user information to put into the email if you have it available.

The purpose of this tutorial was to provide a basic guideline for building solutions of this nature. In addition to this tutorial notebook, we also have a [sample application](https://github.com/predibase/examples/tree/main/model-augmented-generation) that we've generated to showcase what a simple application built with this type of tooling could look like. We hope you enjoyed this tutorial!