# Value-based bidding

Value-based bidding is a strategy where advertisers set bids based on the estimated value each ad click brings to their business, aiming to optimize return on investment.

### Best-in-class solution

We define value using CLV, the present value of revenue flows:

$$\text{CLV}_i = \frac{P_i \times V_i \times r_i}{1 + \text{WACC} - r_i}$$

- $P_i$ is price per seat
- $V_i$ is seat volume
- $r_i$ is retention rate
- $WACC$ is the weighted-average cost of capital

We need to predict CLV for each customer in the first 24-hours of sign up so we can send that information to advertisers.

**Tl;dr:**
1. Calculate CLV using known information for current signups.
2. Train regression model to predict CLV on unseen signups.

**Calculating CLV using known information**
- Revenue ($P_i \cdot V_i$): current MRR
- $r_i$: A function of package, with each package estimated using an exponential survival model
- $WACC$: 15%

### Good-enough solution

We still define value as CLV, but we focus on predicting which product the person will convert to (if any).


**Tl;dr:**
1. Calculate CLV using known information for current signups.
2. Train classification model to predict land package on unseen signups.
3. Assign the mean CLV of current customers on that package to that package.


# Data

**$y$ variable**
- $CLV_i$
- $Product_j$

**$x$ variables**
- Segment [str]
- Education email flag [str]
- Company email flag [str]
- Industry [str]
- Revenue [str]
- Employees [str]
- City [str]
- State [str]
- Country [str]
- Data residency [str]
- GA signup flag [str]
- FB signup flag [str]
- Feature counts in first 1 day [int]

# Import

### Import local XLSX file

In [3]:
def import_data():
    import pandas as pd
    file_path = '/Users/patricksweeney/growth/01_Acquisition/03_Value-based bidding/VBB Train 2.xlsx'
    data = pd.read_excel(file_path)
    return data

data = import_data()
data.head()

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,workspace_id,product,clv,segment,education_flag,company_email_flag,industry,revenue,employees_range,city,...,transcription_count,highlight_count,tag_count,insight_count,reel_created_count,invite_count,shared_object_note_count,shared_object_insight_count,note_viewed_user_count,tag_viewed_user_count
0,00095959-e65b-4256-aeba-06464ae106ac,No conversion,0.0,OTHER,0,0,Diversified Consumer Services,$50M-$100M,251-1K,Utrecht,...,4,0,0,0,0,0,0,0,0,0
1,00786b99-40f5-4703-a772-3026df9827ff,No conversion,0.0,FREE_EMAIL,0,1,Software & Services,$10B+,100K+,Mountain View,...,0,0,0,0,0,0,0,0,0,0
2,00ddc9a5-85c6-44d9-9968-c37cdad31fcc,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,0,0
3,0160b311-e4f8-4bbd-a06f-e2b4c80d40a1,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,1,0
4,0172345b-7159-4c99-88d9-3e4fa521f14d,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,0,0


### Check data

In [4]:
def find_missing_values(data):
    missing_values = data.isnull().sum()
    print("Features with missing values are...")
    print(missing_values)

find_missing_values(data)

Features with missing values are...
workspace_id                   0
product                        0
clv                            0
segment                        0
education_flag                 0
company_email_flag             0
industry                       0
revenue                        0
employees_range                0
city                           0
state                          0
country_code                   0
ga_signup_flag                 0
fb_signup_flag                 0
residency_region               0
project_count                  0
transcription_count            0
highlight_count                0
tag_count                      0
insight_count                  0
reel_created_count             0
invite_count                   0
shared_object_note_count       0
shared_object_insight_count    0
note_viewed_user_count         0
tag_viewed_user_count          0
dtype: int64


# Feature engineering

### One hot encode


One-hot encoding converts categorical variables into a form that can be provided to machine learning algorithms to improve prediction accuracy. It creates binary columns for each category and avoids the misleading ordinal relationships that numeric encoding might imply.

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(data, exclude):
    # Ensure that 'data' is a pandas DataFrame
    if not isinstance(data, pd.DataFrame):
        raise TypeError("Input data must be a pandas DataFrame.")

    # Validate 'exclude' as a list
    if not isinstance(exclude, list):
        raise TypeError("'exclude' must be a list of columns.")

    # Select string and categorical columns to encode, excluding the specified columns
    columns_to_encode = data.select_dtypes(include=['object', 'category']).columns
    columns_to_encode = [col for col in columns_to_encode if col not in exclude]

    # Apply OneHotEncoder
    encoder = OneHotEncoder(sparse=False, drop='if_binary')
    encoded_data = pd.DataFrame(encoder.fit_transform(data[columns_to_encode]))

    # Fix column names after encoding
    encoded_data.columns = encoder.get_feature_names_out(columns_to_encode)

    # Drop original columns and concatenate encoded data
    data = data.drop(columns_to_encode, axis=1)
    data = pd.concat([data, encoded_data], axis=1)

    return data


data = one_hot_encode(data, ['workspace_id', 'product'])
data.head()

KeyboardInterrupt: 

### Best-case performance

In [None]:
def best_case(data, y_variable, exclude, continuous_y=True):
    # Importing necessary libraries
    import pandas as pd
    import numpy as np
    from scipy.stats import entropy, differential_entropy
    from sklearn.feature_selection import mutual_info_regression, mutual_info_classif

    # Exclude specified variables and separate X and Y
    X = data.drop(columns=[y_variable] + exclude)
    Y = data[y_variable]

    # Calculate the entropy of Y-variable
    if continuous_y:
        # Use differential_entropy for continuous Y
        y_entropy = differential_entropy(Y)
        # Calculate mutual information for continuous Y
        mi = mutual_info_regression(X, Y)
    else:
        # Use entropy for discrete Y
        value_counts = Y.value_counts()
        y_entropy = entropy(value_counts)
        # Calculate mutual information for discrete Y
        mi = mutual_info_classif(X, Y)

    total_mi = np.sum(mi)

    # Proportion of uncertainty reduced
    proportion_reduced = total_mi / y_entropy if y_entropy > 0 else 0

    # Print results
    print(f"Entropy of Y-variable: {y_entropy}")
    print(f"Total Mutual Information (excluding interaction): {total_mi}")
    print(f"Proportion of Uncertainty Reduced: {proportion_reduced}")

# Example usage
best_case(data, 'product', ['clv', 'workspace_id'], continuous_y=False)


### Remove redundant features

In [None]:
import pandas as pd
from sklearn.feature_selection import SelectPercentile, mutual_info_regression, mutual_info_classif

def scikit_prune_features(data, y_variable, exclude, percentile):
    # Ensure that 'exclude' is a list
    if not isinstance(exclude, list):
        raise TypeError("'exclude' must be a list of columns.")

    # Separate the features and the target variable
    X = data.drop(columns=[y_variable] + exclude)
    y = data[y_variable]

    # Determine the score function based on the target variable type
    if y.dtype == 'float':
        score_func = mutual_info_regression
    else:
        score_func = mutual_info_classif

    # Apply SelectPercentile
    selector = SelectPercentile(score_func=score_func, percentile=percentile)
    X_new = selector.fit_transform(X, y)

    # Get the selected feature names
    selected_features = X.columns[selector.get_support()]

    # Combine selected features with excluded columns and target variable
    final_data = pd.concat([data[exclude], data[selected_features], data[[y_variable]]], axis=1)

    return final_data

# Example usage:
data = scikit_prune_features(data, 'product', ['workspace_id', 'clv'], 33)
data.head()


# Predicting $CLV_i$ in first 24-hours (not being used)

Performance here is terrible: 0% $r^2$

### Regression with gradient boosting


Gradient Boosting is a machine learning technique that builds models sequentially, with each new model correcting the errors of the previous ones, optimizing for a loss function. This approach combines weak predictive models, typically decision trees, into a stronger ensemble, offering high accuracy in various tasks.

In [None]:
def gradient_boosting_regression(data, y_variable, random_state=42):
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.model_selection import train_test_split, cross_val_score, KFold
    from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import math
    from scipy.interpolate import UnivariateSpline

    # Separate the features and target variable
    X = data.drop(columns=[y_variable]).select_dtypes(include=np.number)
    y = data[y_variable]

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

    # 1. Train model
    model = GradientBoostingRegressor(random_state=random_state)
    model.fit(X_train, y_train)

    # 2. Test model
    cv = KFold(n_splits=10, shuffle=True, random_state=random_state)
    cross_val_scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
    
    # Rounding the scores to two decimal places
    rounded_scores = [round(score, 2) for score in cross_val_scores]
    print("Cross-validation scores (R2):", rounded_scores)

    # Predictions
    y_pred = model.predict(X_test)

    # 3. Performance Metrics
    mse = round(mean_squared_error(y_test, y_pred), 2)
    r2 = round(r2_score(y_test, y_pred), 2)
    mae = round(mean_absolute_error(y_test, y_pred), 2)
    print(f"Mean Squared Error: {mse}")
    print(f"R2 Score: {r2}")
    print(f"Mean Absolute Error: {mae}")

    # 4. Plot Predictions vs Actual with adjusted log10 scale
    offset = 1e-6  # Small constant to offset zero or negative values
    adjusted_y_test = np.log10(y_test + offset)
    adjusted_y_pred = np.log10(y_pred + offset)

    plt.scatter(adjusted_y_test, adjusted_y_pred)
    plt.xlabel("Actual Values (log10 scale)")
    plt.ylabel("Predicted Values (log10 scale)")
    plt.title("Predicted vs Actual Values (log10 scale)")
    
    # Line for perfect predictions
    min_val = min(adjusted_y_test.min(), adjusted_y_pred.min())
    max_val = max(adjusted_y_test.max(), adjusted_y_pred.max())
    plt.plot([min_val, max_val], [min_val, max_val], 'k--')

    plt.show()
    
    
    # 5. Feature Importance - Updated to show only top 10
    feature_importance = model.feature_importances_
    sorted_idx = np.argsort(feature_importance)[-10:]  # Get the indices of the top 10 features
    
    plt.barh(X.columns[sorted_idx], feature_importance[sorted_idx])
    plt.xlabel("Gradient Boosting Feature Importance")
    plt.title("Top 10 Features")
    plt.show()

    plt.tight_layout()
    plt.show()

    return model

model = gradient_boosting_regression(data, 'clv')

# Predicting $Package_i$ in first 24 hours (being used)

We need to be minimising our false negative rate, even if we have an overly sensitive classifier.

### Optional: Make product a binary conversion event

In [6]:
import pandas as pd

def make_binary_conversion(data, y_variable):
    # Check if y_variable exists in the dataframe
    if y_variable not in data.columns:
        raise ValueError(f"{y_variable} is not a column in the provided dataframe.")

    # Convert the y_variable to binary
    data[y_variable] = data[y_variable].apply(lambda x: 'No conversion' if x == 'No conversion' else 'Conversion')

    return data

data = make_binary_conversion(data, 'product')
data.head()


Unnamed: 0,workspace_id,product,clv,segment,education_flag,company_email_flag,industry,revenue,employees_range,city,...,transcription_count,highlight_count,tag_count,insight_count,reel_created_count,invite_count,shared_object_note_count,shared_object_insight_count,note_viewed_user_count,tag_viewed_user_count
0,00095959-e65b-4256-aeba-06464ae106ac,No conversion,0.0,OTHER,0,0,Diversified Consumer Services,$50M-$100M,251-1K,Utrecht,...,4,0,0,0,0,0,0,0,0,0
1,00786b99-40f5-4703-a772-3026df9827ff,No conversion,0.0,FREE_EMAIL,0,1,Software & Services,$10B+,100K+,Mountain View,...,0,0,0,0,0,0,0,0,0,0
2,00ddc9a5-85c6-44d9-9968-c37cdad31fcc,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,0,0
3,0160b311-e4f8-4bbd-a06f-e2b4c80d40a1,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,1,0
4,0172345b-7159-4c99-88d9-3e4fa521f14d,No conversion,0.0,FREE_EMAIL,0,1,Unknown,Unknown,Unknown,Unknown,...,0,0,0,0,0,0,0,0,0,0


### Classification with gradient boosting

In [7]:
def gradient_boosting(data, y_variable, exclude):
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
    from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve, cohen_kappa_score
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd

    random_state = 2
    
    # Ensure 'exclude' is a list
    if not isinstance(exclude, list):
        raise TypeError("'exclude' must be a list of columns.")

    # Separate features and target variable
    X = data.drop(columns=[y_variable] + exclude).select_dtypes(include=np.number)
    y = data[y_variable]

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

    # Train model
    model = GradientBoostingClassifier()
    model.fit(X_train, y_train)

    # Test model
    cv = StratifiedKFold(n_splits=10)
    cross_val_scores = cross_val_score(model, X, y, cv=cv, scoring='f1_macro')
    print("Cross-validation scores:", cross_val_scores)

    # Predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)

    # 4. Classification report
    print(classification_report(y_test, y_pred))

    # Confusion matrix
    print("Confusion matrix:")
    print(confusion_matrix(y_test, y_pred))

    # Compute ROC, AUC, Precision-Recall for each class
    classes = np.unique(y)
    for i, cls in enumerate(classes):
        fpr, tpr, _ = roc_curve((y_test == cls).astype(int), y_proba[:, i])
        roc_auc = auc(fpr, tpr)
        precision, recall, _ = precision_recall_curve((y_test == cls).astype(int), y_proba[:, i])
        
        # ROC Curve
        plt.figure()
        plt.plot(fpr, tpr, label='Class %s AUC = %0.2f' % (cls, roc_auc))
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC for class %s' % cls)
        plt.legend(loc="lower right")
        plt.show()

        # Precision-Recall Curve
        plt.figure()
        plt.plot(recall, precision)
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.title('Precision-Recall Curve for class %s' % cls)
        plt.show()
    

    # # Compute and plot Lift Chart
    # df_lift = pd.DataFrame({'y_test': y_test, 'y_proba': y_proba})
    # df_lift = df_lift.sort_values(by='y_proba', ascending=False)
    # df_lift['decile'] = pd.qcut(df_lift['y_proba'], 10, labels=False)
    # df_lift['num_positive'] = df_lift['y_test'].cumsum()
    # df_lift['total'] = df_lift.index + 1
    # df_lift['lift'] = df_lift['num_positive'] / df_lift['total']

    
    # Cross-validation score
    print("Average cross-validation score:", np.mean(cross_val_scores))

    # Cohen's Kappa
    kappa = cohen_kappa_score(y_test, y_pred)
    print("Cohen's Kappa:", kappa)

    plt.tight_layout()
    plt.show()
    
    return model

In [None]:
model = gradient_boosting(data, 'product', ['clv'])

# Get value-weightings

In [None]:
def get_predictions(data, model, y_variable, exclude):
    import pandas as pd
    import numpy as np

    # Filter data where the target variable equals 0
    target_data = data

    # Exclude specified variables, separate the features, and retain the index
    X_target = target_data.drop(columns=[y_variable] + exclude).select_dtypes(include=np.number)

    # Make predictions
    predictions = model.predict(X_target)
    probabilities = model.predict_proba(X_target)[:, 1]

    # Append predictions and probabilities to the original data
    data.loc[target_data.index, 'Prediction'] = predictions
    data.loc[target_data.index, 'Probability'] = probabilities

    return data

# Example usage
predictions = get_predictions(data, model, 'product', ['clv'])


In [None]:
def get_weightings(data, prediction_column, clv_column):
    import pandas as pd

    # Calculate the average CLV for each prediction category
    average_clvs = data.groupby(prediction_column)[clv_column].mean()

    # Create a new column 'weighting' with the average CLV for each prediction
    data['weighting'] = data[prediction_column].map(average_clvs)

    # Select only the necessary columns
    data = data[['workspace_id', prediction_column, 'weighting']]

    return data

weightings = get_weightings(data, 'Prediction', 'clv')
weightings.head()


# Save predictions

In [None]:
def save_predictions(predictions, filename):
    import pandas as pd

    # Ensure the filename ends with '.xlsx'
    if not filename.endswith('.xlsx'):
        filename += '.xlsx'

    # Save to Excel
    predictions.to_excel(filename, index=False)

save_predictions(weightings, 'predictions_vbb.xlsx')

In [None]:
# Save model