# Predict CTR and Evaluate ROI

Click-through-rate, or CTR, is a metric that defines how many clicks an ad receives divided by the number of views or impressions. Accurately predicting CTR is critical to media-buying decisions because it helps ensure ads are targeted to the right users. This template predicts click-through rates based on categorical and numeric features. It also calculates the cost, return, and return on investment (ROI) based on predicted clicks. This template is designed for those interested in how to optimize ads with machine learning.

To use this template, you will need a dataset containing click log data that meets the following conditions:
- There are at least two feature columns and a column with a binary target variable indicating whether the user clicked or not (0 or 1).
- Any additional features you want to include have already been created. You can review this [course](https://app.datacamp.com/learn/courses/feature-engineering-for-machine-learning-in-python) for more information on feature engineering.
- There are no NaN/NA values. You can use [this template to impute missing values](https://app.datacamp.com/workspace/templates/recipe-python-impute-missing-data) if needed.

The placeholder dataset in this template consists of web browser data, including the search engine type, position of the ad, and whether or not the user clicked.

## 1. Loading packages and data

In [None]:
# Load packages
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score

# Replace with your CSV file path and load the data
df = pd.read_csv("data/ctr_data.csv")

# Preview the data
df

The `pandas` method [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) prints a summary of the data. This includes the data types and the number of non-missing values. The summary is helpful to understand what types of pre-processing may be necessary.

In [None]:
df.info()

## 2. Pre-processing the Data

The code below uses sklearn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to process the numeric and categorical data in preparation for a machine learning model.

- Standardizing numeric columns is important for some machine learning models. The code below removes skew using [`PowerTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html), and then scales it using [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). 
- The categorical data is one-hot encoded using [`OneHotEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=one%20hot#sklearn.preprocessing.OneHotEncoder), which creates new binary columns for each category in a column. 

_Note: Three numeric columns, "search_engine_type_count", "product_type_count", and "advertiser_type_count", were created for the purposes of this template. Although feature engineering is outside the scope of this template, you can review how they were created [here](https://campus.datacamp.com/courses/predicting-ctr-with-machine-learning-in-python/exploratory-ctr-data-analysis?ex=5)._

In [None]:
# Specify the numeric columns you wish to process
numeric_features = [
    "search_engine_type_count",
    "product_type_count",
    "advertiser_type_count",
]

# Specify the categorical columns you wish to process
categorical_features = [
    "banner_pos",
    "device_type",
    "device_conn_type",
    "product_type",
    "advertiser_type",
]

# Transform the numeric and categorical columns
numeric_transformer = Pipeline(
    steps=[
        ("boxcox", PowerTransformer(method="box-cox", standardize=False)),
        ("scaler", StandardScaler()),
    ]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse=False)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

processed_features = pd.DataFrame(
    preprocessor.fit_transform(df), columns=preprocessor.get_feature_names_out()
)

# Preview the processed DataFrame
processed_features

Finally, the data is split into the target and feature variables, and then further split into training and testing datasets using [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
# Select the processed columns
X = processed_features

# Select the column you wish to use as a target
y = df["click"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 3. Setting a Baseline and Evaluation Metrics
The following evaluation assumes a bidding process where you bid on impressions that you predict will result in a click and do not bid on impressions that you predict will not result in a click. The code below defines two functions to evaluate your model. The first function calculates the CTR metrics and takes as arguments:
- The cost that you expect to pay per X number of impressions (`c`). Typically, impressions are charged as a cost per 1000 impressions.
- The downstream return per click `r` (e.g., based on the chance a customer purchases a product).
- The true values (`true_values`).
- The predicted values (`predictions`) based on your model.
The function then calculates the total return, total cost, and total return on investment (ROI) of the model. 

The second function accepts two of the same arguments (`true_values` and `predictions`), and generates two classification metrics to help you evaluate the model:
- Precision: the proportion of clicks you correctly predicted relative to the total number of impressions you predicted to result in a click (i.e., including false positives). Maximizing this increases your ROI on ad spend.
- Recall: the proportion of clicks you correctly predicted relative to the total number of clicks that occurred (including ones you missed). Maximizing this means that the ads are targeting the right people.

The code also sets a baseline to evaluate a model with no bids (i.e., every event is a non-click). If you want to learn more about evaluating your CTR prediction model, be sure to check out these two videos on [CTR metrics](https://campus.datacamp.com/courses/predicting-ctr-with-machine-learning-in-python/model-applications-and-improvements?ex=1) and [model evaluation](https://campus.datacamp.com/courses/predicting-ctr-with-machine-learning-in-python/model-applications-and-improvements?ex=5).

In [None]:
# Set the costs and returns you expect for clicks and impressions here!
c = 0.05
r = 0.2

# Define a metric to calculate CTR metrics
def calculate_ctr_metrics(true_values, predictions, r, c):

    # Generate a confusion matrix
    conf_matrix = confusion_matrix(true_values, predictions)
    tn, fp, fn, tp = conf_matrix.ravel()

    # Calculate the total return, cost, and roi
    total_return = tp * r
    total_cost = (tp + fp) * c
    if total_cost != 0:
        roi = total_return / total_cost
    else:
        roi = 0.0

    # Return the metrics as a dictionary
    ctr_metrics = {
        "total_return": round(total_return, 2),
        "total_cost": round(total_cost, 2),
        "roi": round(roi, 2),
    }
    return ctr_metrics


def calculate_model_metrics(true_values, predictions):
    
    # Calculate the precision and the recall
    prec = precision_score(
        true_values, predictions, average="weighted", zero_division=0
    )
    recall = recall_score(true_values, predictions, average="weighted")

    # Return the metrics as a dictionary
    model_metrics = {"precision": round(prec, 5), 'recall': round(recall, 5)}
    return model_metrics


# Set up baseline data containing no bids
baseline_pred = np.asarray([0 for x in range(len(X_test))])

# Evaluate the performance with no bids
baseline_metrics = calculate_model_metrics(true_values=y_test, predictions=baseline_pred)

print("Precision:", baseline_metrics['precision'])
print("Recall:", baseline_metrics['recall'])

## 4. Fitting and Tuning  a Model
With a baseline established, it is now possible to build a classification model and evaluate its performance. The following code uses a [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and [`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) to initialize a random forest classifier and perform a randomized search on hyperparameters.

The classifier returns predictions based on a `threshold` that you can customize. This threshold will determine at what probability an event is classified as a click or not. For example, you may want to increase the threshold when costs are high so that you only spend when you are confident in a click. 

By default the threshold is set at 0.25. Try adjusting the threshold to see how it affects the CTR and model metrics!

_Note: The randomized grid search may take some time to run._

In [None]:
# Set the threshold
threshold = 0.25

# Set up the parameters to sample from
param_grid = {"max_depth": list(range(5, 20)), "max_features": ["auto", "sqrt"]}

# Instantiate a RandomForestClassifier and Randomized SearchCV
rf = RandomForestClassifier()

random_rf_class = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_grid,
    n_iter=3,
    scoring="roc_auc",
    n_jobs=4,
    cv=3,
)

# Fit it to the data
random_rf_class.fit(X_train, y_train)

# Create predictions
random_rf_pred = (
    random_rf_class.best_estimator_.predict_proba(X_test)[:, 1] >= threshold
).astype(bool)

# Evaluate the performance of the model
ctr_metrics = calculate_ctr_metrics(
    true_values=y_test,
    predictions=random_rf_pred,
    r=r,
    c=c,
)

model_metrics = calculate_model_metrics(true_values=y_test, predictions=random_rf_pred)

print("Total Return:", ctr_metrics["total_return"])
print("Total Cost:", ctr_metrics["total_cost"])
print("ROI:", ctr_metrics["roi"])
print("Precision:", model_metrics["precision"])
print("Recall:", model_metrics["recall"])

## 5. Plot the ROI for Different Costs and Returns
Finally, it is possible to visualize the effect that different returns (`r`) and costs (`c`) have on the ROI. The following code generates an interactive plot that shows the ROI for different levels of returns. You can further visualize the additional effect of costs levels by using the slider at the bottom. 

As you might expect, the lowest costs and the highest returns generate the greatest ROI. Feel free to experiment with the different returns and costs by editing the lists below!

In [None]:
# Create figure
fig = go.Figure()

# Set costs and returns
returns = [0.2, 0.25, 0.3, 0.35, 0.4, 0.45]
costs = [0.05, 0.1, 0.15, 0.2, 0.25, 0.30]

# Add traces, one for each cost
for cost in costs:
    roi = [
        calculate_ctr_metrics(
            true_values=y_test, predictions=random_rf_pred, r=ret, c=cost
        )["roi"]
        for ret in returns
    ]
    fig.add_trace(
        go.Scatter(visible=False, x=returns, y=roi, hoverinfo="text", hovertext=roi)
    )

# Make first trace visible
fig.data[1].visible = True

# Create and add slider
steps = []
for i in range(len(costs)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)}],
        label=costs[i],  # layout attribute
    )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [
    dict(active=1, currentvalue={"prefix": "Cost: "}, pad={"t": 50}, steps=steps)
]

# Update the layout and show the figure
fig.update_layout(
    sliders=sliders,
    title="ROI for Different Levels of Returns and Costs",
    xaxis_title="Returns",
    yaxis_title="ROI",
)
fig.update_yaxes(autorange=False)

fig.show()

## 6. Next Steps
You can further adjust this template by experimenting with different models and adjusting the cost and return parameters to calculate predicted ROI! If you want to learn more about using machine learning to predict CTR in Python, be sure to check out [this course](https://app.datacamp.com/learn/courses/predicting-ctr-with-machine-learning-in-python).

Note that the model used above serves as an example and may not be the best model for your data. You may want to check out the following courses to learn more about classification and tuning models:
- [Supervised Learning with scikit-learn](https://app.datacamp.com/learn/courses/supervised-learning-with-scikit-learn)
- [Machine Learning with Tree-Based Models in Python](https://app.datacamp.com/learn/courses/machine-learning-with-tree-based-models-in-python)
- [Hyperparameter Tuning in Python](https://app.datacamp.com/learn/courses/hyperparameter-tuning-in-python)