Due to the cost of delivering email SMARTMARKET decides to send email to only 25% of it's
subscriber base for week 27.

Given the data provided:

(1) Which subscribers would you send email to?

(2) Which campaign(s) would you deliver to them?

(3) What do you expect the response rate to be?

In [2]:
import pandas as pd
import numpy as np

In [3]:
file_path = 'hist.xlsx'

In [4]:
df = pd.read_excel(file_path)

In [5]:
print(df.head(5))

   week_id  customer_id attribute1  state_id Sex  campaign_id  response
0        1            1          B         2   M            1         1
1        1            2          A        38   F            2         0
2        1            3          C        46   M            3         0
3        1            4          B        35   M            4         0
4        1            5          B        22   M            5         1


In [6]:
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

Number of rows: 260000
Number of columns: 7


In [7]:
filtered_data = df[['state_id', 'Sex', 'attribute1']]
filtered_data.head()

Unnamed: 0,state_id,Sex,attribute1
0,2,M,B
1,38,F,A
2,46,M,C
3,35,M,B
4,22,M,B


### Calculate the response rate grouped by 'state_id', 'Sex', and 'attribute1'

In [8]:
grouped_response_rates = df.groupby(['state_id', 'Sex', 'attribute1'])['response'].mean().reset_index()
grouped_response_rates.columns = ['state_id', 'Sex', 'attribute1', 'response_rate']

### Merge the response rates back into the original dataset

In [9]:
data_with_rates = pd.merge(df, grouped_response_rates, on=['state_id', 'Sex', 'attribute1'])

### Preview the data with response rates

In [10]:
data_with_rates.head()

Unnamed: 0,week_id,customer_id,attribute1,state_id,Sex,campaign_id,response,response_rate
0,1,1,B,2,M,1,1,0.137821
1,1,2,A,38,F,2,0,0.432692
2,1,3,C,46,M,3,0,0.587278
3,1,4,B,35,M,4,0,0.124126
4,1,5,B,22,M,5,1,0.123932


### Determine the threshold for top 25% response rates for groups

In [11]:
top_25_percent_threshold = grouped_response_rates['response_rate'].quantile(0.75)

### Select groups with response rates in the top 25%

In [12]:
top_groups = grouped_response_rates[grouped_response_rates['response_rate'] >= top_25_percent_threshold]

### Preview the top 25% groups

In [13]:
top_groups.head()

Unnamed: 0,state_id,Sex,attribute1,response_rate
2,1,F,C,0.716346
4,1,M,A,0.708502
10,2,F,C,0.730769
12,2,M,A,0.741758
18,3,F,C,0.68109


### Check the last campaign delivered to each group (state, gender, attribute1)
### Assume you're delivering campaigns in rotation, so find out the last campaign for each group at week 26

In [14]:
last_campaign_per_group = data_with_rates[data_with_rates['week_id'] == 26][['state_id', 'Sex', 'attribute1', 'campaign_id']]

### Preview the last campaign data


In [15]:
last_campaign_per_group.head()

Unnamed: 0,state_id,Sex,attribute1,campaign_id
250000,2,M,B,6
250001,38,F,A,7
250002,46,M,C,8
250003,35,M,B,9
250004,22,M,B,10


### For week 27, based on the rotation logic, assign the next campaign in sequence

In [16]:
def next_campaign(campaign_id):
    return 1 if campaign_id == 10 else campaign_id + 1

### Apply the function to determine the next campaign for each top group

In [17]:
top_groups_with_campaign = pd.merge(top_groups, last_campaign_per_group, on=['state_id', 'Sex', 'attribute1'])
top_groups_with_campaign['next_campaign'] = top_groups_with_campaign['campaign_id'].apply(next_campaign)

### Preview the next campaigns to be sent to each group

In [18]:
top_groups_with_campaign.head()

Unnamed: 0,state_id,Sex,attribute1,response_rate,campaign_id,next_campaign
0,1,F,C,0.716346,2,3
1,1,F,C,0.716346,4,5
2,1,F,C,0.716346,10,1
3,1,F,C,0.716346,8,9
4,1,F,C,0.716346,8,9


### Based on historical response rates, estimate the response rate for week 27

In [19]:
predicted_response_rate = top_groups_with_campaign['response_rate'].mean()

### Display the predicted response rate

In [20]:
print(f"Predicted response rate for week 27: {predicted_response_rate:.2%}")

Predicted response rate for week 27: 71.06%


### Response rate for campaign_Id for each one and then which campaign ID has the highest response rate, so which campaignID to send next.

In [21]:
segment_response_rates = top_groups_with_campaign.groupby(['state_id', 'Sex', 'attribute1', 'campaign_id'])['response_rate'].mean().reset_index()

Here we have group by state_id, user_class, gender, and campaign_id to calculate response rates

In [22]:
best_campaigns_by_segment = segment_response_rates.loc[segment_response_rates.groupby(['state_id','Sex', 'attribute1'])['response_rate'].idxmax()]

Here we have the best campaign for each segment based on the highest response rate

In [23]:
print(best_campaigns_by_segment[['state_id', 'Sex', 'attribute1','campaign_id', 'response_rate']])

     state_id Sex attribute1  campaign_id  response_rate
0           1   F          C            1       0.716346
10          1   M          A            1       0.708502
19          2   F          C            1       0.730769
29          2   M          A            1       0.741758
36          3   F          C            1       0.681090
..        ...  ..        ...          ...            ...
838        48   M          A            1       0.662088
847        49   F          C            1       0.725962
856        49   M          A            1       0.717949
865        50   F          C            1       0.724615
873        50   M          A            1       0.719780

[100 rows x 5 columns]


We can display the recommended campaign for each (state_id, user_class, gender) segment


In [29]:
# Frequency Encoding
best_campaigns_by_segment['state_frequency'] = best_campaigns_by_segment['state_id'].map(best_campaigns_by_segment['state_id'].value_counts())

# State-Level Response Rate
state_response = best_campaigns_by_segment.groupby('state_id')['response_rate'].mean().reset_index()
state_response.columns = ['state_id', 'state_avg_response']
best_campaigns_by_segment = best_campaigns_by_segment.merge(state_response, on='state_id', how='left')

In [30]:
# Binary Encoding
best_campaigns_by_segment['Sex_binary'] = best_campaigns_by_segment['Sex'].map({'Male': 1, 'Female': 0})

# Gender-Based Response Rate
gender_response = best_campaigns_by_segment.groupby('Sex')['response_rate'].mean().reset_index()
gender_response.columns = ['Sex', 'gender_avg_response']
best_campaigns_by_segment = best_campaigns_by_segment.merge(gender_response, on='Sex', how='left')

In [32]:
# Mean Response Rate by Attribute
attribute_response = best_campaigns_by_segment.groupby('attribute1')['response_rate'].mean().reset_index()
attribute_response.columns = ['attribute1', 'attribute_avg_response']
best_campaigns_by_segment = best_campaigns_by_segment.merge(attribute_response, on='attribute1', how='left')

In [33]:
# Count Encoding
best_campaigns_by_segment['campaign_frequency'] = best_campaigns_by_segment['campaign_id'].map(best_campaigns_by_segment['campaign_id'].value_counts())

# Campaign-Level Response Rate
campaign_response = best_campaigns_by_segment.groupby('campaign_id')['response_rate'].mean().reset_index()
campaign_response.columns = ['campaign_id', 'campaign_avg_response']
best_campaigns_by_segment = best_campaigns_by_segment.merge(campaign_response, on='campaign_id', how='left')

In [53]:
# Log Transform
best_campaigns_by_segment['log_response_rate'] = np.log1p(best_campaigns_by_segment['response_rate'])

# Deviation from State Mean
#best_campaigns_by_segment['state_response_deviation'] = best_campaigns_by_segment['response_rate'] - best_campaigns_by_segment['state_avg_response']


In [36]:
best_campaigns_by_segment['state_sex_interaction'] = best_campaigns_by_segment['state_id'].astype(str) + '_' + best_campaigns_by_segment['Sex']
best_campaigns_by_segment['attribute_campaign_interaction'] = best_campaigns_by_segment['attribute1'].astype(str) + '_' + best_campaigns_by_segment['campaign_id'].astype(str)


In [37]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer

In [38]:
def frequency_encode(df, col):
    return df[col].map(df[col].value_counts())

def group_mean_encode(df, group_col, target_col):
    mean_response = df.groupby(group_col)[target_col].mean().to_dict()
    return df[group_col].map(mean_response)

In [40]:
# Preprocessing for numeric columns
numeric_features = ['attribute1', 'response_rate']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),       # Handle missing values
    ('scaler', StandardScaler())                      # Normalize features
])                   # Normalize features

In [41]:
# Preprocessing for categorical columns with frequency encoding
def frequency_encode_transform(X):
    for col in ['state_id', 'campaign_id']:
        X[col + '_freq'] = frequency_encode(X, col)
    return X

In [42]:
frequency_transformer = FunctionTransformer(frequency_encode_transform, validate=False)

In [43]:
# Preprocessing for categorical columns with one-hot encoding
categorical_features = ['Sex']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

In [44]:
# ColumnTransformer to combine all preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('freq', frequency_transformer, ['state_id', 'campaign_id']),
        ('cat', categorical_transformer, categorical_features)
    ])

In [46]:
# Define the final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),])

In [49]:
df = best_campaigns_by_segment.copy()

In [50]:
# Apply feature engineering (grouped mean encoding)
df['state_avg_response'] = group_mean_encode(df, 'state_id', 'response_rate')
df['campaign_avg_response'] = group_mean_encode(df, 'campaign_id', 'response_rate')

In [55]:
# Fit and transform the pipeline
#X = df.drop('response_rate', axis=1)  # Features
#y = df['response_rate']                # Target
#pipeline.fit(X, y)
#X_transformed = pipeline.transform(X)

In [77]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score

In [78]:
# Define feature engineering functions
def frequency_encode(df, col):
    """Frequency encode a column."""
    return df[col].map(df[col].value_counts())

In [79]:
def group_mean_encode(df, group_col, target_col):
    """Mean encode a column based on target."""
    mean_response = df.groupby(group_col)[target_col].mean().to_dict()
    return df[group_col].map(mean_response)

In [80]:
# Apply mean encoding to the dataset
df['state_avg_response'] = group_mean_encode(df, 'state_id', 'response_rate')
df['campaign_avg_response'] = group_mean_encode(df, 'campaign_id', 'response_rate')


In [81]:
# Define feature columns and target
X = df.drop(columns=['response_rate'])  # Features
y = df['response_rate']                # Target


In [82]:
# Define train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [83]:
# Custom transformer for frequency encoding
def frequency_encode_transform(X):
    for col in ['state_id', 'campaign_id']:
        X[col + '_freq'] = frequency_encode(X, col)
    return X

frequency_transformer = FunctionTransformer(frequency_encode_transform, validate=False)


In [63]:
# Preprocessing pipelines
numeric_features = ['attribute1', 'state_avg_response', 'campaign_avg_response']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [64]:
categorical_features = ['Sex']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

In [65]:
# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('freq', frequency_transformer, ['state_id', 'campaign_id']),
        ('cat', categorical_transformer, categorical_features)
    ])

In [66]:
# Model integration (Random Forest as example)
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [67]:
# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

In [84]:
# Define numeric and categorical features
numeric_features = ['attribute1', 'state_avg_response', 'campaign_avg_response']  # Ensure these are numeric
categorical_features = ['state_id', 'campaign_id', 'Sex']  # Ensure these are categorical

In [85]:
# Preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with mean
    ('scaler', StandardScaler())                 # Scale numeric features
])

In [86]:
# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),  # Numeric preprocessing
        ('cat', categorical_transformer, categorical_features)  # Categorical preprocessing
    ])

In [87]:
# Check the data types of features
print(X.dtypes)

# Convert columns if necessary
X['attribute1'] = pd.to_numeric(X['attribute1'], errors='coerce')  # Coerce non-numeric to NaN


state_id                            int64
Sex                                object
attribute1                         object
campaign_id                         int64
state_frequency                     int64
state_avg_response_x              float64
state_avg_response_y              float64
Sex_binary                        float64
gender_avg_response               float64
attribute_avg_response            float64
campaign_frequency                  int64
campaign_avg_response             float64
log_response_rate                 float64
state_sex_interaction              object
attribute_campaign_interaction     object
state_avg_response                float64
dtype: object
