# Курс "Машинное обучение в бизнесе"

## Практическое задание урока 4. Uplift-моделирование

### Задания 1 - 7

1. скачать набор данных маркетинговых кампаний отсюда https://www.kaggle.com/davinwijaya/customer-retention
2. там поле conversion - это целевая переменная, а offer - коммуникация. Переименовать поля (conversion -> target, offer -> treatment) и привести поле treatment к бинарному виду (1 или 0, т.е было какое-то предложение или нет) - значение No Offer означает отсутствие коммуникации, а все остальные - наличие.
3. сделать разбиение набора данных не тренировочную и тестовую выборки
4. сделать feature engineering на ваше усмотрение (допускается свобода выбора методов)
5. провести uplift-моделирование 3 способами: одна модель с признаком коммуникации (S learner), модель с трансформацией таргета (трансформация классов п. 2. 1) и вариант с двумя независимыми моделями
6. в конце вывести единую таблицу сравнения метрик uplift@10%, uplift@20% этих 3 моделей
7. построить модель UpliftTreeClassifier и попытаться описать словами полученное дерево

### Решение

**Описание датасета**

* **recency** - months since last purchase

* **history** - $value of the historical purchases

* **used_discount** - indicates if the customer used a discount before

* **used_bogo** - indicates if the customer used a buy one get one (BOGO) before

* **zip_code** - class of the zip code as Suburban/Urban/Rural

* **is_referral** - indicates if the customer was acquired from referral channel

* **channel** - channels that the customer using, Phone/Web/Multichannel

* **offer** - the offers sent to the customers, Discount/But One Get One/No Offer

* **conversion** - customer conversion(buy or not)

#### Подключение библиотек и скриптов

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd; pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split

from sklift.metrics import uplift_at_k
from sklift.viz import plot_uplift_preds
from sklift.models import SoloModel

from catboost import CatBoostClassifier

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#### Загрузка данных

In [2]:
df = pd.read_csv('HW_data.csv')
df.head(3)

Unnamed: 0,recency,history,used_discount,used_bogo,zip_code,is_referral,channel,offer,conversion
0,10,142.44,1,0,Surburban,0,Phone,Buy One Get One,0
1,6,329.08,1,1,Rural,1,Web,No Offer,0
2,7,180.65,0,1,Surburban,1,Web,Buy One Get One,0


#### Обработка и анализ данных

[Задание 2. там поле conversion - это целевая переменная, а offer - коммуникация. Переименовать поля (conversion -> target, offer -> treatment) и привести поле treatment к бинарному виду (1 или 0, т.е было какое-то предложение или нет) - значение No Offer означает отсутствие коммуникации, а все остальные - наличие.]

In [3]:
# Переименуем поля (conversion -> target, offer -> treatment)
df = df.rename({"conversion":"target","offer":"treatment"}, axis='columns')

In [4]:
df['treatment'].value_counts()

Buy One Get One    21387
Discount           21307
No Offer           21306
Name: treatment, dtype: int64

In [5]:
# Приведем поле treatment к бинарному виду (1 или 0, т.е было какое-то предложение или нет)
# значение No Offer означает отсутствие коммуникации, а все остальные - наличие.
df.loc[df['treatment'] == "No Offer", 'treatment'] = 0
df.loc[df['treatment'] != 0, 'treatment'] = 1
df['treatment'].value_counts()

1    42694
0    21306
Name: treatment, dtype: int64

In [6]:
df.head(3)

Unnamed: 0,recency,history,used_discount,used_bogo,zip_code,is_referral,channel,treatment,target
0,10,142.44,1,0,Surburban,0,Phone,1,0
1,6,329.08,1,1,Rural,1,Web,0,0
2,7,180.65,0,1,Surburban,1,Web,1,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64000 entries, 0 to 63999
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   recency        64000 non-null  int64  
 1   history        64000 non-null  float64
 2   used_discount  64000 non-null  int64  
 3   used_bogo      64000 non-null  int64  
 4   zip_code       64000 non-null  object 
 5   is_referral    64000 non-null  int64  
 6   channel        64000 non-null  object 
 7   treatment      64000 non-null  object 
 8   target         64000 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 4.4+ MB


In [8]:
for value in ['recency','history','used_discount','used_bogo','zip_code','is_referral','channel','treatment','target']:
    print(df[value].value_counts(), '\n--------')

1     8952
10    7565
2     7537
9     6441
3     5904
4     5077
6     4605
5     4510
7     4078
11    3504
8     3495
12    2332
Name: recency, dtype: int64 
--------
29.99     7947
81.20        9
53.79        9
142.94       8
35.40        8
          ... 
701.66       1
246.45       1
798.83       1
125.19       1
104.00       1
Name: history, Length: 34833, dtype: int64 
--------
1    35266
0    28734
Name: used_discount, dtype: int64 
--------
1    35182
0    28818
Name: used_bogo, dtype: int64 
--------
Surburban    28776
Urban        25661
Rural         9563
Name: zip_code, dtype: int64 
--------
1    32144
0    31856
Name: is_referral, dtype: int64 
--------
Web             28217
Phone           28021
Multichannel     7762
Name: channel, dtype: int64 
--------
1    42694
0    21306
Name: treatment, dtype: int64 
--------
0    54606
1     9394
Name: target, dtype: int64 
--------


In [9]:
df['recency'].value_counts()

1     8952
10    7565
2     7537
9     6441
3     5904
4     5077
6     4605
5     4510
7     4078
11    3504
8     3495
12    2332
Name: recency, dtype: int64

In [10]:
df_tree = df.copy()

#### Feature Engineering

[Задание 4. сделать feature engineering на ваше усмотрение (допускается свобода выбора методов)]

К полям:
- recency, zip_code, channel применим OHE-кодирование
- history - standardScaler
- used_discount, used_bogo, is_referral, treatment, target - оставим пока как есть

In [11]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key)
        test_columns = [col for col in X.columns]
        for col_ in test_columns:
            if col_ not in self.columns:
                X[col_] = 0
        return X[self.columns]

In [12]:
from sklearn.preprocessing import StandardScaler


cat_cols = ['recency', 'zip_code', 'channel']
continuos_cols = ['history']
base_cols = ['used_discount', 'used_bogo', 'is_referral', 'treatment']

continuos_transformers = []
cat_transformers = []
base_transformers = []

for cont_col in continuos_cols:
    transfomer =  Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                ('standard', StandardScaler())
            ])
    continuos_transformers.append((cont_col, transfomer))
    
for cat_col in cat_cols:
    cat_transformer = Pipeline([
                ('selector', ColumnSelector(key=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    cat_transformers.append((cat_col, cat_transformer))
    
for base_col in base_cols:
    base_transformer = Pipeline([
                ('selector', NumberSelector(key=base_col))
            ])
    base_transformers.append((base_col, base_transformer))

Теперь объединим все наши трансформеры с помощью FeatureUnion

In [13]:
from sklearn.pipeline import FeatureUnion

feats = FeatureUnion(continuos_transformers+cat_transformers+base_transformers)
feature_processing = Pipeline([('feats', feats)])

#### Разбиение набора данных на тренировочную и тестовую выборки

[Задание 3. сделать разбиение набора данных не тренировочную и тестовую выборки]

In [14]:
indices_train = df.index
indices_learn, indices_valid = train_test_split(df.index, test_size=0.3, random_state=123)

Заведем переменные

In [15]:
X_train = df.loc[indices_learn, :]
y_train = df.loc[indices_learn, 'target']
treat_train = df.loc[indices_learn, 'treatment']

X_val = df.loc[indices_valid, :]
y_val = df.loc[indices_valid, 'target']
treat_val =  df.loc[indices_valid, 'treatment']

X_train_full = df.loc[indices_train, :]
y_train_full = df.loc[:, 'target']
treat_train_full = df.loc[:, 'treatment']

models_results = {
    'approach': [],
    'uplift@20%': [],
    'uplift@10%': []
}

#### Применение feature engineering к данным

In [16]:
X_train = feature_processing.fit_transform(X_train)
X_val = feature_processing.fit_transform(X_val)
X_train_full = feature_processing.fit_transform(X_train_full)

#### Uplift-моделирование 3 способами

[Задание 5. провести uplift-моделирование 3 способами: одна модель с признаком коммуникации (S learner), модель с трансформацией таргета (трансформация классов п. 2. 1) и вариант с двумя независимыми моделями]

##### 1 способ. Одна модель с признаком коммуникации

In [17]:
from sklift.metrics import uplift_at_k
from sklift.viz import plot_uplift_preds
from sklift.models import SoloModel

# Воспользуемся CatBoost
sm = SoloModel(CatBoostClassifier(iterations=20, thread_count=2, random_state=42, silent=True))
sm = sm.fit(X_train, y_train, treat_train)

uplift_sm = sm.predict(X_val)

sm_score_at_20 = uplift_at_k(y_true=y_val, uplift=uplift_sm, treatment=treat_val, strategy='by_group', k=0.2)
sm_score_at_10 = uplift_at_k(y_true=y_val, uplift=uplift_sm, treatment=treat_val, strategy='by_group', k=0.1)


models_results['approach'].append('SoloModel')
models_results['uplift@20%'].append(sm_score_at_20)
models_results['uplift@10%'].append(sm_score_at_10)

# Получим условные вероятности выполнения целевого действия при взаимодействии для каждого объекта
sm_trmnt_preds = sm.trmnt_preds_
# И условные вероятности выполнения целевого действия без взаимодействия для каждого объекта
sm_ctrl_preds = sm.ctrl_preds_

models_results

{'approach': ['SoloModel'],
 'uplift@20%': [0.07547381789761626],
 'uplift@10%': [0.06345411454261143]}

##### 2 способ. Трансформация классов

In [18]:
from sklift.models import ClassTransformation

ct = ClassTransformation(CatBoostClassifier(iterations=20, thread_count=2, random_state=42, silent=True))
ct = ct.fit(X_train, y_train, treat_train)

uplift_ct = ct.predict(X_val)

sm_score_at_20 = uplift_at_k(y_true=y_val, uplift=uplift_ct, treatment=treat_val, strategy='by_group', k=0.2)
sm_score_at_10 = uplift_at_k(y_true=y_val, uplift=uplift_ct, treatment=treat_val, strategy='by_group', k=0.1)

models_results['approach'].append('ClassTransformation')
models_results['uplift@20%'].append(sm_score_at_20)
models_results['uplift@10%'].append(sm_score_at_10)
models_results

  after removing the cwd from sys.path.


{'approach': ['SoloModel', 'ClassTransformation'],
 'uplift@20%': [0.07547381789761626, 0.20791111029699103],
 'uplift@10%': [0.06345411454261143, 0.24474320758405005]}

##### 3 способ. Две независимые модели

In [19]:
from sklift.models import TwoModels

tm = TwoModels(
    estimator_trmnt=CatBoostClassifier(iterations=20, thread_count=2, random_state=42, silent=True), 
    estimator_ctrl=CatBoostClassifier(iterations=20, thread_count=2, random_state=42, silent=True), 
    method='vanilla'
)
tm = tm.fit(
    X_train, y_train, treat_train
)

uplift_tm = tm.predict(X_val)

tm_score_at_20 = uplift_at_k(y_true=y_val, uplift=uplift_tm, treatment=treat_val, strategy='by_group', k=0.2)
tm_score_at_10 = uplift_at_k(y_true=y_val, uplift=uplift_tm, treatment=treat_val, strategy='by_group', k=0.1)

models_results['approach'].append('TwoModels')
models_results['uplift@20%'].append(tm_score_at_20)
models_results['uplift@10%'].append(tm_score_at_10)
models_results

{'approach': ['SoloModel', 'ClassTransformation', 'TwoModels'],
 'uplift@20%': [0.07547381789761626, 0.20791111029699103, 0.0453854655138726],
 'uplift@10%': [0.06345411454261143, 0.24474320758405005, 0.02905834636434715]}

#### Таблица сравнения метрик трех моделей

[Задание 6. в конце вывести единую таблицу сравнения метрик uplift@10%, uplift@20% этих 3 моделей]

In [20]:
models_results = pd.DataFrame(models_results)
pd.pivot_table(models_results, columns = 'approach').reset_index()

approach,index,ClassTransformation,SoloModel,TwoModels
0,uplift@10%,0.244743,0.063454,0.029058
1,uplift@20%,0.207911,0.075474,0.045385


#### Построение модели UpliftTreeClassifier

[Задание 7. построить модель UpliftTreeClassifier и попытаться описать словами полученное дерево]

In [21]:
df_tree.head(3)

Unnamed: 0,recency,history,used_discount,used_bogo,zip_code,is_referral,channel,treatment,target
0,10,142.44,1,0,Surburban,0,Phone,1,0
1,6,329.08,1,1,Rural,1,Web,0,0
2,7,180.65,0,1,Surburban,1,Web,1,0


In [22]:
# Переводим все категориальные признаки в dummies
df_tree = pd.concat([df_tree.drop(['recency', 'zip_code', 'channel'], axis=1),
                        pd.get_dummies(df['recency'], prefix='recency'),
                        pd.get_dummies(df['zip_code'], prefix='zip_code'),
                        pd.get_dummies(df['channel'], prefix='channel')
                        ], 1)

df_tree.head(3)

Unnamed: 0,history,used_discount,used_bogo,is_referral,treatment,target,recency_1,recency_2,recency_3,recency_4,recency_5,recency_6,recency_7,recency_8,recency_9,recency_10,recency_11,recency_12,zip_code_Rural,zip_code_Surburban,zip_code_Urban,channel_Multichannel,channel_Phone,channel_Web
0,142.44,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0
1,329.08,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1
2,180.65,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1


In [23]:
indices_train_tree = df_tree.index
indices_learn_tree, indices_valid_tree = train_test_split(df_tree.index, test_size=0.3, random_state=123)

In [24]:
X_train_tree = df_tree.loc[indices_learn_tree, :]
y_train_tree = df_tree.loc[indices_learn_tree, 'target']
treat_train_tree = df_tree.loc[indices_learn_tree, 'treatment']

X_val_tree = df_tree.loc[indices_valid_tree, :]
y_val_tree = df_tree.loc[indices_valid_tree, 'target']
treat_val_tree =  df_tree.loc[indices_valid_tree, 'treatment']

In [25]:
features = [col for col in X_train_tree]
X_train_tree.head()

Unnamed: 0,history,used_discount,used_bogo,is_referral,treatment,target,recency_1,recency_2,recency_3,recency_4,recency_5,recency_6,recency_7,recency_8,recency_9,recency_10,recency_11,recency_12,zip_code_Rural,zip_code_Surburban,zip_code_Urban,channel_Multichannel,channel_Phone,channel_Web
53181,121.56,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
42635,617.62,0,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0
6296,185.62,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1
41722,359.03,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
32660,139.68,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1


In [26]:
%%time
from IPython.display import Image
from causalml.inference.tree import UpliftTreeClassifier, UpliftRandomForestClassifier
from causalml.inference.tree import uplift_tree_string, uplift_tree_plot

uplift_model = UpliftTreeClassifier(max_depth=8, min_samples_leaf=200, min_samples_treatment=50,
                                    n_reg=100, evaluationFunction='KL', control_name='control')

uplift_model.fit(X_train_tree.values,
                 treatment=treat_train_tree.map({1: 'treatment1', 0: 'control'}).values,
                 y=y_train_tree)

graph = uplift_tree_plot(uplift_model.fitted_uplift_tree, features)
Image(graph.create_png())

ModuleNotFoundError: No module named 'causalml'

Не установился causalml, поэтому код задания 7 не вполнился.