# Итоговый проект

<b> Цель работы </b> - научиться предсказывать совершение целевого действия (ориентировочное значение ROC-AUC ~ 0.65) — факт совершения пользователем целевого действия.

<b> Целевое действие </b> — события типа «Оставить заявку» и «Заказать звонок» (ga_hits.event_action in ['sub_car_claim_click', 'sub_car_claim_submit_click', 'sub_open_dialog_click', 'sub_custom_question_submit_click', 'sub_call_number_click', 'sub_callback_submit_click', 'sub_submit_success', 'sub_car_request_submit_click']).

### Описание данных
<b> GA Sessions (ga_sessions.pkl) </b>

Одна строка = один визит на сайт.
Описание атрибутов:
* session_id — ID визита;
* client_id — ID посетителя;
* visit_date — дата визита;
* visit_time — время визита;
* visit_number — порядковый номер визита клиента;
* utm_source — канал привлечения;
* utm_medium — тип привлечения;
* utm_campaign — рекламная кампания;
* utm_keyword — ключевое слово;
* device_category — тип устройства;
* device_os — ОС устройства;
* device_brand — марка устройства;
* device_model — модель устройства;
* device_screen_resolution — разрешение экрана;
* device_brand — марка устройства;
* device_model — модель устройства;
* device_screen_resolution — разрешение экрана;
* device_browser — браузер;
* geo_country — страна;
* geo_city — город.

<b> GA Hits (ga_hits.pkl) </b>

Одна строка = одно событие в рамках одного визита на сайт.
Описание атрибутов:
* session_id — ID визита;
* hit_date — дата события;
* hit_time — время события;
* hit_number — порядковый номер события в рамках сессии;
* hit_type — тип события;
* hit_referer — источник события;
* hit_page_path — страница события;
* event_category — тип действия;
* event_action — действие;
* event_label — тег действия;
* event_value — значение результата действия.

### Содержание
* [Импорт библиотек](#import)
* [Загрузка датасетов](#load_data)
* [Подготовка данных](#transform)
* [ЭКСПЕРИМЕНТЫ с моделями](#model)
* [Финальный результат](#rezult)
* [Сохранение модели](#savemodel)

### Импорт библиотек<a class="anchor" id="import"></a>

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

from dask import dataframe as dd

import pickle

from keras.models import Sequential, load_model
from keras.layers import Dense
from keras.optimizers import SGD
from keras.wrappers.scikit_learn import KerasClassifier

import tensorflow as tf

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

from dask_ml.preprocessing import OneHotEncoder, MinMaxScaler
from dask_ml.model_selection import train_test_split

In [2]:
import warnings
warnings.filterwarnings('ignore')

В значении true, для моделинга будет использован dask вместо pandas и обучаться батчами, а не целым датафреймом
для экономии памяти

In [3]:
NOT_ENOUGH_RAM = False

In [4]:
# Пути к файлам
SESSIONS_DATASET_PATH = 'data/ga_sessions.pkl'
HITS_DATASET_PATH = 'data/ga_hits.pkl'

In [60]:
# Целевая переменная
TARGET = 'target'
# Целевые действия
TARGET_ACTIONS = {'sub_car_claim_click', 'sub_car_claim_submit_click',
                   'sub_open_dialog_click', 'sub_custom_question_submit_click',
                   'sub_call_number_click', 'sub_callback_submit_click', 'sub_submit_success',
                   'sub_car_request_submit_click'}

In [61]:
ADVERTISING_TAGS = {'QxAxdyPLuQMEcrdZWdWb', 'MvfHsxITijuriZxsqZqt', 'ISrKoXQCxqqYvAZICvjs',
                    'IZEXUFLARCUMynmHNBGo', 'PlbkrSYoHuZBWfYjYnfw',
                    'gVRrcxiDQubJiljoTbGm'}

MOSCOW_REGION = {'Aprelevka', 'Balashikha', 'Beloozyorskiy', 'Chekhov', 'Chernogolovka', 'Dedovsk', 'Dmitrov',
                 'Dolgoprudny', 'Domodedovo', 'Dubna', 'Dzerzhinsky','Elektrogorsk', 'Elektrostal', 'Elektrougli',
                 'Fryazino', 'Golitsyno', 'Istra', 'Ivanteyevka', 'Izhevsk', 'Kashira', 'Khimki', 'Khotkovo', 'Klin',
                 'Kolomna', 'Korolyov', 'Kotelniki', 'Krasnoarmeysk', 'Krasnogorsk', 'Krasnoznamensk', 'Kubinka',
                 'Kurovskoye', 'Likino-Dulyovo', 'Lobnya', 'Losino-Petrovsky', 'Lukhovitsy', 'Lytkarino', 'Lyubertsy',
                 'Moscow', 'Mozhaysk', 'Mytishchi', 'Naro-Fominsk', 'Noginsk', 'Odintsovo', 'Orekhovo-Zuyevo',
                 'Pavlovsky Posad', 'Podolsk', 'Protvino', 'Pushchino', 'Pushkino', 'Ramenskoye', 'Reutov', 'Ruza',
                 'Sergiyev Posad', 'Serpukhov', 'Solnechnogorsk', 'Staraya Kupavna', 'Stupino', 'Shchyolkovo',
                 'Shatura', 'Vidnoye', 'Volokolamsk', 'Voskresensk', 'Yakhroma', 'Zvenigorod'}

BIG_CITIES = {'Moscow', 'Saint Petersburg', 'Novosibirsk', 'Yekaterinburg',
              'Kazan', 'Nizhny Novgorod', 'Chelyabinsk', 'Samara',
              'Ufa', 'Rostov-on-Don', 'Omsk', 'Volgograd'}

### функции

In [62]:
# Корреляция целевой переменной с атрибутами
def target_corr_category_heat_map(df, category):
    target_corr = pd.get_dummies(df[category]).\
        corrwith(df[TARGET]).\
        sort_values(key=lambda x: abs(x), ascending=False)
    category_corr_sum = target_corr.apply(lambda x: abs(x)).sum()
    print(f'Общая корреляция {category} с {TARGET} - {category_corr_sum}')
    fig, ax = plt.subplots(figsize=(10,8))
    sns.heatmap(pd.DataFrame(target_corr.head(15)), annot=True, ax=ax)

def target_corr_heat_map(df):
    fig, ax = plt.subplots(figsize=(10,8))
    target_corr = df.corr()[[TARGET]].sort_values(
      by=TARGET, ascending=False)
    sns.heatmap(target_corr, annot=True, ax=ax)

def categorty_density(df, category):
    types = df.dropna(subset=[TARGET])
    types = types[category].value_counts()
    one_percent_len = df.shape[0]/100
    types = list(types[types.values > one_percent_len].index)

    fig, ax = plt.subplots(figsize=(10,8))
    # Plot each building
    for b_type in types:
        # Select the building type
        subset = df[df[category] == b_type]

        # Density plot of Energy Star Scores
        sns.kdeplot(subset[TARGET].dropna(),
                   label = b_type, shade = False, alpha = 0.8);

    # label the plot
    plt.xlabel('Target', size = 20); plt.ylabel('Density', size = 20);
    plt.title(f'Density Plot of target by {category}', size = 28);

def target_percent_by_category(df, category):
    pivot = df.pivot_table(index=[TARGET],
                           columns=[category],
                           aggfunc='size',
                           fill_value=0)
    target_percent = pivot.iloc[1] / (pivot.iloc[0] + pivot.iloc[1])
    fig = px.histogram(x=target_percent.index,
                       y=target_percent,
                       labels={
                         "x": category,
                         "y": f"percent of True {TARGET}"
                       })
    fig.show()

def calculate_outliers(data):
    q25 = data.quantile(0.25)
    q75 = data.quantile(0.75)
    iqr = q75 - q25
    boundaries = (q25 - 1.5 * iqr, q75 + 1.5 * iqr)

    return boundaries

### Загрузка датасетов<a class="anchor" id="load_data"></a>

In [8]:
df = pd.read_pickle(SESSIONS_DATASET_PATH)
print(df.shape)
df.head()

(1860042, 18)


Unnamed: 0,session_id,client_id,visit_date,visit_time,visit_number,utm_source,utm_medium,utm_campaign,utm_adcontent,utm_keyword,device_category,device_os,device_brand,device_model,device_screen_resolution,device_browser,geo_country,geo_city
0,9055434745589932991.1637753792.1637753792,2108382700.1637757,2021-11-24,14:36:32,1,ZpYIoDJMcFzVoPFsHGJL,banner,LEoPHuyFvzoNfnzGgfcd,vCIpmpaGBnIQhyYNkXqp,puhZPIYqKXeFPaUviSjo,mobile,Android,Huawei,,360x720,Chrome,Russia,Zlatoust
1,905544597018549464.1636867290.1636867290,210838531.16368672,2021-11-14,08:21:30,1,MvfHsxITijuriZxsqZqt,cpm,FTjNLDyTrXaWYgZymFkV,xhoenQgDQsgfEPYNPwKO,IGUCNvHlhfHpROGclCit,mobile,Android,Samsung,,385x854,Samsung Internet,Russia,Moscow
2,9055446045651783499.1640648526.1640648526,2108385331.164065,2021-12-28,02:42:06,1,ZpYIoDJMcFzVoPFsHGJL,banner,LEoPHuyFvzoNfnzGgfcd,vCIpmpaGBnIQhyYNkXqp,puhZPIYqKXeFPaUviSjo,mobile,Android,Huawei,,360x720,Chrome,Russia,Krasnoyarsk
3,9055447046360770272.1622255328.1622255328,2108385564.1622252,2021-05-29,05:00:00,1,kjsLglQLzykiRbcDiGcD,cpc,,NOBKLgtuvqYWkXQHeYWM,,mobile,,Xiaomi,,393x786,Chrome,Russia,Moscow
4,9055447046360770272.1622255345.1622255345,2108385564.1622252,2021-05-29,05:00:00,2,kjsLglQLzykiRbcDiGcD,cpc,,,,mobile,,Xiaomi,,393x786,Chrome,Russia,Moscow


In [9]:
df_hits = pd.read_pickle(HITS_DATASET_PATH)
print(df_hits.shape)
df_hits.head()

(15726470, 11)


Unnamed: 0,session_id,hit_date,hit_time,hit_number,hit_type,hit_referer,hit_page_path,event_category,event_action,event_label,event_value
0,5639623078712724064.1640254056.1640254056,2021-12-23,597864.0,30,event,,sberauto.com/cars?utm_source_initial=google&ut...,quiz,quiz_show,,
1,7750352294969115059.1640271109.1640271109,2021-12-23,597331.0,41,event,,sberauto.com/cars/fiat?city=1&city=18&rental_c...,quiz,quiz_show,,
2,885342191847998240.1640235807.1640235807,2021-12-23,796252.0,49,event,,sberauto.com/cars/all/volkswagen/polo/e994838f...,quiz,quiz_show,,
3,142526202120934167.1640211014.1640211014,2021-12-23,934292.0,46,event,,sberauto.com/cars?utm_source_initial=yandex&ut...,quiz,quiz_show,,
4,3450086108837475701.1640265078.1640265078,2021-12-23,768741.0,79,event,,sberauto.com/cars/all/mercedes-benz/cla-klasse...,quiz,quiz_show,,


Оставим только целевые нажатия

In [10]:
df_hits = df_hits[df_hits['event_action'].isin(TARGET_ACTIONS)]
print(df_hits.shape)
df_hits.head()

(104908, 11)


Unnamed: 0,session_id,hit_date,hit_time,hit_number,hit_type,hit_referer,hit_page_path,event_category,event_action,event_label,event_value
4016,2744563715298057088.1640258436.1640258436,2021-12-23,843092.0,81,event,,sberauto.com/cars/all/kia/rio/fee33fe6?utm_sou...,sub_submit,sub_submit_success,nsPPIRqjxBefONGPpnsF,
4045,3087297479839089634.1640268774.1640268774,2021-12-23,194144.0,22,event,,sberauto.com/cars/all/skoda/rapid/bf24b977?utm...,sub_submit,sub_submit_success,nsPPIRqjxBefONGPpnsF,
4046,3156966333326004302.1640206419.1640206800,2021-12-23,327223.0,63,event,,sberauto.com/cars/all/skoda/rapid/bf24b977?utm...,sub_submit,sub_submit_success,nsPPIRqjxBefONGPpnsF,
4047,3750243879753098158.1640272208.1640272208,2021-12-23,156992.0,20,event,,sberauto.com/cars/all/nissan/x-trail/0744675f?...,sub_submit,sub_submit_success,nsPPIRqjxBefONGPpnsF,
4048,7518333712042258254.1640258901.1640258901,2021-12-23,170616.0,16,event,,sberauto.com/cars/all/mercedes-benz/gla-klasse...,sub_submit,sub_submit_success,KuMiABMMbspIDDhiCNVS,


In [11]:
df['target'] = df['session_id'].isin(df_hits['session_id'].unique()).apply(lambda x: 1 if x else 0)
df.head()
del df_hits

Сохраняем датафрейм с целевой переменной

In [12]:
df.to_pickle('data/temp/data_with_target.pickle')

In [13]:
df = pd.read_pickle('data/temp/data_with_target.pickle')

In [14]:
df.target.value_counts()

0    1809728
1      50314
Name: target, dtype: int64

# Data preparation

### Заполнение пустых строк

Изучим процент незаполненных данных от всей выборки

In [15]:
df.isna().mean() * 100

session_id                   0.000000
client_id                    0.000000
visit_date                   0.000000
visit_time                   0.000000
visit_number                 0.000000
utm_source                   0.005215
utm_medium                   0.000000
utm_campaign                11.806346
utm_adcontent               18.043410
utm_keyword                 58.174009
device_category              0.000000
device_os                   57.533002
device_brand                 6.380394
device_model                99.121633
device_screen_resolution     0.000000
device_browser               0.000000
geo_country                  0.000000
geo_city                     0.000000
target                       0.000000
dtype: float64

In [16]:
# target_corr_category_heat_map(df, 'utm_keyword')

device_model и utm_keyword можем смело удалять

In [17]:
df = df.drop(['device_model', 'utm_keyword'], axis=1)

ОС и бренд могут сильно повлиять на предсказание, заполнил пустые значения флагом other

In [31]:
df.device_os.value_counts(dropna=False)

other            1070138
Android           464054
iOS               207104
Windows            88307
Macintosh          24824
Linux               5120
(not set)            364
Chrome OS             83
BlackBerry            27
Tizen                  7
Samsung                4
Windows Phone          4
Firefox OS             3
Nokia                  3
Name: device_os, dtype: int64

In [32]:
df['device_os'] = df['device_os'].fillna('other')

In [33]:
df['device_brand'].value_counts(dropna=False)

Apple       551088
Samsung     332194
Xiaomi      288367
            248500
Huawei      185853
             ...  
Vodafone         1
Wexler           1
KingSing         1
Star             1
Opera            1
Name: device_brand, Length: 207, dtype: int64

In [34]:
df['device_brand'] = df['device_brand'].fillna('other')

3 атрибута ниже заполним модой и скопируем значения для заполнения в пайплайне

In [35]:
df['utm_campaign'].mode()

0    LTuZkdKfxRGVceoWkVyg
Name: utm_campaign, dtype: object

In [36]:
df['utm_source'].mode()

0    ZpYIoDJMcFzVoPFsHGJL
Name: utm_source, dtype: object

In [37]:
df['utm_adcontent'].mode()

0    JNHcPlZPxEMWDnRiyoBf
Name: utm_adcontent, dtype: object

In [38]:
df['utm_campaign'] = df['utm_campaign'].fillna(df['utm_campaign'].mode())
df['utm_source'] = df['utm_source'].fillna(df['utm_source'].mode())
df['utm_adcontent'] = df['utm_adcontent'].fillna(df['utm_adcontent'].mode())

### Приведение типов данных

In [39]:
df.dtypes

session_id                          object
client_id                           object
visit_date                  datetime64[ns]
visit_time                          object
visit_number                         int64
utm_source                          object
utm_medium                          object
utm_campaign                        object
utm_adcontent                       object
device_category                     object
device_os                           object
device_brand                        object
device_screen_resolution            object
device_browser                      object
geo_country                         object
geo_city                            object
target                               int64
dtype: object

In [40]:
df.head()

Unnamed: 0,session_id,client_id,visit_date,visit_time,visit_number,utm_source,utm_medium,utm_campaign,utm_adcontent,device_category,device_os,device_brand,device_screen_resolution,device_browser,geo_country,geo_city,target
0,9055434745589932991.1637753792.1637753792,2108382700.1637757,2021-11-24,14:36:32,1,ZpYIoDJMcFzVoPFsHGJL,banner,LEoPHuyFvzoNfnzGgfcd,vCIpmpaGBnIQhyYNkXqp,mobile,Android,Huawei,360x720,Chrome,Russia,Zlatoust,0
1,905544597018549464.1636867290.1636867290,210838531.16368672,2021-11-14,08:21:30,1,MvfHsxITijuriZxsqZqt,cpm,FTjNLDyTrXaWYgZymFkV,xhoenQgDQsgfEPYNPwKO,mobile,Android,Samsung,385x854,Samsung Internet,Russia,Moscow,0
2,9055446045651783499.1640648526.1640648526,2108385331.164065,2021-12-28,02:42:06,1,ZpYIoDJMcFzVoPFsHGJL,banner,LEoPHuyFvzoNfnzGgfcd,vCIpmpaGBnIQhyYNkXqp,mobile,Android,Huawei,360x720,Chrome,Russia,Krasnoyarsk,0
3,9055447046360770272.1622255328.1622255328,2108385564.1622252,2021-05-29,05:00:00,1,kjsLglQLzykiRbcDiGcD,cpc,,NOBKLgtuvqYWkXQHeYWM,mobile,other,Xiaomi,393x786,Chrome,Russia,Moscow,0
4,9055447046360770272.1622255345.1622255345,2108385564.1622252,2021-05-29,05:00:00,2,kjsLglQLzykiRbcDiGcD,cpc,,,mobile,other,Xiaomi,393x786,Chrome,Russia,Moscow,0


Преобразуем атрибут даты посещения в формат даты

In [41]:
df['visit_date'] = pd.to_datetime(df['visit_date'])

### Обработка выбросов

Здесь видим сильные выбросы в количестве посещений сайта

In [42]:
df.visit_number.describe()

count    1.860042e+06
mean     2.712804e+00
std      1.182907e+01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      5.640000e+02
Name: visit_number, dtype: float64

Скопируем значение верхней границы для пайплайна

In [43]:
boundaries = calculate_outliers(df['visit_number'])
b_max = boundaries[1]
round(b_max)

4

И прировняем выбросы к макс границе

In [44]:
df.loc[df['visit_number'] > b_max, 'visit_number'] = round(b_max)

### Объединение редких значений

объеденим редкие города под флаг other, дабы не раздувать датафрейм

In [68]:
df.geo_city.describe()

count     1860042
unique       2548
top        Moscow
freq       805329
Name: geo_city, dtype: object

In [53]:
top50_cities = df.geo_city.value_counts().head(50).index
top50_cities

Index(['Moscow', 'Saint Petersburg', '(not set)', 'Yekaterinburg', 'Krasnodar',
       'Kazan', 'Samara', 'Nizhny Novgorod', 'Ufa', 'Novosibirsk',
       'Krasnoyarsk', 'Chelyabinsk', 'Tula', 'Voronezh', 'Rostov-on-Don',
       'Irkutsk', 'Grozny', 'Balashikha', 'Vladivostok', 'Yaroslavl', 'Sochi',
       'Tyumen', 'Khimki', 'Saratov', 'Perm', 'Vidnoye', 'Odintsovo',
       'Mytishchi', 'Izhevsk', 'Zheleznodorozhny', 'Lipetsk', 'Stavropol',
       'Omsk', 'Korolyov', 'Domodedovo', 'Dublin', 'Khabarovsk', 'Volgograd',
       'Kaliningrad', 'Pyatigorsk', 'Tver', 'Podolsk', 'Kaluga', 'Ryazan',
       'Krasnogorsk', 'Surgut', 'Prineville', 'Barnaul', 'Dolgoprudny',
       'Makhachkala'],
      dtype='object')

In [63]:
target_percent_by_category(df[df.geo_city.isin(top50_cities)], 'geo_city')

In [67]:
target_percent_by_category(df[df.geo_city.isin(BIG_CITIES | MOSCOW_REGION)], 'geo_city')

In [70]:
unpopular_city = [k for k, v in geo_city_vc.items() if v <= geo_city_vc.mean()]

Сохраним список в файл, для использования в пайплайне

In [71]:
pd.DataFrame(unpopular_city).to_csv('data/pipeline/unpopular_city.csv', index=False)

In [72]:
del geo_city_vc

In [68]:
df.loc[df['geo_city'].isin(unpopular_city), 'geo_city'] = 'other'

NameError: name 'unpopular_city' is not defined

In [None]:
del unpopular_city

# Feature engineering

In [69]:
df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,session_id,client_id,visit_date,visit_time,visit_number,utm_source,utm_medium,utm_campaign,utm_adcontent,device_category,device_os,device_brand,device_screen_resolution,device_browser,geo_country,geo_city,target
count,1860042,1860042.0,1860042,1860042,1860042.0,1859945,1860042,1640439,1524427,1860042,1860042,1860042,1860042,1860042,1860042,1860042,1860042.0
unique,1860042,1391719.0,,85318,,293,56,412,286,3,14,207,5039,57,166,2548,
top,9055434745589932991.1637753792.1637753792,1750498477.162945,,12:00:00,,ZpYIoDJMcFzVoPFsHGJL,banner,LTuZkdKfxRGVceoWkVyg,JNHcPlZPxEMWDnRiyoBf,mobile,other,Apple,414x896,Chrome,Russia,Moscow,
freq,1,462.0,,61067,,578290,552272,463481,1006599,1474871,1070138,551088,169090,1013436,1800565,805329,
mean,,,2021-09-26 11:45:55.389394176,,1.483161,,,,,,,,,,,,0.02704993
min,,,2021-05-19 00:00:00,,1.0,,,,,,,,,,,,0.0
25%,,,2021-08-02 00:00:00,,1.0,,,,,,,,,,,,0.0
50%,,,2021-10-06 00:00:00,,1.0,,,,,,,,,,,,0.0
75%,,,2021-11-23 00:00:00,,2.0,,,,,,,,,,,,0.0
max,,,2021-12-31 00:00:00,,4.0,,,,,,,,,,,,1.0


### Органический трафик

In [70]:
df.utm_medium.value_counts(dropna=False)

banner               552272
cpc                  434794
(none)               300575
cpm                  242083
referral             152050
organic               63034
email                 29240
push                  28035
stories               10582
cpv                    8022
blogger_channel        8015
smartbanner            6794
blogger_stories        4312
cpa                    4279
tg                     4011
app                    2836
post                   2326
smm                    1985
outlook                1332
clicks                  934
blogger_header          771
(not set)               480
info_text               343
sms                     239
landing                 134
partner                  97
fb_smm                   66
vk_smm                   65
link                     57
cbaafe                   47
CPM                      40
yandex_cpc               36
ok_smm                   28
static                   22
google_cpc               20
article             

Выставим флаг органического трафика согласно списку из методички

In [71]:
df['is_organic'] = df.utm_medium.isin(['organic', 'referral', '(none)']).apply(lambda x: 1 if x else 0)

In [72]:
df['is_organic'].value_counts(dropna=False)

0    1344383
1     515659
Name: is_organic, dtype: int64

### Реклама в социальных сетях

Выставим флаг рекламы в соц сетях согласно списку из методички

In [73]:
df['is_advertising'] = df['utm_source'].isin(ADVERTISING_TAGS).apply(lambda x: 1 if x else 0)

In [74]:
df['is_advertising'].value_counts()

0    1585815
1     274227
Name: is_advertising, dtype: int64

### День недели

Лояльность клиентов должна разниться в зависимости от дня недели и времени суток

In [75]:
df['day'] = df['visit_date'].dt.day.astype('str').apply(lambda x: f'0{x}' if len(x) == 1 else x)

In [76]:
target_percent_by_category(df, 'day')

In [77]:
df['dayofweek'] = df['visit_date'].dt.dayofweek.astype('str')

Гипотеза подтвердилась, пик лояльности в пн

In [78]:
target_percent_by_category(df, 'dayofweek')

### Час посещения 

In [79]:
df['visit_time_hour'] = df['visit_time'].apply(lambda x: str(x.hour))

In [80]:
df['visit_time_hour'].value_counts()

14    111487
16    105241
15    103854
18    102701
13    102477
12    102071
11    101266
17    101086
10    100849
19     99184
20     98731
21     97612
22     93473
9      91860
23     84643
8      72015
0      66202
7      54894
1      40080
6      36440
2      27328
5      25292
3      21036
4      20220
Name: visit_time_hour, dtype: int64

In [81]:
pivot = df.pivot_table(index=['target'],
                       columns=['visit_time_hour'],
                       aggfunc='size',
                       fill_value=0)
pivot

visit_time_hour,0,1,10,11,12,13,14,15,16,17,...,21,22,23,3,4,5,6,7,8,9
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,64316,38959,97992,98183,99032,99247,108202,100774,102053,98127,...,95224,91101,82598,20552,19752,24723,35719,53700,70316,89543
1,1886,1121,2857,3083,3039,3230,3285,3080,3188,2959,...,2388,2372,2045,484,468,569,721,1194,1699,2317


In [82]:
target_percent = pivot.iloc[0] / pivot.iloc[1]
target_percent

visit_time_hour
0     34.101803
1     34.753791
10    34.298915
11    31.846578
12    32.587035
13    30.726625
14    32.938204
15    32.718831
16    32.011606
17    33.162217
18    35.600499
19    38.945228
2     41.434783
20    40.206594
21    39.876047
22    38.406830
23    40.390220
3     42.462810
4     42.205128
5     43.449912
6     49.540915
7     44.974874
8     41.386698
9     38.646094
dtype: float64

аналогично

In [83]:
fig = px.histogram(x=target_percent.index,
                   y=target_percent)
fig.show()

### Разрешение экрана по осям

Есть подозрения, что из-за разрешения экрана пользователь может увидеть меньше приятного контента и не оставит заявку

In [197]:
df['device_screen_resolution'].value_counts()

414x896      169090
1920x1080    125768
375x812      117944
393x851      115454
375x667       93341
              ...  
2262x1553         1
1097x617          1
421x847           1
1791x1007         1
464x1123          1
Name: device_screen_resolution, Length: 5039, dtype: int64

In [198]:
screen_mode = df['device_screen_resolution'].mode()[0]
screen_mode

'414x896'

In [199]:
df.loc[df['device_screen_resolution'] == '(not set)', 'device_screen_resolution'] = screen_mode
df['device_screen_resolution_x'] = df['device_screen_resolution'].apply(lambda x: int(x.split('x')[0]))
df['device_screen_resolution_y'] = df['device_screen_resolution'].apply(lambda x: int(x.split('x')[1]))

### Преобразование категориальных и числовых атриутов

In [200]:
categorical = ['utm_source', 'utm_medium', 'utm_campaign', 'utm_adcontent',
              'device_os', 'device_brand', 'device_browser',
              'geo_country', 'geo_city',
              'dayofweek', 'visit_time_hour']
numerical = ['visit_number', 'device_screen_resolution_x', 'device_screen_resolution_y']

In [201]:
ddf = dd.from_pandas(df[numerical], chunksize=20000)

### MinMax

In [202]:
%%time
mm_scaler = MinMaxScaler()
mm_scaler.fit(ddf[numerical])
pickle.dump(mm_scaler, open("data/pipeline/mm_scaler.pickle", "wb"))

CPU times: user 79.8 ms, sys: 13.2 ms, total: 92.9 ms
Wall time: 78.9 ms


In [203]:
ddf[numerical] = mm_scaler.transform(ddf[numerical])

### OHE

In [204]:
ddf_categorical = dd.from_pandas(df[categorical], chunksize=20000).categorize()

In [205]:
ohe = OneHotEncoder(sparse=False)
ohe.fit(ddf_categorical)
pickle.dump(ohe, open("data/pipeline/ohe.pickle", "wb"))
ohe_transform = ohe.transform(ddf_categorical)
del ddf_categorical

In [206]:
ddf = dd.merge(ddf, ohe_transform.astype('uint8'))

In [207]:
del ohe_transform

### Перенос целевой переменной

In [208]:
ddf['target'] = df['target']

In [209]:
# del df

In [210]:
df = ddf.compute()

In [211]:
df.shape

(1860042, 1671)

In [212]:
df.to_pickle('data/temp/df_full_columns.pkl')

### Кореляция целевой переменной с другими атрибутами

In [3]:
df = pd.read_pickle('data/temp/df_full_columns.pkl')

In [4]:
corrs = df.corrwith(df["target"]).sort_values(key=lambda x: abs(x), ascending=False)

In [5]:
corrs

target                                1.000000e+00
utm_campaign_LTuZkdKfxRGVceoWkVyg     4.956263e-02
utm_medium_referral                   4.920653e-02
utm_adcontent_JNHcPlZPxEMWDnRiyoBf    4.478846e-02
utm_campaign_FTjNLDyTrXaWYgZymFkV    -4.142048e-02
                                          ...     
geo_country_Poland                   -7.622137e-06
geo_city_Belgorod                    -2.812427e-06
utm_source_ZHCJROlbqnkXTqIuVxnm      -1.991068e-06
utm_adcontent_NacUSAyeXYJDflPqmJGg   -1.090477e-06
device_brand_Fly                     -6.295746e-07
Length: 1671, dtype: float64

Отрицательно коррелирующие атрибуты плохо сказывались на модели, поэтому abs не используем

In [6]:
important_columns = corrs[abs(corrs) >= 0.001].index
# important_columns = corrs[corrs >= 0.001].index
important_columns

Index(['target', 'utm_campaign_LTuZkdKfxRGVceoWkVyg', 'utm_medium_referral',
       'utm_adcontent_JNHcPlZPxEMWDnRiyoBf',
       'utm_campaign_FTjNLDyTrXaWYgZymFkV', 'utm_source_bByPQxmDaMXgpHeypKSM',
       'visit_number', 'utm_medium_cpm', 'device_os_other',
       'utm_source_MvfHsxITijuriZxsqZqt',
       ...
       'utm_campaign_lDZWtjMawBaqetnVFboy', 'geo_city_Stupino',
       'utm_campaign_tVtbIKrPSOvrXLCznVVe',
       'utm_campaign_YHobSrmCVImJLFtqxaTd', 'geo_city_Arkhangelsk',
       'utm_campaign_FybWmKxPurtzLVenltZy',
       'utm_campaign_klTrhUaShgnjIbaPmqjc', 'geo_city_Kotelniki',
       'geo_city_Zhukovskiy', 'utm_adcontent_cwJiQIYIVtjAceFuhkSu'],
      dtype='object', length=550)

сохраним для использования в пайплайне

In [7]:
pd.DataFrame(important_columns).to_csv('data/pipeline/important_columns.csv')

### Удаление ненужных колонок

In [8]:
df = df[important_columns]

### Сохранение датафрейма

In [9]:
df.to_pickle('data/temp/df_prep.pkl')

# Modeling

### Загрузка датафрейма

В зависимости от объема памяти загружаем датафрейм через пандас либо даск

In [4]:
if NOT_ENOUGH_RAM:
    chunksize = 100000
    ddf = dd.from_pandas(pd.read_pickle('data/temp/f_prep.pkl'), chunksize=chunksize)
else:
    df = pd.read_pickle('data/temp/df_prep.pkl')

### train test split

Собираем первые значения чанка для трейна, последние для теста

In [5]:
def part_train_test(ddf, target, train_size, test_size):
    part = ddf.partitions[0]
    train = part.head(train_size)
    test = part.tail(test_size)
    for p in range(1, ddf.npartitions):
        part = ddf.partitions[p]
        train = pd.concat([train, part.head(train_size)])
        test = pd.concat([test, part.tail(test_size)])

    return tf.convert_to_tensor(train.drop(target, axis=1)), \
           tf.convert_to_tensor(test.drop(target, axis=1)), \
           tf.convert_to_tensor(train[target]), \
           tf.convert_to_tensor(test[target])

In [6]:
def train_iterator(ddf, target, size):
    while True:
        for p in range(1, ddf.npartitions):
            X_train = ddf.partitions[p].drop(target, axis=1).head(size)
            y_train = ddf.partitions[p][target].head(size)
            yield tf.convert_to_tensor(X_train), tf.convert_to_tensor(y_train)

In [7]:
true_target_count = df.target.value_counts()[1]
df = pd.concat([df[df.target == 1],
                df[df.target == 0].sample(n=true_target_count, random_state=42)])

In [8]:
%%time
if NOT_ENOUGH_RAM:
    X_train, X_test, y_train, y_test = part_train_test(ddf, 'target', 0, int(chunksize*0.2))
else:
    X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1),
                                                        df['target'],
                                                        test_size=0.3,
                                                        shuffle=True,
                                                        random_state=42)
    del df

CPU times: user 123 ms, sys: 21.4 ms, total: 144 ms
Wall time: 143 ms


### Неудачные модели

Модели показавшие плохие результаты (включая перебор параметров)

In [9]:
# model = LogisticRegression(max_iter=10,
#                            multi_class='multinomial',
#                            solver='newton',
#                            penalty='none',
#                            solver_kwargs={"normalize":False},
#                            random_state=42)
#
# model = xgb.XGBClassifier(random_state=42,
#                           n_estimators=100,
#                           learning_rate=0.1,
#                           max_depth=5,
#                           eval_metric='error')

## Keras

In [64]:
tf.random.set_seed(42)

Поиск оптимальной функции активации

In [None]:
def buildmodel(activation):
    model = Sequential()
    model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(256, input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(128, input_dim=256, activation=activation))
    model.add(Dense(64, input_dim=128, activation=activation))
    model.add(Dense(1, input_dim=64, activation='linear'))
    sgd = SGD(learning_rate=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='mean_squared_error', optimizer=sgd)
    return model

activations = [
    # 'softmax',
    'elu',
    # 'selu',
    # 'softplus',
    'softsign',
    'swish',
    'relu',
    'gelu',
    'tanh',
    'sigmoid',
    # 'exponential',
    'hard_sigmoid',
    'linear',
    'sigmoid']

results = {}

for p in activations:
    model = buildmodel(p)
    model.fit(X_train,
              y_train,
              epochs=1,
              verbose=2,
              shuffle=False)

    results[p] = roc_auc_score(y_test, model.predict(X_test))

In [12]:
results

{'softmax': 0.5,
 'elu': 0.6800105098541558,
 'softplus': 0.5014950918872244,
 'softsign': 0.6815212443313388,
 'swish': 0.6713468982736743,
 'relu': 0.6874558581824184,
 'gelu': 0.6731615396571169,
 'tanh': 0.6779195423824388,
 'sigmoid': 0.6327520949527332,
 'hard_sigmoid': 0.6232556997972449,
 'linear': 0.6732294756359815}

In [63]:
def buildmodel(activation):
    model = Sequential()
    model.add(Dense(X_train.shape[1], input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(256, input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(128, input_dim=256, activation=activation))
    model.add(Dense(64, input_dim=128, activation=activation))
    model.add(Dense(1, input_dim=64, activation='linear'))
    return model

optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)

model = KerasClassifier(model=buildmodel, loss="binary_crossentropy", epochs=1, batch_size=10, verbose=0)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=4)
grid_result = grid.fit(X_train, y_train)

  model = KerasClassifier(model=buildmodel, loss="binary_crossentropy", epochs=1, batch_size=10, verbose=0)


AttributeError: 'KerasClassifier' object has no attribute '__call__'

In [38]:
lr_range =
results = {}

for lr in lr_range:
    model = buildmodel('relu', lr, 0.9)
    model.fit(X_train,
              y_train,
              epochs=1,
              verbose=2,
              shuffle=False)

    results[a] = roc_auc_score(y_test, model.predict(X_test))

2022-08-20 00:16:44.831358: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 10s - loss: 0.2309 - 10s/epoch - 4ms/step
 83/944 [=>............................] - ETA: 1s

2022-08-20 00:16:54.383945: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:16:56.447088: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2310 - 9s/epoch - 4ms/step
107/944 [==>...........................] - ETA: 1s

2022-08-20 00:17:05.563576: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:17:07.237083: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2315 - 9s/epoch - 4ms/step
 99/944 [==>...........................] - ETA: 1s

2022-08-20 00:17:16.198556: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:17:17.900867: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2321 - 9s/epoch - 4ms/step
 97/944 [==>...........................] - ETA: 1s

2022-08-20 00:17:27.021762: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:17:28.799456: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2327 - 9s/epoch - 4ms/step
 94/944 [=>............................] - ETA: 1s

2022-08-20 00:17:37.913429: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:17:39.665630: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2331 - 9s/epoch - 4ms/step
105/944 [==>...........................] - ETA: 1s

2022-08-20 00:17:48.701139: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:17:50.567084: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2336 - 9s/epoch - 4ms/step
104/944 [==>...........................] - ETA: 1s

2022-08-20 00:17:59.527872: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:18:01.304562: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 10s - loss: 0.2342 - 10s/epoch - 4ms/step
 96/944 [==>...........................] - ETA: 1s

2022-08-20 00:18:10.821591: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-20 00:18:12.502413: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


2202/2202 - 9s - loss: 0.2347 - 9s/epoch - 4ms/step
105/944 [==>...........................] - ETA: 1s

2022-08-20 00:18:21.425317: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




In [41]:
results

{'sigmoid': 0.6859262200478418}

Обучение модели с оптимальными параметрами

In [49]:
if True:
    model = buildmodel('relu', 0.01, 0.9)
else:
    model = load_model('data/pipeline/keras_seq.h5')

In [50]:
if NOT_ENOUGH_RAM:
    gen = train_iterator(ddf, 'target', int(chunksize*0.8))
    while True:
        # with tf.device('/gpu:0'):
        X_train, y_train = next(gen)
        model.fit(X_train,
                  y_train,
                  epochs=1,
                  verbose=1)
else:
    results = {}
    for e in range(1, 2):
        model.fit(X_train,
                  y_train,
                  epochs=1,
                  shuffle=False)

        results[e] = roc_auc_score(y_test, model.predict(X_test))

results

ValueError: in user code:

    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_function  *
        return step_function(self, iterator)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/training.py", line 1040, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/training.py", line 1030, in run_step  **
        outputs = model.train_step(data)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/training.py", line 894, in train_step
        return self.compute_metrics(x, y, y_pred, sample_weight)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/training.py", line 987, in compute_metrics
        self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 480, in update_state
        self.build(y_pred, y_true)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 393, in build
        self._metrics = tf.__internal__.nest.map_structure_up_to(
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 526, in _get_metric_objects
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 526, in <listcomp>
        return [self._get_metric_object(m, y_t, y_p) for m in metrics]
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 545, in _get_metric_object
        metric_obj = metrics_mod.get(metric)
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/metrics/__init__.py", line 182, in get
        return deserialize(str(identifier))
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/metrics/__init__.py", line 138, in deserialize
        return deserialize_keras_object(
    File "/opt/miniconda3/envs/tensorflow/lib/python3.9/site-packages/keras/utils/generic_utils.py", line 709, in deserialize_keras_object
        raise ValueError(

    ValueError: Unknown metric function: roc_auc_score. Please ensure this object is passed to the `custom_objects` argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.


In [30]:
# model.save('data/pipeline/keras_seq.h5')

In [31]:
predicted_train = model.predict(X_train)



In [32]:
roc_auc_score(y_train, predicted_train)

0.6849175548146202

In [33]:
predicted_test = model.predict(X_test)



In [34]:
roc_auc_score(y_test, predicted_test)

0.6854417935277493

Определим оптимальный порог для положительного прогноза

In [35]:
predicted_test

array([[0.6529238 ],
       [0.11967808],
       [0.6998433 ],
       ...,
       [0.50787836],
       [0.43028384],
       [0.3540578 ]], dtype=float32)

In [36]:
b_trsh = 0
max_score = 0
for trsh in np.arange(0.01, 1.2, 0.1):
    r_score = roc_auc_score(y_test, [1 if i >= trsh else 0 for i in predicted_test])
    if r_score >= max_score:
        b_trsh, max_score = trsh, r_score

print(f'{b_trsh} - {max_score}')

0.51 - 0.6298778546344123
