# Подготовка параметров и подбор модели для финального предсказания

**Задача соревнования**

Определение пола и возраста владельца HTTP cookie по истории активности пользователя в интернете на основе синтетических данных.

В моем распоряжении находится база данных в которой хранится информация об активности 400к+ пользователей на уровне даты+времени суток. 

Поскольку поставленная задача классификации - определить пол и возраст пользователя, то для формирования рабочего датасета я аггрегирую данные на уровне пользователя: user_id + [параметры/характеристики, полученные из данных]

В данном ноутбуке я составляю таблицу с параметрами для каждого пользователя (его основной регион пребывания, основная модель устройства и пр), а так же добавляю данные по посещаемости сайтов-кластеров (эту задачу я выполнила ранее - см в см.mts_ml_cup_sibrikova_url_claster.ipynb) 

Далее Полученные данные разбиваю на обучающую и тестовую выборку и произвожу поиск наилучшей модели для поставленной задачи (Параллельно тестируя несколько вариантов кластеризации сайтов и подходы к определению возраста - задачей классификации или регрессией с последующим бакетированием на необходимые возрастные подгруппы)

Используемые модели: Linear Regression, Logistic Regression, Random Forest classifier+regression, Catboost Classifier+regressor,  а так же LGBM classifier+regressor. Некоторые из типов моделей показывали крайне неудовлетворительный результат или долго рассчитывались, поэтому при тестах на разных кластеризациях сайтов были использованы выборочно. 

Итоговый этап работы над проектом находится тут MTS_ML_CUP_sibrikova_public_submition) - для моделей с наилучшими показателями произвожу подбор гиперпараметров и дальнейшее предсказание на конкурсном тесте. 

## Импорт необходимых библиотек

In [2]:
import sys
import os
import warnings
os.environ['OPENBLAS_NUM_THREADS'] = '1'

In [3]:
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
import pyarrow as pa
import scipy
import implicit
import bisect
import sklearn.metrics as m
from catboost import CatBoostClassifier, CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [5]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder,OrdinalEncoder

In [6]:
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
import lightgbm as lgb

warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

## Загрузка внешних файлов

In [7]:
# путь к папке с файлу данных
LOCAL_DATA_PATH = './context_data'

# параметр
SPLIT_SEED = 42

# Основной датасет
DATA_FILE = 'competition_data_final_pqt'

# таргеты
TARGET_FILE = 'public_train.pqt'

# ID клиентов по которым делаем сабмит
SUBMISSION_FILE = 'submission.csv'

In [8]:
%%time
# ID клиентов для сабмита
id_to_submit = pd.read_csv(f'{LOCAL_DATA_PATH}/{SUBMISSION_FILE}')

# Таргеты
targets = pq.read_table(f'{LOCAL_DATA_PATH}/{TARGET_FILE}')

# Основной датасет
data = pq.read_table(f'{LOCAL_DATA_PATH}/{DATA_FILE}')

CPU times: user 1min 30s, sys: 34.2 s, total: 2min 5s
Wall time: 12.2 s


In [9]:
#посмотрим струкрутру исходного файла
pd.DataFrame([(z.name, z.type) for z in data.schema], columns = [['field', 'type']])

Unnamed: 0,field,type
0,region_name,string
1,city_name,string
2,cpe_manufacturer_name,string
3,cpe_model_name,string
4,url_host,string
5,cpe_type_cd,string
6,cpe_model_os_type,string
7,price,double
8,date,date32[day]
9,part_of_day,string


## Подготовка признаков уровня user_id

### reg_main - reg_count

In [10]:
#аггрегируем нужные поля из исходного датасета
reg_user = data.select(['user_id', 'region_name', 'request_cnt']).\
    group_by(['user_id', 'region_name']).aggregate([('request_cnt', "sum")])

In [11]:
#для удобства переведем в pdDataframe
reg_user=reg_user.to_pandas()

In [12]:
reg_user= reg_user.sort_values(by=['user_id','request_cnt_sum'],ascending=False,ignore_index=True)

In [13]:
#cоставим таблицу с кол-вом регионовдля юзера и соединим с таблицой с самым популярным регионом для каждого юзера
info_user=reg_user.groupby(['user_id',]).agg({'request_cnt_sum':'count'}).join(\
    reg_user.groupby(['user_id']).agg({'region_name':'first'}))

In [14]:
info_user=pd.DataFrame(info_user)

In [15]:
info_user.columns=['reg_count','reg_main']
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,Москва
1,3,Москва
2,1,Республика Коми
3,1,Воронежская область
4,5,Краснодарский край


In [16]:
del reg_user

Для остальных категориальных переменных произведем аналогичные преобразования и соединим таблицы по user_id

### cpe_manufacturer_name

In [17]:
reg_user = data.select(['user_id', 'cpe_manufacturer_name', 'request_cnt']).\
    group_by(['user_id', 'cpe_manufacturer_name']).aggregate([('request_cnt', "sum")])

In [18]:
reg_user=reg_user.to_pandas()

reg_user= reg_user.sort_values(by=['user_id','request_cnt_sum'],ascending=False,ignore_index=True)

In [19]:
reg_user = reg_user.groupby(['user_id',]).agg({'request_cnt_sum':'count'}).join(\
    reg_user.groupby(['user_id']).agg({'cpe_manufacturer_name':'first'}))

In [20]:
reg_user = reg_user.drop('request_cnt_sum',axis=1)

In [21]:
info_user=info_user.join(reg_user)

In [22]:
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main,cpe_manufacturer_name
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,Москва,Samsung
1,3,Москва,Xiaomi
2,1,Республика Коми,Huawei
3,1,Воронежская область,Huawei Device Company Limited
4,5,Краснодарский край,Huawei


In [23]:
del reg_user

### cpe_type_cd

In [24]:
reg_user = data.select(['user_id', 'cpe_type_cd', 'request_cnt']).\
    group_by(['user_id', 'cpe_type_cd']).aggregate([('request_cnt', "sum")]).to_pandas()


In [25]:
reg_user = reg_user.sort_values(by=['user_id','request_cnt_sum'],ascending=False,ignore_index=True)

In [26]:
reg_user = reg_user.groupby(['user_id',]).agg({'request_cnt_sum':'count'}).join(\
    reg_user.groupby(['user_id']).agg({'cpe_type_cd':'first'}))

In [27]:
reg_user=reg_user.drop('request_cnt_sum',axis=1)

In [28]:
info_user=info_user.join(reg_user)

In [29]:
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main,cpe_manufacturer_name,cpe_type_cd
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,Москва,Samsung,smartphone
1,3,Москва,Xiaomi,smartphone
2,1,Республика Коми,Huawei,smartphone
3,1,Воронежская область,Huawei Device Company Limited,smartphone
4,5,Краснодарский край,Huawei,smartphone


In [30]:
del reg_user

### cpe_model_os_type

In [31]:
reg_user = data.select(['user_id', 'cpe_model_os_type', 'request_cnt']).\
    group_by(['user_id', 'cpe_model_os_type']).aggregate([('request_cnt', "sum")]).to_pandas()

In [32]:
reg_user = reg_user.sort_values(by=['user_id','request_cnt_sum'],ascending=False,ignore_index=True)

In [33]:
reg_user = reg_user.groupby(['user_id',]).agg({'request_cnt_sum':'count'}).join(\
    reg_user.groupby(['user_id']).agg({'cpe_model_os_type':'first'}))

In [34]:
reg_user=reg_user.drop('request_cnt_sum',axis=1)

In [35]:
info_user=info_user.join(reg_user)

In [36]:
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main,cpe_manufacturer_name,cpe_type_cd,cpe_model_os_type
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,Москва,Samsung,smartphone,Android
1,3,Москва,Xiaomi,smartphone,Android
2,1,Республика Коми,Huawei,smartphone,Android
3,1,Воронежская область,Huawei Device Company Limited,smartphone,Android
4,5,Краснодарский край,Huawei,smartphone,Android


In [37]:
del reg_user

### cpe_model_name + price

In [38]:
reg_user = data.select(['user_id', 'cpe_manufacturer_name','cpe_model_name', 'price','request_cnt']).\
    group_by(['user_id', 'cpe_manufacturer_name','cpe_model_name','price']).aggregate([('request_cnt', "sum")]).to_pandas()


In [39]:
price_per_model = reg_user.groupby('cpe_model_name')['price'].mean()
price_per_manuf = reg_user.groupby('cpe_manufacturer_name')['price'].mean()

In [40]:
#reg_user.isna().sum()

In [41]:
for model in price_per_model.index:
    reg_user.loc[reg_user['cpe_model_name']==model]=\
        reg_user.loc[reg_user['cpe_model_name']==model].\
            fillna(price_per_model.loc[model])

In [42]:
#reg_user.isna().sum()

In [43]:
for manuf in price_per_manuf.index:
    reg_user.loc[reg_user['cpe_manufacturer_name']==manuf]=\
        reg_user.loc[reg_user['cpe_manufacturer_name']==manuf].\
            fillna(price_per_manuf.loc[manuf])

In [44]:
#reg_user.isna().sum()

In [45]:
mean_price = reg_user.groupby('cpe_model_name')['price'].mean()
mean_price=mean_price.median()
reg_user=reg_user.fillna(mean_price)

In [46]:
#reg_user.isna().sum()

In [47]:
reg_user=reg_user.groupby(['user_id','cpe_model_name'])['price'].mean()

In [48]:
reg_user=reg_user.reset_index()

In [49]:
reg_user=reg_user.set_index('user_id')

In [50]:
info_user=info_user.join(reg_user)

In [51]:
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main,cpe_manufacturer_name,cpe_type_cd,cpe_model_os_type,cpe_model_name,price
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,Москва,Samsung,smartphone,Android,Galaxy J1 2016 LTE Dual,2990.0
1,3,Москва,Xiaomi,smartphone,Android,Mi 9,33401.428571
2,1,Республика Коми,Huawei,smartphone,Android,Honor 9 Lite,5915.0
3,1,Воронежская область,Huawei Device Company Limited,smartphone,Android,P Smart 2021,13990.0
4,5,Краснодарский край,Huawei,smartphone,Android,Nova 3,12990.0


In [52]:
del reg_user

In [53]:
del price_per_model,price_per_manuf

### url_host -clustered

In [54]:
#откроем файл с заранее выполненой кластеризацией url-ов
url_cluster=pd.read_csv('url_clusters.csv')

  url_cluster=pd.read_csv('url_clusters.csv')


In [55]:
url_cluster.columns

Index(['url', 'is_low_usage', 'kmeans_50', 'kmeans_150', 'kmeans_300',
       'kmeans_500'],
      dtype='object')

In [56]:
%%time
df = data.select(['user_id', 'url_host', 'request_cnt']).\
    group_by(['user_id', 'url_host',]).aggregate([('request_cnt', "sum")]).to_pandas()

CPU times: user 35.7 s, sys: 4.35 s, total: 40.1 s
Wall time: 43.7 s


In [57]:
df = df.sort_values(by=['user_id','url_host'])

In [58]:
df.head()

Unnamed: 0,request_cnt_sum,user_id,url_host
17842783,1,0,ad.adriver.ru
17842749,6,0,ad.mail.ru
17842737,5,0,ads.adfox.ru
17842748,2,0,ads.betweendigital.com
17842734,9,0,avatars.mds.yandex.net


In [59]:
df.columns=['request_cnt_sum', 'user_id', 'url']

In [60]:
#на основе urlдобавим информацию о кластере
df2=df.merge(url_cluster[['url', 'is_low_usage', 'kmeans_50', 'kmeans_150', 'kmeans_300',
       'kmeans_500']], on = 'url', how = 'left')

In [61]:
del df

In [62]:
del url_cluster

### Обработка признака посещаемости сайтов с низкой посещаемостью

Так как не все сайты удалось спарсить, я их выделила в отдельную группу. Как правило эти сайты имеют низкую популярность среди пользователей, поэтому с целью снизить признаковое пострансво я сфомрирую для них разреженные матрицы и с помощью implicit.approximate_als.FaissAlternatingLeastSquares сфомрирую 150 факторов для признаков посещаемости сайта-по времени суток

In [63]:
df2_low =df2.loc[df2['is_low_usage']==True,['user_id','url','request_cnt_sum']]

In [64]:
df2_low.head()

Unnamed: 0,user_id,url,request_cnt_sum
9,0,employmentcenter.ru,1
14,0,gorodrabot.ru,2
15,0,gotovim-doma.ru,1
19,0,jobfilter.ru,1
20,0,jobinmoscow.com.ru,1


In [65]:
#получу уникальные id всех имеющися пользователей
unique_users=pd.DataFrame(info_user.index)

In [66]:
#добавляю их в получившуюся таблицу и затем заполню пропуски  для кол-во посещений 0, а сайт возь любой из уже имеющися
df2_low = df2_low.merge(unique_users['user_id'],on='user_id',how='outer')
df2_low['request_cnt_sum']=df2_low['request_cnt_sum'].fillna(0)
df2_low['url']=df2_low['url'].fillna('employmentcenter.ru')

In [67]:
df2_low = pa.table(df2_low)
pd.DataFrame([(z.name, z.type) for z in df2_low.schema], columns = [['field', 'type']])

Unnamed: 0,field,type
0,user_id,int64
1,url,string
2,request_cnt_sum,double
3,__index_level_0__,int64


In [68]:
#создам словари для сочетаний сайт и юзеров

url_set = set(df2_low.select(['url']).to_pandas()['url'])
print(f'{len(url_set)} urls')
url_dict = {url: idurl for url, idurl in zip(url_set, range(len(url_set)))}
usr_set = set(df2_low.select(['user_id']).to_pandas()['user_id'])
print(f'{len(usr_set)} users')
usr_dict = {usr: user_id for usr, user_id in zip(usr_set, range(len(usr_set)))}

74061 urls
415317 users


In [69]:
%%time

values = np.array(df2_low.select(['request_cnt_sum']).to_pandas()['request_cnt_sum'])
rows = np.array(df2_low.select(['user_id']).to_pandas()['user_id'].map(usr_dict))
cols = np.array(df2_low.select(['url']).to_pandas()['url'].map(url_dict))
mat = scipy.sparse.coo_matrix((values, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))
#создаю множества для значений, строк и столбцов и формирую матрицу разреженную

CPU times: user 2.01 s, sys: 190 ms, total: 2.2 s
Wall time: 2.4 s


In [70]:
# снижаю признаковое пространство до 150
als = implicit.approximate_als.FaissAlternatingLeastSquares(factors = 150, iterations = 30, use_gpu = False, \
       calculate_training_loss = False, regularization = 0.1)
als.fit(mat)

u_factors = als.model.user_factors 
d_factors = als.model.item_factors

inv_usr_map = {v: k for k, v in usr_dict.items()}
usr_emb = pd.DataFrame(u_factors)
usr_emb['user_id'] = usr_emb.index.map(inv_usr_map)



  0%|          | 0/30 [00:00<?, ?it/s]

In [71]:
#получаем матрицу посещений малопосещаемых сайтов пользователями
usr_low=usr_emb.set_index('user_id')

In [72]:
#перемиенуем столбцы
dic={}
for i in usr_low.columns:
    dic[i]=str(i)+"low"

In [73]:
usr_low=usr_low.rename(columns=dic)

In [74]:
#добавим получившиеся признаки в общую таблицу признаков
info_user=info_user.join(usr_low)

In [75]:
info_user.head()

Unnamed: 0_level_0,reg_count,reg_main,cpe_manufacturer_name,cpe_type_cd,cpe_model_os_type,cpe_model_name,price,0low,1low,2low,...,140low,141low,142low,143low,144low,145low,146low,147low,148low,149low
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,Москва,Samsung,smartphone,Android,Galaxy J1 2016 LTE Dual,2990.0,-0.046409,-0.178324,0.071651,...,-0.011594,-0.171913,-0.077253,-0.067693,-0.009521,-0.049413,0.153522,0.056872,0.102366,0.009024
1,3,Москва,Xiaomi,smartphone,Android,Mi 9,33401.428571,-0.21928,0.113869,0.234526,...,0.185535,0.040139,0.242046,-0.157605,0.108859,-0.083552,-0.158932,-0.075543,-0.093775,0.183241
2,1,Республика Коми,Huawei,smartphone,Android,Honor 9 Lite,5915.0,-0.121608,0.018752,0.011818,...,0.055805,-0.007707,0.018519,-0.030663,-0.053377,0.015479,-0.017513,0.014654,-0.047422,-0.020332
3,1,Воронежская область,Huawei Device Company Limited,smartphone,Android,P Smart 2021,13990.0,9.5e-05,-0.000183,0.000436,...,-0.000746,7.4e-05,-1e-06,0.000792,-0.000135,-9.5e-05,-0.000227,0.000715,-2.6e-05,-0.00029
4,5,Краснодарский край,Huawei,smartphone,Android,Nova 3,12990.0,0.074125,0.046849,-0.016433,...,-0.049166,-0.027304,-0.032236,0.081305,-0.025955,-0.199342,-0.006012,0.082552,0.053362,-0.032763


In [76]:
info_user.to_parquet('info_user.parquet')

In [77]:
del usr_low,df2_low

### Обработка признака посещаемости кластеризованных сайтов + сайтов с высокой посещаемостью

In [78]:
df2_high =df2.loc[df2['is_low_usage']==False,['user_id','url','request_cnt_sum','kmeans_50', 'kmeans_150', 'kmeans_300',
       'kmeans_500']]

In [79]:
del df2

#### Добавление данных на 50 кластеров

In [80]:
#создаю талицу с посещаемостью пользователями сайтов кластеризованных 
df2_high_50=df2_high.groupby(['user_id','kmeans_50'])['request_cnt_sum'].sum()
df2_high_50=pd.DataFrame(df2_high_50)
df2_high_50=df2_high_50.reset_index()
df2_high_50['kmeans_50']=df2_high_50['kmeans_50'].astype('str')
df2_high_50.columns=['user_id', 'url', 'request_cnt_sum']
df2_high_50.to_parquet('df2_high_50.parquet')

In [81]:


df2_high_50 =pa.table(df2_high_50)

In [82]:
#добавляем словари
url_set = set(df2_high_50.select(['url']).to_pandas()['url'])
print(f'{len(url_set)} urls')
url_dict = {url: idurl for url, idurl in zip(url_set, range(len(url_set)))}
usr_set = set(df2_high_50.select(['user_id']).to_pandas()['user_id'])
print(f'{len(usr_set)} users')
usr_dict = {usr: user_id for usr, user_id in zip(usr_set, range(len(usr_set)))}

550 urls
415274 users


In [83]:
%%time
values = np.array(df2_high_50.select(['request_cnt_sum']).to_pandas()['request_cnt_sum'])
rows = np.array(df2_high_50.select(['user_id']).to_pandas()['user_id'].map(usr_dict))
cols = np.array(df2_high_50.select(['url']).to_pandas()['url'].map(url_dict))
mat = scipy.sparse.coo_matrix((values, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))
mat = pd.DataFrame.sparse.from_spmatrix(mat)
#Составлем разреженную матрицу и переводим в dataframe

CPU times: user 3.96 s, sys: 572 ms, total: 4.54 s
Wall time: 4.79 s


In [84]:
#переименовываю столбцы по принципу №кластреа+high
dic={}
for i in mat.columns:
    dic[i]=str(i)+'high'

mat=mat.rename(columns=dic)

In [85]:
mat.index=\
np.array(df2_high_50.select(['user_id']).to_pandas()['user_id'].unique())

In [86]:
info_user.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 157 entries, reg_count to 149low
dtypes: float32(150), float64(1), int64(1), object(5)
memory usage: 279.1+ MB


In [87]:
#добавляем полученную информацию по пользователям в основную таблицу
info_user50=info_user.join(mat)
info_user50=info_user50.fillna(0)

In [88]:
info_user50.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 707 entries, reg_count to 549high
dtypes: Sparse[float64, 0](550), float32(150), float64(1), int64(1), object(5)
memory usage: 564.5+ MB


Сделаем аналогичные операции и для варинатов со 150-300-500 кластерами

#### Добавление данных на 150 кластеров

In [89]:
#создаю талицу с посещаемостью пользователями сайтов кластеризованных 
df2_high_150=df2_high.groupby(['user_id','kmeans_150'])['request_cnt_sum'].sum()
df2_high_150=pd.DataFrame(df2_high_150)
df2_high_150=df2_high_150.reset_index()
df2_high_150['kmeans_150']=df2_high_150['kmeans_150'].astype('str')
df2_high_150.columns=['user_id', 'url', 'request_cnt_sum']
df2_high_150.to_parquet('df2_high_150.parquet')

In [90]:


df2_high_150 =pa.table(df2_high_150)

In [91]:
#добавляем словари
url_set = set(df2_high_150.select(['url']).to_pandas()['url'])
print(f'{len(url_set)} urls')
url_dict = {url: idurl for url, idurl in zip(url_set, range(len(url_set)))}
usr_set = set(df2_high_150.select(['user_id']).to_pandas()['user_id'])
print(f'{len(usr_set)} users')
usr_dict = {usr: user_id for usr, user_id in zip(usr_set, range(len(usr_set)))}

649 urls
415274 users


In [92]:
%%time
values = np.array(df2_high_150.select(['request_cnt_sum']).to_pandas()['request_cnt_sum'])
rows = np.array(df2_high_150.select(['user_id']).to_pandas()['user_id'].map(usr_dict))
cols = np.array(df2_high_150.select(['url']).to_pandas()['url'].map(url_dict))
mat = scipy.sparse.coo_matrix((values, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))
mat = pd.DataFrame.sparse.from_spmatrix(mat)
#Составлем разреженную матрицу и переводим в dataframe

CPU times: user 3.96 s, sys: 669 ms, total: 4.62 s
Wall time: 5.02 s


In [93]:
#переименовываю столбцы по принципу №кластреа+high
dic={}
for i in mat.columns:
    dic[i]=str(i)+'high'

mat=mat.rename(columns=dic)

In [94]:
mat.index=\
np.array(df2_high_150.select(['user_id']).to_pandas()['user_id'].unique())

In [95]:
info_user.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 157 entries, reg_count to 149low
dtypes: float32(150), float64(1), int64(1), object(5)
memory usage: 279.1+ MB


In [96]:
#добавляем полученную информацию по пользователям в основную таблицу
info_user150=info_user.join(mat)
info_user150=info_user150.fillna(0)

In [97]:
info_user150.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 806 entries, reg_count to 648high
dtypes: Sparse[float64, 0](649), float32(150), float64(1), int64(1), object(5)
memory usage: 564.8+ MB


#### Добавление данных на 300 кластеров

In [98]:
#создаю талицу с посещаемостью пользователями сайтов кластеризованных 
df2_high_300=df2_high.groupby(['user_id','kmeans_300'])['request_cnt_sum'].sum()
df2_high_300=pd.DataFrame(df2_high_300)
df2_high_300=df2_high_300.reset_index()
df2_high_300['kmeans_300']=df2_high_300['kmeans_300'].astype('str')
df2_high_300.columns=['user_id', 'url', 'request_cnt_sum']
df2_high_300.to_parquet('df2_high_300.parquet')

In [99]:


df2_high_300 =pa.table(df2_high_300)

In [100]:
#добавляем словари
url_set = set(df2_high_300.select(['url']).to_pandas()['url'])
print(f'{len(url_set)} urls')
url_dict = {url: idurl for url, idurl in zip(url_set, range(len(url_set)))}
usr_set = set(df2_high_300.select(['user_id']).to_pandas()['user_id'])
print(f'{len(usr_set)} users')
usr_dict = {usr: user_id for usr, user_id in zip(usr_set, range(len(usr_set)))}

799 urls
415274 users


In [101]:
%%time
values = np.array(df2_high_300.select(['request_cnt_sum']).to_pandas()['request_cnt_sum'])
rows = np.array(df2_high_300.select(['user_id']).to_pandas()['user_id'].map(usr_dict))
cols = np.array(df2_high_300.select(['url']).to_pandas()['url'].map(url_dict))
mat = scipy.sparse.coo_matrix((values, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))
mat = pd.DataFrame.sparse.from_spmatrix(mat)
#Составлем разреженную матрицу и переводим в dataframe

CPU times: user 3.88 s, sys: 599 ms, total: 4.48 s
Wall time: 4.74 s


In [102]:
#переименовываю столбцы по принципу №кластреа+high
dic={}
for i in mat.columns:
    dic[i]=str(i)+'high'

mat=mat.rename(columns=dic)

In [103]:
mat.index=\
np.array(df2_high_300.select(['user_id']).to_pandas()['user_id'].unique())

In [104]:
info_user.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 157 entries, reg_count to 149low
dtypes: float32(150), float64(1), int64(1), object(5)
memory usage: 279.1+ MB


In [105]:
#добавляем полученную информацию по пользователям в основную таблицу
info_user300=info_user.join(mat)
info_user300=info_user300.fillna(0)

In [106]:
info_user300.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 956 entries, reg_count to 798high
dtypes: Sparse[float64, 0](799), float32(150), float64(1), int64(1), object(5)
memory usage: 565.1+ MB


#### Добавление данных на 500 кластеров

In [107]:
#создаю талицу с посещаемостью пользователями сайтов кластеризованных 
df2_high_500=df2_high.groupby(['user_id','kmeans_500'])['request_cnt_sum'].sum()
df2_high_500=pd.DataFrame(df2_high_500)
df2_high_500=df2_high_500.reset_index()
df2_high_500['kmeans_500']=df2_high_500['kmeans_500'].astype('str')
df2_high_500.columns=['user_id', 'url', 'request_cnt_sum']
df2_high_500.to_parquet('df2_high_500.parquet')

In [108]:


df2_high_500 =pa.table(df2_high_500)

In [109]:
#добавляем словари
url_set = set(df2_high_500.select(['url']).to_pandas()['url'])
print(f'{len(url_set)} urls')
url_dict = {url: idurl for url, idurl in zip(url_set, range(len(url_set)))}
usr_set = set(df2_high_500.select(['user_id']).to_pandas()['user_id'])
print(f'{len(usr_set)} users')
usr_dict = {usr: user_id for usr, user_id in zip(usr_set, range(len(usr_set)))}

987 urls
415274 users


In [110]:
%%time
values = np.array(df2_high_500.select(['request_cnt_sum']).to_pandas()['request_cnt_sum'])
rows = np.array(df2_high_500.select(['user_id']).to_pandas()['user_id'].map(usr_dict))
cols = np.array(df2_high_500.select(['url']).to_pandas()['url'].map(url_dict))
mat = scipy.sparse.coo_matrix((values, (rows, cols)), shape=(rows.max() + 1, cols.max() + 1))
mat = pd.DataFrame.sparse.from_spmatrix(mat)
#Составлем разреженную матрицу и переводим в dataframe

CPU times: user 3.96 s, sys: 620 ms, total: 4.58 s
Wall time: 4.93 s


In [111]:
#переименовываю столбцы по принципу №кластреа+high
dic={}
for i in mat.columns:
    dic[i]=str(i)+'high'

mat=mat.rename(columns=dic)

In [112]:
mat.index=\
np.array(df2_high_500.select(['user_id']).to_pandas()['user_id'].unique())

In [113]:
info_user.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 157 entries, reg_count to 149low
dtypes: float32(150), float64(1), int64(1), object(5)
memory usage: 279.1+ MB


In [114]:
#добавляем полученную информацию по пользователям в основную таблицу
info_user500=info_user.join(mat)
info_user500=info_user500.fillna(0)

In [115]:
info_user500.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 415317 entries, 0 to 415316
Columns: 1144 entries, reg_count to 986high
dtypes: Sparse[float64, 0](987), float32(150), float64(1), int64(1), object(5)
memory usage: 565.4+ MB


In [116]:
del df2_high_500,df2_high

In [117]:
del df2_high_50,df2_high_150,df2_high_300

In [118]:
del mat,dic,values,rows,cols

In [119]:
del info_user

## Тестирование данных с 50-и кластерный вариантом

### *Подготовка данных*

In [120]:
usr_targets = targets.to_pandas()
info_user50 = info_user50.reset_index()

In [121]:
df = usr_targets.merge(info_user50, how = 'inner', on = ['user_id'])
df_train=df[df['is_male'] != 'NA']
df_train=df_train.dropna()
df_train['is_male'] = df_train['is_male'].astype('int')

In [122]:
categorical = ['reg_main','cpe_manufacturer_name','cpe_type_cd','cpe_model_os_type','cpe_model_name']

In [123]:
numerical = []
for i in df_train.columns:
    if i not in [*categorical,'age','is_male','user_id']:
        numerical.append(i)

## Подготовка выборок для дальнейшего подбора моделей

In [124]:

df_train=df_train.drop(['user_id'],axis=1)

In [125]:
train, test = train_test_split(df_train,test_size = 0.25,random_state=12345)

In [126]:
train_traget_age = train['age']
train_traget_is_male = train['is_male']

test_traget_age = test['age']
test_traget_is_male = test['is_male']

train_features= train.drop(['age','is_male',],axis=1)
test_features= test.drop(['age','is_male'],axis=1)

In [127]:
def age_bucket(x):
    return bisect.bisect_left([18,25,35,45,55,65], x)
train_traget_age_buck = train_traget_age.map(age_bucket)
test_traget_age_buck = test_traget_age.map(age_bucket)

In [128]:
#Данные для обучения финальной модели
df_train_target_age=df_train['age']
df_train_target_is_male = df_train['is_male']
df_train_target_age_buck=df_train_target_age.map(age_bucket)

df_train_features=df_train.drop(['age','is_male',],axis=1)

In [129]:
#train_features.head()

In [130]:
#test_features.head()

***standart scaler***

In [131]:
scaler = StandardScaler()
scaler.fit(train_features[numerical])
train_features[numerical]=scaler.transform(train_features[numerical])
test_features[numerical]=scaler.transform(test_features[numerical])





In [132]:
#train_features.head()

In [133]:
#test_features.head()

***OHE***

In [134]:
train_features_ohe=train_features.copy()
test_features_ohe=test_features.copy()
df_train_features_ohe = df_train_features.copy()


In [135]:


ohe=OneHotEncoder(drop='first',handle_unknown='ignore')
ohe.fit(train_features[categorical])

train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(train_features_ohe[categorical]).toarray()
train_features_ohe=train_features_ohe.drop(categorical,axis=1)


test_features_ohe[ohe.get_feature_names_out()]=ohe.transform(test_features_ohe[categorical]).toarray()
test_features_ohe=test_features_ohe.drop(categorical,axis=1)



In [136]:
#train_features_ohe.head()

In [137]:
#test_features_ohe.head()

***OE***

In [138]:
train_features_oe=train_features.copy()
test_features_oe=test_features.copy()
df_train_features_oe = df_train_features.copy()


In [139]:
#oe=OrdinalEncoder()
#oe.fit(df_train_features_oe[categorical])
#df_train_features_oe[categorical]=oe.transform(df_train_features_oe[categorical])




In [140]:
oe=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
oe.fit(train_features_oe[categorical])
train_features_oe[categorical]=oe.transform(train_features_oe[categorical])
test_features_oe[categorical]=oe.transform(test_features_oe[categorical])

In [141]:
#train_features_oe.head()

In [142]:
#test_features_oe.head()

# Подбор модели определения пола пользователя

***Logistic regression***

In [143]:
%%time
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_is_male)
predict_proba=model.predict_proba(test_features_ohe)[:,1]
gini_logs_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
logs_params_gender=""
print(f'GINI по полу {gini_logs_gender - 1:2.3f}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GINI по полу 0.674
CPU times: user 2min 55s, sys: 12.9 s, total: 3min 8s
Wall time: 19.7 s


***RandomForestClassifier***

In [144]:
%%time
model = RandomForestClassifier(random_state=12345)
model.fit(train_features_oe,train_traget_is_male)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_RF_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
RF_params_gender=""
print(f'GINI по полу {gini_RF_gender - 1:2.3f}')

GINI по полу 0.639
CPU times: user 4min 53s, sys: 4.09 s, total: 4min 57s
Wall time: 5min 24s


***CatBoost***

In [145]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [146]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [147]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_is_male, verbose = False,cat_features=categorical)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_CB_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
CB_params_gender=""
print(f'GINI по полу {gini_CB_gender - 1:2.3f}')

GINI по полу 0.738
CPU times: user 23min 53s, sys: 1min 12s, total: 25min 5s
Wall time: 2min 6s


***LightGBM***

In [148]:
train_features_oe[categorical]=train_features_oe[categorical].astype('category')

In [149]:
test_features_oe[categorical]=test_features_oe[categorical].astype('category')

In [150]:
%%time
model = lgb.LGBMClassifier()
model.fit(train_features_oe,train_traget_is_male, verbose = False)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_LGB_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
LGB_params_gender=""
print(f'GINI по полу {gini_LGB_gender - 1:2.3f}')



GINI по полу 0.695
CPU times: user 2min 28s, sys: 8.16 s, total: 2min 36s
Wall time: 24.4 s


# Подбор модели определения возраста пользователя (bucketed)

***Logistic regression***

In [151]:
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_age_buck)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=12345)

In [152]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_ohe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.07      0.01      0.02       210
       18-25       0.56      0.25      0.34      7966
       25-34       0.46      0.63      0.53     21340
       35-44       0.38      0.55      0.45     19093
       45-54       0.38      0.14      0.20     10183
       55-65       0.44      0.13      0.20      5934
         65+       0.28      0.03      0.05      1356

    accuracy                           0.42     66082
   macro avg       0.37      0.25      0.26     66082
weighted avg       0.43      0.42      0.39     66082



***RandomForestClassifier***

In [153]:
%%time
model = RandomForestClassifier(random_state=12345)
model.fit(train_features_oe,train_traget_age_buck)


CPU times: user 5min 21s, sys: 8.92 s, total: 5min 30s
Wall time: 5min 59s


RandomForestClassifier(random_state=12345)

In [154]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.54      0.20      0.29      7966
       25-34       0.45      0.67      0.54     21340
       35-44       0.37      0.47      0.42     19093
       45-54       0.32      0.15      0.20     10183
       55-65       0.38      0.12      0.18      5934
         65+       0.19      0.00      0.01      1356

    accuracy                           0.41     66082
   macro avg       0.32      0.23      0.23     66082
weighted avg       0.40      0.41      0.38     66082



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


***CatBoost***

In [155]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [156]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [157]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_age_buck, verbose = False,cat_features=categorical)


CPU times: user 2h 55min 1s, sys: 7min 3s, total: 3h 2min 4s
Wall time: 24min 30s


<catboost.core.CatBoostClassifier at 0x7f61f35befd0>

In [158]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.57      0.35      0.43      7966
       25-34       0.51      0.65      0.57     21340
       35-44       0.42      0.54      0.47     19093
       45-54       0.40      0.23      0.29     10183
       55-65       0.46      0.23      0.31      5934
         65+       0.45      0.03      0.05      1356

    accuracy                           0.47     66082
   macro avg       0.40      0.29      0.30     66082
weighted avg       0.47      0.47      0.45     66082



***LightGBM***

In [159]:
train_features_oe[categorical]=train_features_oe[categorical].astype('category')

In [160]:
test_features_oe[categorical]=test_features_oe[categorical].astype('category')

In [161]:
model = lgb.LGBMClassifier()
model.fit(train_features_oe,train_traget_age_buck, verbose = False)




LGBMClassifier()

In [162]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.02      0.01      0.02       210
       18-25       0.55      0.34      0.42      7966
       25-34       0.48      0.65      0.55     21340
       35-44       0.40      0.49      0.44     19093
       45-54       0.36      0.21      0.26     10183
       55-65       0.40      0.21      0.27      5934
         65+       0.29      0.04      0.08      1356

    accuracy                           0.44     66082
   macro avg       0.36      0.28      0.29     66082
weighted avg       0.44      0.44      0.42     66082



# Подбор модели определения возраста пользователя (not bucketed)

***Linear Model***

In [163]:
model = LinearRegression()
model.fit(train_features_ohe,train_traget_age)
predict = pd.DataFrame(model.predict(test_features_ohe))


In [164]:
print(m.classification_report(test_traget_age_buck, pd.DataFrame(model.predict(test_features_ohe))[0].map(age_bucket), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))


              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.46      0.06      0.11      7966
       25-34       0.48      0.31      0.37     21340
       35-44       0.34      0.79      0.48     19093
       45-54       0.33      0.20      0.25     10183
       55-65       0.45      0.05      0.09      5934
         65+       0.23      0.03      0.05      1356

    accuracy                           0.37     66082
   macro avg       0.33      0.21      0.19     66082
weighted avg       0.40      0.37      0.32     66082



***CatBoost***

In [165]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [166]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [167]:
%%time
model = CatBoostRegressor()
model.fit(train_features_oe,train_traget_age, verbose = False,cat_features=categorical)


CPU times: user 20min 51s, sys: 1min 32s, total: 22min 24s
Wall time: 3min 7s


<catboost.core.CatBoostRegressor at 0x7f61f3565a00>

In [168]:
print(m.classification_report(test_traget_age_buck, pd.DataFrame(model.predict(test_features_oe))[0].map(age_bucket), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.65      0.07      0.12      7966
       25-34       0.50      0.50      0.50     21340
       35-44       0.41      0.65      0.50     19093
       45-54       0.35      0.40      0.37     10183
       55-65       0.50      0.15      0.22      5934
         65+       0.49      0.04      0.07      1356

    accuracy                           0.43     66082
   macro avg       0.41      0.26      0.25     66082
weighted avg       0.47      0.43      0.40     66082



***LightGBM***

In [169]:
train_features_oe[categorical]=train_features_oe[categorical].astype('category')

In [170]:
test_features_oe[categorical]=test_features_oe[categorical].astype('category')

In [171]:
model = lgb.LGBMRegressor()
model.fit(train_features_oe,train_traget_age, verbose = False)




LGBMRegressor()

In [172]:
print(m.classification_report(test_traget_age_buck, pd.DataFrame(model.predict(test_features_oe))[0].map(age_bucket), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.69      0.01      0.03      7966
       25-34       0.48      0.50      0.49     21340
       35-44       0.39      0.63      0.48     19093
       45-54       0.33      0.38      0.35     10183
       55-65       0.51      0.11      0.18      5934
         65+       0.57      0.03      0.06      1356

    accuracy                           0.41     66082
   macro avg       0.42      0.24      0.23     66082
weighted avg       0.46      0.41      0.37     66082



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Поскольку вариант с регрессиями стабильно показывает худший результат - далеея откажусь от него. Так же для остальных моделей я планирую тестировать только логистическую регресии и катбуст как первая - самая быстроая, а вторая модель стабильно показывает наиболее высокие скоры.

## 150-и кластерный вариант

### *Подготовка данных*

In [173]:
info_user150 = info_user150.reset_index()

In [174]:
df = usr_targets.merge(info_user150, how = 'inner', on = ['user_id'])
df_train=df[df['is_male'] != 'NA']
df_train=df_train.dropna()
df_train['is_male'] = df_train['is_male'].astype('int')

In [175]:
categorical = ['reg_main','cpe_manufacturer_name','cpe_type_cd','cpe_model_os_type','cpe_model_name']

In [176]:
numerical = []
for i in df_train.columns:
    if i not in [*categorical,'age','is_male','user_id']:
        numerical.append(i)

## Подготовка выборок для дальнейшего подбора моделей

In [177]:

df_train=df_train.drop(['user_id'],axis=1)

In [178]:
train, test = train_test_split(df_train,test_size = 0.25,random_state=12345)

In [179]:
train_traget_age = train['age']
train_traget_is_male = train['is_male']

test_traget_age = test['age']
test_traget_is_male = test['is_male']

train_features= train.drop(['age','is_male',],axis=1)
test_features= test.drop(['age','is_male'],axis=1)

In [180]:
def age_bucket(x):
    return bisect.bisect_left([18,25,35,45,55,65], x)
train_traget_age_buck = train_traget_age.map(age_bucket)
test_traget_age_buck = test_traget_age.map(age_bucket)

In [181]:
#Данные для обучения финальной модели
df_train_target_age=df_train['age']
df_train_target_is_male = df_train['is_male']
df_train_target_age_buck=df_train_target_age.map(age_bucket)

df_train_features=df_train.drop(['age','is_male',],axis=1)

***standart scaler***

In [182]:
scaler = StandardScaler()
scaler.fit(train_features[numerical])
train_features[numerical]=scaler.transform(train_features[numerical])
test_features[numerical]=scaler.transform(test_features[numerical])





***OHE***

In [183]:
train_features_ohe=train_features.copy()
test_features_ohe=test_features.copy()
df_train_features_ohe = df_train_features.copy()


In [184]:
#ohe=OneHotEncoder(drop='first')
#ohe.fit(df_train_features[categorical])

#df_train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(df_train_features_ohe[categorical]).toarray()
#df_train_features_ohe=df_train_features_ohe.drop(categorical,axis=1)

ohe=OneHotEncoder(drop='first',handle_unknown='ignore')
ohe.fit(train_features[categorical])

train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(train_features_ohe[categorical]).toarray()
train_features_ohe=train_features_ohe.drop(categorical,axis=1)


test_features_ohe[ohe.get_feature_names_out()]=ohe.transform(test_features_ohe[categorical]).toarray()
test_features_ohe=test_features_ohe.drop(categorical,axis=1)



***OE***

In [185]:
train_features_oe=train_features.copy()
test_features_oe=test_features.copy()
df_train_features_oe = df_train_features.copy()


In [186]:
#oe=OrdinalEncoder()
#oe.fit(df_train_features_oe[categorical])
#df_train_features_oe[categorical]=oe.transform(df_train_features_oe[categorical])




In [187]:
oe=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
oe.fit(train_features_oe[categorical])
train_features_oe[categorical]=oe.transform(train_features_oe[categorical])
test_features_oe[categorical]=oe.transform(test_features_oe[categorical])

# Подбор модели определения пола пользователя

***Logistic regression***

In [188]:
%%time
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_is_male)
predict_proba=model.predict_proba(test_features_ohe)[:,1]
gini_logs_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
logs_params_gender=""
print(f'GINI по полу {gini_logs_gender - 1:2.3f}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GINI по полу 0.673
CPU times: user 3min 37s, sys: 14.3 s, total: 3min 51s
Wall time: 35.3 s


***CatBoost***

In [189]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [190]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [191]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_is_male, verbose = False,cat_features=categorical)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_CB_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
CB_params_gender=""
print(f'GINI по полу {gini_CB_gender - 1:2.3f}')

GINI по полу 0.738
CPU times: user 23min 55s, sys: 1min 40s, total: 25min 36s
Wall time: 3min 6s


# Подбор модели определения возраста пользователя (bucketed)

***Logistic regression***

In [192]:
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_age_buck)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=12345)

In [193]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_ohe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.03      0.00      0.01       210
       18-25       0.56      0.24      0.33      7966
       25-34       0.46      0.63      0.53     21340
       35-44       0.38      0.54      0.44     19093
       45-54       0.38      0.14      0.20     10183
       55-65       0.44      0.13      0.20      5934
         65+       0.26      0.03      0.06      1356

    accuracy                           0.42     66082
   macro avg       0.36      0.25      0.25     66082
weighted avg       0.43      0.42      0.39     66082



***CatBoost***

In [194]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [195]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [196]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_age_buck, verbose = False,cat_features=categorical)


CPU times: user 2h 56min 32s, sys: 5min 22s, total: 3h 1min 55s
Wall time: 14min 56s


<catboost.core.CatBoostClassifier at 0x7f62005231c0>

In [197]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.57      0.35      0.43      7966
       25-34       0.51      0.66      0.57     21340
       35-44       0.42      0.55      0.48     19093
       45-54       0.39      0.23      0.29     10183
       55-65       0.45      0.23      0.30      5934
         65+       0.46      0.02      0.05      1356

    accuracy                           0.47     66082
   macro avg       0.40      0.29      0.30     66082
weighted avg       0.47      0.47      0.45     66082



### 300-т кластерный вариант

### *Подготовка данных*

In [198]:
#usr_targets = targets.to_pandas()

In [199]:
info_user300 = info_user300.reset_index()
df = usr_targets.merge(info_user300, how = 'inner', on = ['user_id'])
df_train=df[df['is_male'] != 'NA']
df_train=df_train.dropna()
df_train['is_male'] = df_train['is_male'].astype('int')

In [200]:
categorical = ['reg_main','cpe_manufacturer_name','cpe_type_cd','cpe_model_os_type','cpe_model_name']

In [201]:
numerical = []
for i in df_train.columns:
    if i not in [*categorical,'age','is_male','user_id']:
        numerical.append(i)

## Подготовка выборок для дальнейшего подбора моделей

In [202]:

df_train=df_train.drop(['user_id'],axis=1)

In [203]:
train, test = train_test_split(df_train,test_size = 0.25,random_state=12345)

In [204]:
train_traget_age = train['age']
train_traget_is_male = train['is_male']

test_traget_age = test['age']
test_traget_is_male = test['is_male']

train_features= train.drop(['age','is_male',],axis=1)
test_features= test.drop(['age','is_male'],axis=1)

In [205]:
def age_bucket(x):
    return bisect.bisect_left([18,25,35,45,55,65], x)
train_traget_age_buck = train_traget_age.map(age_bucket)
test_traget_age_buck = test_traget_age.map(age_bucket)

In [206]:
#Данные для обучения финальной модели
df_train_target_age=df_train['age']
df_train_target_is_male = df_train['is_male']
df_train_target_age_buck=df_train_target_age.map(age_bucket)

df_train_features=df_train.drop(['age','is_male',],axis=1)

***standart scaler***

In [207]:
scaler = StandardScaler()
scaler.fit(train_features[numerical])
train_features[numerical]=scaler.transform(train_features[numerical])
test_features[numerical]=scaler.transform(test_features[numerical])





***OHE***

In [208]:
train_features_ohe=train_features.copy()
test_features_ohe=test_features.copy()
df_train_features_ohe = df_train_features.copy()


In [209]:
#ohe=OneHotEncoder(drop='first')
#ohe.fit(df_train_features[categorical])

#df_train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(df_train_features_ohe[categorical]).toarray()
#df_train_features_ohe=df_train_features_ohe.drop(categorical,axis=1)

ohe=OneHotEncoder(drop='first',handle_unknown='ignore')
ohe.fit(train_features[categorical])

train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(train_features_ohe[categorical]).toarray()
train_features_ohe=train_features_ohe.drop(categorical,axis=1)


test_features_ohe[ohe.get_feature_names_out()]=ohe.transform(test_features_ohe[categorical]).toarray()
test_features_ohe=test_features_ohe.drop(categorical,axis=1)



***OE***

In [210]:
train_features_oe=train_features.copy()
test_features_oe=test_features.copy()
df_train_features_oe = df_train_features.copy()


In [211]:
#oe=OrdinalEncoder()
#oe.fit(df_train_features_oe[categorical])
#df_train_features_oe[categorical]=oe.transform(df_train_features_oe[categorical])




In [212]:
oe=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
oe.fit(train_features_oe[categorical])
train_features_oe[categorical]=oe.transform(train_features_oe[categorical])
test_features_oe[categorical]=oe.transform(test_features_oe[categorical])

# Подбор модели определения пола пользователя

***Logistic regression***

In [213]:
%%time
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_is_male)
predict_proba=model.predict_proba(test_features_ohe)[:,1]
gini_logs_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
logs_params_gender=""
print(f'GINI по полу {gini_logs_gender - 1:2.3f}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GINI по полу 0.673
CPU times: user 3min 13s, sys: 13.6 s, total: 3min 26s
Wall time: 21.8 s


***CatBoost***

In [214]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [215]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [216]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_is_male, verbose = False,cat_features=categorical)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_CB_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
CB_params_gender=""
print(f'GINI по полу {gini_CB_gender - 1:2.3f}')

GINI по полу 0.738
CPU times: user 25min 15s, sys: 1min 16s, total: 26min 31s
Wall time: 2min 4s


# Подбор модели определения возраста пользователя (bucketed)

***Logistic regression***

In [217]:
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_age_buck)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=12345)

In [218]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_ohe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.02      0.00      0.01       210
       18-25       0.55      0.24      0.34      7966
       25-34       0.46      0.63      0.53     21340
       35-44       0.38      0.54      0.44     19093
       45-54       0.39      0.14      0.21     10183
       55-65       0.45      0.13      0.20      5934
         65+       0.24      0.03      0.05      1356

    accuracy                           0.42     66082
   macro avg       0.35      0.25      0.25     66082
weighted avg       0.43      0.42      0.39     66082



***CatBoost***

In [219]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [220]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [221]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_age_buck, verbose = False,cat_features=categorical)


CPU times: user 3h 49s, sys: 5min 15s, total: 3h 6min 4s
Wall time: 15min 2s


<catboost.core.CatBoostClassifier at 0x7f62010dd1c0>

In [222]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.57      0.35      0.43      7966
       25-34       0.51      0.65      0.57     21340
       35-44       0.42      0.54      0.47     19093
       45-54       0.39      0.23      0.29     10183
       55-65       0.46      0.23      0.30      5934
         65+       0.51      0.03      0.06      1356

    accuracy                           0.47     66082
   macro avg       0.41      0.29      0.30     66082
weighted avg       0.47      0.47      0.45     66082



### 500-т кластерный вариант

### *Подготовка данных*

In [223]:
#usr_targets = targets.to_pandas()

In [224]:
info_user500 = info_user500.reset_index()
df = usr_targets.merge(info_user500, how = 'inner', on = ['user_id'])
df_train=df[df['is_male'] != 'NA']
df_train=df_train.dropna()
df_train['is_male'] = df_train['is_male'].astype('int')

In [225]:
categorical = ['reg_main','cpe_manufacturer_name','cpe_type_cd','cpe_model_os_type','cpe_model_name']

In [226]:
numerical = []
for i in df_train.columns:
    if i not in [*categorical,'age','is_male','user_id']:
        numerical.append(i)

## Подготовка выборок для дальнейшего подбора моделей

In [227]:

df_train=df_train.drop(['user_id'],axis=1)

In [228]:
train, test = train_test_split(df_train,test_size = 0.25,random_state=12345)

In [229]:
train_traget_age = train['age']
train_traget_is_male = train['is_male']

test_traget_age = test['age']
test_traget_is_male = test['is_male']

train_features= train.drop(['age','is_male',],axis=1)
test_features= test.drop(['age','is_male'],axis=1)

In [230]:
def age_bucket(x):
    return bisect.bisect_left([18,25,35,45,55,65], x)
train_traget_age_buck = train_traget_age.map(age_bucket)
test_traget_age_buck = test_traget_age.map(age_bucket)

In [231]:
#Данные для обучения финальной модели
df_train_target_age=df_train['age']
df_train_target_is_male = df_train['is_male']
df_train_target_age_buck=df_train_target_age.map(age_bucket)

df_train_features=df_train.drop(['age','is_male',],axis=1)

***standart scaler***

In [232]:
scaler = StandardScaler()
scaler.fit(train_features[numerical])
train_features[numerical]=scaler.transform(train_features[numerical])
test_features[numerical]=scaler.transform(test_features[numerical])





***OHE***

In [233]:
train_features_ohe=train_features.copy()
test_features_ohe=test_features.copy()
df_train_features_ohe = df_train_features.copy()


In [234]:


ohe=OneHotEncoder(drop='first',handle_unknown='ignore')
ohe.fit(train_features[categorical])

train_features_ohe[ohe.get_feature_names_out()]=ohe.transform(train_features_ohe[categorical]).toarray()
train_features_ohe=train_features_ohe.drop(categorical,axis=1)


test_features_ohe[ohe.get_feature_names_out()]=ohe.transform(test_features_ohe[categorical]).toarray()
test_features_ohe=test_features_ohe.drop(categorical,axis=1)



***OE***

In [235]:
train_features_oe=train_features.copy()
test_features_oe=test_features.copy()
df_train_features_oe = df_train_features.copy()


In [236]:
oe=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
oe.fit(train_features_oe[categorical])
train_features_oe[categorical]=oe.transform(train_features_oe[categorical])
test_features_oe[categorical]=oe.transform(test_features_oe[categorical])

# Подбор модели определения пола пользователя

***Logistic regression***

In [237]:
%%time
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_is_male)
predict_proba=model.predict_proba(test_features_ohe)[:,1]
gini_logs_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
logs_params_gender=""
print(f'GINI по полу {gini_logs_gender - 1:2.3f}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GINI по полу 0.673
CPU times: user 5min 32s, sys: 19.7 s, total: 5min 51s
Wall time: 39.4 s


***CatBoost***

In [238]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [239]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [240]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_is_male, verbose = False,cat_features=categorical)
predict_proba=model.predict_proba(test_features_oe)[:,1]
gini_CB_gender=2 * m.roc_auc_score(test_traget_is_male, predict_proba)
CB_params_gender=""
print(f'GINI по полу {gini_CB_gender - 1:2.3f}')

GINI по полу 0.738
CPU times: user 26min 5s, sys: 1min 13s, total: 27min 19s
Wall time: 2min 15s


# Подбор модели определения возраста пользователя (bucketed)

***Logistic regression***

In [241]:
model = LogisticRegression(random_state=12345)
model.fit(train_features_ohe,train_traget_age_buck)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=12345)

In [242]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_ohe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.06      0.01      0.02       210
       18-25       0.55      0.24      0.34      7966
       25-34       0.46      0.63      0.53     21340
       35-44       0.37      0.55      0.44     19093
       45-54       0.38      0.14      0.21     10183
       55-65       0.43      0.12      0.19      5934
         65+       0.24      0.03      0.06      1356

    accuracy                           0.42     66082
   macro avg       0.36      0.25      0.26     66082
weighted avg       0.43      0.42      0.39     66082



***CatBoost***

In [243]:
train_features_oe[categorical]=train_features_oe[categorical].astype(str)

In [244]:
test_features_oe[categorical]=test_features_oe[categorical].astype(str)

In [245]:
%%time
model = CatBoostClassifier()
model.fit(train_features_oe,train_traget_age_buck, verbose = False,cat_features=categorical)


CPU times: user 3h 6min 8s, sys: 5min 25s, total: 3h 11min 34s
Wall time: 16min 30s


<catboost.core.CatBoostClassifier at 0x7f79d163ae20>

In [246]:
print(m.classification_report(test_traget_age_buck, model.predict(test_features_oe), \
                            target_names = ['<18', '18-25','25-34', '35-44', '45-54', '55-65', '65+']))

              precision    recall  f1-score   support

         <18       0.00      0.00      0.00       210
       18-25       0.57      0.35      0.43      7966
       25-34       0.51      0.66      0.57     21340
       35-44       0.42      0.54      0.47     19093
       45-54       0.40      0.23      0.29     10183
       55-65       0.46      0.23      0.31      5934
         65+       0.53      0.03      0.06      1356

    accuracy                           0.47     66082
   macro avg       0.41      0.29      0.31     66082
weighted avg       0.47      0.47      0.45     66082



# ВЫВОДЫ

Наилучшая модель для определения пола и возраста пользователя по cooлies оказался CATBOOSTCLASSIFIER для обоих тагетов (с вариантом кластеризации сайтов в 50 кластеров так как по мере роста кол-ва классовкачесво оставалось неизменным). Далее я постараюсь улучшить показатели этих моделей подобрав оптимальные гиперпараметры. тут - MTS_ML_CUP_sibrikova_public_submition)