Будем использовать данные отсюда - [**Black Friday Sales**](https://www.kaggle.com/c/gb-black-friday-sales)   

### Data fields

* `User_ID` - идентификационный номер покупателя
* `Product_ID` - идентификационный номер товара
* `Gender` - пол покупателя 
* `Age` - возраст покупателя
* `Occupation` - род деятельности покупателя
* `City_Category` - город проживания
* `Stay_In_Current_City_Years` - как долго покупатель живет в этом городе
* `Marital_Status` - семейное положение покупателя
* `Product_Category_1` - категория товара 1
* `Product_Category_2` - категория товара 2
* `Product_Category_3` - категория товара 3

In [95]:
import pandas as pd
import numpy as np
import warnings
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import f1_score, roc_auc_score, precision_score, classification_report, precision_recall_curve, confusion_matrix
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

warnings.filterwarnings('ignore')

In [96]:
df = pd.read_csv('black-friday-train.csv')
df

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1004085,P00075742,F,26-35,6,A,1,0,8,14.0,,7803
1,1005491,P00234842,M,18-25,7,A,1,0,5,6.0,16.0,6903
2,1003499,P00220142,M,26-35,3,A,2,0,1,15.0,,15773
3,1000097,P00211242,F,36-45,3,C,3,0,8,12.0,,8116
4,1005802,P00327142,F,26-35,0,A,4+,0,8,15.0,,6144
...,...,...,...,...,...,...,...,...,...,...,...,...
79995,1000919,P00217942,F,36-45,1,C,1,0,5,,,5231
79996,1001733,P00255742,M,18-25,14,B,0,1,3,4.0,,10904
79997,1002674,P00209842,M,26-35,4,A,1,0,5,8.0,,6953
79998,1005599,P00171842,M,36-45,7,A,1,0,8,14.0,,5888


In [97]:
df.Product_ID.value_counts()

P00265242    272
P00025442    225
P00112142    224
P00058042    221
P00110742    220
            ... 
P00068542      1
P00312442      1
P00361042      1
P00056542      1
P00270442      1
Name: Product_ID, Length: 3256, dtype: int64

Создадим задачу классификации. Возьмем продукт P00265242 и разметим пользователей, которые его покупали как класс 1 (Positive), остальные будут 0 (Unlabled). 

In [98]:
users_who_bought = df.loc[df['Product_ID'] == 'P00265242', 'User_ID'].unique()

In [99]:
df['bought_product'] = 0
df['bought_product'].loc[df['User_ID'].isin(users_who_bought)] = 1
df['bought_product'].value_counts()

0    74859
1     5141
Name: bought_product, dtype: int64

In [100]:
df.loc[df['Stay_In_Current_City_Years'] == '4+', 'Stay_In_Current_City_Years'] = 4
df['Stay_In_Current_City_Years'].astype(int)

0        1
1        1
2        2
3        3
4        4
        ..
79995    1
79996    0
79997    1
79998    1
79999    0
Name: Stay_In_Current_City_Years, Length: 80000, dtype: int64

In [101]:
X_train, X_test, y_train, y_test = train_test_split(df, df['bought_product'],
                                                  shuffle=True,  
                                                  stratify=df['bought_product'], 
                                                  random_state=42)

In [102]:
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
       'Product_Category_2', 'Product_Category_3', 'Purchase',
       'bought_product'],
      dtype='object')

In [103]:
disbalance = df['bought_product'].value_counts()[0] / df['bought_product'].value_counts()[1]
categories = ['Product_ID', 'Gender', 'Age', 'City_Category']
features = ['Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
               'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
               'Product_Category_2', 'Product_Category_3', 'Purchase']

clf = CatBoostClassifier(silent=True, 
                         random_state=42, 
                         cat_features=categories,
                         class_weights=[1, disbalance])

clf.fit(X_train[features], y_train)

y_train_prob = clf.predict_proba(X_train[features])
y_test_prob = clf.predict_proba(X_test[features])

In [104]:
precision, recall, thresholds = precision_recall_curve(y_test, y_test_prob[:,1])
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))

Best Threshold=0.730848, F-Score=0.612, Precision=0.604, Recall=0.619


Переразметим данные так, чтобы часть пользователей не попала, возьмем половину пользователей, которые купили как класс 1 (Positive), остальные будут 0 (Unlabled) 

In [105]:
inds = df.loc[df.bought_product == 1].index
index = np.random.choice(inds, len(inds)//2, replace=False)

df['y'] = 0
df['y'].iloc[index] = 1


In [106]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['bought_product']), df['y'], random_state=42)

In [107]:
clf = CatBoostClassifier(silent=True, 
                         random_state=42, 
                         cat_features=categories,
                         class_weights=[1, disbalance])

clf.fit(X_train, y_train)

y_train_prob = clf.predict_proba(X_train)
y_df_prob = clf.predict_proba(df.drop(columns=['bought_product']))

In [108]:
y_df_prob[:, 1]

array([3.10480884e-05, 3.43089770e-05, 3.10467374e-05, ...,
       3.15949633e-05, 2.95160340e-05, 4.25791751e-05])

In [109]:
precision, recall, thresholds = precision_recall_curve(df['bought_product'], y_df_prob[:,1])
fscore = (2  * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))


Best Threshold=0.999863, F-Score=0.667, Precision=1.000, Recall=0.500


На всех размеченных данных метрики классификации были:  
F-Score=0.612, Precision=0.604, Recall=0.619

На половине разметки:  
Best Threshold=0.999863, F-Score=0.667, Precision=1.000, Recall=0.500

Качество модели не ухудшилось