# Подбор наиболее похожих товаров.  

**Задача**: разработать алгоритм, который для всех товаров из validation.csv предложит несколько вариантов наиболее похожих товаров из base.  

**Данные**:  
- *base.csv* - анонимизированный набор товаров. Каждый товар представлен как уникальный id (0-base, 1-base, 2-base) и вектор признаков размерностью 72.
- *target.csv -* обучающий датасет. Каждая строчка - один товар, для которого известен уникальный id (0-query, 1-query, …) , вектор признаков И id товара из *base.csv*, который максимально похож на него (по мнению экспертов).
- *validation.csv* - датасет с товарами (уникальный id и вектор признаков), для которых надо найти наиболее близкие товары из *base.csv*
- *validation_answer.csv* - правильные ответы к предыдущему файлу.

## Содержание.
1. [Загрузка и ознакомление с данными](#step1).
2. [Подготовка данных](#step2).  
3. [Применение метода приближенного поиска ближайших соседей](#step3).
4. [Применение алгоритма классификации](#step4).



In [None]:
#!pip install faiss-cpu --no-cache

In [None]:
#!pip install catboost

In [None]:
import numpy as np
import pandas as pd
import faiss
#from numpy.core.multiarray import ascontiguousarray
from sklearn.preprocessing import StandardScaler
#from google.colab import drive

In [None]:
from catboost import CatBoostClassifier
from catboost import cv, Pool
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
import warnings
warnings.filterwarnings("ignore")

<a id='step1'></a>
## Загрузка и ознакомление с данными.

In [None]:
#drive.mount('/content/gdrive')

In [None]:
try:
    df_base = pd.read_csv("/content/gdrive/MyDrive/data/base.csv", index_col=0)
    df_train = pd.read_csv("/content/gdrive/MyDrive/data/train.csv", index_col=0)
except:
    df_base = pd.read_csv("data/base.csv", index_col=0)
    df_train = pd.read_csv("data/train.csv", index_col=0)


In [None]:
df_base.describe()

In [None]:
df_base.head()

In [None]:
df_base.duplicated().sum()

Для обучения доступны почти три миллиона товаров. Все характеристики числовые, пропусков и дубликатов нет.

In [None]:
df_train.describe()

In [None]:
df_train.sample(5)

In [None]:
df_train.duplicated().sum()

В тренировочной выборке 100000 товаров, пропусков и дубликатов также не обнаружено.

Посмотрим на распределения признаков.

In [None]:
fig, axes = plt.subplots(24,3,figsize=(16,72))

for i in range(72):
    sns.histplot(df_base[str(i)].sample(1000), ax=axes[i//3,i%3], alpha=0.3)
    sns.histplot(df_train[str(i)].sample(1000), ax=axes[i//3,i%3], alpha=0.3)
plt.subplots_adjust(hspace=0.5)

Распределения признаков в обучающй и тренировочной выборке примерно одинаковые. У большей части признаков нормальное распределение. Исключение составляют признаки 6, 21, 25, 33, 44, 59, 65 и 70.

<a id='step2'></a>
## 2. Подготовка данных.

Поскольку мы собираемся использовать метод ближайших соседей, проведем страндартизацию данных.

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(df_base);

In [None]:
df_base_scaled = pd.DataFrame(scaler.transform(df_base), index=df_base.index)

<a id="step3"></a>
## 3. Применение метода приближенного поиска ближайших соседей

In [None]:
dims = df_base.shape[1]

In [None]:
n_cells = 100
quantizer = faiss.IndexFlatL2(dims)

In [None]:
idx_l2 = faiss.IndexIVFFlat(quantizer, dims, n_cells)

In [None]:
idx_l2.train(np.ascontiguousarray(df_base_scaled.values).astype('float32'))
idx_l2.add(np.ascontiguousarray(df_base_scaled.values).astype('float32'))

In [None]:
base_index = {k:v for k,v in enumerate(df_base.index.to_list())}

In [None]:
targets = df_train['Target']
df_train.drop('Target', axis=1, inplace=True)

In [22]:
df_train_scaled = pd.DataFrame(scaler.transform(df_train), index=df_train.index)

In [None]:
idx_l2.nprobe = 40

In [None]:
vecs, idx = idx_l2.search(np.ascontiguousarray(df_train_scaled.values).astype('float32'), 10)

In [None]:
acc = 0
for target, el in zip(targets.values.tolist(), idx.tolist()):
  acc += int(target in [base_index[r] for r in el])

print(100 * acc/ len(idx))

In [None]:
idx_df = pd.DataFrame(idx)

In [None]:
idx_df.head()

In [None]:
#idx_df.to_csv('/content/gdrive/MyDrive/data/idx_df.csv')

Применение приближенного поиска ближайших соседей дало значение метрики accuracy@10 69.664%.  
Такое значение было достигнуто при делении набора товаров на 100 кластеров, поиск похожих товаров проводился в 30 ближайших.  
Далее к полученному набору похожих кандидатов будет применен алгоритм классификации с целью выбрать 5 наиболее близких векторов.

<a id='step4'></a>
## Применение алгоритма классификации

In [None]:
#try:
#    idx_df = pd.read_csv('/content/gdrive/MyDrive/data/idx_df.csv', index_col=0)
#except:
#    idx_df = pd.read_csv('idx_df.csv', index_col=0)

Сформируем данные для обучения классификатора. Объединим данные из base и train, целевым признаком будет являться факт того что товар из base является подходящим для товара из train по оценке экспертов.

In [None]:
df_clf = pd.DataFrame()
for i in range(len(idx_df)):
    df_clf = pd.concat([df_clf, pd.DataFrame({"train_idx":targets.index[i], "base_vecs":idx_df.iloc[i]})],
                       ignore_index=True)

In [None]:
#df_clf.to_csv('/content/gdrive/MyDrive/data/train_base_df.csv')

In [19]:
try:
    df_clf = pd.read_csv('/content/gdrive/MyDrive/data/train_base_df.csv', index_col=0)
except:
    df_clf = pd.read_csv('train_base_df.csv', index_col=0)  

In [20]:
df_clf = pd.merge(df_clf, df_base_scaled.reset_index(), left_on='base_vecs', right_index=True, how='left')

In [23]:
df_clf = pd.merge(df_clf, df_train_scaled, left_on='train_idx', right_index=True)

In [24]:
df_clf = pd.merge(df_clf, targets, left_on='train_idx', right_index=True)

In [25]:
df_clf['clf_target'] = 0
df_clf['clf_target'] = df_clf['clf_target'].where((df_clf['Id'] != df_clf['Target']), other=1)

In [None]:
#df_clf.to_csv('/content/gdrive/MyDrive/data/train_base_df.csv')

In [26]:
df_clf.head()

Unnamed: 0,train_idx,base_vecs,Id,0_x,1_x,2_x,3_x,4_x,5_x,6_x,...,64_y,65_y,66_y,67_y,68_y,69_y,70_y,71_y,Target,clf_target
0,0-query,598613,675816-base,0.898824,1.833764,0.318692,-1.164895,1.889056,-0.185822,0.165148,...,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963,675816-base,1
1,0-query,755584,877519-base,1.165669,2.229054,-0.704335,-1.734815,1.900754,-0.144193,-0.672585,...,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963,675816-base,0
2,0-query,336969,361564-base,1.044863,1.884666,-0.007851,-2.048885,1.42315,-0.281895,-1.352459,...,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963,675816-base,0
3,0-query,1934845,2725256-base,1.172481,1.973162,-0.269434,-1.615983,1.082829,-0.546268,0.107806,...,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963,675816-base,0
4,0-query,13374,13406-base,1.165306,2.40123,-0.227657,-1.995106,1.589433,0.04218,-1.237574,...,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963,675816-base,0


В итоге получаем данные для обучения:

In [27]:
X = df_clf.drop(['train_idx', 'base_vecs', 'Id', 'Target', 'clf_target'], axis=1)
X.head()

Unnamed: 0,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,9_x,...,62_y,63_y,64_y,65_y,66_y,67_y,68_y,69_y,70_y,71_y
0,0.898824,1.833764,0.318692,-1.164895,1.889056,-0.185822,0.165148,-0.110005,-2.43123,-1.251412,...,-0.866975,1.274319,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963
1,1.165669,2.229054,-0.704335,-1.734815,1.900754,-0.144193,-0.672585,0.136535,-2.022735,-0.149625,...,-0.866975,1.274319,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963
2,1.044863,1.884666,-0.007851,-2.048885,1.42315,-0.281895,-1.352459,0.464914,-1.751272,-0.707837,...,-0.866975,1.274319,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963
3,1.172481,1.973162,-0.269434,-1.615983,1.082829,-0.546268,0.107806,-0.081237,-1.115532,-0.590654,...,-0.866975,1.274319,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963
4,1.165306,2.40123,-0.227657,-1.995106,1.589433,0.04218,-1.237574,0.536234,-1.574161,-0.095937,...,-0.866975,1.274319,-0.02441,-1.173481,-1.035388,0.197184,-0.200786,0.906575,0.995267,0.522963


Целевая метрика:

In [28]:
y = df_clf['clf_target']

In [29]:
y.value_counts(normalize=True)

0    0.93047
1    0.06953
Name: clf_target, dtype: float64

Полученные классы сильно не сбалансированы, это нужно будет учесть при обучении модели.  
В качестве алгоритма классфикации будет использован CatBoostClassifier.

In [30]:
train_data = Pool(data=X,
                  label=y)

In [31]:
params = {'eval_metric': 'AUC',
          'loss_function':'Logloss',
          'learning_rate':0.15,
          'random_seed':42,
          'verbose':20}

In [32]:
cv_data = cv(params = params,
           pool = train_data,
           fold_count = 5,
           shuffle = True,
           partition_random_seed = 0,
           stratified = True,
           verbose = False,
           early_stopping_rounds = 200)

In [33]:
cv_data

Unnamed: 0,iterations,test-AUC-mean,test-AUC-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
0,0,0.528090,0.003494,0.544136,0.000017,0.544123,0.000015
1,1,0.634865,0.000696,0.440109,0.000117,0.440094,0.000142
2,2,0.680110,0.012101,0.373018,0.000990,0.372986,0.001010
3,3,0.701079,0.024421,0.328688,0.002449,0.328625,0.002501
4,4,0.752254,0.007398,0.296259,0.001322,0.296176,0.001376
...,...,...,...,...,...,...,...
995,995,0.977244,0.000446,0.078658,0.000950,0.066871,0.001576
996,996,0.977247,0.000446,0.078655,0.000952,0.066851,0.001580
997,997,0.977252,0.000443,0.078647,0.000949,0.066833,0.001577
998,998,0.977266,0.000424,0.078622,0.000906,0.066803,0.001549


In [34]:
clf_model = CatBoostClassifier(**params)

In [35]:
clf_model.fit(X,y)

0:	total: 117ms	remaining: 1m 56s
20:	total: 2.44s	remaining: 1m 53s
40:	total: 4.83s	remaining: 1m 52s
60:	total: 7.11s	remaining: 1m 49s
80:	total: 9.43s	remaining: 1m 47s
100:	total: 12.2s	remaining: 1m 48s
120:	total: 14.4s	remaining: 1m 44s
140:	total: 16.7s	remaining: 1m 41s
160:	total: 19s	remaining: 1m 38s
180:	total: 21.3s	remaining: 1m 36s
200:	total: 23.6s	remaining: 1m 33s
220:	total: 25.8s	remaining: 1m 30s
240:	total: 28.1s	remaining: 1m 28s
260:	total: 30.3s	remaining: 1m 25s
280:	total: 32.5s	remaining: 1m 23s
300:	total: 34.7s	remaining: 1m 20s
320:	total: 37.1s	remaining: 1m 18s
340:	total: 39.3s	remaining: 1m 15s
360:	total: 41.5s	remaining: 1m 13s
380:	total: 43.7s	remaining: 1m 11s
400:	total: 45.9s	remaining: 1m 8s
420:	total: 48s	remaining: 1m 6s
440:	total: 50.2s	remaining: 1m 3s
460:	total: 52.5s	remaining: 1m 1s
480:	total: 54.7s	remaining: 59s
500:	total: 56.9s	remaining: 56.7s
520:	total: 59.1s	remaining: 54.4s
540:	total: 1m 1s	remaining: 52s
560:	total: 1m

<catboost.core.CatBoostClassifier at 0x7fef48420700>

In [36]:
y_pred = clf_model.predict_proba(X)

In [37]:
result = df_clf[['train_idx', 'Target']]
result['Base'] = df_clf['Id']
result['Clf_predictions'] = y_pred[:,1]

In [38]:
result = result.sort_values(['train_idx', 'Clf_predictions'], ascending=False)

In [39]:
result.head()

Unnamed: 0,train_idx,Target,Base,Clf_predictions
999990,99999-query,2769109-base,2769109-base,0.976532
999992,99999-query,2769109-base,2539368-base,0.008407
999995,99999-query,2769109-base,1412044-base,0.002486
999998,99999-query,2769109-base,49440-base,0.000549
999993,99999-query,2769109-base,1804388-base,0.000507


In [44]:
result_array = result.Base.values.reshape(100000,10)
result_array

array([['2769109-base', '2539368-base', '1412044-base', ...,
        '1137584-base', '870586-base', '473313-base'],
       ['1079397-base', '1861685-base', '2123459-base', ...,
        '105365-base', '4624606-base', '315187-base'],
       ['2366140-base', '4264966-base', '330402-base', ...,
        '4625222-base', '2477422-base', '34436-base'],
       ...,
       ['1790410-base', '418450-base', '3006692-base', ...,
        '2428735-base', '258545-base', '474720-base'],
       ['577617-base', '15226-base', '854272-base', ..., '3121612-base',
        '231406-base', '234491-base'],
       ['675816-base', '361564-base', '656625-base', ..., '2725256-base',
        '3543241-base', '13406-base']], dtype=object)

Итоговый набор векторов выглядит так:

In [60]:
final_result = pd.DataFrame(data=result_array[:,:5], index=result['train_idx'].unique())
final_result


Unnamed: 0,0,1,2,3,4
99999-query,2769109-base,2539368-base,1412044-base,49440-base,1804388-base
99998-query,1079397-base,1861685-base,2123459-base,491760-base,3539125-base
99997-query,2366140-base,4264966-base,330402-base,1095097-base,954342-base
99996-query,339932-base,3521422-base,236881-base,290344-base,41154-base
99995-query,1604453-base,252198-base,450650-base,642757-base,4502196-base
...,...,...,...,...,...
1000-query,16751-base,798711-base,1119473-base,327432-base,2367864-base
100-query,862477-base,4248796-base,1544002-base,3934158-base,3005591-base
10-query,1790410-base,418450-base,3006692-base,1409906-base,266030-base
1-query,577617-base,15226-base,854272-base,1075687-base,511045-base
