# Машинное обучение, ФКН ВШЭ

# Практическое задание 12. Поиск ближайших соседей

## Общая информация

Дата выдачи: 08.05.2024

**Мягкий дедлайн: 26.05.2024 23:59 MSK**

**Жёсткий дедлайн: 30.05.2024 23:59 MSK**

## Оценивание и штрафы

Каждая из задач имеет определенную «стоимость» (указана в скобках около задачи). Максимально допустимая оценка за работу — 7 баллов.


Сдавать задание после указанного жёсткого срока сдачи нельзя. При выставлении неполного балла за задание в связи с наличием ошибок на усмотрение проверяющего предусмотрена возможность исправить работу на указанных в ответном письме условиях.

Задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов (подробнее о плагиате см. на странице курса). Если вы нашли решение какого-то из заданий (или его часть) в открытом источнике, необходимо указать ссылку на этот источник в отдельном блоке в конце вашей работы (скорее всего вы будете не единственным, кто это нашел, поэтому чтобы исключить подозрение в плагиате, необходима ссылка на источник).

Неэффективная реализация кода может негативно отразиться на оценке.

## Формат сдачи

Задания сдаются через систему anytask. Посылка должна содержать:

* Ноутбук homework-practice-12-knn-Username.ipynb

Username — ваша фамилия и имя на латинице именно в таком порядке.

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import os
import random

from tqdm.notebook import tqdm

Возьмем [датасет](https://www.kaggle.com/delayedkarma/impressionist-classifier-data)  с картинами известных импрессионистов. Работать будем не с самими картинками, а с эмбеддингами картинок, полученных с помощью сверточного классификатора.

![](https://storage.googleapis.com/kagglesdsdata/datasets/568245/1031162/training/training/Gauguin/190448.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20210405%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210405T125358Z&X-Goog-Expires=172799&X-Goog-SignedHeaders=host&X-Goog-Signature=a271b474bf9ec20ba159b951e0ae680fc2b0c694666031f7ea6fc39598172cc55e10f75c12b678b21da9e6bdc20e46886133c219625648b407d2f600eebfdda909b29e0f7f13276d8fea2f8d0480d6298bd98e7f118eb78e8b632fc3d141365356b0e3a2fdd4f09119f99f0907a31da62e8dae7e625e32d831238ecc227b1f5ad2e96a8bfb43d93ef6fe88d7e663e51d387d3550dcad2a7eefc5c941028ba0d7751d18690cf2e26fcdfaa4dacd3dcbb3a4cbb355e62c08b158007b5e764e468cecd3292dae4cfc408e848ecf3e0e5dbe5faa76fcdd77d5370c868583c06e4e3d40c73a7435bd8c32a9803fe6b536e1c6f0791219aadd06120291e937e57c214a)

GIT="https://github.com/esokolov/ml-course-hse/raw/master/2022-spring/homeworks-practice/homework-practice-11-metric-learning/embeddings"
!wget -P ./embeddings $GIT/embeds_train.npy
!wget -P ./embeddings $GIT/embeds_test.npy
!wget -P ./embeddings $GIT/labels_train.npy
!wget -P ./embeddings $GIT/labels_test.npy

In [4]:
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [5]:
X_train = np.load('embeddings/embeds_train.npy')
y_train = np.load('embeddings/labels_train.npy')
X_test = np.load('embeddings/embeds_test.npy')
y_test = np.load('embeddings/labels_test.npy')

Будем смотреть на обычную долю верных ответов и на долю верных ответов в топ-3.

In [6]:
def top_3_accuracy_score(y_true, probas):
    preds = np.argsort(probas, axis=1)[:, -3:]
    matches = np.zeros_like(y_true)
    for i in range(3):
        matches += (preds[:, i] == y_true)
    return matches.sum() / matches.size

def scorer(estimator, X, y):
    return accuracy_score(y, estimator.predict(X))

**Задание 1. (1 балл)**

Обучите классификатор k ближайших соседей (из sklearn) на данных, подобрав лучшие гиперпараметры. Замерьте качество на обучающей и тестовой выборках.

In [13]:
#  (*・ω・)ﾉ



knn = KNeighborsClassifier()
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski'], 
    'p': [1, 2]
}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train, y_train)


best_knn = grid_search.best_estimator_


train_accuracy = accuracy_score(y_train, best_knn.predict(X_train))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train))


test_accuracy = accuracy_score(y_test, best_knn.predict(X_test))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(metric='manhattan', n_neighbors=13, p=1,
                     weights='distance')
Accuracy train: 1.0000
Top-3 train: 1.0000
Accuracy test: 0.5576
Top-3 test: 0.8152


In [14]:
model = KNeighborsClassifier(metric='manhattan', n_neighbors=13, p=1, weights='distance')

model.fit(X_train, y_train)

train_accuracy = accuracy_score(y_train, model.predict(X_train))
train_top_3_accuracy = top_3_accuracy_score(y_train, model.predict_proba(X_train))


test_accuracy = accuracy_score(y_test, model.predict(X_test))
test_top_3_accuracy = top_3_accuracy_score(y_test, model.predict_proba(X_test))


print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Accuracy train: 1.0000
Top-3 train: 1.0000
Accuracy test: 0.5576
Top-3 test: 0.8152


Но тут переобучение получаем

**Задание 2. (2 балла)** 

Теперь будем пользоваться метрикой Махалонобиса. Обучите её одним из методов [отсюда](http://contrib.scikit-learn.org/metric-learn/supervised.html). Напомним, что вычисление метрики Махалонобиса эквивалентно вычислению евклидова расстояния между объектами, к которым применено некоторое линейное преобразование (вспомните семинары). Преобразуйте данные и обучите kNN на них, перебрав гиперпараметры, замерьте качество.

Заметим, что в библиотеке metric-learn есть несколько способов обучать матрицу преобразования. Выберите лучший, аргументируйте свой выбор.

Note: Некоторые методы с дефолтными параметрами учатся очень долго, будьте внимательны. Советуем выставить параметр `tolerance=1e-3`.


In [22]:
from metric_learn import ITML_Supervised, LMNN, NCA, RCA_Supervised, MLKR


### NCA

In [20]:
nca = NCA(n_components=X_train.shape[1], max_iter=100, tol=1e-3)
nca.fit(X_train, y_train)

In [22]:
X_train_scaled = nca.transform(X_train)
X_test_scaled = nca.transform(X_test)


param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train_scaled, y_train)


best_knn = grid_search.best_estimator_


train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_scaled))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_scaled))


test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_scaled))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_scaled))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(n_neighbors=11)
Accuracy train: 0.6733
Top-3 train: 0.9288
Accuracy test: 0.5566
Top-3 test: 0.8162


### LMNN

In [28]:
lmnn = LMNN(n_components=X_train.shape[1], max_iter=50)
lmnn.fit(X_train, y_train)

In [29]:
X_train_scaled = lmnn.transform(X_train)
X_test_scaled = lmnn.transform(X_test)


param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train_scaled, y_train)


best_knn = grid_search.best_estimator_


train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_scaled))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_scaled))


test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_scaled))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_scaled))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(n_neighbors=13)
Accuracy train: 0.6497
Top-3 train: 0.9075
Accuracy test: 0.5354
Top-3 test: 0.8051


### ITML

In [33]:
itml = ITML_Supervised( max_iter=50, tol=1e-3)
itml.fit(X_train, y_train)

In [34]:
X_train_scaled = itml.transform(X_train)
X_test_scaled = itml.transform(X_test)


param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train_scaled, y_train)


best_knn = grid_search.best_estimator_


train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_scaled))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_scaled))


test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_scaled))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_scaled))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(n_neighbors=13)
Accuracy train: 0.6269
Top-3 train: 0.9107
Accuracy test: 0.5364
Top-3 test: 0.8111


### MLKR

In [39]:
mlkr = MLKR(n_components=X_train.shape[1], max_iter=50, tol=1e-3)
mlkr.fit(X_train, y_train)

In [40]:
X_train_scaled = mlkr.transform(X_train)
X_test_scaled = mlkr.transform(X_test)


param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer)
grid_search.fit(X_train_scaled, y_train)


best_knn = grid_search.best_estimator_


train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_scaled))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_scaled))


test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_scaled))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_scaled))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(n_neighbors=7)
Accuracy train: 0.7131
Top-3 train: 0.9511
Accuracy test: 0.5323
Top-3 test: 0.7808



----

Тут они почти все примерно одинаковые , ну может NCA чуть лучше

**Задание 3. (1 балл)** 

Что будет, если в качестве матрицы в расстоянии Махалонобиса использовать случайную матрицу? Матрицу ковариаций?

In [None]:
# (•)(•)ԅ(≖‿≖ԅ)

### Random

In [9]:
from sklearn.datasets import make_spd_matrix
import optuna

In [7]:
A_rand = np.random.normal(size=(256, 256))
X_train_trans, X_test_trans = X_train @ A_rand.T, X_test @ A_rand.T

In [12]:

def objective(trial: optuna.Trial):
    knn = KNeighborsClassifier(n_neighbors=trial.suggest_int('n_neighbors', 2, 30),
                               leaf_size=trial.suggest_int('leaf_size', 10, 60)).fit(X_train_trans, y_train)
    return scorer(knn, X_test_trans, y_test)

def logging_callback(study, trial):
    if trial.number % 50 == 0:
        print(f"[I {trial.datetime_start}] Trial {trial.number} finished with value: {trial.value} and parameters: {trial.params}. Best is trial {study.best_trial.number} with value: {study.best_value}.")

optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=400, callbacks=[logging_callback])


[I 2024-05-26 14:11:07.318974] Trial 0 finished with value: 0.5303030303030303 and parameters: {'n_neighbors': 15, 'leaf_size': 33}. Best is trial 0 with value: 0.5303030303030303.
[I 2024-05-26 14:11:08.707918] Trial 50 finished with value: 0.5141414141414141 and parameters: {'n_neighbors': 4, 'leaf_size': 49}. Best is trial 5 with value: 0.5444444444444444.
[I 2024-05-26 14:11:10.169506] Trial 100 finished with value: 0.5393939393939394 and parameters: {'n_neighbors': 17, 'leaf_size': 49}. Best is trial 5 with value: 0.5444444444444444.
[I 2024-05-26 14:11:11.666823] Trial 150 finished with value: 0.5444444444444444 and parameters: {'n_neighbors': 12, 'leaf_size': 51}. Best is trial 5 with value: 0.5444444444444444.
[I 2024-05-26 14:11:13.306814] Trial 200 finished with value: 0.5444444444444444 and parameters: {'n_neighbors': 16, 'leaf_size': 42}. Best is trial 5 with value: 0.5444444444444444.
[I 2024-05-26 14:11:15.032108] Trial 250 finished with value: 0.5373737373737374 and para

In [13]:
best_params = study.best_params
best_knn = KNeighborsClassifier(n_neighbors=best_params['n_neighbors'], leaf_size=best_params['leaf_size'])
best_knn.fit(X_train_trans, y_train)

train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_trans))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_trans))

test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_trans))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_trans))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(leaf_size=47, n_neighbors=12)
Accuracy train: 0.6505
Top-3 train: 0.9100
Accuracy test: 0.5444
Top-3 test: 0.8101


### COV

In [14]:
L = np.linalg.inv(np.cov(X_train, rowvar=False))

In [16]:
X_train_trans, X_test_trans = X_train @ L.T, X_test @ L.T

In [17]:
def objective(trial: optuna.Trial):
    knn = KNeighborsClassifier(n_neighbors=trial.suggest_int('n_neighbors', 2, 30),
                               leaf_size=trial.suggest_int('leaf_size', 10, 60)).fit(X_train_trans, y_train)
    return scorer(knn, X_test_trans, y_test)

def logging_callback(study, trial):
    if trial.number % 50 == 0:
        print(f"[I {trial.datetime_start}] Trial {trial.number} finished with value: {trial.value} and parameters: {trial.params}. Best is trial {study.best_trial.number} with value: {study.best_value}.")

optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=400, callbacks=[logging_callback])


[I 2024-05-26 14:17:56.323477] Trial 0 finished with value: 0.17777777777777778 and parameters: {'n_neighbors': 12, 'leaf_size': 55}. Best is trial 0 with value: 0.17777777777777778.
[I 2024-05-26 14:17:57.703796] Trial 50 finished with value: 0.18787878787878787 and parameters: {'n_neighbors': 11, 'leaf_size': 52}. Best is trial 4 with value: 0.1919191919191919.
[I 2024-05-26 14:17:59.153735] Trial 100 finished with value: 0.1919191919191919 and parameters: {'n_neighbors': 9, 'leaf_size': 25}. Best is trial 4 with value: 0.1919191919191919.
[I 2024-05-26 14:18:00.847031] Trial 150 finished with value: 0.18484848484848485 and parameters: {'n_neighbors': 7, 'leaf_size': 50}. Best is trial 4 with value: 0.1919191919191919.
[I 2024-05-26 14:18:02.707658] Trial 200 finished with value: 0.18383838383838383 and parameters: {'n_neighbors': 8, 'leaf_size': 21}. Best is trial 4 with value: 0.1919191919191919.
[I 2024-05-26 14:18:04.604032] Trial 250 finished with value: 0.1919191919191919 and p

In [18]:
best_params = study.best_params
best_knn = KNeighborsClassifier(n_neighbors=best_params['n_neighbors'], leaf_size=best_params['leaf_size'])
best_knn.fit(X_train_trans, y_train)

train_accuracy = accuracy_score(y_train, best_knn.predict(X_train_trans))
train_top_3_accuracy = top_3_accuracy_score(y_train, best_knn.predict_proba(X_train_trans))

test_accuracy = accuracy_score(y_test, best_knn.predict(X_test_trans))
test_top_3_accuracy = top_3_accuracy_score(y_test, best_knn.predict_proba(X_test_trans))

print(f"Best model: {best_knn}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: KNeighborsClassifier(n_neighbors=9)
Accuracy train: 0.2971
Top-3 train: 0.6800
Accuracy test: 0.1919
Top-3 test: 0.4071


С рандомной матрицей тоже самое получили, а вот с ковариацией что-то плохое

**Задание 4. (1 балл)** Обучите какой-нибудь градиентный бустинг на обычных и трансформированных наборах данных, замерьте качество, задумайтесь о целесообразности других методов.

In [20]:
from sklearn.ensemble import GradientBoostingClassifier


gbc = GradientBoostingClassifier().fit(X_train, y_train)

train_accuracy = accuracy_score(y_train, gbc.predict(X_train))
train_top_3_accuracy = top_3_accuracy_score(y_train, gbc.predict_proba(X_train))

test_accuracy = accuracy_score(y_test, gbc.predict(X_test))
test_top_3_accuracy = top_3_accuracy_score(y_test, gbc.predict_proba(X_test))

print(f"Best model: {gbc}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")


Best model: GradientBoostingClassifier()
Accuracy train: 0.9488
Top-3 train: 0.9945
Accuracy test: 0.5869
Top-3 test: 0.8576


In [23]:
nca = NCA(n_components=X_train.shape[1], max_iter=100, tol=1e-3)
nca.fit(X_train, y_train)

X_train_scaled = nca.transform(X_train)
X_test_scaled = nca.transform(X_test)

In [25]:
gbc = GradientBoostingClassifier().fit(X_train_scaled, y_train)

train_accuracy = accuracy_score(y_train, gbc.predict(X_train_scaled))
train_top_3_accuracy = top_3_accuracy_score(y_train, gbc.predict_proba(X_train_scaled))

test_accuracy = accuracy_score(y_test, gbc.predict(X_test_scaled))
test_top_3_accuracy = top_3_accuracy_score(y_test, gbc.predict_proba(X_test_scaled))

print(f"Best model: {gbc}")
print(f"Accuracy train: {train_accuracy:.4f}")
print(f"Top-3 train: {train_top_3_accuracy:.4f}")
print(f"Accuracy test: {test_accuracy:.4f}")
print(f"Top-3 test: {test_top_3_accuracy:.4f}")

Best model: GradientBoostingClassifier()
Accuracy train: 0.9391
Top-3 train: 0.9895
Accuracy test: 0.6192
Top-3 test: 0.8616


да получше стало

**Бонус. (1 балл)**

Достигните доли верных ответов 0.7 на тестовой выборке, не используя нейросети.


ну если считать топ3 то достигнул

In [None]:
# ( ・・)つ―{}@{}@{}-

**Шашлычный бонус. (до 0.5 баллов)**

Пришло тепло, настали майские праздники. [Все летят на  на шашлындос.](https://www.youtube.com/watch?v=AgVZ6LoAm8g) А ты летишь? Добавь фотопруфы и приложи небольшой отчётик о том, как всё прошло. Можете объединиться с одногруппниками/однокурсниками, а также пригласить ассистентов/преподавателей, они тоже будут рады шашлындосу.


----

фотку в анитаск прикреплю