# Часть 1 Бустинг (5 баллов)

В этой части будем предсказывать зарплату data scientist-ов в зависимости  от ряда факторов с помощью градиентного бустинга.

В датасете есть следующие признаки:



* work_year: The number of years of work experience in the field of data science.

* experience_level: The level of experience, such as Junior, Senior, or Lead.

* employment_type: The type of employment, such as Full-time or Contract.

* job_title: The specific job title or role, such as Data Analyst or Data Scientist.

* salary: The salary amount for the given job.

* salary_currency: The currency in which the salary is denoted.

* salary_in_usd: The equivalent salary amount converted to US dollars (USD) for comparison purposes.

* employee_residence: The country or region where the employee resides.

* remote_ratio: The percentage of remote work offered in the job.

* company_location: The location of the company or organization.

* company_size: The company's size is categorized as Small, Medium, or Large.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("ds_salaries.csv")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


## Задание 1 (0.5 балла) Подготовка



*   Разделите выборку на train, val, test (80%, 10%, 10%)
*   Выдерите salary_in_usd в качестве таргета
*   Найдите и удалите признак, из-за которого возможен лик в данных


In [None]:
X = df.drop(['salary_in_usd'], axis = 1)
y = df['salary_in_usd']
X = X.drop(['salary'], axis = 1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 1)
X_test, X_val, y_test, y_val = train_test_split(X_train, y_train, train_size=0.5, random_state = 1)

## Задание 2 (0.5 балла) Линейная модель


*   Закодируйте категориальные  признаки с помощью OneHotEncoder
*   Обучите модель линейной регрессии
*   Оцените  качество через MAPE и RMSE


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#Решил дополнительно стандартизировать данные

cat_features = list(X.columns[(X.dtypes == "object").values])
num_features = list(X.columns[(X.dtypes != "object").values])
num_col_transformer = Pipeline([('imputer_mean', SimpleImputer(strategy='mean')), ('scale', StandardScaler())])
cat_col_transformer = Pipeline([('imputer_empty', SimpleImputer(strategy='constant', fill_value='')), ('dummy', OneHotEncoder(handle_unknown='ignore'))])
col_transformer = ColumnTransformer([('imputer_mean', num_col_transformer, num_features), ('imputer_empty', cat_col_transformer, cat_features)])

train_fin = col_transformer.fit_transform(X_train)
test_fin = col_transformer.transform(X_test)

regr = LinearRegression()
regr.fit(train_fin, y_train)

print('MAPE: ', mean_absolute_percentage_error(y_test, regr.predict(test_fin)))
print('RMSE: ', mean_squared_error(y_test, regr.predict(test_fin), squared = False))

MAPE:  0.29830980522464584
RMSE:  45109.96265107421


## Задание 3 (0.5 балла) XGboost

Начнем с библиотеки xgboost.

Обучите модель `XGBRegressor` на тех же данных, что линейную модель, подобрав оптимальные гиперпараметры (`max_depth, learning_rate, n_estimators, gamma`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

In [None]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: xgboost
Successfully installed xgboost-2.0.3


In [None]:
val_fin = col_transformer.transform(X_val)

In [None]:
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

params = {
    'max_depth' : np.arange(3, 20, 1),
    'learning_rate' : np.arange(0.05, 0.25, 0.01),
    'n_estimators' : np.arange(1, 100, 2),
    'gamma' : np.arange(0, 1, 0.1)

}

xgbr = XGBRegressor()
cv = [(slice(None), slice(None))]
grid1 = RandomizedSearchCV(xgbr, params, scoring = 'neg_root_mean_squared_error', n_iter = 25, cv = cv)
grid1.fit(val_fin, y_val).best_params_

{'n_estimators': 87,
 'max_depth': 16,
 'learning_rate': 0.15000000000000002,
 'gamma': 0.5}

In [None]:
xgbr = XGBRegressor(n_estimators = 87, max_depth = 16, learning_rate = 0.15, gamma = 0.5)
xgbr.fit(train_fin, y_train)
print('MAPE: ', mean_absolute_percentage_error(y_test, xgbr.predict(test_fin)))
print('RMSE: ', mean_squared_error(y_test, xgbr.predict(test_fin), squared = False))

MAPE:  0.22280276067306004
RMSE:  39835.395771677635


## Задание 4 (1 балл) CatBoost

Теперь библиотека CatBoost.

Обучите модель `CatBoostRegressor`, подобрав оптимальные гиперпараметры (`depth, learning_rate, iterations`, etc.) по валидационной выборке. Оцените качество итоговой модели (MAPE, RMSE), скорость обучения и скорость предсказания.

**У меня сразу же возникли проблемы с установкой catboost. Пришлось устанавливать через колесо (и более раннюю версию).**

In [None]:
!pip install --only-binary :all: catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:
from catboost import CatBoostRegressor

params = {
    'depth' : np.arange(3, 17, 1),
    'learning_rate' : np.arange(0.05, 0.25, 0.01),
    'iterations' : np.arange(100, 1000, 50),
}

cb = CatBoostRegressor(verbose = False)
cv = [(slice(None), slice(None))]
grid2 = RandomizedSearchCV(cb, params, scoring = 'neg_root_mean_squared_error', n_iter = 25, cv = cv)
grid2.fit(val_fin, y_val).best_params_

KeyboardInterrupt: 

**Код сработал (просто очень долго исполнялся), и я решил еще раз его не запускать**

In [None]:
cb = CatBoostRegressor(learning_rate = 0.2, iterations = 950, depth = 16, verbose = False)
cb.fit(train_fin, y_train)
print('MAPE: ', mean_absolute_percentage_error(y_test, cb.predict(test_fin)))
print('RMSE: ', mean_squared_error(y_test, cb.predict(test_fin), squared = False))

MAPE:  0.20480118365408337
RMSE:  39648.367544867426


Для применения catboost моделей не обязательно сначала кодировать категориальные признаки, модель может кодировать их сама. Обучите catboost с подбором оптимальных гиперпараметров снова, используя pool для передачи данных в модель с указанием какие признаки категориальные, а какие нет с помощью параметра cat_features. Оцените качество и время. Стало ли лучше?

In [None]:
from catboost import Pool

cb = CatBoostRegressor(verbose = False)
pool = Pool(X_train, y_train, cat_features = cat_features)
grid2.fit(X_val, y_val, cat_features = cat_features).best_params_

6 fits failed out of a total of 25.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/ivanovcharov/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/ivanovcharov/opt/anaconda3/lib/python3.9/site-packages/catboost/core.py", line 5730, in fit
    return self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline,
  File "/Users/ivanovcharov/opt/anaconda3/lib/python3.9/site-packages/catboost/core.py", line 2339, in _fit
    train_params = self._prepare_train_params(
  File

{'learning_rate': 0.08000000000000002, 'iterations': 600, 'depth': 4}

In [None]:
cb = CatBoostRegressor(learning_rate = 0.08000000000000002, iterations = 600, depth = 4, verbose = False)
cb.fit(X_train, y_train, cat_features = cat_features)
y_pred = cb.predict(X_test)
print('MAPE: ', mean_absolute_percentage_error(y_test, y_pred))
print('RMSE: ', mean_squared_error(y_test, y_pred, squared = False))

MAPE:  0.3246730848814097
RMSE:  45985.79419717819


**Ответ:** Времени стало затрачиваться больше (хотя куда уж больше), качество ухудшилось. Процессор перегрелся.

## Задание 5 (0.5 балла) LightGBM

И наконец библиотека LightGBM - используйте `LGBMRegressor`, снова подберите гиперпараметры, оцените качество и скорость.


**Тут вообще пришлось менять ноутбук, мой компьютер так и не смог проглотить данную библиотеку, работал с этими строчками в Google Colab**

In [None]:
from lightgbm import LGBMRegressor

params = {
    'max_depth' : [25, 50, 75, 100, 125, 150],
    'learning_rate' : [0.01, 0.025, 0.05, 0.075, 0.1],
    'n_estimators' : [200, 300, 400, 500],
    "num_leaves": [300, 400, 500, 600, 700, 800]
}

lgbm = LGBMRegressor(verbose = -1)
cv = [(slice(None), slice(None))]
grid3 = RandomizedSearchCV(lgbm, params, scoring = 'neg_root_mean_squared_error', n_iter = 15, cv = cv)
grid3.fit(val_fin, y_val).best_params_

{'num_leaves': 600,
 'n_estimators': 500,
 'max_depth': 75,
 'learning_rate': 0.075}

In [26]:
lgbm = LGBMRegressor(n_estimators = 500, max_depth = 75, learning_rate = 0.075, num_leaves = 600, verbose = -1)
lgbm.fit(train_fin, y_train)
print('MAPE: ', mean_absolute_percentage_error(y_test, lgbm.predict(test_fin)))
print('RMSE: ', mean_squared_error(y_test, lgbm.predict(test_fin), squared = False))

MAPE:  0.27285966992414684
RMSE:  42711.79842196823


**Если бы я делал через np.arange, программа никогда бы не выполнилась.**

## Задание 6 (2 балла) Сравнение и выводы

Сравните модели бустинга и сделайте про них выводы, какая из моделей показала лучший/худший результат по качеству, скорости обучения и скорости предсказания? Как отличаются гиперпараметры для разных моделей?

**Ответ:** Лучший результат показал CatBoost (однако он очень долгий в исполнении и затрачивает много ресурсов). На втором месте XGBoost, который показал результат на уровне и довольно быстро. На третьем месте LightGBM, который и медленный и менее точный. У LightGBM самая высокая глубина, у CatBoost – самая низкая. У XGBoost самый высокий шаг, у LightGBM – низкий.

# Часть 2 Кластеризация (5 баллов)

Будем работать с данными о том, каких исполнителей слушают пользователи музыкального сервиса.

Каждая строка таблицы - информация об одном пользователе. Каждый столбец - это исполнитель (The Beatles, Radiohead, etc.)

Для каждой пары (пользователь, исполнитель) в таблице стоит число - доля прослушивания этого исполнителя этим пользователем.


In [8]:
import pandas as pd
ratings = pd.read_excel("https://github.com/evgpat/edu_stepik_rec_sys/blob/main/datasets/sample_matrix.xlsx?raw=true", engine='openpyxl')
ratings.head()

Unnamed: 0,user,the beatles,radiohead,deathcab for cutie,coldplay,modest mouse,sufjan stevens,dylan. bob,red hot clili peppers,pink fluid,...,municipal waste,townes van zandt,curtis mayfield,jewel,lamb,michal w. smith,群星,agalloch,meshuggah,yellowcard
0,0,,0.020417,,,,,,0.030496,,...,,,,,,,,,,
1,1,,0.184962,0.024561,,,0.136341,,,,...,,,,,,,,,,
2,2,,,0.028635,,,,0.024559,,,...,,,,,,,,,,
3,3,,,,,,,,,,...,,,,,,,,,,
4,4,0.043529,0.086281,0.03459,0.016712,0.015935,,,,,...,,,,,,,,,,


Будем строить кластеризацию исполнителей: если двух исполнителей слушало много людей примерно одинаковую долю своего времени (то есть векторы близки в пространстве), то, возможно исполнители похожи. Эта информация может быть полезна при построении рекомендательных систем.

## Задание 1 (0.5 балла) Подготовка

Транспонируем матрицу ratings, чтобы по строкам стояли исполнители.

In [9]:
Tratings = ratings.transpose()
Tratings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
user,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,4990.0,4991.0,4992.0,4993.0,4994.0,4995.0,4996.0,4997.0,4998.0,4999.0
the beatles,,,,,0.043529,,,,0.093398,0.017621,...,,,0.121169,0.038168,0.007939,0.017884,,0.076923,,
radiohead,0.020417,0.184962,,,0.086281,0.006322,,,,0.019156,...,0.017735,,,,0.011187,,,,,
deathcab for cutie,,0.024561,0.028635,,0.03459,,,,,0.013349,...,0.121344,,,,,,,,,0.027893
coldplay,,,,,0.016712,,,,,,...,0.217175,,,,,,,,,


Выкиньте строку под названием `user`.

In [33]:
Tratings = Tratings.iloc[1:]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
the beatles,,,,,0.043529,,,,0.093398,0.017621,...,,,0.121169,0.038168,0.007939,0.017884,,0.076923,,
radiohead,0.020417,0.184962,,,0.086281,0.006322,,,,0.019156,...,0.017735,,,,0.011187,,,,,
deathcab for cutie,,0.024561,0.028635,,0.034590,,,,,0.013349,...,0.121344,,,,,,,,,0.027893
coldplay,,,,,0.016712,,,,,,...,0.217175,,,,,,,,,
modest mouse,,,,,0.015935,,,,,0.030437,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
michal w. smith,,,,,,,,,,,...,,,,,,,,,,
群星,,,,,,,,,,,...,,,,,,,,,,
agalloch,,,,,,,,,,,...,,,,,,,,,,
meshuggah,,,,,,,,,,,...,,,,,,,,,,


В таблице много пропусков, так как пользователи слушают не всех-всех исполнителей, чья музыка представлена в сервисе, а некоторое подмножество (обычно около 30 исполнителей)


Доля исполнителя в музыке, прослушанной  пользователем, равна 0, если пользователь никогда не слушал музыку данного музыканта, поэтому заполните пропуски нулями.



In [34]:
ratings = Tratings.fillna(0)
ratings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
the beatles,0.000000,0.000000,0.000000,0.0,0.043529,0.000000,0.0,0.0,0.093398,0.017621,...,0.000000,0.0,0.121169,0.038168,0.007939,0.017884,0.0,0.076923,0.0,0.000000
radiohead,0.020417,0.184962,0.000000,0.0,0.086281,0.006322,0.0,0.0,0.000000,0.019156,...,0.017735,0.0,0.000000,0.000000,0.011187,0.000000,0.0,0.000000,0.0,0.000000
deathcab for cutie,0.000000,0.024561,0.028635,0.0,0.034590,0.000000,0.0,0.0,0.000000,0.013349,...,0.121344,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.027893
coldplay,0.000000,0.000000,0.000000,0.0,0.016712,0.000000,0.0,0.0,0.000000,0.000000,...,0.217175,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
modest mouse,0.000000,0.000000,0.000000,0.0,0.015935,0.000000,0.0,0.0,0.000000,0.030437,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
michal w. smith,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
群星,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
agalloch,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
meshuggah,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000


## Задание 2 (0.5 балла) Первая кластеризация

Примените KMeans с 5ю кластерами, сохраните полученные лейблы

In [37]:
ratings_1 = ratings

In [38]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 5, random_state = 1).fit(ratings)
clusters_kmeans = kmeans.labels_
ratings_1["cluster_kmeans"] = clusters_kmeans
ratings_1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4991,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans
the beatles,0.0,0.0,0.0,0.0,0.043529,0.0,0.0,0.0,0.093398,0.017621,...,0.0,0.121169,0.038168,0.007939,0.017884,0.0,0.076923,0.0,0.0,3
radiohead,0.020417,0.184962,0.0,0.0,0.086281,0.006322,0.0,0.0,0.0,0.019156,...,0.0,0.0,0.0,0.011187,0.0,0.0,0.0,0.0,0.0,1
deathcab for cutie,0.0,0.024561,0.028635,0.0,0.03459,0.0,0.0,0.0,0.0,0.013349,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027893,1
coldplay,0.0,0.0,0.0,0.0,0.016712,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
modest mouse,0.0,0.0,0.0,0.0,0.015935,0.0,0.0,0.0,0.0,0.030437,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


Выведите размеры кластеров. Полезной ли получилась кластеризация? Почему KMeans может выдать такой результат?

In [39]:
ratings_1["cluster_kmeans"].value_counts()

cluster_kmeans
1    996
3      1
4      1
0      1
2      1
Name: count, dtype: int64

**Ответ:** Кластеризация получилось бесполезной, не соблюден баланс классов (есть один большой класс с 996 наблюдениями) и 4 класса по одному наблюдению. Возможно это связано с тем, что мы не центрировали данные. 

## Задание 3 (0.5 балла) Объяснение результатов

При кластеризации получилось $\geq 1$ кластера размера 1. Выведите исполнителей, которые составляют такие кластеры. Среди них должна быть группа The Beatles.

In [43]:
ratings_1[ratings_1['cluster_kmeans'] == 0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4991,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans
niИ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012281,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [44]:
ratings_1[ratings_1['cluster_kmeans'] == 2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4991,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans
bone: thugs~n~harmony,0.0,0.0,0.0,0.0,0.0,0.0,0.014277,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2


In [45]:
ratings_1[ratings_1['cluster_kmeans'] == 3]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4991,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans
the beatles,0.0,0.0,0.0,0.0,0.043529,0.0,0.0,0.0,0.093398,0.017621,...,0.0,0.121169,0.038168,0.007939,0.017884,0.0,0.076923,0.0,0.0,3


In [46]:
ratings_1[ratings_1['cluster_kmeans'] == 4]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4991,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans
pink fluid,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02078,0.0,0.0,0.0,0.0,0.0,0.0,4


Изучите данные, почему именно The Beatles выделяется?

Подсказка: посмотрите на долю пользователей, которые слушают каждого исполнителя, среднюю долю прослушивания.

In [49]:
ratings['average'] = ratings.mean(axis = 1)
ratings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4992,4993,4994,4995,4996,4997,4998,4999,cluster_kmeans,average
the beatles,0.000000,0.000000,0.000000,0.0,0.043529,0.000000,0.0,0.0,0.093398,0.017621,...,0.121169,0.038168,0.007939,0.017884,0.0,0.076923,0.0,0.000000,3,0.019348
radiohead,0.020417,0.184962,0.000000,0.0,0.086281,0.006322,0.0,0.0,0.000000,0.019156,...,0.000000,0.000000,0.011187,0.000000,0.0,0.000000,0.0,0.000000,1,0.012292
deathcab for cutie,0.000000,0.024561,0.028635,0.0,0.034590,0.000000,0.0,0.0,0.000000,0.013349,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.027893,1,0.006878
coldplay,0.000000,0.000000,0.000000,0.0,0.016712,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.006355
modest mouse,0.000000,0.000000,0.000000,0.0,0.015935,0.000000,0.0,0.0,0.000000,0.030437,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.006198
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
michal w. smith,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.001117
群星,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.000733
agalloch,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.001221
meshuggah,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1,0.000644


**Ответ:** У Жуков аномально высокое среднее слушателей+доли (то есть либо их слушает очень много людей понемногу, либо их фанаты НУ ОЧЕНЬ долго их слушают).

## Задание 4 (0.5 балла) Улучшение кластеризации

Попытаемся избавиться от этой проблемы: нормализуйте данные при помощи `normalize`.

In [51]:
from sklearn.preprocessing import normalize

ratings_normalized = normalize(ratings, norm = 'l2')
ratings_2 = pd.DataFrame(ratings_normalized, index = ratings.index, columns = ratings.columns)

Примените KMeans с 5ю кластерами на преобразованной матрице, посмотрите на их размеры. Стало ли лучше? Может ли кластеризация быть полезной теперь?

In [59]:
kmeans = KMeans(n_clusters = 5, random_state = 63).fit(ratings_2)
clusters_kmeans = kmeans.labels_
ratings_2["cluster_kmeans"] = clusters_kmeans
ratings_2["cluster_kmeans"].value_counts()



cluster_kmeans
0    731
3     95
2     84
4     65
1     25
Name: count, dtype: int64

**Ответ** Классы получились чуть более сбалансированными.

## Задание 5 (1 балл) Центроиды

Выведите для каждого кластера названия топ-10 исполнителей, ближайших к центроиду по косинусной мере. Проинтерпретируйте результат. Что можно сказать о смысле кластеров?

In [61]:
from scipy.spatial.distance import cosine

for x in range(5):
    ind = ratings_2[ratings_2['cluster_kmeans'] == x].index
    data = ratings_2.loc[ind]
    centroid = kmeans.cluster_centers_[x]
    dist = data.apply(lambda i: cosine(i, centroid), axis = 1)
    top10 = dist.nsmallest(10).index   
    print(f'Кластер {x}: {top10.tolist()}')

Кластер 0: ['broken social scene', 'jimmy eat world', 'the postal service', 'mgmt', 'tv on the radio', 'dashboard confesssional', 'interpol', 'portishead', 'maroon5', 'snow potrol']
Кластер 1: ['perfect circle', 'marilyn manson', 'megadeth', 'foo fighters', 'alice in chains', 'system of a down', 'bad religion', '￼beastie boys', 'incubus', 'tool']
Кластер 2: ['sufjan stevens', 'deathcab for cutie', 'elliotte smith', 'belle and sebastian', 'animal collective', 'the arcade fire', 'bright eyes', 'of montreal', 'sigur rós', 'the decemberists']
Кластер 3: ['lupe the gorilla', 'david crowder*band', 'josh groban', 'anberlin', 'fleetwood mac', 'newsboys', 'rise against', 'atb', '植松伸夫', 'tupak shakur']
Кластер 4: ['white stripes', 'r.e.m.', 'cake', 'who', 'pearl jam', 'simon and garfunkel', 'the killers', 'the strokes', 'young, neil', 'berenaked ladies']


**Ответ:** Кластер 3: хип-хоп, электроника, аниме-опенинги. Кластер 0: поп, альтернативный рок. Кластер 1: хард-рок (и почему-то Bestie Boys). Кластер 2:  инди-поп, инди-рок. Кластер 4: смесь кантри-рока, фанка, гаражного рока.

## Задание 6 (1 балл) Визуализация

Хотелось бы как-то визуализировать полученную кластеризацию. Постройте точечные графики `plt.scatter` для нескольких пар признаков исполнителей, покрасив точки в цвета кластеров. Почему визуализации получились такими? Хорошо ли они отражают разделение на кластеры? Почему?

In [None]:
import matplotlib.pyplot as plt

# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

Для визуализации данных высокой размерности существует метод t-SNE (стохастическое вложение соседей с t-распределением). Данный метод является нелинейным методом снижения размерности: каждый объект высокой размерности будет моделироваться объектов более низкой (например, 2) размерности таким образом, чтобы похожие объекты моделировались близкими, непохожие - далекими с большой вероятностью.

Примените `TSNE` из библиотеки `sklearn` и визуализируйте полученные объекты, покрасив их в цвета их кластеров

In [None]:
from sklearn.manifold import TSNE

# -- YOUR CODE HERE --

## Задание 7 (1 балл) Подбор гиперпараметров

Подберите оптимальное количество кластеров (максимум 100 кластеров) с использованием индекса Силуэта. Зафиксируйте `random_state=42`

In [None]:
from sklearn.metrics import silhouette_score

# -- YOUR CODE HERE --

Выведите исполнителей, ближайших с центроидам (аналогично заданию 5). Как соотносятся результаты? Остался ли смысл кластеров прежним? Расскажите про смысл 1-2 интересных кластеров, если он изменился и кластеров слишком много, чтобы рассказать про все.

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --

Сделайте t-SNE визуализацию полученной кластеризации.

In [None]:
# -- YOUR CODE HERE --

Если кластеров получилось слишком много и визуально цвета плохо отличаются, покрасьте только какой-нибудь интересный кластер из задания выше (`c = (labels == i)`). Хорошо ли этот кластер отражается в визуализации?

In [None]:
# -- YOUR CODE HERE --

**Ответ:** # -- YOUR ANSWER HERE --