# Кластеризация исполнителей по жанрам

В этом задании вы попробуете кластеризовать исполнителей по жанрам на основе данных о прослушивании.

В матрице sample_matrix по строкам стоят пользователи, а по столбцам - исполнители.

Для каждой пары (пользователь,исполнитель) в таблице стоит число - доля (процент) прослушивания этого исполнителя выбранным пользователем.

## Импорт библиотек, загрузка данных

In [47]:
import pandas as pd
import numpy as np

In [48]:
ratings = pd.read_excel("https://github.com/evgpat/edu_stepik_rec_sys/blob/main/datasets/sample_matrix.xlsx?raw=true", engine='openpyxl')

In [49]:
ratings.head()

Unnamed: 0,user,the beatles,radiohead,deathcab for cutie,coldplay,modest mouse,sufjan stevens,dylan. bob,red hot clili peppers,pink fluid,...,municipal waste,townes van zandt,curtis mayfield,jewel,lamb,michal w. smith,群星,agalloch,meshuggah,yellowcard
0,0,,0.020417,,,,,,0.030496,,...,,,,,,,,,,
1,1,,0.184962,0.024561,,,0.136341,,,,...,,,,,,,,,,
2,2,,,0.028635,,,,0.024559,,,...,,,,,,,,,,
3,3,,,,,,,,,,...,,,,,,,,,,
4,4,0.043529,0.086281,0.03459,0.016712,0.015935,,,,,...,,,,,,,,,,


## Задание

Транспонируем матрицу ratings, чтобы по строкам стояли исполнители.

In [50]:
ratings = ratings.T

Выкиньте строку под названием `user`.

In [51]:
ratings.drop(['user'],axis = 0,inplace= True)
ratings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
the beatles,,,,,0.043529,,,,0.093398,0.017621,...,,,0.121169,0.038168,0.007939,0.017884,,0.076923,,
radiohead,0.020417,0.184962,,,0.086281,0.006322,,,,0.019156,...,0.017735,,,,0.011187,,,,,
deathcab for cutie,,0.024561,0.028635,,0.03459,,,,,0.013349,...,0.121344,,,,,,,,,0.027893
coldplay,,,,,0.016712,,,,,,...,0.217175,,,,,,,,,
modest mouse,,,,,0.015935,,,,,0.030437,...,,,,,,,,,,


Заполните пропуски нулями.

In [81]:
ratings_without_null = ratings.fillna(0)
ratings_without_null.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
the beatles,0.0,0.0,0.0,0.0,0.043529,0.0,0.0,0.0,0.093398,0.017621,...,0.0,0.0,0.121169,0.038168,0.007939,0.017884,0.0,0.076923,0.0,0.0
radiohead,0.020417,0.184962,0.0,0.0,0.086281,0.006322,0.0,0.0,0.0,0.019156,...,0.017735,0.0,0.0,0.0,0.011187,0.0,0.0,0.0,0.0,0.0
deathcab for cutie,0.0,0.024561,0.028635,0.0,0.03459,0.0,0.0,0.0,0.0,0.013349,...,0.121344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027893
coldplay,0.0,0.0,0.0,0.0,0.016712,0.0,0.0,0.0,0.0,0.0,...,0.217175,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
modest mouse,0.0,0.0,0.0,0.0,0.015935,0.0,0.0,0.0,0.0,0.030437,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Нормализуйте данные при помощи `normalize`.

In [83]:
from sklearn import preprocessing

normalized_rating = preprocessing.normalize(ratings_without_null, norm='l2')
normalized_ratings = pd.DataFrame(normalized_rating, columns=ratings_without_null.columns)
normalized_ratings.index = ratings_without_null.index.copy()

normalized_ratings.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
the beatles,0.0,0.0,0.0,0.0,0.012054,0.0,0.0,0.0,0.025864,0.00488,...,0.0,0.0,0.033554,0.010569,0.002199,0.004952,0.0,0.021302,0.0,0.0
radiohead,0.009348,0.084688,0.0,0.0,0.039505,0.002894,0.0,0.0,0.0,0.008771,...,0.00812,0.0,0.0,0.0,0.005122,0.0,0.0,0.0,0.0,0.0
deathcab for cutie,0.0,0.017278,0.020144,0.0,0.024333,0.0,0.0,0.0,0.0,0.009391,...,0.085361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019622
coldplay,0.0,0.0,0.0,0.0,0.011129,0.0,0.0,0.0,0.0,0.0,...,0.144628,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
modest mouse,0.0,0.0,0.0,0.0,0.01026,0.0,0.0,0.0,0.0,0.019597,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Примените KMeans с 5ю кластерами на преобразованной матрице (сделайте fit, а затем вычислите кластеры при помощи predict).

In [84]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5) # Создание экземпляра KMeans
kmeans.fit(normalized_ratings)
clusters = kmeans.predict(normalized_ratings) # Вычисление кластеров

print(clusters[:5])



[1 0 0 2 0]


Выведите на экран центры кластеров (центроиды)

In [66]:
centroids = kmeans.cluster_centers_
print(centroids)

[[ 6.55715426e-04  7.29059016e-04  1.29797221e-03 ...  6.77472759e-03
  -2.16840434e-19  5.30363031e-04]
 [ 6.21562902e-04  2.58066414e-03  2.92557374e-03 ...  7.83166137e-04
   3.01553453e-03  5.77018598e-03]
 [ 3.03120370e-03  1.73472348e-18  8.33820884e-04 ...  1.22102503e-03
   1.02614380e-04  8.71926701e-04]
 [ 4.20641938e-03  1.02980746e-03  8.87040521e-05 ... -1.08420217e-18
   6.02355037e-04 -1.08420217e-18]
 [ 7.43435366e-04  1.17588043e-03  1.11141087e-03 ...  9.06679240e-05
   2.50148160e-03  4.89431990e-04]]


Для каждого кластера выведите топ-10 исполнителей, наиболее близких к центроидам соотвествующего кластера.

In [116]:
alldistances = kmeans.fit_transform(normalized_ratings)

top = {}
for cluster in range(kmeans.n_clusters):
    cluster_indices = np.where(clusters == cluster)[0] #кортеж из индексов нужных кластеров
    distances = alldistances[cluster_indices, cluster] #расстояния от каждой точки до кластера
  
    top_10_indices = np.argsort(distances)[:10] #индексы элементов массива
    top_10_artists = normalized_ratings.iloc[cluster_indices[top_10_indices]].index.values
    top[cluster] = list(top_10_artists)
    print(f'Top 10 artists for cluster {cluster}: {", ".join(top_10_artists)}')




Top 10 artists for cluster 0: boards of canada, radiohead, beck, girl talk, four tet, tv on the radio, ratatat, modest mouse, animal collective, stereolab
Top 10 artists for cluster 1: acdc, pink fluid, led zeppelin., the beatles, van hallen, timi hendrix, aerosmith, rush, queen, johnny clash
Top 10 artists for cluster 2: white stripes, the arctic monkeys, crystal castles, the strokes, talking heads, the clash, kings of leon, franz ferdinand, air, the knife
Top 10 artists for cluster 3: gym class heroes, kanye west, justin timberlake, chris brown, eminem, t.i., lupe the gorilla, マイケル・ジャクソン, flo rida, usher
Top 10 artists for cluster 4: kelly clarkson, rihanna & jay-z, maroon5, the pussycat dolls, john mayer, lady gaga, fergie, natasia beddingfield, brritney spears, sara bareilles


Проинтерпретируйте результат. Что можно сказать о смысле кластеров?

In [131]:
df = pd.DataFrame()

pd.set_option('display.width', 1000)

df["Electronic Rock"] = list(top.values())[0]
df["Rock"] = list(top.values())[1]
df["Indie"] = list(top.values())[2]
df["Hip-Hop"] = list(top.values())[3]
df["Pop"] = list(top.values())[4]
print(df)

     Electronic Rock           Rock               Indie            Hip-Hop                   Pop
0   boards of canada           acdc       white stripes   gym class heroes        kelly clarkson
1          radiohead     pink fluid  the arctic monkeys         kanye west       rihanna & jay-z
2               beck  led zeppelin.     crystal castles  justin timberlake               maroon5
3          girl talk    the beatles         the strokes        chris brown    the pussycat dolls
4           four tet     van hallen       talking heads             eminem            john mayer
5    tv on the radio   timi hendrix           the clash               t.i.             lady gaga
6            ratatat      aerosmith       kings of leon   lupe the gorilla                fergie
7       modest mouse           rush     franz ferdinand         マイケル・ジャクソン  natasia beddingfield
8  animal collective          queen                 air           flo rida       brritney spears
9          stereolab   johnny 