In this notebook, I will try clustering contextual word representations from DeepPavlov's conversational RuBERT for the task of Word Sense Induction.

In [1]:
# !pip install transformers umap-learn

In [2]:
import warnings
import itertools
from tqdm import tqdm
from copy import deepcopy
from functools import reduce
from typing import List, Tuple, Dict, Callable

import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer

import torch
from transformers import BertTokenizer, BertModel

from umap import UMAP

from sklearn.decomposition import PCA
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch

warnings.filterwarnings('ignore')

In [3]:
re_tokenizer = RegexpTokenizer(r'[А-Яа-яЁё]+|[A-za-z]+|\w+|[«»\'",.:;!?\(\)-–—]|[^\w\s]+')
word_tokenize = re_tokenizer.tokenize

In [4]:
data = pd.read_csv('https://raw.githubusercontent.com/nsykhr/russe-wsi-kit/master/data/main/active-dict/train.csv', sep='\t')
data.dropna(subset=['positions'], inplace=True) # let's drop the rows where the relevant token is not present in the context
data.positions = data.positions.apply(lambda x: [tuple(map(int, pos.split('-'))) for pos in x.split(',')])
data

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,1,дар,1,,"[(18, 22)]",Отвергнуть щедрый дар
1,2,дар,1,,"[(21, 28)]",покупать преданность дарами и наградами
2,3,дар,1,,"[(19, 23)]",Вот яд – последний дар моей Изоры
3,4,дар,1,,"[(81, 87)]",Основная функция корильных песен – повеселить ...
4,5,дар,1,,"[(151, 157)]",Но недели две спустя (Алевтина его когда-то об...
...,...,...,...,...,...,...
2068,2069,зонт,1,,"[(85, 91)]","Такая погода легко переживается весной, а вот ..."
2069,2070,зонт,2,,"[(8, 13)]",Пляжный зонт
2070,2071,зонт,2,,"[(18, 25)]",сидеть в кафе под зонтом
2071,2072,зонт,2,,"[(21, 29)]","Cтолики под широкими зонтами, несколько привин..."


In [5]:
data_by_words = {key: [] for key in data.word.unique()}
for _, row in data.iterrows():
    data_by_words[row.word].append((row.context, row.positions, row.gold_sense_id))

# ConvBERT + Clustering

Let's try clustering contextual word embeddings. By design, they are meant to encode tokens as a weighted sum of embeddings of all the tokens in a sequence. Therefore, clustering BERT embeddings is similar to the embedding clustering approaches described in, for example, https://arxiv.org/abs/1805.02258.

We do not know the number of clusters in advance, but one can tune the clustering hyperparameters on the training data. Besides the clustering hyperparameters and the clustering algorithm itself, it is a good idea to try dimensionality reduction (it may help alleviate the dimensionality curse).

To choose the number of clusters for Agglomerative Clustering, I will be using the idea introduced in https://www.aclweb.org/anthology/S19-2004.pdf (iterating over all sensible values and choosing the one that yields the best silhouette score).

I chose to use the conversational version of DeepPavlov's RuBERT because it yields significantly better results, according to my preliminary experiments.

In [6]:
RUBERT_PATH = '../../RuBERT/ru_conversational_cased_L-12_H-768_A-12_pt'
USE_GPU = False

tokenizer = BertTokenizer.from_pretrained(RUBERT_PATH, do_lower_case=False)
model = BertModel.from_pretrained(RUBERT_PATH)
if USE_GPU:
    model = model.cuda()
model.eval()
device = torch.device('cuda:0' if USE_GPU else 'cpu')

In [7]:
CLS_ID = tokenizer.vocab[tokenizer.cls_token]
SEP_ID = tokenizer.vocab[tokenizer.sep_token]

In [8]:
def get_contextual_embeddings(sentence: str, char_positions: List[Tuple[int, int]], return_mean: bool = False) -> np.ndarray:
    """
    We need to carefully find the tokens corresponding to the provided positions
    and return either the mean of their first wordpieces' contextual embeddings
    or the mean of all of their wordpieces' contextual embeddings.
    
    1. Tokenize the text with conventional means and write down the relevant tokens' indices.
    2. Tokenize the tokens one by one using BERT wordpiece tokenizer and write down the relevant wordpieces' indices.
    3. Calculate the wordpieces' contextual embeddings by running BERT.
    4. Return the mean of the relevant wordpieces' embeddings.
    """
    
    # We set acc_len to 1 in order to account for the [CLS] token we will prepend to the sequence later.
    input_indices, wordpiece_positions, acc_len = [], [], 1
    relevant_tokens = {sentence[pos[0]:pos[1]-1] for pos in char_positions}
    
    for i, token in enumerate(word_tokenize(sentence)):
        indices = tokenizer.encode(token, add_special_tokens=False)
        input_indices.extend(indices)
        
        if token in relevant_tokens:
            wordpiece_positions.extend(list(range(acc_len, acc_len + len(indices)))
                                       if return_mean else [acc_len])
        
        acc_len += len(indices)
    
    with torch.no_grad():
        model_output = model(torch.tensor([[CLS_ID] + input_indices + [SEP_ID]]).to(device))[0][0]
    
    return np.mean(model_output[wordpiece_positions].detach().cpu().numpy(), axis=0)

In [9]:
X_first = [np.vstack([get_contextual_embeddings(context, pos) for context, pos, _ in word_data])
           for word_data in data_by_words.values()]
X_all = [np.vstack([get_contextual_embeddings(context, pos, return_mean=True) for context, pos, _ in word_data])
         for word_data in data_by_words.values()]

In [10]:
y = [np.array([label for _, _, label in word_data])
     for word_data in data_by_words.values()]

## Hyperparameter tuning

In [11]:
MAX_CLUSTERS = data.groupby('word').aggregate(lambda x: len(set(x)))['gold_sense_id'].max()
MAX_CLUSTERS

17

In [12]:
class ClusteringGridSearch:
    def __init__(self, estimator, param_grid: Dict[str, list],
                 scoring: Callable, needs_n_clusters: bool = False,
                 dimensionality_reducer = None) -> None:
        self.estimator = estimator
        self.param_grid = param_grid
        self.scoring = scoring
        self.results_ = None
        self.needs_n_clusters = needs_n_clusters
        self.dimensionality_reducer = dimensionality_reducer
        
    def fit(self, X, y) -> None:
        self.clear_results()
        
        if self.dimensionality_reducer is not None:
            X = deepcopy(X)
            X = [self.dimensionality_reducer.fit_transform(word_data) for word_data in X]
        
        max_comb = reduce(lambda x, y: x*y, [len(value) for value in self.param_grid.values()])
        
        for i, param_combination in tqdm(enumerate(
            itertools.product(*[list(range(len(value))) for value in self.param_grid.values()])
        )):
            kwargs = {}
            for (key, value), idx in zip(self.param_grid.items(), param_combination):
                kwargs[key] = value[idx]
                self.results_[key][i] = value[idx]
            
            score = 0
            for word_data, word_labels in zip(X, y):
                if self.needs_n_clusters:
                    self.find_optimal_n_clusters(word_data, kwargs)
                estimator = self.estimator(**kwargs)
                
                preds = estimator.fit_predict(word_data)
                score += self.scoring(word_labels, preds) * len(word_data)
            
            score /= sum(len(word_data) for word_data in X)
            self.results_['score'][i] = score
            if i != max_comb - 1:
                self.results_.loc[len(self.results_), :] = [None for _ in range(len(self.param_grid) + 1)]
            
            del estimator
        
        self.results_.score = self.results_.score.astype('float')
    
    def find_optimal_n_clusters(self, X, kwargs) -> None:
        best_score, best_n_clusters = -1, 0
        
        for n_clusters in range(2, min(MAX_CLUSTERS, len(X) - 1)):
            kwargs['n_clusters'] = n_clusters
            estimator = self.estimator(**kwargs)
            preds = estimator.fit_predict(X)
            score = silhouette_score(X, preds)
            
            if score > best_score:
                best_score = score
                best_n_clusters = n_clusters
        
        kwargs['n_clusters'] = best_n_clusters
    
    def clear_results(self) -> None:
        self.results_ = pd.DataFrame([[None for _ in range(len(self.param_grid) + 1)]],
                                     columns = list(self.param_grid.keys()) + ['score'])

### Affinity Propagation

In [13]:
grid = ClusteringGridSearch(AffinityPropagation,
                            {
                                'damping': [0.5, 0.55, 0.6, 0.7, 0.8, 0.9, 0.99],
                                'max_iter': [50, 100, 200]
                            },
                            adjusted_rand_score)

In [14]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

21it [00:52,  2.48s/it]


damping          0.6
max_iter          50
score       0.227508
Name: 6, dtype: object

In [15]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

21it [00:48,  2.32s/it]


damping          0.5
max_iter         200
score       0.230609
Name: 2, dtype: object

### AffinityPropagation + PCA

In [16]:
N_COMP = min(x.shape[0] for x in X_first)
N_COMP

7

In [17]:
grid = ClusteringGridSearch(AffinityPropagation,
                            {
                                'damping': [0.5, 0.55, 0.6, 0.7, 0.8, 0.9, 0.99],
                                'max_iter': [50, 100, 200]
                            },
                            adjusted_rand_score, dimensionality_reducer=PCA(n_components=N_COMP))

In [18]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

21it [00:37,  1.78s/it]


damping          0.8
max_iter          50
score       0.225494
Name: 12, dtype: object

In [19]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

21it [00:43,  2.06s/it]


damping          0.5
max_iter          50
score       0.223428
Name: 0, dtype: object

### AffinityPropagation + UMAP

In [20]:
grid = ClusteringGridSearch(AffinityPropagation,
                            {
                                'damping': [0.5, 0.55, 0.6, 0.7, 0.8, 0.9, 0.99],
                                'preference': [None] + list(np.arange(-5.0, -1.0, 0.25)),
                                'max_iter': [50, 100, 200]
                            },
                            adjusted_rand_score, dimensionality_reducer=UMAP(n_components=N_COMP-2, metric='cosine', n_neighbors=10))

In [21]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

357it [10:55,  1.84s/it]


damping            0.5
preference       -4.25
max_iter            50
score         0.319283
Name: 12, dtype: object

In [22]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

357it [09:09,  1.54s/it]


damping            0.8
preference          -4
max_iter            50
score         0.307646
Name: 219, dtype: object

### Agglomerative Clustering

For this clustering method, the number of clusters must be passed explicitly. We are going to infer it from the data, individually for each word, using the silhouette score.

In [23]:
grid = ClusteringGridSearch(AgglomerativeClustering,
                            {
                                'affinity': ['manhattan', 'cosine'],
                                'linkage': ['single', 'average', 'complete']
                            },
                            adjusted_rand_score, needs_n_clusters=True)

In [24]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:49,  8.28s/it]


affinity    manhattan
linkage      complete
score        0.202546
Name: 2, dtype: object

In [25]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:35,  5.91s/it]


affinity      cosine
linkage     complete
score       0.195608
Name: 5, dtype: object

### Agglomerative Clustering + PCA

In [26]:
grid = ClusteringGridSearch(AgglomerativeClustering,
                            {
                                'affinity': ['manhattan', 'cosine'],
                                'linkage': ['single', 'average', 'complete']
                            },
                            adjusted_rand_score, needs_n_clusters=True,
                            dimensionality_reducer=PCA(n_components=N_COMP))

In [27]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:21,  3.57s/it]


affinity      cosine
linkage     complete
score       0.215005
Name: 5, dtype: object

In [28]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:18,  3.16s/it]


affinity      cosine
linkage      average
score       0.214023
Name: 4, dtype: object

### Agglomerative Clustering + UMAP

In [29]:
grid = ClusteringGridSearch(AgglomerativeClustering,
                            {
                                'affinity': ['manhattan', 'cosine'],
                                'linkage': ['single', 'average', 'complete']
                            },
                            adjusted_rand_score, needs_n_clusters=True,
                            dimensionality_reducer=UMAP(n_components=N_COMP-2, metric='cosine', n_neighbors=10))

In [30]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:12,  2.06s/it]


affinity    manhattan
linkage       average
score        0.260198
Name: 1, dtype: object

In [31]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

6it [00:12,  2.05s/it]


affinity    manhattan
linkage       average
score        0.261971
Name: 1, dtype: object

### Birch

In [32]:
grid = ClusteringGridSearch(Birch,
                            {
                                'threshold': [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 5.0, 10.0],
                                'branching_factor': [2, 3, 4, 5, 10, 25, 50, 100]
                            },
                            adjusted_rand_score)

In [33]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [01:47,  1.50s/it]


threshold                  5
branching_factor           2
score               0.210423
Name: 56, dtype: object

In [34]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [01:48,  1.51s/it]


threshold               0.01
branching_factor           2
score               0.213639
Name: 0, dtype: object

### Birch + PCA

In [35]:
grid = ClusteringGridSearch(Birch,
                            {
                                'threshold': [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 5.0, 10.0],
                                'branching_factor': [2, 3, 4, 5, 10, 25, 50, 100]
                            },
                            adjusted_rand_score, dimensionality_reducer=PCA(n_components=N_COMP))

In [36]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [00:37,  1.93it/s]


threshold                  1
branching_factor           4
score               0.220705
Name: 50, dtype: object

In [37]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [00:38,  1.88it/s]


threshold                  1
branching_factor           4
score               0.216445
Name: 50, dtype: object

### Birch + UMAP

In [38]:
grid = ClusteringGridSearch(Birch,
                            {
                                'threshold': [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 5.0, 10.0],
                                'branching_factor': [2, 3, 4, 5, 10, 25, 50, 100]
                            },
                            adjusted_rand_score,
                            dimensionality_reducer=UMAP(n_components=N_COMP-2, metric='cosine', n_neighbors=10))

In [39]:
grid.fit(X_first, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [00:29,  2.40it/s]


threshold               0.75
branching_factor          10
score               0.267725
Name: 44, dtype: object

In [40]:
grid.fit(X_all, y)
grid.results_.loc[grid.results_.score.idxmax()]

72it [00:28,  2.48it/s]


threshold               0.75
branching_factor          10
score               0.255086
Name: 44, dtype: object

# Prediction

The best clustering algorithm turned out to be Affinity Propagation over UMAP-reduced contextual embeddings with the following hyperparameters:
- UMAP: n_components = 5 (or less, see the code below), metric = cosine, n_neighbors = 10, the rest of the parameters: default values
- AffinityPropagation: damping = 0.5, preference = -4.25, max_iter = 50

Furthermore, it is slightly beneficial to use just the first wordpiece of a token as its contextual representation. One could argue that suffixes do not carry a lot of information relevant for WSI.

Let's run this pipeline for the test data and write down the result to be evaluated with the provided Python script.

P. S. The additional dataset, weirdly, uses different indexing, so we need to correct that by adding 2 to the end of every span.

## Main

In [41]:
data = pd.read_csv('https://raw.githubusercontent.com/nsykhr/russe-wsi-kit/master/data/main/active-dict/test-solution.csv', sep='\t')
data.dropna(subset=['positions'], inplace=True) # let's drop the rows where the relevant token is not present in the context
data.positions = data.positions.apply(lambda x: [tuple(map(int, pos.split('-'))) for pos in x.split(',')])
data

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,2074,давление,1,,"[(0, 9)]",Давление пара создается движением поршня в цил...
1,2075,давление,2.2,,"[(13, 22)]","«У тебя что, давление поднялось?» Я сказал, чт..."
2,2076,давление,2.2,,"[(56, 65)]",Я жалуюсь Никоновичу наконец на головокружение...
3,2077,давление,2.1,,"[(0, 9)]",Давление в котле не менялось
4,2078,давление,2.2,,"[(25, 34)]",Он каждые два часа мерил давление и сахар в крови
...,...,...,...,...,...,...
3724,5798,зуд,2,,"[(43, 47)]",Многих американцев одолевает романтический зуд...
3725,5799,зуд,2,,"[(23, 27)]",Если на нее не находил зуд рассказывания истор...
3726,5800,зуд,2,,"[(27, 33)]","С раздражающей завистью, с зудом неудовлетворе..."
3727,5801,зуд,2,,"[(12, 16)]",Нестерпимый зуд любопытства


In [42]:
data_by_words = {key: [] for key in data.word.unique()}
for i, row in data.iterrows():
    data_by_words[row.word].append((row.context, row.positions, row.gold_sense_id, i))

In [43]:
X = [np.vstack([get_contextual_embeddings(context, pos) for context, pos, _, _ in word_data])
     for word_data in data_by_words.values()]

In [44]:
aff = AffinityPropagation(damping=0.5, preference=-4.25, max_iter=50)

for word, word_data in tqdm(zip(data_by_words, X)):
    n_components = min(5, len(data_by_words[word]) - 2)
    umap = UMAP(n_components=n_components, metric='cosine', n_neighbors=10)
    
    preds = aff.fit_predict(umap.fit_transform(word_data))
    for pred, row in zip(preds, data_by_words[word]):
        data.predict_sense_id.loc[row[-1]] = pred

168it [07:35,  2.71s/it]


In [45]:
data

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,2074,давление,1,2.0,"[(0, 9)]",Давление пара создается движением поршня в цил...
1,2075,давление,2.2,0.0,"[(13, 22)]","«У тебя что, давление поднялось?» Я сказал, чт..."
2,2076,давление,2.2,0.0,"[(56, 65)]",Я жалуюсь Никоновичу наконец на головокружение...
3,2077,давление,2.1,0.0,"[(0, 9)]",Давление в котле не менялось
4,2078,давление,2.2,0.0,"[(25, 34)]",Он каждые два часа мерил давление и сахар в крови
...,...,...,...,...,...,...
3724,5798,зуд,2,1.0,"[(43, 47)]",Многих американцев одолевает романтический зуд...
3725,5799,зуд,2,1.0,"[(23, 27)]",Если на нее не находил зуд рассказывания истор...
3726,5800,зуд,2,0.0,"[(27, 33)]","С раздражающей завистью, с зудом неудовлетворе..."
3727,5801,зуд,2,1.0,"[(12, 16)]",Нестерпимый зуд любопытства


In [46]:
data.to_csv('../data/main/active-dict/result_conv_bert_aff.csv', sep='\t', index=None)

After that I ran the baseline algorithm (Adagram) on the test data to get its predictions and the evaluation script three times to compare the metrics. Adagram achieves __0.161189__ ARI, while my algorithm achieves __0.273346__ ARI. On the training data, Adagram achieves only __0.159930__ ARI, while my algorithm achieves __0.319283__ ARI. Thus, I was able to very significantly improve over the baseline and achieve a very competitive result. If you choose to re-run the prediction code, keep in mind that UMAP may yield different results depending on its random initialization. Voting over several random seeds could both improve and stabilize the result.

## Additional

In [47]:
data = pd.read_csv('https://raw.githubusercontent.com/nsykhr/russe-wsi-kit/master/data/additional/active-rutenten/train.csv', sep='\t')
data.dropna(subset=['positions'], inplace=True) # let's drop the rows where the relevant token is not present in the context
data.positions = data.positions.apply(lambda x: [tuple(map(int, pos.split('-'))) for pos in x.split(',')])
data.positions = data.positions.apply(lambda x: [(a, b+2) for a, b in x])
data

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,1,альбом,2,,"[(88, 96)]",достаточно лишь колесиком мышки крутить вниз. ...
1,2,альбом,3,,"[(85, 93)]","выступал в составе команды с таким названием, ..."
2,3,альбом,2,,"[(81, 89)]",". Работает так себе, поскольку функция заточен..."
3,4,альбом,3,,"[(84, 91)]",одержала победу в двух из пяти номинаций: 'Луч...
4,5,альбом,3,,"[(83, 90)]",встречи с Божественным. Вы испытаете ни с чем ...
...,...,...,...,...,...,...
3666,3667,группа,4,,"[(102, 109)]","напротив, цветет пуще прежнего. География расш..."
3667,3668,группа,4,,"[(93, 100)]","синтетической работе, терпение и упорство, жел..."
3668,3669,группа,4,,"[(20, 27)]",Маркетинг процедуры.Группа компаний Кивеннапа ...
3669,3670,группа,4,,"[(100, 107)]",International» признались миллионам слушателей...


There are several sentences in the dataset that cannot be tokenized correctly. I have to resort to correcting them manually (add a whitespace or two) because I do not have the time nor the resources to use/train a neural tokenizer or create complex tokenization rules for such sentences.

In [48]:
data.context.loc[494] = '15, 000 на всех членов семьи.Вот тогда и законы будут человечными Анатомия и физиология человека'
data.positions.loc[494] = [(66, 75)]

data.context.loc[1202] = 'выше чем 1 метр или до 12 лет-8 евро, дети ниже чем 1 метр –вход бесплатный Билеты: взрослые-16 евро, дети выше чем 1 метр или старше 10 лет-12 евро, дети до 10 лет'
data.positions.loc[1202] = [(76, 83)]

data.context.loc[1381] = 'крепление Крепление на стену по стандарту VESA 100мм Блок питания внешний'
data.positions.loc[1381] = [(53, 58)]

data.context.loc[1398] = 'Пористые заполнители Блоки оконные'
data.positions.loc[1398] = [(21, 27)]

data.context.loc[1432] = 'СКАТ БЛОКИ ПИТАНИЯ БПУ-24-0,5; БПУ-24-0,7; БПУ-12 -1,5; БПС -12 -0,7; БП-TV1; БП-TV3 БЛОК ПИТАНИЯ СЕТЕВОЙ БПС СИСТЕМА ДИСТАНЦИОННОГО... БПС -12 . Блок питания с симисторами'
data.positions.loc[1432] = [(85, 90)]

data.context.loc[1514] = 'Библиографические ресурсы и каталоги Блок библиографических ресурсов глобальных сетей обширен и разнообразен. Его главной'
data.positions.loc[1514] = [(37, 42)]

data.context.loc[2019] = 'Выпускаемая продукция Вешалка детская'
data.positions.loc[2019] = [(22, 30)]

data.context.loc[2566] = 'Электропроводка для подключения светодиодных знаков в задней части прицепа вилка/розетка) для подключения электрооборудования прицепа к электросети автомобиля'
data.positions.loc[2566] = [(75, 81)]

data.context.loc[2811] = 'Волги только левым расположением запасного колеса. Оно так же прикручено винтом. горизонтальным торсионам и удерживалась ими в открытом положении. Причем оригинальной'
data.positions.loc[2811] = [(73, 80)]

In [49]:
data_by_words = {key: [] for key in data.word.unique()}
for i, row in data.iterrows():
    data_by_words[row.word].append((row.context, row.positions, row.gold_sense_id, i))

In [50]:
X = [np.vstack([get_contextual_embeddings(context, pos) for context, pos, _, _ in word_data])
     for word_data in data_by_words.values()]

In [51]:
aff = AffinityPropagation(damping=0.5, preference=-4.25, max_iter=50)

for word, word_data in tqdm(zip(data_by_words, X)):
    n_components = min(5, len(data_by_words[word]) - 2)
    umap = UMAP(n_components=n_components, metric='cosine', n_neighbors=10)
    
    preds = aff.fit_predict(umap.fit_transform(word_data))
    for pred, row in zip(preds, data_by_words[word]):
        data.predict_sense_id.loc[row[-1]] = pred

20it [01:04,  3.23s/it]


In [52]:
data

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,1,альбом,2,3.0,"[(88, 96)]",достаточно лишь колесиком мышки крутить вниз. ...
1,2,альбом,3,16.0,"[(85, 93)]","выступал в составе команды с таким названием, ..."
2,3,альбом,2,3.0,"[(81, 89)]",". Работает так себе, поскольку функция заточен..."
3,4,альбом,3,2.0,"[(84, 91)]",одержала победу в двух из пяти номинаций: 'Луч...
4,5,альбом,3,10.0,"[(83, 90)]",встречи с Божественным. Вы испытаете ни с чем ...
...,...,...,...,...,...,...
3666,3667,группа,4,1.0,"[(102, 109)]","напротив, цветет пуще прежнего. География расш..."
3667,3668,группа,4,8.0,"[(93, 100)]","синтетической работе, терпение и упорство, жел..."
3668,3669,группа,4,8.0,"[(20, 27)]",Маркетинг процедуры.Группа компаний Кивеннапа ...
3669,3670,группа,4,1.0,"[(100, 107)]",International» признались миллионам слушателей...


In [53]:
data.to_csv('../data/additional/active-rutenten/result_conv_bert_aff.csv', sep='\t', index=None)

On this dataset, my algorithm demonstrates a measly __0.057292__ ARI. Clearly, the hyperparameters of the model need to be tuned separately for this data since the current algorithm predicts much more granular senses than there are in the labels (illustration below). Since this is impossible, I will try another approach to tackle this data (see the other notebook).

In [54]:
data[['word', 'gold_sense_id', 'predict_sense_id']].groupby(by=['word']).\
    aggregate(lambda x: len(set(x))).applymap(int)

Unnamed: 0_level_0,gold_sense_id,predict_sense_id
word,Unnamed: 1_level_1,Unnamed: 2_level_1
альбом,3,24
анатомия,3,9
базар,3,8
балет,4,10
беда,2,11
бездна,4,7
билет,4,26
блок,8,14
блоха,2,8
брак,2,10
