# Word category formation using BERT predictions

## The idea is to generate word embeddings by:
- Get a list of sentences
- Mask a word in each sentence (repeat a sentence in the list if you want to mask different positions)
- For each sentence, obtain the logit vector for the masked word from BERT's prediction (last hidden layer)
- Cluster sentences logit vectors. The clusters should reflect words that fit together both syntactically and semantically.
- Build each word category by finding the highest valued words in the vectors belonging to a cluster (perhaps by most common top words, all words above some threshold, etc)

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
import numpy as np
import torch
import re

In [2]:
with torch.no_grad():
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForMaskedLM.from_pretrained('bert-base-uncased')
    model.eval()

### Choose some simple sentences with masked adjectives and nouns

In [104]:
text_sentences = """The _ cat ate the mouse.
She was wearing a lovely _ dress last night.
He was receiving quite a _ salary.
He also bought a _ sofa for his new apartment.
I was born and grew up in _.
The _ metropolitan area added more than a million people in the past decade.
Bike races are held around the _ and farmlands.
My _ called me last night.
Mozart's _ came from a remote country.
A device is considered to be available if it is not being used by another _."""

### Process sentences with BERT

In [105]:
# Place [MASK] tokens
MASK = '[MASK]'
sentences = re.sub(r'\b_+\b', '[MASK]', text_sentences).split('\n')
sentences

['The [MASK] cat ate the mouse.',
 'She was wearing a lovely [MASK] dress last night.',
 'He was receiving quite a [MASK] salary.',
 'He also bought a [MASK] sofa for his new apartment.',
 'I was born and grew up in [MASK].',
 'The [MASK] metropolitan area added more than a million people in the past decade.',
 'Bike races are held around the [MASK] and farmlands.',
 'My [MASK] called me last night.',
 "Mozart's [MASK] came from a remote country.",
 'A device is considered to be available if it is not being used by another [MASK].']

In [106]:
# tokenize input
input_ids = [tokenizer.encode(s, add_special_tokens=True) for s in sentences]

# Find location of MASKS
tok_MASK = tokenizer.convert_tokens_to_ids(MASK)
mask_positions = [s.index(tok_MASK) for s in input_ids] 

# Make all sentence arrays equal length by padding
max_len = max(len(i) for i in input_ids)
padded_input = np.array([i + [0]* (max_len - len(i)) for i in input_ids])

attention_mask = np.where(padded_input != 0, 1, 0)  # Create mask to ignore padding

input = torch.tensor(padded_input)
attention_mask = torch.tensor(attention_mask)

In [107]:
# Get hidden layers
with torch.no_grad():
    last_hidden_states = model(input, attention_mask=attention_mask)

In [108]:
# Get embeddings for the masked word of each sentence
embeddings = [lh[m].numpy() for lh, m in zip(last_hidden_states[0], mask_positions)]

In [109]:
def get_top_predictions(probs, k=5, thres=0.01):
    """
    Print and return top-k predictions for a given probs list.
    Also return predictions above threshold
    """
    # Get top-k tokens
    probs = probs.detach().numpy()
    top_indexes = np.argpartition(probs, -k)[-k:]
    sorted_indexes = top_indexes[np.argsort(-probs[top_indexes])]
    top_tokens = tokenizer.convert_ids_to_tokens(sorted_indexes)
    print(f"Ordered top predicted tokens: {top_tokens}")
    print(f"Ordered top predicted values: {probs[sorted_indexes]}\n")
    
    # Get all tokens above threshold
    high_indexes = np.where(probs > thres)
    high_tokens = tokenizer.convert_ids_to_tokens(high_indexes[0])
    return top_tokens, high_tokens

### Convert last layer logit predictions to probabilities
We can see what are the highest predictions for the blank in each sentence, and their probabilities.

In [110]:
# Convert last hidden state to probs and find tokens
sm = torch.nn.Softmax(dim=0) 
#id_large = tokenizer.convert_tokens_to_ids('large')
all_high_tokens = []
i = 0
for lh, m in zip(last_hidden_states[0], mask_positions):
    print("Sentence:")
    print(sentences[i])
    i += 1
    probs = sm(lh[m])
    #print(f"Probability of 'large': {probs[id_large]}")
    _, high_tokens = get_top_predictions(probs)
    all_high_tokens.append(high_tokens)

Sentence:
The [MASK] cat ate the mouse.
Ordered top predicted tokens: ['black', 'cheshire', 'big', 'little', 'fat']
Ordered top predicted values: [0.13267049 0.08640933 0.06516975 0.03538685 0.03100599]

Sentence:
She was wearing a lovely [MASK] dress last night.
Ordered top predicted tokens: ['white', 'black', 'red', 'pink', 'blue']
Ordered top predicted values: [0.20945124 0.16496556 0.13129269 0.08869011 0.05542691]

Sentence:
He was receiving quite a [MASK] salary.
Ordered top predicted tokens: ['good', 'handsome', 'high', 'generous', 'decent']
Ordered top predicted values: [0.18829058 0.09613485 0.09576207 0.0917473  0.0544567 ]

Sentence:
He also bought a [MASK] sofa for his new apartment.
Ordered top predicted tokens: ['new', 'comfortable', 'luxurious', 'large', 'luxury']
Ordered top predicted values: [0.6247967  0.05839209 0.02877485 0.0248212  0.01501671]

Sentence:
I was born and grew up in [MASK].
Ordered top predicted tokens: ['chicago', 'california', 'texas', 'london', 'en

In [114]:
# Cluster embeddings with KMeans
from sklearn.cluster import KMeans, OPTICS, DBSCAN, cluster_optics_dbscan
k = 5
estimator = KMeans(init="k-means++", n_clusters=k, n_jobs=4)
estimator.fit(embeddings)
estimator.labels_

array([1, 1, 0, 1, 2, 2, 2, 3, 3, 4], dtype=int32)

### Form word categories
Take all words above a threshold from vectors that belong to a cluster to form word categories

In [115]:
word_categories = {}
for cl in range(k):
    cluster_members = np.where(estimator.labels_ == cl)
    word_categories[cl] = sum((all_high_tokens[i] for i in cluster_members[0]), [])
    word_categories[cl] = set(word_categories[cl])
    print(f"Category {cl}:")
    print(", ".join(word_categories[cl]) + "\n")

Category 0:
fine, high, low, steady, comfortable, respectable, good, small, nice, modest, decent, generous, considerable, large, handsome, substantial

Category 1:
luxurious, mother, old, giant, silver, comfortable, big, dead, green, fat, wild, purple, cheshire, yellow, red, evening, wedding, silk, little, white, new, small, great, leather, large, brown, luxury, blue, black, gray, pink

Category 2:
dallas, washington, london, woods, seattle, canada, philadelphia, france, gardens, lakes, portland, fields, atlanta, villages, austin, forest, pittsburgh, minneapolis, california, england, florida, mexico, parks, forests, toronto, park, farms, village, countryside, lake, hills, mountains, louisville, brooklyn, cleveland, indianapolis, town, texas, germany, denver, city, towns, detroit, chicago, houston

Category 3:
friend, parents, family, mother, dad, aunt, name, mom, music, wife, violin, brother, sister, voice, grandfather, orchestra, father

Category 4:
device, company, provider, customer