# Word category formation using BERT predictions

## The idea is to generate word embeddings by:
- Get a list of sentences
- Mask a word in each sentence (repeat a sentence in the list if you want to mask different positions)
- For each sentence, obtain the logit vector for the masked word from BERT's prediction (last hidden layer)
- Cluster sentences logit vectors. The clusters should reflect words that fit together both syntactically and semantically.
- Build each word category by finding the highest valued words in the vectors belonging to a cluster (perhaps by most common top words, all words above some threshold, etc)

In [1]:
from transformers import BertTokenizer, BertForMaskedLM
import numpy as np
import torch
import re

In [2]:
with torch.no_grad():
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForMaskedLM.from_pretrained('bert-base-uncased')
    model.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=361.0, style=ProgressStyle(description_…




### Choose some simple sentences with masked adjectives and nouns

In [3]:
text_sentences = """_ fat cat ate the mouse.
The _ cat ate the mouse.
The fat _ ate the mouse.
The fat cat _ the mouse.
The fat cat ate _ mouse.
The fat cat ate the _.
_ was wearing a lovely satin dress last night.
She _ wearing a lovely satin dress last night.
She was _ a lovely satin dress last night.
She was wearing _ lovely satin dress last night.
She was wearing a _ satin dress last night.
She was wearing a lovely _ dress last night.
She was wearing a lovely satin _ last night.
She was wearing a lovely satin dress _ night.
She was wearing a lovely satin dress last _.
_ was receiving quite a hefty salary.
He _ receiving quite a hefty salary.
He was _ quite a hefty salary.
He was receiving _ a hefty salary.
He was receiving quite _ hefty salary.
He was receiving quite a _ salary.
He was receiving quite a hefty _.
_ also bought a used sofa for his new apartment.
He _ bought a used sofa for his new apartment.
He also _ a used sofa for his new apartment.
He also bought _ used sofa for his new apartment.
He also bought a _ sofa for his new apartment.
He also bought a used _ for his new apartment.
He also bought a used sofa _ his new apartment.
He also bought a used sofa for _ new apartment.
He also bought a used sofa for his _ apartment.
He also bought a used sofa for his new _.
_ was born and grew up in Havana.
I _ born and grew up in Havana.
I was _ and grew up in Havana.
I was born _ grew up in Havana.
I was born and _ up in Havana.
I was born and grew _ in Havana.
I was born and grew up _ Havana.
I was born and grew up in _.
_ Beijing metropolitan area added more than a million people in the past decade.
The _ metropolitan area added more than a million people in the past decade.
The Beijing _ area added more than a million people in the past decade.
The Beijing metropolitan _ added more than a million people in the past decade.
The Beijing metropolitan area _ more than a million people in the past decade.
The Beijing metropolitan area added _ than a million people in the past decade.
The Beijing metropolitan area added more _ a million people in the past decade.
The Beijing metropolitan area added more than _ million people in the past decade.
The Beijing metropolitan area added more than a _ people in the past decade.
The Beijing metropolitan area added more than a million _ in the past decade.
The Beijing metropolitan area added more than a million people _ the past decade.
The Beijing metropolitan area added more than a million people in _ past decade.
The Beijing metropolitan area added more than a million people in the _ decade.
The Beijing metropolitan area added more than a million people in the past _.
_ races are held around the lake and farmlands.
Bike _ are held around the lake and farmlands.
Bike races _ held around the lake and farmlands.
Bike races are _ around the lake and farmlands.
Bike races are held _ the lake and farmlands.
Bike races are held around _ lake and farmlands.
Bike races are held around the _ and farmlands.
Bike races are held around the lake _ farmlands.
Bike races are held around the lake and _.
_ racist cousin called me last night.
My _ cousin called me last night.
My racist _ called me last night.
My racist cousin _ me last night.
My racist cousin called _ last night.
My racist cousin called me _ night.
My racist cousin called me last _.
_ device is considered to be available if it is not being used by another adult.
A _ is considered to be available if it is not being used by another adult.
A device _ considered to be available if it is not being used by another adult.
A device is _ to be available if it is not being used by another adult.
A device is considered _ be available if it is not being used by another adult.
A device is considered to _ available if it is not being used by another adult.
A device is considered to be _ if it is not being used by another adult.
A device is considered to be available _ it is not being used by another adult.
A device is considered to be available if _ is not being used by another adult.
A device is considered to be available if it _ not being used by another adult.
A device is considered to be available if it is _ being used by another adult.
A device is considered to be available if it is not _ used by another adult.
A device is considered to be available if it is not being _ by another adult.
A device is considered to be available if it is not being used _ another adult.
A device is considered to be available if it is not being used by _ adult.
A device is considered to be available if it is not being used by another _."""

In [4]:
text_sentences = """The _ cat ate the mouse.
She was wearing a lovely _ dress last night.
He was receiving quite a _ salary.
He also bought a _ sofa for his new apartment.
I was born and grew up in _.
The _ metropolitan area added more than a million people in the past decade.
Bike races are held around the _ and farmlands.
My racist _ called me last night.
A device is considered to be available if it is not being used by another _."""

### Process sentences with BERT

In [5]:
# Place [MASK] tokens
MASK = '[MASK]'
sentences = re.sub(r'\b_+\b', '[MASK]', text_sentences).split('\n')

In [6]:
# tokenize input
input_ids = [tokenizer.encode(s, add_special_tokens=True) for s in sentences]

# Find location of MASKS
tok_MASK = tokenizer.convert_tokens_to_ids(MASK)
mask_positions = [s.index(tok_MASK) for s in input_ids] 

# Make all sentence arrays equal length by padding
max_len = max(len(i) for i in input_ids)
padded_input = np.array([i + [0]* (max_len - len(i)) for i in input_ids])

attention_mask = np.where(padded_input != 0, 1, 0)  # Create mask to ignore padding

input = torch.tensor(padded_input)
attention_mask = torch.tensor(attention_mask)

In [7]:
# Get hidden layers
with torch.no_grad():
    last_hidden_states = model(input, attention_mask=attention_mask)

In [8]:
# Get embeddings for the masked word of each sentence
embeddings = [lh[m].numpy() for lh, m in zip(last_hidden_states[0], mask_positions)]

In [9]:
def get_top_predictions(probs, k=5, thres=0.01):
    """
    Print and return top-k predictions for a given probs list.
    Also return predictions above threshold
    """
    # Get top-k tokens
    probs = probs.detach().numpy()
    top_indexes = np.argpartition(probs, -k)[-k:]
    sorted_indexes = top_indexes[np.argsort(-probs[top_indexes])]
    top_tokens = tokenizer.convert_ids_to_tokens(sorted_indexes)
    print(f"Ordered top predicted tokens: {top_tokens}")
    print(f"Ordered top predicted values: {probs[sorted_indexes]}\n")
    
    # Get all tokens above threshold
    high_indexes = np.where(probs > thres)
    high_tokens = tokenizer.convert_ids_to_tokens(high_indexes[0])
    return top_tokens, high_tokens

### Convert last layer logit predictions to probabilities
We can see what are the highest predictions for the blank in each sentence, and their probabilities.

In [10]:
# Convert last hidden state to probs and find tokens
sm = torch.nn.Softmax(dim=0) 
#id_large = tokenizer.convert_tokens_to_ids('large')
all_high_tokens = []
i = 0
for lh, m in zip(last_hidden_states[0], mask_positions):
    print("Sentence:")
    print(sentences[i])
    i += 1
    probs = sm(lh[m])
    #print(f"Probability of 'large': {probs[id_large]}")
    _, high_tokens = get_top_predictions(probs)
    all_high_tokens.append(high_tokens)

Sentence:
The [MASK] cat ate the mouse.
Ordered top predicted tokens: ['black', 'cheshire', 'big', 'little', 'fat']
Ordered top predicted values: [0.13267049 0.08640933 0.06516975 0.03538685 0.03100599]

Sentence:
She was wearing a lovely [MASK] dress last night.
Ordered top predicted tokens: ['white', 'black', 'red', 'pink', 'blue']
Ordered top predicted values: [0.20945124 0.16496556 0.13129269 0.08869011 0.05542691]

Sentence:
He was receiving quite a [MASK] salary.
Ordered top predicted tokens: ['good', 'handsome', 'high', 'generous', 'decent']
Ordered top predicted values: [0.18829058 0.09613485 0.09576207 0.0917473  0.0544567 ]

Sentence:
He also bought a [MASK] sofa for his new apartment.
Ordered top predicted tokens: ['new', 'comfortable', 'luxurious', 'large', 'luxury']
Ordered top predicted values: [0.6247967  0.05839209 0.02877485 0.0248212  0.01501671]

Sentence:
I was born and grew up in [MASK].
Ordered top predicted tokens: ['chicago', 'california', 'texas', 'london', 'en

In [20]:
# Cluster embeddings with KMeans
from sklearn.cluster import KMeans, OPTICS, DBSCAN, cluster_optics_dbscan
k = 6
estimator = KMeans(init="k-means++", n_clusters=k, n_jobs=4)
estimator.fit(embeddings)
estimator.labels_

array([1, 1, 0, 1, 3, 3, 4, 2, 5], dtype=int32)

### Form word categories
Take all words above a threshold from vectors that belong to a cluster to form word categories

In [21]:
word_categories = {}
for cl in range(k):
    cluster_members = np.where(estimator.labels_ == cl)
    word_categories[cl] = sum((all_high_tokens[i] for i in cluster_members[0]), [])
    word_categories[cl] = set(word_categories[cl])
    print(f"Category {cl}:")
    print(", ".join(word_categories[cl]) + "\n")

Category 0:
nice, modest, handsome, decent, high, comfortable, steady, substantial, generous, respectable, low, small, fine, large, considerable, good

Category 1:
little, evening, new, black, luxury, great, silver, blue, white, gray, yellow, old, green, brown, comfortable, small, giant, leather, silk, pink, red, dead, mother, big, wild, luxurious, purple, wedding, large, fat, cheshire

Category 2:
father, cousin, friends, boss, neighbors, husband, friend, dad, uncle, neighbor, wife, boyfriend, partner, mother, girlfriend, brother, roommate

Category 3:
indianapolis, washington, toronto, london, atlanta, california, austin, cleveland, mexico, brooklyn, philadelphia, denver, chicago, minneapolis, florida, seattle, portland, dallas, germany, detroit, france, louisville, texas, england, pittsburgh, houston, canada

Category 4:
village, forest, hills, forests, gardens, countryside, towns, parks, lake, park, lakes, town, mountains, city, farms, villages, fields, woods

Category 5:
user, par