# Use the ngram embeddings
In this program we use our pre-made word-embeddings from BERT. In addition to using the direct n-gram to n-gram comparison (gensim most_similar function) we can perform it contextually by extracting the contextual embeddings of a subset of a input-sentence and comparing it with out word-embedding vocabulary. 

In [1]:
# numpy, useful for efficient vector operations
import numpy as np

# pytorch 
import torch

# transformoer models from huggingface
from transformers import BertTokenizerFast, BertModel

# gensim library
from gensim.models import KeyedVectors

# natural langauge toolkit
import nltk
from nltk.tokenize import word_tokenize

import re

In [2]:
NGRAMS = 1
LAYER = 11

## Load the pre-trained word-embeddings

In [3]:
# load in ngram embeddings
word_model = KeyedVectors.load("academic_ngrams_"+str(NGRAMS)+".kv")

## Load Contextual Model
The contextual model is not actually necessary to find similar words. But if you want to find a similar word in context, or from a word not in the word-model dictionary we need the model. 

In [4]:
# = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = "cpu"

In [5]:
# define model name
model = 'bert-base-uncased' #for norwegian you can use: 'NbAiLab/nb-bert-base' or 'ltgoslo/norbert'

# the tokenizer plits the input text into tokens, which in this case is called wordpieces 
tokenizer = BertTokenizerFast.from_pretrained(model)

# download the model online
bert_model = BertModel.from_pretrained(model)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# change to cuda cores if a gpu is available
bert_model.to(device)
print()




In [7]:
def tokens_to_embeddings(word_tokens): 
    """Obtain token embeddings and token to word token mapping 
    from BERT-model

    Parameters
    ----------
    sent : array
        Input array of word tokens

    Returns
    -------
    list
        token to word id mapping
    list
        token embeddings from all layers
    """
    with torch.no_grad(): 
        inputs = tokenizer(word_tokens, return_tensors = "pt", truncation = True, max_length = 512, is_split_into_words=True)
        word_ids = inputs.word_ids(batch_index=0)
        outputs = bert_model(**inputs.to(device), output_hidden_states=True)
        hidden_states = outputs[2]

        token_embeddings = torch.stack(hidden_states, dim=0) #stack all hidden states into same tensor
        token_embeddings = token_embeddings.squeeze(dim=1) # remove empty dimension
        token_embeddings = token_embeddings.permute(1,0,2)
        
    # convert to numpy arrays
    word_ids = np.array(word_ids)
    token_embeddings = token_embeddings.to("cpu").detach().numpy()
    
    return word_ids, token_embeddings

def get_substitute_indecies(token_mapping, word_array, substitute_array):
    """Find the token indecies of the part of the sentence we would
        like to find substitutes for

    Parameters
    ----------
    tokenized_sent : list
        list of wordpieces
    sent: str
        input sentence
    substitute_phrase: str
        part of the sentence to find substitute alternatives for
        must be lowercase, if we use bert-base-uncased

    Returns
    -------
    array 
        token indecies of the substitution phrase

    """
    word_indecies = False
    # find word indecies for overlap
    for i in range(len(word_array)): 
        sub_array = word_array[i:i+len(substitute_array)]
        if (sub_array == substitute_array): 
            word_indecies = [i+j for j in range(len(substitute_array))]
            break
    if (not word_indecies): 
        return False
    
    sub_ids = [token_index for word_index in word_indecies for token_index in np.where(token_mapping == word_index)[0]]

    return sub_ids

def get_substitute_embedding(token_embeddings, substitute_indecies): 
    return np.mean(token_embeddings[substitute_indecies], axis=0)


def find_similar_ngrams(sent, substitute_phrase, top = 10): 
    """Find similar n-grams to the substitute phrase in the
        input sentence

    Parameters
    ----------
    sent : str
        input sentence
    substitute_phrase: str
        subpart of input sentence to find substitutions for
    top: int
        how many substitution suggestions the program should
        give

    Returns
    -------
    array 
        array of the most similar n-grams to the substitution phrase

    """
    # split into word array, both for sent and for sub-phrase
    word_array = word_tokenize(sent)
    sub_array = word_tokenize(substitute_phrase)

    # find embeddings for sentence
    token_mapping, token_embeddings = tokens_to_embeddings(word_array)
    token_embeddings = token_embeddings[:, LAYER, :]

    # extract relevant phrases for ngram
    substitute_indecies = get_substitute_indecies(token_mapping, word_array, sub_array)
    substitute_embedding = get_substitute_embedding(token_embeddings, substitute_indecies)
    
    return word_model.similar_by_vector(substitute_embedding, topn=top)

In [22]:
def recommend_changes_fast(sent): 
    word_tokens = nltk.word_tokenize(sent.lower())
    # find embeddings of the sentence words
    token_mapping, token_embeddings = tokens_to_embeddings(word_tokens)
    token_embeddings = token_embeddings[:, LAYER, :]
    
    for i, token in enumerate(word_tokens): 
        # extract relevant phrases for ngram
        substitute_indecies = get_substitute_indecies(token_mapping, word_tokens, [token])
        substitute_embedding = get_substitute_embedding(token_embeddings, substitute_indecies)
        sub = word_model.similar_by_vector(substitute_embedding, topn=1)
        
        # only suggest correction if it does not match the word itself
        if (sub[0][0] != token and token not in [".", ",", "?", "-"]): 
            word_tokens[i] = word_tokens[i]+"/("+sub[0][0]+")"
    return " ".join(word_tokens)

def recommend_changes(sent): 
    sent = sent.lower()
    word_tokens = nltk.word_tokenize(sent)
    for i, token in enumerate(word_tokens): 
        sub = find_similar_ngrams(sent, token, top = 1)
        if (sub[0][0] != token and token not in [".", ",", "?", "-"]): 
            word_tokens[i] = word_tokens[i]+"/"+sub[0][0]
    return " ".join(word_tokens)

## Test embeddings

In [23]:
word_model.most_similar("code", topn=10)

[('codes', 0.7800700664520264),
 ('script', 0.7725754976272583),
 ('implementation', 0.7720517516136169),
 ('coded', 0.7638325691223145),
 ('suite', 0.7323173880577087),
 ('architecture', 0.7209630012512207),
 ('toolkit', 0.7193244695663452),
 ('api', 0.7193145155906677),
 ('program', 0.7187706232070923),
 ('framework', 0.71236652135849)]

In [25]:
test_sent = "There were many weird data points in the dataset."
substitute_phrase = "weird"
find_similar_ngrams(test_sent, substitute_phrase, top = 10)

[('unusual', 0.7220069766044617),
 ('novel', 0.6847960352897644),
 ('unique', 0.6456605792045593),
 ('unwanted', 0.6455644369125366),
 ('amazing', 0.6397463083267212),
 ('interesting', 0.6375235915184021),
 ('complex', 0.6298054456710815),
 ('phenomena', 0.6267231106758118),
 ('artifacts', 0.6227136850357056),
 ('complicated', 0.6226660013198853)]

In [26]:
sent = "There were many weird data points in the dataset."
recommend_changes_fast(sent)

'there were many weird/(unusual) data points/(point) in the dataset .'

In [15]:
sudoku_par = """In this chapter we will try and explain how the code implementation works with its SudokuSolver class and functions. We have included some figures to examplify the processes and give an insight to how the program works. To debug the solution we have often printed temporary states to find out where the mistakes occurred. The input format for the sudoku is assumed to be a string where each row comes after one another without spaces and the cells that are not filled are set to 0.
"""

In [30]:
print(recommend_changes_fast(sudoku_par))

in this/(here) chapter/(section) we will try/(examine) and/(to) explain how the code implementation works/(performs) with its sudokusolver/(enchmark) class and functions . we have included some figures/(statistics) to examplify/(simplify) the processes/(process) and give an/(detailed) insight/(understand) to how the program works/(operate) . to debug/(reﬂect) the solution we have often printed temporary/(additional) states to find/(discover) out where/(locations) the mistakes/(errors) occurred/(occurs) . the input format for the sudoku/(heuristic) is assumed to be a string where each row comes/(occurs) after one another without spaces and the cells/(cell) that are not filled/(spaces) are set/(assign) to 0/(zero) .
