## Cosine similarity

Redaction of PII in these steps:

- Select $n$ most probable subtitutions
- from these, extract $k$ most similar in terms of cosine similarity (hyper-sphere)
- select randomly

For multisubtoken words, nothing is said(?).
=> Trying by finding the most similar to the entire word. E.g.

- original = contex1 + "Amanda" + context2
- tokenized = "Aman" "da"
- redaction result = yes
- subtitution by token:
1. for context1 + [MASK] [MASK] + context2 the most probable predictions for the **first masked** are $S_1$
2. Find $S^r_1$ closest to "Amanda" where "Amanda" is max pooled form "Aman" and "da"
3. select $s_1$ in $S^r_1$ randomly
4. $S_2$ == most probable for context + s_1 + [MASK] + context2
5. Similarly find one closest to "Amanda" => s_2
6. Assume $s_1$+$s_2$ is a coherent word :)

Notes:

- if using the same model to redact and find substitutions: less computation
- if using different models, you may optimize both



In [1]:
from transformers import pipeline, pipelines, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
MODEL_NAME="xlm-roberta-base"

In [3]:
# taking these from past course work:

def get_embed_for_full_dataset(dataset, lang):
  model_name = MODEL_NAME
  p=pipeline(task="feature-extraction",model=model_name,return_tensors=True,device=0)
  embedded=p(pipelines.pt_utils.KeyDataset(dataset[lang], "text"), batch_size=64, truncation="only_first")

  # to cpu and take the mean over words
  embedded_pooled=[torch.mean(elem,axis=1).cpu() for elem in embedded]
  # to single matrix
  results=torch.vstack(embedded_pooled).numpy()
  return results

def get_embed_for_one_instance(x):
    p=pipeline(task="feature-extraction",model=MODEL_NAME,return_tensors=True,device=0)
    return p(x)

def cosine_sim(x,y):
  M=cosine_similarity(x,y)
  aligned=np.argsort(-M,axis=-1)

  sims=[]
  for i in range(M.shape[0]): #M.shape[0] is the number of rows / input documents
    j=aligned[i,0] # index 1 for 2nd best match => first one if different languages
    score=M[i,j]
    sims.append((i,j,score))
  # sort in descending order  element -> score => sort by score
  sims.sort(key=lambda element:element[2],reverse=True)

  return sims

    

In [4]:
text = "My name is Leena and I like playing the piano"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [5]:
tok = tokenizer(text, return_tensors='pt')
print(tok)
detok = tokenizer.convert_ids_to_tokens(tok["input_ids"][0])
print(detok)

{'input_ids': tensor([[    0,  2646,  9351,    83, 19824,    76,   136,    87,  1884, 75169,
            70, 16569,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['<s>', '▁My', '▁name', '▁is', '▁Lee', 'na', '▁and', '▁I', '▁like', '▁playing', '▁the', '▁piano', '</s>']


In [12]:
# to redact: indices 4,5.
redact_ids = np.array([4,5])
to_redict = tok.input_ids[0][**redact_ids]
to_substitute = tokenizer.decode(to_redact)
good_predictions_1 = ["Mai", "An", "Emi", "Luci"]
choose_index = 1 # An
print(to_substitute)



Lee


In [13]:
# get embed
orig = get_embed_for_one_instance(to_substitute)
guesses = []

In [14]:
print(orig)

tensor([[[ 0.0807,  0.1047,  0.0497,  ..., -0.1252,  0.0364,  0.0175],
         [-0.0448, -0.0033, -0.0233,  ..., -0.0942, -0.0756, -0.0872],
         [ 0.0683,  0.0952, -0.0115,  ..., -0.2160, -0.0362,  0.0587]]])
