Deal with the duplicated positions in generator #41

zheyuye · 2020-04-14T10:44:28Z

Here, the corrupted tokens are produced in generator as fake data. I can understand why we should deal with the duplicated positions and only appy it once. However, I am confused about the below implementation take average value of corrupted token ids of duplicated, and what's the intuition behind it?

electra/pretrain/pretrain_helpers.py

Lines 101 to 103 in 19175a1

    
           if sequence.dtype == tf.float32: 
        
             updates_mask_3d = tf.cast(updates_mask_3d, tf.float32) 
        
             updates /= tf.maximum(1.0, updates_mask_3d)

clarkkev · 2020-04-14T22:54:17Z

scatter_nd sums up values with the same index, but we want to just pick a single value per index. If the values for an index are the same (e.g., we are just scattering a mask tensor of 1s), then dividing the summed values by the number of occurrences at that index fixes the issue. That's what is implemented in the code you linked to.

However, this does NOT fully fix having duplicate mask positions. It does stop there from being errors due to overflowing above the vocab size, but if (1) the same position is sampled twice and (2) the generator samples different tokens for the position then the replaced token will be a "random" token obtained by averaging the two sampled token ids. I didn't bother fixing this issue when developing ELECTRA because this occurs for a pretty small fraction of masked positions. But there actually is an easy fix: replace the sampling step (masked_lm_positions = tf.random.categorical... in pretrain_helpers.mask) with

masked_lm_positions = tf.math.top_k(
      sample_logits + sample_gumbel(
          modeling.get_shape_list(sample_logits)), N)[1]

This will do sampling without replacement when picking mask positions.

zheyuye · 2020-04-15T04:10:58Z

That is clear. Thank you for the explanation.

zheyuye closed this as completed Apr 15, 2020

zheyuye mentioned this issue Aug 28, 2020

Dynamic masked position sampling of Electra dmlc/gluon-nlp#1324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with the duplicated positions in generator #41

Deal with the duplicated positions in generator #41

zheyuye commented Apr 14, 2020

clarkkev commented Apr 14, 2020 •

edited

Loading

zheyuye commented Apr 15, 2020

Deal with the duplicated positions in generator #41

Deal with the duplicated positions in generator #41

Comments

zheyuye commented Apr 14, 2020

clarkkev commented Apr 14, 2020 • edited Loading

zheyuye commented Apr 15, 2020

clarkkev commented Apr 14, 2020 •

edited

Loading