Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bracket in listops task? #6

Closed
sihyun-yu opened this issue Nov 28, 2020 · 8 comments
Closed

bracket in listops task? #6

sihyun-yu opened this issue Nov 28, 2020 · 8 comments

Comments

@sihyun-yu
Copy link

Hello, thank you for sharing a great benchmark!
I'm focusing on 'listops' benchmark, with the provided codes and hyperparameters.

The paper says the maximum length of the sequence in this task is 2K, but it seems the code excluded all of the brackets in the sequence.

With the consideration of brackets ( '(', ')', '[', ']'), the maximum length becomes 6K which becomes larger sequence compared to the mentioned length in the paper.

Don't we have to consider such brackets as the input in this task?

Thank you.

@vanzytay
Copy link
Collaborator

Thanks for the find. Yes, we need the brackets.

If you switch the tokenizer to tensorflow_text.WhitespaceTokenizer() for now it should do the trick.

We will push a fix soon. thanks :)

Based on this vocab set:

[b'7', b'(', b'0', b']', b'[MAX', b'8', b')', b'[MED', b'9', b'6', b'[SM', b'1', b'[MIN', b'5', b'3', b'2', b'4']

the max seq length shouldn't be more than 2K.

@sihyun-yu
Copy link
Author

sihyun-yu commented Nov 28, 2020

@vanzytay Thank you for your clarification. However, I am still confused with the length of sequences with brackets.

With the above setting, the length of the sequence is >2K. For instance, with the first sequence in the validation set, the length becomes 5231. Is there something that I am missing?

Thank you.

Below is the input that I mentioned above.

@sihyun-yu
Copy link
Author

( ( ( ( ( ( ( ( [MAX 1 ) ( ( ( ( ( [MIN 3 ) ( ( ( [MED 6 ) 1 ) ] ) ) 4 ) 5 ) ] ) ) 0 ) ( ( ( ( ( ( [SM ( ( ( ( ( ( ( ( ( [MED 1 ) 6 ) 2 ) 1 ) 0 ) 8 ) 2 ) 3 ) ] ) ) 9 ) ( ( ( ( [MED 0 ) 3 ) 7 ) ] ) ) 2 ) ( ( ( ( ( ( ( ( ( ( ( [SM ( ( ( ( [MIN 8 ) 9 ) 1 ) ] ) ) 3 ) 0 ) 2 ) 5 ) ( ( ( [MED 7 ) ( ( ( ( [SM 4 ) ( ( ( ( ( ( ( [MAX 4 ) 1 ) ( ( ( ( [MAX 9 ) 4 ) 3 ) ] ) ) 1 ) 9 ) 2 ) ] ) ) 3 ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( [MED 1 ) ( ( ( ( [MIN 0 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 7 ) 3 ) 9 ) 2 ) 6 ) 3 ) 2 ) 9 ) ( ( ( ( ( ( ( ( ( [SM 9 ) 2 ) ( ( ( ( ( ( ( ( ( ( [MIN 4 ) 3 ) 5 ) 2 ) 6 ) 2 ) ( ( ( ( ( ( ( [MIN 7 ) 2 ) 9 ) 3 ) 6 ) 6 ) ] ) ) ( ( ( [MED 3 ) 5 ) ] ) ) 2 ) ] ) ) ( ( ( ( ( ( ( ( [MED 6 ) 6 ) 5 ) 9 ) 2 ) ( ( ( ( ( ( ( [SM 0 ) 2 ) 9 ) 8 ) 0 ) 7 ) ] ) ) 6 ) ] ) ) 1 ) 9 ) 3 ) 2 ) ] ) ) 9 ) ] ) ) 7 ) ] ) ) 0 ) ( ( ( ( ( ( ( ( ( [MED 1 ) 4 ) 2 ) 0 ) 1 ) ( ( ( ( ( ( [MAX ( ( ( ( ( [MED 8 ) 2 ) ( ( ( ( ( ( ( [MIN ( ( ( ( ( ( ( ( ( ( ( [MIN 1 ) 5 ) 6 ) 9 ) 9 ) 2 ) 2 ) 5 ) 4 ) 0 ) ] ) ) 1 ) ( ( ( ( [MED 2 ) 6 ) 4 ) ] ) ) 7 ) 2 ) ( ( ( [SM 3 ) 4 ) ] ) ) ] ) ) 7 ) ] ) ) ( ( ( ( ( [MIN 1 ) 8 ) 3 ) 6 ) ] ) ) 4 ) 6 ) 6 ) ] ) ) 8 ) 1 ) ] ) ) 6 ) ( ( ( ( ( ( ( ( ( ( [SM 1 ) 7 ) 5 ) 5 ) 5 ) 6 ) 9 ) 8 ) 4 ) ] ) ) 1 ) 5 ) ] ) ) 8 ) 1 ) 1 ) ] ) ) ] ) ) ( ( ( ( ( [MAX 0 ) ( ( ( [MAX 7 ) ( ( ( ( ( [MED 9 ) ( ( ( ( [MIN 2 ) ( ( ( [SM 9 ) ( ( ( [MIN ( ( ( ( ( ( [MAX 2 ) 6 ) 8 ) ( ( ( ( ( ( ( ( [MED 6 ) 7 ) 6 ) 7 ) 7 ) 6 ) 3 ) ] ) ) 2 ) ] ) ) 6 ) ] ) ) ] ) ) ( ( ( ( ( [MIN 0 ) 4 ) 2 ) 1 ) ] ) ) ] ) ) 4 ) 1 ) ] ) ) ] ) ) ( ( ( ( ( ( [MIN 8 ) 2 ) 5 ) 6 ) 7 ) ] ) ) 9 ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [MED 2 ) 8 ) 5 ) 0 ) ( ( ( ( [SM ( ( ( ( ( ( ( ( ( [MIN 3 ) 8 ) 6 ) ( ( ( ( ( ( ( ( ( [SM 7 ) 7 ) ( ( ( ( ( [MAX 7 ) ( ( ( ( [SM 3 ) 5 ) 0 ) ] ) ) 3 ) 9 ) ] ) ) 6 ) ( ( ( ( ( ( [MAX ( ( ( ( [MED ( ( ( ( ( ( ( ( ( ( ( [MED 3 ) ( ( ( ( ( ( ( ( ( ( ( [MED 1 ) 3 ) 1 ) 4 ) 9 ) 2 ) 0 ) 5 ) 3 ) 0 ) ] ) ) 1 ) 5 ) ( ( ( ( ( ( ( ( [SM 3 ) 7 ) 5 ) 2 ) 5 ) 8 ) 3 ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [MAX 2 ) 2 ) 9 ) 5 ) 7 ) 8 ) 3 ) 3 ) 3 ) 0 ) ] ) ) 7 ) 9 ) 8 ) 5 ) ] ) ) 8 ) 2 ) ] ) ) 8 ) 8 ) 1 ) 9 ) ] ) ) 9 ) 4 ) 5 ) ] ) ) ( ( ( ( ( ( ( ( ( [MAX 0 ) 6 ) 6 ) 7 ) 8 ) 3 ) 5 ) 4 ) ] ) ) 5 ) 9 ) 7 ) ] ) ) 9 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 3 ) ( ( ( ( [MAX ( ( ( ( ( ( ( ( [SM 7 ) 0 ) ( ( ( [MIN ( ( ( ( ( ( [MAX 1 ) ( ( ( ( ( ( ( [SM 2 ) 5 ) 1 ) 4 ) 6 ) 1 ) ] ) ) 6 ) 5 ) 5 ) ] ) ) ( ( ( ( ( ( ( ( ( ( [SM 3 ) 3 ) 6 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 1 ) 9 ) 9 ) 3 ) 2 ) 7 ) 5 ) 0 ) 0 ) 0 ) ] ) ) 8 ) 1 ) 4 ) 9 ) 6 ) ] ) ) ] ) ) 3 ) 6 ) 0 ) ( ( ( ( ( ( ( ( [MED 4 ) 2 ) ( ( ( ( ( ( ( ( [MAX 3 ) 8 ) 4 ) 3 ) 1 ) 3 ) 8 ) ] ) ) 7 ) ( ( ( ( ( ( ( ( ( ( ( [MED 2 ) ( ( ( ( ( [MIN 2 ) 4 ) 9 ) 2 ) ] ) ) 5 ) ( ( ( ( [MED 7 ) 6 ) 7 ) ] ) ) 2 ) ( ( ( [MIN 4 ) 9 ) ] ) ) 9 ) 1 ) ( ( ( [MIN 0 ) 3 ) ] ) ) 8 ) ] ) ) 9 ) ( ( ( ( ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( ( ( [MIN 1 ) 1 ) 7 ) 3 ) 1 ) 8 ) 4 ) 0 ) 2 ) ] ) ) 2 ) 6 ) 5 ) 3 ) 4 ) 3 ) ( ( ( [MAX 8 ) 9 ) ] ) ) 5 ) 6 ) ] ) ) ] ) ) ] ) ) 0 ) ( ( ( [MAX 6 ) 3 ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [SM 2 ) ( ( ( ( ( ( [MIN 2 ) 8 ) 7 ) ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( [MIN ( ( ( ( ( [SM 6 ) 8 ) 2 ) 6 ) ] ) ) ( ( ( ( ( [MIN 3 ) 1 ) 1 ) 4 ) ] ) ) 9 ) 9 ) 4 ) ] ) ) 6 ) 0 ) ( ( ( ( ( ( ( ( ( [SM 1 ) 7 ) ( ( ( ( [MAX 3 ) 5 ) 2 ) ] ) ) 5 ) 3 ) 6 ) 8 ) 8 ) ] ) ) ( ( ( ( ( ( [MAX 6 ) 6 ) ( ( ( ( ( ( ( ( [MIN 5 ) 0 ) 5 ) 1 ) 7 ) 2 ) 0 ) ] ) ) 8 ) 5 ) ] ) ) 9 ) ] ) ) 9 ) ] ) ) ( ( ( ( ( [MAX 9 ) ( ( ( ( ( ( [MED 5 ) 6 ) 2 ) ( ( ( ( ( [SM 4 ) 4 ) 1 ) 1 ) ] ) ) ( ( ( ( ( ( ( ( ( [MAX 2 ) ( ( ( ( ( ( ( ( [MAX 0 ) 9 ) 6 ) 9 ) 7 ) 5 ) 7 ) ] ) ) 3 ) 7 ) 9 ) 2 ) 0 ) 3 ) ] ) ) ] ) ) 6 ) ( ( ( [SM 6 ) ( ( ( ( [SM 8 ) 6 ) 7 ) ] ) ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( ( [MAX 7 ) 9 ) ( ( ( ( ( ( ( [MED 5 ) 4 ) 9 ) 2 ) 4 ) 4 ) ] ) ) ( ( ( ( ( [SM ( ( ( [MAX 6 ) 8 ) ] ) ) 5 ) 8 ) 3 ) ] ) ) 9 ) 4 ) 7 ) ( ( ( ( ( ( ( ( ( ( [SM 5 ) 2 ) ( ( ( ( ( ( ( ( [MIN 6 ) ( ( ( ( ( ( ( ( [SM 7 ) 4 ) 3 ) 2 ) 5 ) 6 ) 2 ) ] ) ) 2 ) 1 ) 9 ) 4 ) 7 ) ] ) ) 9 ) 2 ) 1 ) 0 ) 9 ) 6 ) ] ) ) 5 ) ] ) ) 9 ) ( ( ( [SM ( ( ( ( ( ( ( [SM 0 ) 2 ) 3 ) 6 ) ( ( ( ( ( ( ( ( ( ( [MED 2 ) 5 ) 1 ) ( ( ( ( ( ( ( ( ( ( ( [MED 0 ) 2 ) 6 ) 6 ) 5 ) 5 ) 3 ) 0 ) 8 ) 4 ) ] ) ) 4 ) 1 ) ( ( ( ( ( ( ( [MED 5 ) 1 ) 3 ) 8 ) 1 ) 3 ) ] ) ) ( ( ( ( ( [MAX 0 ) 3 ) 9 ) 4 ) ] ) ) ( ( ( ( ( ( ( ( ( [SM 7 ) 8 ) 4 ) 4 ) 0 ) 3 ) 8 ) 7 ) ] ) ) ] ) ) 3 ) ] ) ) 2 ) ] ) ) 8 ) 2 ) 6 ) ( ( ( ( ( ( [SM 2 ) ( ( ( ( [MAX 2 ) ( ( ( [MED 5 ) 1 ) ] ) ) 6 ) ] ) ) 3 ) 4 ) 0 ) ] ) ) ] ) ) ( ( ( ( [MIN 1 ) 3 ) 3 ) ] ) ) ( ( ( ( ( ( ( ( ( ( [SM 9 ) ( ( ( ( ( ( ( ( ( ( [SM ( ( ( ( ( ( ( [MED ( ( ( ( ( ( ( ( ( ( [MIN 2 ) ( ( ( ( ( ( ( ( [MIN 7 ) 7 ) 6 ) 2 ) 7 ) 1 ) 0 ) ] ) ) 4 ) ( ( ( ( ( ( ( ( ( ( [SM 8 ) 5 ) 9 ) 1 ) 5 ) 9 ) 2 ) 1 ) 5 ) ] ) ) 6 ) 5 ) 8 ) 4 ) 2 ) ] ) ) 2 ) 9 ) 2 ) ( ( ( ( ( [MIN ( ( ( ( ( ( ( ( ( [MED 3 ) 0 ) 5 ) 3 ) 9 ) 5 ) 2 ) 7 ) ] ) ) 0 ) 6 ) 9 ) ] ) ) 6 ) ] ) ) ( ( ( ( ( [MAX 5 ) 6 ) 9 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 9 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 9 ) 1 ) 8 ) 1 ) 6 ) 5 ) 2 ) 8 ) 4 ) 6 ) ] ) ) ( ( ( [SM 1 ) 2 ) ] ) ) 9 ) 5 ) 5 ) 8 ) 6 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 1 ) 8 ) 1 ) 0 ) 9 ) 0 ) 2 ) 5 ) 5 ) 4 ) ] ) ) ( ( ( ( ( ( ( ( ( [MED 9 ) 3 ) 0 ) 3 ) 0 ) 6 ) 2 ) 4 ) ] ) ) ] ) ) ] ) ) 3 ) ( ( ( [MAX ( ( ( ( ( ( ( ( [MED 2 ) 5 ) ( ( ( ( ( ( ( ( ( [MED 8 ) 1 ) 0 ) 7 ) 0 ) 3 ) 6 ) 6 ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [MIN 3 ) 6 ) 4 ) 6 ) 7 ) 3 ) 2 ) 1 ) 0 ) 8 ) ] ) ) 4 ) ( ( ( ( ( ( ( ( ( ( ( [SM 5 ) 2 ) 5 ) 3 ) 2 ) 7 ) 9 ) 1 ) 6 ) 2 ) ] ) ) 7 ) ] ) ) 8 ) ] ) ) 7 ) 9 ) 5 ) 7 ) 8 ) ] ) ) 4 ) 2 ) 1 ) 3 ) 0 ) ( ( ( ( ( ( ( ( ( ( ( [SM ( ( ( ( ( ( ( ( ( ( ( [MAX 8 ) 4 ) 0 ) ( ( ( ( ( ( ( [MED 3 ) 1 ) 8 ) 8 ) ( ( ( ( ( ( ( ( [MAX 3 ) 0 ) 0 ) 2 ) 8 ) 8 ) 0 ) ] ) ) ( ( ( ( ( ( [SM 6 ) 4 ) 0 ) 6 ) 3 ) ] ) ) ] ) ) 7 ) ( ( ( ( ( ( ( ( ( ( ( [MED ( ( ( ( [SM 3 ) 2 ) 7 ) ] ) ) 2 ) 2 ) ( ( ( ( ( ( ( [MAX 2 ) 2 ) 5 ) 7 ) 1 ) 5 ) ] ) ) 7 ) 6 ) 8 ) 9 ) 1 ) 4 ) ] ) ) 2 ) 7 ) 5 ) 8 ) ] ) ) 8 ) 5 ) 7 ) ( ( ( ( ( ( ( ( ( ( [MED 4 ) 8 ) 4 ) 9 ) 6 ) 4 ) ( ( ( ( ( [MIN ( ( ( ( ( ( ( [SM 1 ) 9 ) 5 ) 0 ) 7 ) 7 ) ] ) ) 2 ) 4 ) ( ( ( ( ( ( [MED 6 ) 9 ) 5 ) 0 ) 8 ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( ( [MED 6 ) 0 ) 3 ) ( ( ( ( ( [SM 2 ) 7 ) 5 ) 5 ) ] ) ) 1 ) 3 ) 0 ) ( ( ( ( ( ( ( [MAX 6 ) 8 ) 7 ) 7 ) 9 ) 8 ) ] ) ) 2 ) ] ) ) 0 ) ] ) ) 3 ) 3 ) 6 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 1 ) 0 ) 9 ) 3 ) 2 ) 2 ) 6 ) 6 ) 5 ) 8 ) ] ) ) ( ( ( ( ( ( ( ( ( [MIN 6 ) 3 ) 5 ) ( ( ( ( [SM 5 ) 6 ) 4 ) ] ) ) 2 ) 0 ) 8 ) ( ( ( ( ( [MAX 0 ) ( ( ( ( [MED 0 ) 5 ) 0 ) ] ) ) 0 ) ( ( ( ( ( [MIN 2 ) 2 ) 1 ) 0 ) ] ) ) ] ) ) ] ) ) ] ) ) 8 ) ] ) ) 7 ) 6 ) 8 ) ( ( ( ( ( ( ( ( ( [MIN 2 ) ( ( ( ( ( ( ( ( ( [MIN ( ( ( ( ( ( ( ( ( [MIN 5 ) ( ( ( ( ( [MIN 4 ) 7 ) 2 ) 2 ) ] ) ) 4 ) ( ( ( ( ( ( ( ( ( [MAX 4 ) 3 ) 4 ) 9 ) 6 ) 4 ) 3 ) 8 ) ] ) ) 2 ) 4 ) 5 ) 3 ) ] ) ) ( ( ( ( ( ( ( ( [MED 8 ) 5 ) 5 ) 4 ) ( ( ( ( ( ( ( ( [MED 8 ) 5 ) 2 ) ( ( ( ( ( ( ( ( ( [MAX 4 ) 5 ) 5 ) 6 ) 5 ) 4 ) 6 ) 8 ) ] ) ) 2 ) 7 ) 5 ) ] ) ) 6 ) 3 ) ] ) ) ( ( ( ( ( ( ( [MIN 0 ) 4 ) 2 ) 1 ) ( ( ( ( [MIN 4 ) 9 ) 2 ) ] ) ) 2 ) ] ) ) 0 ) 9 ) 4 ) 3 ) ( ( ( ( ( ( ( ( ( ( [SM 6 ) 6 ) 4 ) 7 ) 4 ) 2 ) 2 ) 9 ) 7 ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( ( ( [MED 7 ) ( ( ( ( ( ( ( ( [SM 8 ) 1 ) ( ( ( ( ( ( ( ( [MED 2 ) 4 ) 7 ) 0 ) 3 ) 1 ) 4 ) ] ) ) ( ( ( ( [MIN 1 ) 9 ) 4 ) ] ) ) 8 ) 0 ) 6 ) ] ) ) ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( [SM 1 ) 4 ) 9 ) 1 ) 1 ) ] ) ) 0 ) 0 ) ( ( ( ( [MAX 7 ) 9 ) 3 ) ] ) ) 3 ) ( ( ( ( ( ( ( ( [SM 7 ) 3 ) 4 ) 8 ) 3 ) 1 ) 4 ) ] ) ) ] ) ) 1 ) 4 ) ( ( ( [MAX 1 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 3 ) 3 ) 0 ) 2 ) 6 ) 4 ) 7 ) 7 ) 0 ) 5 ) ] ) ) ] ) ) 4 ) ( ( ( ( ( ( ( [MIN 9 ) 3 ) 5 ) 1 ) 6 ) 7 ) ] ) ) 3 ) ] ) ) ( ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( [MAX 4 ) 0 ) 0 ) 4 ) 1 ) 1 ) 9 ) ] ) ) 1 ) 1 ) 7 ) 9 ) 2 ) 1 ) ] ) ) 9 ) 4 ) 1 ) 7 ) 5 ) ( ( ( ( ( ( [MAX 6 ) 4 ) 9 ) 5 ) 1 ) ] ) ) 0 ) ( ( ( ( ( ( ( ( [MAX 4 ) ( ( ( [MED 2 ) ( ( ( ( ( [MIN 7 ) 7 ) 2 ) 2 ) ] ) ) ] ) ) ( ( ( ( [SM 8 ) 8 ) ( ( ( ( ( [MED 5 ) 5 ) 6 ) 4 ) ] ) ) ] ) ) 4 ) 1 ) 1 ) 7 ) ] ) ) ] ) ) ( ( ( [MIN 4 ) 6 ) ] ) ) 0 ) ( ( ( ( ( [SM 1 ) ( ( ( ( ( ( ( ( ( ( [MED 2 ) ( ( ( ( [MED ( ( ( [SM 1 ) 2 ) ] ) ) 1 ) ( ( ( ( ( ( ( ( ( [MIN 2 ) 4 ) 6 ) 5 ) 6 ) 0 ) 9 ) 9 ) ] ) ) ] ) ) 3 ) 5 ) 4 ) 9 ) ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( [MAX 5 ) 3 ) 9 ) 2 ) 9 ) 1 ) 4 ) ] ) ) 2 ) 1 ) 1 ) ] ) ) 5 ) ( ( ( ( ( ( ( [SM 2 ) ( ( ( ( ( ( ( ( ( ( [MIN 9 ) 6 ) 7 ) 5 ) 3 ) 4 ) 9 ) 6 ) 9 ) ] ) ) ( ( ( ( ( [MED 1 ) 3 ) 3 ) 4 ) ] ) ) 0 ) 5 ) ( ( ( ( [MAX 6 ) 6 ) 2 ) ] ) ) ] ) ) ] ) ) 8 ) 8 ) ] ) ) 8 ) 1 ) ] ) ) ( ( ( ( ( ( ( [MIN 4 ) 5 ) 1 ) 7 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 2 ) ( ( ( [SM 3 ) 9 ) ] ) ) 8 ) 7 ) 8 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 6 ) 0 ) 4 ) 8 ) ( ( ( ( ( ( [MIN 1 ) 5 ) 8 ) ( ( ( ( ( ( ( ( ( ( [MAX 5 ) 8 ) 7 ) 6 ) 4 ) 5 ) 4 ) 4 ) 9 ) ] ) ) 0 ) ] ) ) 2 ) 4 ) 0 ) 2 ) ( ( ( ( ( ( ( ( [MAX 8 ) 2 ) 1 ) 4 ) 7 ) 7 ) 6 ) ] ) ) ] ) ) 5 ) ( ( ( ( ( ( ( ( [MIN 6 ) 6 ) 7 ) 8 ) 4 ) 3 ) 7 ) ] ) ) ( ( ( ( ( [MAX 5 ) 1 ) ( ( ( ( ( ( [MED ( ( ( ( [MAX 1 ) 4 ) 8 ) ] ) ) ( ( ( ( ( ( ( ( ( ( [MAX 9 ) 0 ) 6 ) 6 ) 8 ) 0 ) 0 ) 0 ) 1 ) ] ) ) 6 ) 3 ) 4 ) ] ) ) 6 ) ] ) ) 4 ) ] ) ) ( ( ( ( ( ( ( ( ( ( [MAX 5 ) 8 ) 6 ) 0 ) ( ( ( ( ( ( ( ( ( [SM 2 ) 0 ) 0 ) 4 ) 2 ) 0 ) 8 ) 7 ) ] ) ) 6 ) 4 ) ( ( ( ( ( ( [MIN 2 ) 5 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 1 ) 7 ) 9 ) 8 ) 0 ) 8 ) 2 ) 3 ) ( ( ( ( ( ( ( ( [MAX 6 ) 2 ) 6 ) 6 ) 1 ) 0 ) 8 ) ] ) ) ( ( ( ( [SM 2 ) 5 ) 4 ) ] ) ) ] ) ) 7 ) ( ( ( ( ( ( ( ( ( ( [MED 8 ) 1 ) 1 ) ( ( ( ( ( ( ( ( ( [MED 7 ) 8 ) 6 ) 9 ) 7 ) 4 ) 9 ) 3 ) ] ) ) ( ( ( ( ( ( ( ( ( [MAX 7 ) 2 ) 1 ) 7 ) 6 ) 3 ) 5 ) 0 ) ] ) ) 5 ) 4 ) 5 ) 4 ) ] ) ) ] ) ) ( ( ( ( ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( ( ( ( [MIN 2 ) 7 ) 3 ) 1 ) 2 ) 1 ) 1 ) 6 ) ( ( ( ( ( ( ( [MAX 8 ) 7 ) 1 ) 4 ) 6 ) 7 ) ] ) ) 6 ) ] ) ) 2 ) 7 ) 3 ) ( ( ( ( ( [MED ( ( ( ( ( ( ( ( ( ( [SM 1 ) 5 ) 4 ) 3 ) 8 ) 0 ) 7 ) 9 ) 6 ) ] ) ) 5 ) ( ( ( ( ( ( ( ( ( ( ( [MIN 4 ) 2 ) 2 ) 8 ) 3 ) 5 ) 5 ) 4 ) 6 ) 8 ) ] ) ) 1 ) ] ) ) 8 ) 9 ) 8 ) ( ( ( ( ( ( ( ( ( ( [MED 3 ) 0 ) 7 ) 0 ) ( ( ( [MIN 8 ) 1 ) ] ) ) ( ( ( ( ( [MAX 2 ) 4 ) 0 ) 5 ) ] ) ) 3 ) 7 ) 2 ) ] ) ) 4 ) ] ) ) ] ) ) ] ) ) ] ) ) ] ) ) ( ( ( ( ( ( [SM 8 ) 2 ) 7 ) 8 ) 4 ) ] ) ) 5 ) ( ( ( ( ( ( ( ( ( [MAX 0 ) 2 ) 1 ) ( ( ( ( ( ( ( ( ( ( ( [SM 0 ) 9 ) ( ( ( ( ( [MED 9 ) 9 ) 1 ) 5 ) ] ) ) ( ( ( ( ( ( ( [MIN ( ( ( ( ( ( ( ( ( ( [SM 6 ) 0 ) ( ( ( ( ( ( ( ( ( ( ( [SM ( ( ( [MED 6 ) 8 ) ] ) ) 0 ) 2 ) ( ( ( ( ( ( ( [MIN 6 ) ( ( ( ( ( ( ( [MAX 2 ) 7 ) 6 ) 7 ) 0 ) 9 ) ] ) ) ( ( ( ( [SM 3 ) 5 ) 1 ) ] ) ) 5 ) 0 ) 4 ) ] ) ) 2 ) 4 ) 7 ) 0 ) 0 ) 7 ) ] ) ) 6 ) 2 ) 1 ) 1 ) 6 ) 5 ) ] ) ) ( ( ( ( ( [MIN 0 ) 3 ) 8 ) 2 ) ] ) ) ( ( ( ( ( [MAX 8 ) ( ( ( ( ( ( [SM ( ( ( ( ( ( ( ( ( [MAX ( ( ( ( ( ( ( ( [MAX 4 ) 8 ) 3 ) 2 ) 3 ) 9 ) 8 ) ] ) ) 5 ) 6 ) ( ( ( ( ( ( [SM 6 ) 9 ) 1 ) 7 ) 0 ) ] ) ) ( ( ( ( [MAX 2 ) 5 ) 3 ) ] ) ) 8 ) 4 ) 8 ) ] ) ) 3 ) 9 ) 6 ) 1 ) ] ) ) ( ( ( ( ( [SM ( ( ( ( ( ( [MIN ( ( ( ( ( ( [MED 6 ) 4 ) 0 ) 5 ) 9 ) ] ) ) 3 ) 3 ) 7 ) 5 ) ] ) ) 3 ) 9 ) 2 ) ] ) ) 5 ) ] ) ) 7 ) 4 ) 1 ) ] ) ) 3 ) ( ( ( ( ( ( ( ( ( ( [SM 6 ) 2 ) 8 ) 9 ) 3 ) 8 ) 8 ) 5 ) 2 ) ] ) ) 3 ) ( ( ( ( ( ( ( [SM 9 ) ( ( ( [MED 0 ) 5 ) ] ) ) 4 ) ( ( ( ( ( ( ( ( [MAX ( ( ( [MED ( ( ( ( ( ( ( ( ( ( [MAX 2 ) 2 ) 1 ) ( ( ( ( [SM 7 ) 7 ) 9 ) ] ) ) 8 ) 6 ) 2 ) 7 ) 0 ) ] ) ) 5 ) ] ) ) 2 ) 3 ) 3 ) 7 ) 8 ) 2 ) ] ) ) 8 ) 5 ) ] ) ) 2 ) ( ( ( ( ( ( ( [MIN ( ( ( [MIN 5 ) 0 ) ] ) ) 7 ) 8 ) 5 ) 8 ) 8 ) ] ) ) ] ) ) 8 ) 1 ) 9 ) 4 ) ] ) ) 2 ) 4 ) ] ) ) 0 ) ] )

@adamsolomou
Copy link

Thanks for the find. Yes, we need the brackets.

If you switch the tokenizer to tensorflow_text.WhitespaceTokenizer() for now it should do the trick.

We will push a fix soon. thanks :)

Based on this vocab set:

[b'7', b'(', b'0', b']', b'[MAX', b'8', b')', b'[MED', b'9', b'6', b'[SM', b'1', b'[MIN', b'5', b'3', b'2', b'4']

the max seq length shouldn't be more than 2K.

Hi @vanzytay, I also have a question following your earlier comment. The tokenizer tensorflow_text.WhitespaceTokenizer() is essentially equivalent to applying str.split() on each data point. Doing so the maximum sequence length in the training set is found to be 5995. I compute that as follows

import pandas as pd

train_file = pd.read_csv('lra_release_listops-1000_basic_train.tsv', sep='\t')

max_len = 0 
for _, l in train_file['Source'].iteritems(): 
    t = l.strip().split()
    if len(t) > max_len: 
        max_len = len(t)

The tokenization scheme based on str.split() gives rise to the following unique tokens

array(['(', ')', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[MAX', '[MED', '[MIN', '[SM', ']'], dtype='<U4')

which is in agreement to what you reported earlier (the encoding is different but the tokens are essentially the same). So I am not sure how a maximum sequence length less than 2K can be achieved when the parentheses of type '(' and ')' are included.

Nonetheless, I am still not sure why the parenthesis '(', ')' are needed? The nesting is already encoded by the bracketed parentheses '[', ']'.

Many thanks in advance.

@sihyun-yu
Copy link
Author

sihyun-yu commented Nov 28, 2020

Hi, @adamsolomou, I have one question regarding your opinion. Have you trained the base transformer model without '(' and ')' for tokenization? I wonder the accuracy is similar to the value reported in this repository in such case!

Thank you in advance.

@adamsolomou
Copy link

Hi @sihyun-yu, no I have not trained a model yet. I would like to clarify the ambiguity regarding the maximum sequence length first.

@sihyun-yu
Copy link
Author

@adamsolomou Thank you for your response.

@vanzytay
Copy link
Collaborator

vanzytay commented Dec 1, 2020

Hi!

The (, ) are not necessary but we need to close the brackets for [MAX, ]. I've removed (, ) from the tokens and they are under 2K (could you double check this). The original reason for (, ) is only needed for recursive models, which none of our transformer models require. You may simply ignore it / filter ( and ) away.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants