Skip to content

Conversation

aflah02
Copy link
Collaborator

@aflah02 aflah02 commented Apr 15, 2022

Fixes #113
Opening this PR so that we can discuss the name of the utility
I'll write tests as discussed in the issue and fix some problems currently in the meanwhile

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 15, 2022

@mattdangerw Since I also need to test the Unicode tokenizer all it's commits will reflect here too. Should I close the other PR then?

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 15, 2022

@mattdangerw The PR is ready for review.
There is an issue with the tests with ragged values but when I run it simply on colab nothing happens and it works as intended as can be seen here however the test fails and I suspect it is due to the warning which is raised due to some other components which I haven't tinkered with :
image

@mattdangerw
Copy link
Member

@aflah02 sorry missed this comment. We should land this as a follow up to the other PR. Can you rebase this so we can review just this change?

Also I would add your tests to the base tokenizer as a that's where the functionality is. You may need to make a new simple subclass of the Tokenizer base class in the unit test.

Re that error, I am not sure what is going on there, it may be a incompatibility between tensorflow and numpy when you call ragged.to_list(). If so we probably don't need to fix that here.

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 16, 2022

@mattdangerw
No Worries!
Sure I'll do that, also I might refork and make the changes there as I messed up my branch locally and need to fix that as well. I'll open a new PR then and reference this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a utility to detokenize as lists of python strings
2 participants