Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #119

aflah02 · 2022-04-15T06:01:59Z

Fixes #113
Opening this PR so that we can discuss the name of the utility
I'll write tests as discussed in the issue and fix some problems currently in the meanwhile

aflah02 · 2022-04-15T06:03:32Z

@mattdangerw Since I also need to test the Unicode tokenizer all it's commits will reflect here too. Should I close the other PR then?

…as-nlp into DetokenizingToString

Changes Pulled

…as-nlp into DetokenizingToString

aflah02 · 2022-04-15T11:43:24Z

@mattdangerw The PR is ready for review.
There is an issue with the tests with ragged values but when I run it simply on colab nothing happens and it works as intended as can be seen here however the test fails and I suspect it is due to the warning which is raised due to some other components which I haven't tinkered with :

mattdangerw · 2022-04-16T05:38:28Z

@aflah02 sorry missed this comment. We should land this as a follow up to the other PR. Can you rebase this so we can review just this change?

Also I would add your tests to the base tokenizer as a that's where the functionality is. You may need to make a new simple subclass of the Tokenizer base class in the unit test.

Re that error, I am not sure what is going on there, it may be a incompatibility between tensorflow and numpy when you call ragged.to_list(). If so we probably don't need to fix that here.

aflah02 · 2022-04-16T06:58:31Z

@mattdangerw
No Worries!
Sure I'll do that, also I might refork and make the changes there as I messed up my branch locally and need to fix that as well. I'll open a new PR then and reference this

aflah02 added 30 commits April 8, 2022 10:51

Debugging

f04325c

Debugging

20db313

Fixed Sequence Length Issue

e0b6c44

Sequence Length Changes

f949f5a

Removed _ From Class Attributes

ac2bb89

Fixed Null Bytes in Detokenization

1d1a1a2

Testing regex_replace

ef1b5b6

Testing

0de3153

Helper Function and Debug Statements

161e316

Testing Regex Replace New Ordering

8054855

Added Checks for Errors and Normalization Form

d638260

Doc String Completed

5fad8ad

Ran lint/format

a6b095f

New Tests and Decoding Changes

c45de16

Changes

78f4da7

Minor Tweak

927fdc6

Tweaking Detokenizer

68830d4

Added Tests and Updated Docstrings

7137c39

Ran format.sh and lint.sh

8cd02d2

Refactoring and Removing Unused Lines

11e5eed

Fixed Some Broken Tests

09f5f30

Fixed All Tests

91c06af

Testing Decode

24fb3ac

Testing

2ded9a7

Debug

82ee48c

Fixes + Replaced Regex with BooleanMask

43c33c8

Added Debug Lines

0731294

Added Debug Line for .numpy()

4da8739

Testing Byte Tokenizer Approach

996fd25

Testing With Unicode_transcode

44b01f7

aflah02 added 11 commits April 15, 2022 03:33

Added debug lines

dc60dfb

More Debug Statements

e46e4b4

Fixed Error

f9a3055

Minor Fix for Scalar

52ee1c7

Added detokenize_to_strings to UnicodeTokenizer

ce4cb09

Added Decode for Scalar

5d8575d

Refactored Method to Base Class

b98837e

Fixed Docstring and Improved Examples

ba76dcc

Merge branch 'keras-team:master' into master

ac55c10

Ran format and lint

96cf050

Merge branch 'keras-team:master' into DetokenizingToString

77eb4b8

aflah02 added 13 commits April 15, 2022 11:41

Changed to .shape.rank from .ndim added some tests

854f788

Merge branch 'DetokenizingToString' of https://github.com/aflah02/ker…

706c2cf

…as-nlp into DetokenizingToString

Added Recursive Decoding

3e572ef

Added New Tests, To Fix Ragged Test

9fba803

Fixed Indentation

7f5c481

Decode When Scalar

a95404a

Fixed Map Issue

c5b6dbc

Merge pull request #1 from aflah02/master

86e0478

Changes Pulled

Changes

52254de

Merge branch 'DetokenizingToString' of https://github.com/aflah02/ker…

13148d3

…as-nlp into DetokenizingToString

Working for Scalar

6963ca5

Ran lint and format

00c3f06

Added Tests to Unicode and Byte Tokenizer

b2c6fde

aflah02 closed this Apr 16, 2022

aflah02 mentioned this pull request Apr 16, 2022

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #119

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #119

Uh oh!

aflah02 commented Apr 15, 2022 •

edited

Loading

Uh oh!

aflah02 commented Apr 15, 2022

Uh oh!

aflah02 commented Apr 15, 2022

Uh oh!

mattdangerw commented Apr 16, 2022

Uh oh!

aflah02 commented Apr 16, 2022

Uh oh!

Uh oh!

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #119

Adding Utility to Detokenize as list of Strings to Tokenizer Base Class #119

Uh oh!

Conversation

aflah02 commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aflah02 commented Apr 15, 2022

Uh oh!

aflah02 commented Apr 15, 2022

Uh oh!

mattdangerw commented Apr 16, 2022

Uh oh!

aflah02 commented Apr 16, 2022

Uh oh!

Uh oh!

aflah02 commented Apr 15, 2022 •

edited

Loading