Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement modular encoder/decoder class #364

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bkarab03
Copy link

This PR implements a set of encoder/decoder classes that provide a consistent interface for encoding and decoding text using different schemes like character or BPE encoding.

The main changes:

  • Add EncoderDecoder base class with encode()/decode() methods
  • Implement CharEncoderDecoder and BPEEncoderDecoder subclasses
  • CharEncoderDecoder loads mappings from metadata file
  • BPEEncoderDecoder uses tiktoken for tokenization
  • EncoderDecoder picks correct subclass based on metadata

This allows cleanly switching between different encoding implementations by just changing the imported EncoderDecoder class. New schemes can easily be added by implementing a new subclass.

Some benefits:

  • Consistent usage of encdec.encode() and encdec.decode()
  • Encoder implementation details are hidden
  • Easy to add new encoding types
  • Model code does not change, just the EncoderDecoder class

Let me know if any changes are needed! Please take a look at the EncoderDecoder class design and see if it meets the goals of modular and extensible encodings.

Usage example:

encdec = EncoderDecoder(meta_path)
tokens = encdec.encode(text)
text = encdec.decode(tokens)```

- Refactored the encoding and decoding process by introducing EncoderDecoder class.
- Separated character and BPE encoding into CharEncoderDecoder and BPEEncoderDecoder classes.
- Updated the sample.py script to utilize the new EncoderDecoder abstraction.
- Improved flexibility for handling different encoding methods.

These changes aim to make the code more modular and extendable, allowing for easier integration of different encoding schemes in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant