You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wonder if byte-pair encoding would be an interesting algorithm to implement in this chapter. I suspect it's probably right-sized for implementing in a book chapter. While it's not a state-of-the-art compressor today, it is SOTA for NLP tokenization used in LLMs like the GPTs. That offers an opportunity to talk about some relevant topics in software engineering ethics using the implemented compressor as a demonstration.
For example, pretraining the compression dictionary on the English version of SDXJS probably handles the English SDXPY pretty reasonably. It probably does less well, but okay on Shakespeare, and probably terribly on Atukagawa Ryūnosuke. As we engineer our tools to be more data-driven, availability biases in how we obtain the data to build those tools have consequences that we need to think about.
No description provided.
The text was updated successfully, but these errors were encountered: