Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file compressor #144

Open
gvwilson opened this issue May 8, 2023 · 1 comment · May be fixed by #280
Open

file compressor #144

gvwilson opened this issue May 8, 2023 · 1 comment · May be fixed by #280
Assignees
Labels
new-topic Ideas for new chapters
Milestone

Comments

@gvwilson
Copy link
Owner

gvwilson commented May 8, 2023

No description provided.

@gvwilson gvwilson added this to the v1 milestone May 8, 2023
@gvwilson gvwilson self-assigned this May 8, 2023
@gvwilson gvwilson transferred this issue from another repository Jul 6, 2023
@gvwilson gvwilson added to-add Add something new in-content Lesson content labels Jul 6, 2023
@gvwilson gvwilson removed their assignment Jul 6, 2023
@gvwilson gvwilson modified the milestones: v1, v2 Jul 6, 2023
@gvwilson gvwilson added new-topic Ideas for new chapters and removed to-add Add something new in-content Lesson content labels Jul 6, 2023
@gvwilson gvwilson changed the title add a chapter on compression algorithms file compressor Jul 6, 2023
@rkern
Copy link

rkern commented Jul 13, 2023

I wonder if byte-pair encoding would be an interesting algorithm to implement in this chapter. I suspect it's probably right-sized for implementing in a book chapter. While it's not a state-of-the-art compressor today, it is SOTA for NLP tokenization used in LLMs like the GPTs. That offers an opportunity to talk about some relevant topics in software engineering ethics using the implemented compressor as a demonstration.

For example, pretraining the compression dictionary on the English version of SDXJS probably handles the English SDXPY pretty reasonably. It probably does less well, but okay on Shakespeare, and probably terribly on Atukagawa Ryūnosuke. As we engineer our tools to be more data-driven, availability biases in how we obtain the data to build those tools have consequences that we need to think about.

@gvwilson gvwilson self-assigned this May 26, 2024
@gvwilson gvwilson linked a pull request Jun 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-topic Ideas for new chapters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants