Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📓 Text Generation docs rework #24575

Closed
6 of 8 tasks
gante opened this issue Jun 29, 2023 · 8 comments
Closed
6 of 8 tasks

📓 Text Generation docs rework #24575

gante opened this issue Jun 29, 2023 · 8 comments

Comments

@gante
Copy link
Member

gante commented Jun 29, 2023

What is this?

This is an issue to discuss and track the rework of the docs for text generation. Comments and feedback are appreciated, as always 🤗

Current issues

  1. Our main reference for text generation is not in the docs and is quite outdated
  2. The docs regarding text generation are scattered, and it is not simple to navigate between them -- the reader has to know where to look for them
  3. We lack examples beyond the simplest forms of text generation
  4. We have undocumented advanced use cases, such as setting a custom stopping criteria
  5. We are not clear about what the user can't do

Proposed plan

EDIT:

I'd like to split the plan into three parts:

  1. Designing a simpler entry point to text generation, from which all related documentation is discoverable
  2. Upgrading the developer guides to cover the full potential of text generation
  3. Make our code more self-documenting and other code changes

1. Designing a simpler entry point for text generation docs

Tackles issues 1 and 2.

This part is further divided into two actions:

  • The (blog post)[https://huggingface.co/blog/how-to-generate] is still a solid reference for the background in text generation, but it holds old examples (tensorflow!) and focuses a bit too much on top_p/top_k. Let's retouch it.
  • Create a short tutorial to serve as an entry point to the multiple forms of text generation. Like the other tutorials, it contains references to related docs throughout the text (let's see if it is enough to handle discoverability -- we can create a stand-alone related docs section in the future if needed). It would also cover a few basics like "use left-padding when doing batched generation with decoder-only models" and "double-check your generate kwargs".

Related docs:

  1. Tasks
  2. Related developer guides
  3. API reference
  4. Outside transformers (e.g. optimum, text-generation-inference, LLM leaderboard, non-HF libs like autogptq?)

2. Upgrading the developer guides

Tackles issues 3 and 4.

We currently have one developer guide, which writes about the API and a few basic ways to manipulate text generation. I propose we improve the existing one and add 2 new guides, preferably with examples that cover more modalities and use cases:

  • 1. Improve the existing guide -- Add a section about the impact of logits processors, and another on how stopping conditions operate.
  • 2. "Prompting" -- Some basic "do and don'ts" regarding prompting and how different types of models respond differently to it (encoder-decoder vs decoder, instruction-tuned vs base), the importance of prompting on chat applications
  • 3. Using LLMs, with a focus on the 1st L (large) -- write about variable types, quantization, device mapping, advanced architectures (alibi, rope, MQA/GQA), flash attention
  • 4. Advanced examples (name?) -- Concrete use cases that make use of many features at once, to serve as inspiration: how to control between extractive and abstraction summarization, retrival-augmented generation, and other modality-specific examples

3. Self-documenting code and other code changes

Tackles issues 3 and 5.

  • Let's be honest -- the best user experience is when no docs are needed at all. We can improve our game here, by performing parameterization validation. Currently, our validation step is very superficial, and users are allowed to do things like passing temperature with do_sample=False, ultimately resulting in GH issues. I'd suggest performing a hard validation and throwing informative exceptions, pointing to the redesigned docs 🤗

  • In parallel, our logits processors and stopping condition classes are missing docstring examples on how to use them. This should make our API reference much more robust.

@SoyGema
Copy link
Contributor

SoyGema commented Jul 2, 2023

Hello there @gante ! 👋
First and foremost, thanks for opening a discussion about this .

Would like to glimpse a couple of vector thoughts about point 1, as point 2 and 3 seem consistent and structured enough

1. Designing a "home page" for text generation docs

As a user , the underlying pattern that I´m glimpsing from the Tutorials section make reference to how to use the library from the engineering perspective, in abstract terms ( preprocess data, Share your model, fine-tune etc ) . The tutorials seem like a library approach for a ‘HOW-TO ‘ do things. In fact, in Tutorials section, several examples about vision, audio and language are displayed.
I would think about putting a Text Generation section directly in Task guides , inside Natural Language processing, at the top , as it is related to a challenge to solve ( Text classification, Token classification ) . This doesn’t entail that one of the main ‘HOW-TOs’ related to text generation would be included inside Tutorials as a section. From what I’m taking for the guide, there is an insightful section of Search and Sampling, that could be added to the Tutorials, and a more detailed clarification added in Tasks and Developer guides.

The thing is that following this schema, at first sight ( abstracting main challenge from guide in Tutorials and add a robust example or some "home-page" references in Tasks with link to developer guides ) seems more coherent with your current structure.

On the other hand, and tangential, why not adding a LLMs leaderboard link somewhere (maybe in point 2) so the users can be mindful about the state of the art of the models in terms of perf for Text generation tasks ?

Hope I explained myself clearly enough 🧭
Thanks again for the open discussion! And for making the library! 👐

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 3, 2023

Big +1 on the 3.) point. I think self-explanatory code with better doc string examples and arg/kwarg checking would take us a long way!

RE: 1.) Yes this makes sense, but I think a single concise page is probably better than a homepage that links to everything we already have as this might be too difficult to keep up to date and might also be too complex. A single concise page under "Tutorials" which is a strong iteration on the blog post "how to generate" (which is still one of our most read blog posts) could be a good starting point. The blog post does a very good job at explaining how LLMs fundamentally work. It is however not up to date anymore and also puts too much focus on things like top_k and top_p. So a strong incremental improvement (more tailored to new LLMs) could serve as the main point of introduction to text-generation and be put under "Tutorials".

RE: 2.) Yes I think these could all go into Developer guides and be a nice iterative improvement to "Customize Text generation strategy"

@Vaibhavs10
Copy link
Member

Hi @gante - I love the plan.

Here are a couple of quick suggestions:

  1. Big +1 on validation with parameters passed to generate, or even just changing the error message to point to text generation strategies post
  2. I agree with @patrickvonplaten - Instead of a bouquet of docs, just one simple and concise doc page would do more wonders than not.
  3. I think a good way to structure would be starting from the basics - explaining the default behaviour (greedy search) and work our way up to other strategies. What would be helpful is to provide suggested parameter values along with the strategy as well.
  4. Each of the above strategy can be paired with two toggle based snippets, how to generate with pipeline and how to generate with processor + generate -> this will help cater to our rough user base.
  5. We can end the blog post with all the cool tricks that are not part of the generate yet, link it to a gh repo or gist. These are examples like generate with ggml, gptq integration and so on.
  6. [Long term] once we have this page in place we can work our way to update the model cards on text-gen models to add a link to it. I reckon it'll just be a batch PR.

Cheers!

@NielsRogge
Copy link
Contributor

Adding here that it would be nice to include a section on batched generation (I made a PR for GPT-2 here). This is not that intuitive for people as you need to pad from the left, set the padding token appropriately in the tokenizer, etc.

@gante
Copy link
Member Author

gante commented Jul 3, 2023

Thank you for the feedback folks 🙌

I've incorporated your feedback in the plan, which got edited. Main differences:

  • Part 1 now consists in updating the existing blog post and creating a short usage tutorial (with references to the blog post and advanced docs over its contents, as opposed to a stand-alone section with links, like the other tutorials)
  • Part 2 got condensed to reduce the long-term maintenance burden

@stevhliu
Copy link
Member

stevhliu commented Jul 6, 2023

Thanks for taking the time to outline this detailed rework! Big +1 for the additional developer guides and tutorial. ❤️

I would consider leaving the blog post as is also to reduce long-term maintenance and instead keep all the relevant text generation content in the docs. In general, I feel like a blog post is more appropriate for "timely" content like announcements/news or explaining why something was designed the way it was. Content that needs to be maintained and updated is better in the docs I think. As you mentioned, the how-to-generate blog post still contains super useful background info about text generation so I think we should definitely find a way to preserve that info. My suggestions would be to:

  • link from the blog post to the docs for the latest changes (could be a simpler banner at the top like this)
  • create a doc in the Conceptual Guide section to hold the background info from the how-to-generate blog post

@MKhalusova
Copy link
Contributor

As discussed with @gante, I'll start working on an LLM prompting guide (part 2.2 ("Prompting" )).

Copy link

github-actions bot commented Jan 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants