Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

Merged
merged 4 commits into from
Apr 8, 2024

Conversation

aravindputrevu
Copy link
Contributor

What does this PR do?

Demonstrating how to effectively annotate text data for Transformer models using active learning, specifically leveraging the Cleanlab open-source package.

  • Introduction to active learning and its importance in efficiently utilizing labeling efforts under budget constraints.

  • Implementation of the ActiveLab algorithm, which assists in prioritizing data for annotation based on the potential impact on model performance. This is particularly beneficial when dealing with noisy annotators, as it helps in deciding whether to seek additional annotations for previously labeled data or new data.

  • A detailed walkthrough on iteratively improving a text classification model by selecting the most impactful data points for annotation, retraining the model, and evaluating its performance.

Who can review?

@MKhalusova appreciate your review.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@MKhalusova MKhalusova self-requested a review March 18, 2024 13:08
@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this, as this button is automatically generated for the notebook.


Reply via ReviewNB

@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It consistently produces much better models with approximately 50% less error rate, regardless of the total labeling budget." Is there a study from which this number comes from?


Reply via ReviewNB

@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a new dataset on the Hugging Face Hub with these files, and load from there? Some reasons for this:

  • You can preview CSV datasets on the Hub
  • If those files are removed from their current location, they would still be on the Hub, so this would future-proof the notebook.


Reply via ReviewNB

@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there are some issues displaying pandas tables on our end. Let's instead print out the examples (just the contents of the text column)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this. Please let me know if it works for you.

@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be super helpful and much easier to read if you could split this large code cell into a set of smaller sets and add explanations for the methods (not in the comments, but as markdown)


Reply via ReviewNB

@@ -0,0 +1,2194 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably help here to explain briefly in a sentence or two, how ActiveLab consensus labels are calculated (if it's not majority vote, then what is it?), and what active learning scores are.

A bit of intuition on this is given later, but I feel that at this point it may not be clear to the reader.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I say, which is explained further in the notebook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@MKhalusova
Copy link
Contributor

Really interesting topic! Thank you for contributing! I have left some suggestions.

Copy link
Contributor Author

I have addressed all the comments. Please take a look and let me know if you have any suggestions.

Copy link
Contributor

@MKhalusova MKhalusova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the title in the toc and index for consistency

@@ -32,6 +32,7 @@
title: Create a legal preference dataset
- local: semantic_cache_chroma_vector_database
title: Implementing semantic cache to improve a RAG system.
- local: annotate_text_data_transformers_via_active_learning
title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab
title: Annotate text data using Active Learning with Cleanlab

@@ -23,6 +23,7 @@ Check out the recently added notebooks:
- [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
- [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
- [Detecting Issues in a Text Dataset with Cleanlab](issues_in_text_dataset)
- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning)
- [Annotate text data using Active Learning with Cleanlab](annotate_text_data_transformers_via_active_learning)

@aravindputrevu
Copy link
Contributor Author

@MKhalusova Updated title across the toc and index.md

@davanstrien davanstrien self-requested a review April 2, 2024 17:38
@davanstrien
Copy link
Member

@aravindputrevu I'll give this a final review tomorrow :)

@@ -0,0 +1,2296 @@
{
Copy link
Member

@davanstrien davanstrien Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    pd.set_option('max_colwidth', None)

nit would you be able to move this to the bottom of the imports cell?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,2296 @@
{
Copy link
Member

@davanstrien davanstrien Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth briefly explaining here how you are doing the additional annotations here.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davanstrien I have explained it via a flow chart above and also added specific pointers above flow chart.

@davanstrien
Copy link
Member

davanstrien commented Apr 3, 2024

@aravindputrevu this is a great lesson. There are still a few comments from Maria to respond to:

Copy link
Contributor Author

@davanstrien - I have addressed the comments, please let me know.

@davanstrien davanstrien self-requested a review April 6, 2024 11:01
Copy link
Member

@davanstrien davanstrien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Really nice tutorial and topic. I'll merge this on Monday. I'm unfamiliar with the cookbook build system, so I need to check with others how that's set up.

@aravindputrevu
Copy link
Contributor Author

Thank you @davanstrien

@davanstrien davanstrien merged commit 7a512d6 into huggingface:main Apr 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants