-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It consistently produces much better models with approximately 50% less error rate, regardless of the total labeling budget." Is there a study from which this number comes from?
Reply via ReviewNB
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a new dataset on the Hugging Face Hub with these files, and load from there? Some reasons for this:
- You can preview CSV datasets on the Hub
- If those files are removed from their current location, they would still be on the Hub, so this would future-proof the notebook.
Reply via ReviewNB
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there are some issues displaying pandas tables on our end. Let's instead print out the examples (just the contents of the text
column)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed this. Please let me know if it works for you.
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be super helpful and much easier to read if you could split this large code cell into a set of smaller sets and add explanations for the methods (not in the comments, but as markdown)
Reply via ReviewNB
@@ -0,0 +1,2194 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably help here to explain briefly in a sentence or two, how ActiveLab consensus labels are calculated (if it's not majority vote, then what is it?), and what active learning scores are.
A bit of intuition on this is given later, but I feel that at this point it may not be clear to the reader.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I say, which is explained further in the notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Really interesting topic! Thank you for contributing! I have left some suggestions. |
I have addressed all the comments. Please take a look and let me know if you have any suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the title in the toc and index for consistency
notebooks/en/_toctree.yml
Outdated
@@ -32,6 +32,7 @@ | |||
title: Create a legal preference dataset | |||
- local: semantic_cache_chroma_vector_database | |||
title: Implementing semantic cache to improve a RAG system. | |||
- local: annotate_text_data_transformers_via_active_learning | |||
title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab | |
title: Annotate text data using Active Learning with Cleanlab |
notebooks/en/index.md
Outdated
@@ -23,6 +23,7 @@ Check out the recently added notebooks: | |||
- [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation) | |||
- [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag) | |||
- [Detecting Issues in a Text Dataset with Cleanlab](issues_in_text_dataset) | |||
- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning) | |
- [Annotate text data using Active Learning with Cleanlab](annotate_text_data_transformers_via_active_learning) |
@MKhalusova Updated title across the |
@aravindputrevu I'll give this a final review tomorrow :) |
@@ -0,0 +1,2296 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. pd.set_option('max_colwidth', None)
nit would you be able to move this to the bottom of the imports cell?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,2296 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth briefly explaining here how you are doing the additional annotations here.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davanstrien I have explained it via a flow chart above and also added specific pointers above flow chart.
@aravindputrevu this is a great lesson. There are still a few comments from Maria to respond to:
|
@davanstrien - I have addressed the comments, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Really nice tutorial and topic. I'll merge this on Monday. I'm unfamiliar with the cookbook build system, so I need to check with others how that's set up.
Thank you @davanstrien |
What does this PR do?
Demonstrating how to effectively annotate text data for Transformer models using active learning, specifically leveraging the Cleanlab open-source package.
Introduction to active learning and its importance in efficiently utilizing labeling efforts under budget constraints.
Implementation of the ActiveLab algorithm, which assists in prioritizing data for annotation based on the potential impact on model performance. This is particularly beneficial when dealing with noisy annotators, as it helps in deciding whether to seek additional annotations for previously labeled data or new data.
A detailed walkthrough on iteratively improving a text classification model by selecting the most impactful data points for annotation, retraining the model, and evaluating its performance.
Who can review?
@MKhalusova appreciate your review.