Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

aravindputrevu · 2024-03-15T21:15:48Z

What does this PR do?

Demonstrating how to effectively annotate text data for Transformer models using active learning, specifically leveraging the Cleanlab open-source package.

Introduction to active learning and its importance in efficiently utilizing labeling efforts under budget constraints.
Implementation of the ActiveLab algorithm, which assists in prioritizing data for annotation based on the potential impact on model performance. This is particularly beneficial when dealing with noisy annotators, as it helps in deciding whether to seek additional annotations for previously labeled data or new data.
A detailed walkthrough on iteratively improving a text classification model by selecting the most impactful data points for annotation, retraining the model, and evaluating its performance.

Who can review?

@MKhalusova appreciate your review.

review-notebook-app · 2024-03-15T21:15:53Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

HuggingFaceDocBuilderDev · 2024-03-15T21:19:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


Please remove this, as this button is automatically generated for the notebook.

Reply via ReviewNB

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


"It consistently produces much better models with approximately 50% less error rate, regardless of the total labeling budget." Is there a study from which this number comes from?

Reply via ReviewNB

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


Can you create a new dataset on the Hugging Face Hub with these files, and load from there? Some reasons for this:
You can preview CSV datasets on the Hub
If those files are removed from their current location, they would still be on the Hub, so this would future-proof the notebook.

Reply via ReviewNB

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


Currently, there are some issues displaying pandas tables on our end. Let's instead print out the examples (just the contents of the text column)

Reply via ReviewNB

I've addressed this. Please let me know if it works for you.

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


It would be super helpful and much easier to read if you could split this large code cell into a set of smaller sets and add explanations for the methods (not in the comments, but as markdown)

Reply via ReviewNB

MKhalusova · 2024-03-18T17:52:26Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2194 @@
+{


It would probably help here to explain briefly in a sentence or two, how ActiveLab consensus labels are calculated (if it's not majority vote, then what is it?), and what active learning scores are.
A bit of intuition on this is given later, but I feel that at this point it may not be clear to the reader.

Reply via ReviewNB

Can I say, which is explained further in the notebook?

MKhalusova · 2024-03-18T17:53:52Z

Really interesting topic! Thank you for contributing! I have left some suggestions.

aravindputrevu · 2024-03-29T01:17:53Z

I have addressed all the comments. Please take a look and let me know if you have any suggestions.

MKhalusova

Please update the title in the toc and index for consistency

MKhalusova · 2024-03-29T13:11:21Z

notebooks/en/_toctree.yml

@@ -32,6 +32,7 @@
    title: Create a legal preference dataset
  - local: semantic_cache_chroma_vector_database
    title: Implementing semantic cache to improve a RAG system.
+  - local: annotate_text_data_transformers_via_active_learning
+    title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab


Suggested change

title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab

title: Annotate text data using Active Learning with Cleanlab

MKhalusova · 2024-03-29T13:11:31Z

notebooks/en/index.md

@@ -23,6 +23,7 @@ Check out the recently added notebooks:
 - [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
 - [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
 - [Detecting Issues in a Text Dataset with Cleanlab](issues_in_text_dataset)
+- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning)


Suggested change

- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning)

- [Annotate text data using Active Learning with Cleanlab](annotate_text_data_transformers_via_active_learning)

aravindputrevu · 2024-04-02T06:54:31Z

@MKhalusova Updated title across the toc and index.md

davanstrien · 2024-04-02T17:39:18Z

@aravindputrevu I'll give this a final review tomorrow :)

davanstrien · 2024-04-03T09:30:37Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2296 @@
+{


Line #2. pd.set_option('max_colwidth', None)
nit would you be able to move this to the bottom of the imports cell?

Reply via ReviewNB

davanstrien · 2024-04-03T09:30:38Z

notebooks/en/annotate_text_data_transformers_via_active_learning.ipynb

@@ -0,0 +1,2296 @@
+{


It might be worth briefly explaining here how you are doing the additional annotations here.

Reply via ReviewNB

@davanstrien I have explained it via a flow chart above and also added specific pointers above flow chart.

davanstrien · 2024-04-03T09:31:15Z

@aravindputrevu this is a great lesson. There are still a few comments from Maria to respond to:

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63 (comment) (replace Pandas DataFrame head
I left a few other minor comments in the notebook
Let me know if anything is unclear, think we're almost ready to merge this :)

aravindputrevu · 2024-04-03T14:43:59Z

@davanstrien - I have addressed the comments, please let me know.

davanstrien

Thanks for working on this! Really nice tutorial and topic. I'll merge this on Monday. I'm unfamiliar with the cookbook build system, so I need to check with others how that's set up.

aravindputrevu · 2024-04-07T14:22:10Z

Thank you @davanstrien

Activelab Changes

4da589c

MKhalusova self-requested a review March 18, 2024 13:08

MKhalusova reviewed Mar 18, 2024

View reviewed changes

Update addressing all the comments

49b83d9

Merge branch 'main' into main

50597e7

MKhalusova reviewed Mar 29, 2024

View reviewed changes

Updating title across toc and index files

4f545df

davanstrien self-requested a review April 2, 2024 17:38

davanstrien reviewed Apr 3, 2024

View reviewed changes

davanstrien self-requested a review April 6, 2024 11:01

davanstrien approved these changes Apr 6, 2024

View reviewed changes

davanstrien merged commit 7a512d6 into huggingface:main Apr 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

aravindputrevu commented Mar 15, 2024

review-notebook-app bot commented Mar 15, 2024

HuggingFaceDocBuilderDev commented Mar 15, 2024

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

aravindputrevu Apr 3, 2024

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

aravindputrevu Mar 21, 2024

MKhalusova Mar 22, 2024

MKhalusova commented Mar 18, 2024

aravindputrevu commented Mar 29, 2024

MKhalusova left a comment

MKhalusova Mar 29, 2024

MKhalusova Mar 29, 2024

aravindputrevu commented Apr 2, 2024

davanstrien commented Apr 2, 2024

davanstrien Apr 3, 2024 •

edited

Loading

aravindputrevu Apr 3, 2024

davanstrien Apr 3, 2024 •

edited

Loading

aravindputrevu Apr 3, 2024

davanstrien commented Apr 3, 2024 •

edited

Loading

aravindputrevu commented Apr 3, 2024

davanstrien left a comment

aravindputrevu commented Apr 7, 2024

	title: Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab
	title: Annotate text data using Active Learning with Cleanlab

	- [Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab](annotate_text_data_transformers_via_active_learning)
	- [Annotate text data using Active Learning with Cleanlab](annotate_text_data_transformers_via_active_learning)

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

Effectively Annotate Text Data for Transformers via Active Learning using Cleanlab #63

Conversation

aravindputrevu commented Mar 15, 2024

What does this PR do?

Who can review?

review-notebook-app bot commented Mar 15, 2024

HuggingFaceDocBuilderDev commented Mar 15, 2024

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Apr 3, 2024

Choose a reason for hiding this comment

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Mar 21, 2024

Choose a reason for hiding this comment

MKhalusova Mar 22, 2024

Choose a reason for hiding this comment

MKhalusova commented Mar 18, 2024

aravindputrevu commented Mar 29, 2024

MKhalusova left a comment

Choose a reason for hiding this comment

MKhalusova Mar 29, 2024

Choose a reason for hiding this comment

MKhalusova Mar 29, 2024

Choose a reason for hiding this comment

aravindputrevu commented Apr 2, 2024

davanstrien commented Apr 2, 2024

davanstrien Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Apr 3, 2024

Choose a reason for hiding this comment

davanstrien Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Apr 3, 2024

Choose a reason for hiding this comment

davanstrien commented Apr 3, 2024 • edited Loading

aravindputrevu commented Apr 3, 2024

davanstrien left a comment

Choose a reason for hiding this comment

aravindputrevu commented Apr 7, 2024

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

MKhalusova Mar 18, 2024 •

edited

Loading

davanstrien Apr 3, 2024 •

edited

Loading

davanstrien Apr 3, 2024 •

edited

Loading

davanstrien commented Apr 3, 2024 •

edited

Loading