-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting Issues in a Text Dataset with Datalab #30
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to install the version corresponding to this tutorial
What are these versions? Perhaps, we can recommend to install the latest (add -U
flag to install the newest versions). We can also probably remove the comment.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removing this block.
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. # Package installation (hidden on docs.cleanlab.ai).
This comment can be removed.
Reply via ReviewNB
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. # If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
This you can add to the introduction.
Reply via ReviewNB
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,571 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to quickly show how you would update the dataset based on the report (e.g. remove all of the "bad" examples, or add a column indicating which ones are good-to-use, and which ones are not). I imagine one would want to run such cleanup on schedule, and somehow integrate the results.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, Cleanlab is the package that helps identify these issues. One can simply delete the near_duplicates
or outliers
from dataframe and export the CSV. Cleanlab calls it Cleanset
like Cleaned Dataset
I'd want the user to take a look and objectively delete the data points according to their choice.
As per the goal of the project, it is aimed at showcasing the problems within dataset , it would be a bit difficult to integrate package like a workflow in my opinion (could be done using a GH Action or so!)
Hence also the last paragraph, Cleanlab Studio helps with the necessary UI and longterm solution in maintaining the datasets throughout.
notebooks/en/_toctree.yml
Outdated
@@ -12,3 +12,5 @@ | |||
title: Advanced RAG on HuggingFace documentation using LangChain | |||
- local: rag_evaluation | |||
title: RAG Evaluation | |||
- local: issues_in_text_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to move this to the top, right after the index page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it looks like there's a space missing here that breaks the CI/CD check. Make sure it's aligned with other entries
@@ -12,6 +12,7 @@ Check out the recently added notebooks: | |||
- [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu) | |||
- [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation) | |||
- [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag) | |||
- [Detecting Issues in a Text Dataset with Datalab](issues_in_text_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to add it to the top of the list
Awesome tutorial, @aravindputrevu !
Also, please add yourself as an author, right after the main title, like this: Authored by: Your Name Feel free to use either your Hugging Face profile, or GitHub profile, it's up to you which one to link. |
@MKhalusova Thanks for the review, I will be working on the comments. |
@MKhalusova I have fixed the review comments, and responded on the other questions. Please let me know. |
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the actual name and not the account handle for the author, for consistency with other notebooks, i.e.
[FirstName LastName](link_to_HF_profile)
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc-builder that we use to publish notebooks, seems to have issues with the <div>
tags in this markdown. Please reformat to remove them, and leave only the markdown formatting.
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these outputs as they take a lot of space and are not super informative.
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to duplicate the output in the markdown cell, it will be shown in the rendered notebook. Please remove the markdown copy.
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment there are some issues displaying pandas dataframe outputs, so you can actually leave this markdown version of the output
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the output duplicated in markdown, only leave the actual cell output
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. While for the rest of the outputs I encourage you to remove the duplication, you can leave this for pandas dataframes.
Reply via ReviewNB
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads like a sales pitch, which is not aligned with the goals of the Open Source AI cookbook. Please remove.
Reply via ReviewNB
A few finishing touches, and the notebook will be good to merge! |
@MKhalusova Made requested changes and corrected, few other items. Please review. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I fixed some missing columns, and we can merge now. Will share the new recipe tomorrow! |
What does this PR do?
This notebook is about detecting issues in a text dataset using Data-centric AI using Opensource package Cleanlab. It uses Datalab object from Cleanlab package.
List of out comes from the notebook:
Compute out-of-sample predicted probabilities for a sample dataset using cross-validation.
Use Datalab to identify issues such as noisy labels, outliers, (near) duplicates, and other types of problems
View the issue summaries and other information about our sample dataset
@MKhalusova appreciate your review.