Hands-on Notebooks for the Tutorial on Navigating Data Errors in ML Pipelines

Links: [🌐 Tutorial Website] [👀 Review Slides] [📜 Tutorial Paper]

The goal of this tutorial is to showcase different notions of data importance, based on the DataScope library. The notebooks revolve around a toy problem of classifying the sentiment of recommendation letters.

Accessing the Tutorial Notebooks

We provide several notebooks walking you through several data debugging scenarios.

Part 1: Data Errors

We begin by detailling how to leverage data importance to identify impactful label errors in the data.

https://colab.research.google.com/github/navigating-data-errors/tutorial/blob/main/navigating_data_errors_tutorial_part1_data_errors.ipynb

Part 2a: Tracing Data Errors through Pipelines

We extend the previous notebook with a complex feature encoding pipeline, and show that we can trace data errors through these pipelines easily too.

https://colab.research.google.com/github/navigating-data-errors/tutorial/blob/main/navigating_data_errors_tutorial_part2a_with_pipeline.ipynb

Part 2b: Tracing Data Errors through Pipelines with Dataframe Operations

We extend our example use case with dataframe operations, and show that we can trace data errors through these relational operations as well.

https://colab.research.google.com/github/navigating-data-errors/tutorial/blob/main/navigating_data_errors_tutorial_part2b_with_dataframes.ipynb

Frequently Asked Questions

Can I run the notebooks locally?

Yes, after cloning the repo and entering the repo directory, run the following commands:

make shell
make setup
make jupyter

Note: The make setup command installs all the Python dependencies and needs to be run only the first time you set up the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
dev/makefiles		dev/makefiles
nde		nde
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
data-gen.ipynb		data-gen.ipynb
data-prep.ipynb		data-prep.ipynb
generational.csv		generational.csv
navigating_data_errors_tutorial_part1_data_errors.ipynb		navigating_data_errors_tutorial_part1_data_errors.ipynb
navigating_data_errors_tutorial_part2a_with_pipeline.ipynb		navigating_data_errors_tutorial_part2a_with_pipeline.ipynb
navigating_data_errors_tutorial_part2b_with_dataframes.ipynb		navigating_data_errors_tutorial_part2b_with_dataframes.ipynb
pyproject.toml		pyproject.toml
synthetic_letters.csv		synthetic_letters.csv
synthetic_letters_text_only.csv		synthetic_letters_text_only.csv
synthetic_letters_with_attributes.csv		synthetic_letters_with_attributes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hands-on Notebooks for the Tutorial on Navigating Data Errors in ML Pipelines

Accessing the Tutorial Notebooks

Part 1: Data Errors

Part 2a: Tracing Data Errors through Pipelines

Part 2b: Tracing Data Errors through Pipelines with Dataframe Operations

Frequently Asked Questions

Can I run the notebooks locally?

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

navigating-data-errors/tutorial

Folders and files

Latest commit

History

Repository files navigation

Hands-on Notebooks for the Tutorial on Navigating Data Errors in ML Pipelines

Accessing the Tutorial Notebooks

Part 1: Data Errors

Part 2a: Tracing Data Errors through Pipelines

Part 2b: Tracing Data Errors through Pipelines with Dataframe Operations

Frequently Asked Questions

Can I run the notebooks locally?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages