Skip to content

navigating-data-errors/tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hands-on Notebooks for the Tutorial on Navigating Data Errors in ML Pipelines

Links: [🌐 Tutorial Website] [👀 Review Slides] [📜 Tutorial Paper]

The goal of this tutorial is to showcase different notions of data importance, based on the DataScope library. The notebooks revolve around a toy problem of classifying the sentiment of recommendation letters.

Accessing the Tutorial Notebooks

We provide several notebooks walking you through several data debugging scenarios.

Part 1: Data Errors

We begin by detailling how to leverage data importance to identify impactful label errors in the data.

Part 2a: Tracing Data Errors through Pipelines

We extend the previous notebook with a complex feature encoding pipeline, and show that we can trace data errors through these pipelines easily too.

Part 2b: Tracing Data Errors through Pipelines with Dataframe Operations

We extend our example use case with dataframe operations, and show that we can trace data errors through these relational operations as well.

Frequently Asked Questions

Can I run the notebooks locally?

Yes, after cloning the repo and entering the repo directory, run the following commands:

make shell
make setup
make jupyter

Note: The make setup command installs all the Python dependencies and needs to be run only the first time you set up the repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •