# MLfix – using AI and UI to explore and fix datasets

[![Gitter](https://badges.gitter.im/MLfix/community.svg)](https://gitter.im/MLfix/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jpc/mlfix-mapillary-traffic-signs/HEAD?labpath=index.ipynb)

![A futuristic robot cleaning streets of New York that are overflowing with papers.](banner.jpg)
Is your dataset overflowing with low quality samples? Our highly-skilled robots can help you! (generated by [Centipede Diffusion](https://github.com/Zalring/Centipede_Diffusion/))

### Why?

Even carefully curated AI datasets have errors. In a highly curated dataset like the [Mapillary Traffic Sign Dataset](https://www.mapillary.com/dataset/trafficsign) even the easy classes, like speed limits, [have 3% of mislabeled samples](./2.%20Mapillary%20speed-limit%20cleanup.ipynb). [After training this results in a 2% loss of model accuracy on these classes.](./3.%20Did%20it%20help%3F.ipynb#The-results)

### What?

This repository contains tools which can help you find mistakes in your labels directly from your Jupyter notebook. You just need a folder full of images or an object detection dataset in one of the supported formats. Here is an example of MLfix running inside a Jupyter session:

![MLfix v2 usage example](https://user-images.githubusercontent.com/107984/184148474-839b8049-fe68-47f0-b4b5-cc83f27dfea8.mp4)

You can also [try it yourself now on Binder](https://mybinder.org/v2/gh/jpc/mlfix-mapillary-traffic-signs/HEAD?labpath=index.ipynb).

### How?

The tools work by sorting and grouping the images and then showing them in a streamlined user interface. The interface allows you to mark the photos so you can perform your QA process. The images may be easily groupped and sorted by class or other metadata as well as by AI sorting methods like visual similarity or validation loss.

For more background info you can check the talk we gave at OpenSource Summit NA 2022 [video](https://www.youtube.com/watch?v=IS0k8rPVcmY) [slides](OSS%20NA%202022%20presentation.pdf).

## Dependencies

The tool depends on [Pandas](https://pandas.pydata.org), [jupyter-server-proxy](https://jupyter-server-proxy.readthedocs.io/en/latest/) and [CherryPy](https://cherrypy.dev)
and should work with any JupyterLab or Jupyter Notebook setup. The demos use [fastai](https://docs.fast.ai)
for data manipulation and model training.

It does not work on Google Collab because I could not figure out a way to expose an HTTP server (needed to serve the images) from the running notebook. I think it could be supported in the future by moving the images to an object storage database (like Amazon S3).

### Quick start

It's easiest to start with any dataset in the ImageNet format (one folder per class) or with just a folder of unsorted pictures. If you have a custom dataset you can look into [the notebook we wrote for the Mapillary dataset](1.%20Generate%20bbox%20crops%20from%20ground%20truth.ipynb). Support for YOLO format is comming soon.

You can also test out the tool on the included traffic sign samples:

In [1]:
#| eval: false
#| output: false
from fastai.vision.all import *
from MLfix import MLfix

# get all the photos
fnames = get_image_files('./yolo-bbox-crops-aspects-traffic-signs/')

# put tham into a DataFrame
data = pd.DataFrame(dict(fname = fnames), index = [str(x) for x in fnames])
# the class label is the parent folder name
data['label'] = data.fname.map(lambda x: x.parent.name)
# show a sample of rows
data.head()

Unnamed: 0,fname,label
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/002578-0.jpg,yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/002578-0.jpg,Speed-Limit-30-US
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001971-0.jpg,yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001971-0.jpg,Speed-Limit-30-US
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/002348-0.jpg,yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/002348-0.jpg,Speed-Limit-30-US
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001564-0.jpg,yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001564-0.jpg,Speed-Limit-30-US
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001842-1.jpg,yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001842-1.jpg,Speed-Limit-30-US


In [2]:
#| eval: false
#| output: false 
# run MLfix and save the results into a new variable
new_labels = MLfix(data, label='label')

label


yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002058-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002725-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002585-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001661-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002724-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001385-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002726-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002586-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002055-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002057-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001475-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001476-0.jpg label = 

After working through the images you can check the results in the returned Pandas Series.

In [3]:
#| output: false
new_labels[new_labels == 'invalid']

yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001842-1.jpg    invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/002161-0.jpg    invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001841-0.jpg    invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001498-0.jpg    invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30-US/001844-0.jpg    invalid
                                                                         ...   
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001474-0.jpg     invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002584-0.jpg     invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/001384-0.jpg     invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002056-0.jpg     invalid
yolo-bbox-crops-aspects-traffic-signs/Lane-Reduce-Left/002054-0.jpg     invalid
Name: label, Length: 187, dtype: object

In [4]:
invalid = new_labels[new_labels == 'invalid']

For more usage examples and tricks, see the example notebooks in this repository. More examples are comming soon.

In [5]:
invalid.to_csv('invalid-traffic-signs.csv')

yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-90/001891-0.jpg label = Speed-Limit-90
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-90/002374-0.jpg label = invalid


In [3]:
https://localhost:8888/proxy/47739/mlfix-jYHCi/imgs/yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-90/002374-0.jpg

SyntaxError: invalid syntax (1834417957.py, line 1)

yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000115-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000227-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000051-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000195-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000080-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000080-0.jpg label = Speed-Limit-30
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-30/000080-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-60/000131-0.jpg label = invalid
yolo-bbox-crops-aspects-traffic-signs/Speed-Limit-90/000175-0.jpg label = invalid
