# MLfix – using AI and UI to explore and fix datasets

[![Gitter](https://badges.gitter.im/MLfix/community.svg)](https://gitter.im/MLfix/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)

![A futuristic robot cleaning streets of New York that are overflowing with papers.](banner.jpg)
Is your dataset overflowing with low quality samples? Our highly-skilled robots can help you! (generated by [Centipede Diffusion](https://github.com/Zalring/Centipede_Diffusion/))

### Why?

Even carefully curated AI datasets have errors. In a highly curated dataset like the [Mapillary Traffic Sign Dataset](https://www.mapillary.com/dataset/trafficsign) even the easy classes, like speed limits, [have 3% of mislabeled samples](./2.%20Mapillary%20speed-limit%20cleanup.ipynb). [After training this results in a 2% loss of model accuracy on these classes.](./3.%20Did%20it%20help%3F.ipynb#The-results)

### What?

This repository contains tools which can help you find mistakes in your labels directly from your Jupyter notebook. You just need a folder full of images or an object detection dataset in one of the supported formats. Here is an example of MLfix running inside a Jupyter session:

![MLfix v2 usage example](https://user-images.githubusercontent.com/107984/184148474-839b8049-fe68-47f0-b4b5-cc83f27dfea8.mp4)

You can also [try it yourself now on Google Collab](https://colab.research.google.com/github/jpc/mlfix-mapillary-traffic-signs/blob/main/index.ipynb?authuser=1).

### How?

The tools work by sorting and grouping the images and then showing them in a streamlined user interface. The interface allows you to mark the photos so you can perform your QA process. The images may be easily groupped and sorted by class or other metadata as well as by AI sorting methods like visual similarity or validation loss.

For more background info you can check the talk we gave at OpenSource Summit NA 2022 [video](https://www.youtube.com/watch?v=IS0k8rPVcmY) [slides](OSS%20NA%202022%20presentation.pdf).

### Quick start

It's easiest to start with any dataset in the ImageNet format (one folder per class) or with just a folder of unsorted pictures. If you have a custom dataset you can look into [the notebook we wrote for the Mapillary dataset](1.%20Generate%20bbox%20crops%20from%20ground%20truth.ipynb). Support for YOLO format is comming soon.

You can also test out the tool on the included traffic sign samples:

In [17]:
#| eval: false
#| output: false
from fastai.vision.all import *
from MLfix import MLfix

# get all the photos
fnames = get_image_files('./mapillary-samples/')

# put tham into a DataFrame
data = pd.DataFrame(dict(fname = fnames), index = [str(x) for x in fnames])
# the class label is the parent folder name
data['label'] = data.fname.map(lambda x: x.parent.name)
# show a sample of rows
data.head()

Unnamed: 0,fname,label
mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/zRcBUlEsTXn9qTkAih2PNw-0.jpg,mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/zRcBUlEsTXn9qTkAih2PNw-0.jpg,regulatory--end-of-maximum-speed-limit-30--g2
mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/0h4xymedlyjkvJFJFeJJQA-2.jpg,mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/0h4xymedlyjkvJFJFeJJQA-2.jpg,regulatory--end-of-maximum-speed-limit-30--g2
mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/j0EqcGd-CWF6Z6lFvD-V5Q-4.jpg,mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/j0EqcGd-CWF6Z6lFvD-V5Q-4.jpg,regulatory--end-of-maximum-speed-limit-30--g2
mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/sReQ1kF3d3bnoONxj-5vBw-0.jpg,mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/sReQ1kF3d3bnoONxj-5vBw-0.jpg,regulatory--end-of-maximum-speed-limit-30--g2
mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/Rp8tVRtBprDttjBV9QZRhQ-2.jpg,mapillary-samples/regulatory--end-of-maximum-speed-limit-30--g2/Rp8tVRtBprDttjBV9QZRhQ-2.jpg,regulatory--end-of-maximum-speed-limit-30--g2


In [18]:
#| eval: false
#| output: false 
# run MLfix and save the results into a new variable
new_labels = MLfix(data, label='label')

After working through the images you can check the results in the returned Pandas Series.

In [10]:
#| output: false
new_labels[new_labels == 'invalid']

mapillary-samples/complementary--maximum-speed-limit-50--g1/VtooHtty7IxmAu4BzcYBPQ-2.jpg    invalid
mapillary-samples/complementary--maximum-speed-limit-50--g1/b_NqcMxp1fTPoCEn2iA_mw-0.jpg    invalid
Name: label, dtype: object

For more usage examples and tricks, see the example notebooks in this repository. More examples are comming soon.