Skip to content

kennethleungty/Data-Centric-AI-Competition

Repository files navigation

Data-Centric AI Competition 2021 - Tips and Tricks of a Top 5% Finish

Sharing the techniques that worked (and did not work) in the competition organized by Andrew Ng & DeepLearning.AI

Link to Medium writeup: https://towardsdatascience.com/data-centric-ai-competition-tips-and-tricks-of-a-top-5-finish-9cacc254626e


Introduction

Data is food for AI, and there is vast potential for model performance improvement by shifting from a model-centric to a data-centric approach. That is the motivation behind the recent Data-Centric AI Competition organized by Andrew Ng and DeepLearning.AI.

In this repo, I unveil the methods (and codes) of my Top 5% ranked submission (~84% accuracy, ranked 24), including the various techniques that worked and did not work for me. Do check out the Medium article for a more in-depth look at my thought process and methods behind the submission.


About the Competition

  • Link to competition page: https://https-deeplearning-ai.github.io/data-centric-comp/
  • A collaboration between DeepLearning.AI and Landing AI, the Data-Centric AI Competition aims to elevate data-centric approaches to improving the performance of machine learning models.
  • In most machine learning competitions, you are asked to build a high-performance model given a fixed dataset.
  • However, machine learning has matured to the point that high-performance model architectures are widely available, while approaches to engineering datasets have lagged.
  • The Data-Centric AI Competition inverts the traditional format and instead asks you to improve a dataset given a fixed model. We will provide you with a dataset to improve by applying data-centric techniques such as fixing incorrect labels, adding examples that represent edge cases, apply data augmentation, etc.

Contents

  • Full_Notebook_Best_Submission.ipynb (Complete walkthrough codes for the best submission I submitted for the competition)
  • experiment_tracker.csv (Spreadsheet tracker I used to monitor my various experiments)
  • /data (Public Roman MNIST dataset released by the competition)