Skip to content

This repository shall help finding a good distribution for huge datasets like LAION-5B for more efficient training.

License

Notifications You must be signed in to change notification settings

LAION-AI/balanced-laion5b

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

balanced-laion5b

This repository shall help finding a good distribution for huge datasets like LAION-5B for more efficient training.

Project outline

  • Literature research
  • Evaluate possible balancing strategies
  • Find a good sample / test dataset, e.g. a subset of LAION-5B
  • Create adjective + noun dictionary
  • Run model/algorithm on LAION-400M
  • Run model/algorithm on LAION-5B

Literature

Methods

Idea 1

  1. Cluster text/image embeddings (e.g., KMM/UMAP/PaCMAP/...) and reduce dense regions
  2. Compare distribution of nouns/adjectives in whole dataset and subset

Idea 2

  1. ...
  2. ...

About

This repository shall help finding a good distribution for huge datasets like LAION-5B for more efficient training.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published