This repository shall help finding a good distribution for huge datasets like LAION-5B for more efficient training.
- Literature research
- Evaluate possible balancing strategies
- Find a good sample / test dataset, e.g. a subset of LAION-5B
- Create adjective + noun dictionary
- Run model/algorithm on LAION-400M
- Run model/algorithm on LAION-5B
- https://arxiv.org/abs/2106.09994
- https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.603
- https://www.frontiersin.org/articles/10.3389/fpubh.2020.00274/full
- ...
- Cluster text/image embeddings (e.g., KMM/UMAP/PaCMAP/...) and reduce dense regions
- Compare distribution of nouns/adjectives in whole dataset and subset
- ...
- ...