Skip to content

trending projects & awesome papers about data-centric llm studies.

Notifications You must be signed in to change notification settings

koalazf99/Awesome-DataCentric-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 

Repository files navigation

Awesome-DataCentric-LLM

Awesome

Trending projects & awesome papers about data-centric LLM studies, including large-scale data curation, data quality assessment, evaluation, toolkits, etc.

Papers

  1. Scaling Data-Constrained Language Models

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel [pdf] [code] [May 2023] stars

  2. A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

    Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito [pdf] [May 2023]

  3. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay [pdf] [Jun 2023]

  4. Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li [pdf] [Jun 2023]

  5. Textbooks Are All You Need II: phi-1.5 technical report.

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee [pdf] [Sep 2023]

  6. What's In My Big Data?

    Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge [pdf] [code] [Oct 2023] stars

  7. SlimPajama-DC: Understanding Data Combinations for LLM Training

    Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing [pdf] [Sep 2023]

  8. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu [pdf] [code] [Nov 2023] stars

  9. Rephrasing the Web: A Recipe for Compute & Data-Efficient Language Modeling

    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly [pdf] [Jan 2024]

  10. QuRating: Selecting High-Quality Data for Training Language Models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen [pdf] [code] [Feb 2024] stars

  11. WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

    Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan, Conghui He [pdf] [Feb 2024]

  12. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo (AI2) [pdf] [code] [Feb 2024] stars

  13. Instruction-tuned Language Models are Better Knowledge Learners

    Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer [pdf] [Feb 2024]

  14. Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

    Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei [pdf] [Feb 2024]

  15. How to Train Data-Efficient LLMs

    Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng [pdf] [Feb 2024]

  16. Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul [pdf] [May 2024]

  17. MAP-NEO: A fully open-sourced Large Language Model

    Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn Guo, Soren Gao, Wangchunshu Zhou, Xinyue Zhang, Yizhi Zhou, Yubo Wang, Yuelin Bai, Yuhan Zhang, Yuxiang Zhang, Zenith Wang, Zhenzhu Yang, Zijian Zhao, Jiajun Zhang, Wanli Ouyang, Wenhao Huang, Wenhu Chen [pdf] [code] [May 2024] stars

  18. MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

    Zichun Yu, Spandan Das, Chenyan Xiong [pdf] [code] [Jun 2024] stars

  19. Does your data spark joy? Performance gains from domain upsampling at the end of training

    Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle [pdf] [Jun 2024]

  20. DataComp-LM: In search of the next generation of training sets for language models.

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, Vaishaal Shankar [pdf] [code] [Jun 2024] stars

  21. Instruction Pre-Training: Language Models are Supervised Multitask Learners

    Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei [pdf] [Jun 2024]

  22. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu [pdf] [Jun 2024]

  23. Resolving Discrepancies in Compute-Optimal Scaling of Language Models Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon [pdf] [Jun 2024]

  24. RegMix: Data Mixture as Regression for Language Model Pre-training

    Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin [pdf] [code] [Jul 2024] stars

Projects & Blogs

  1. Language Model Evaluation Harness

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, Andy Zou [code] [report] [2023] stars

  2. Cosmopedia

    Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, Leandro von Werra (HuggingFaceTB) [code] [datasets] [Feb 2024] stars

  3. DataTrove: large scale data processing

    Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, Thomas Wolf [code] [Feb 2024] stars

  4. 🍷 FineWeb: decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Colin Raffel, Leandro Werra, Thomas Wolf (HuggingFaceFW) [datasets] [report] [pdf] [May 2024]

  5. Hugging Face Ethics and Society Newsletter 6: Building Better AI: The Importance of Data Quality

    Avijit Ghosh and Lucie-Aimée Kaffee (Huggingface) [blog] [Jun 2024]

Tutorials

  1. CSE599J: Data-centric Machine Learning

    Pang Wei Koh [website] [2023]

About

trending projects & awesome papers about data-centric llm studies.

Topics

Resources

Stars

Watchers

Forks