A repository part of a "Personalized/Customized LLMs" project.
We refer to the folder: DataScraping.
An initial step is to collect a considerable amount of appropriate data: user profiles (UPs) and associated natural language text. We principally consider pairs such as (UP, text) or (UP, Q/A). Further, we process those data in order to create proper datasets that can then be used as part of the training process of personalized/customized LLMs or as benchmark corpora for a more standardized, comprehensive and systematic evaluation of personalized LLMs by researchers, companies and, in fact, anyone.