# 1.5 Leveraging Large Datasets

Large training datasets for popular GPT and BERT-like models represent diverse and comprehensive text corpora containing billions of words, covering a wide range of topics and both natural and computer languages. To provide a concrete example, Table 1.1 summarizes the datasets used to pre-train GPT-3, which served as the base model for the first version of ChatGPT.

**Table 1.1 Pre-training datasets for popular GPT-3 large language models**

| Dataset name | Dataset description | Number of tokens | Proportion in training data |
|----------------|----------------------|------------------|-----------------------------|
| CommonCrawl (filtered) | Web crawl data | 410 billion | 60% |
| WebText2 | Web crawl data | 19 billion | 22% |
| Books1 | Internet-based book corpus | 12 billion | 8% |
| Books2 | Internet-basedd book corpus | 55 billion | 8% |
| Wikipedia | High-quality text | 3 billion | 3% |

Table 1.1 reports the number of tokens, which are the units of text read by the model. The number of tokens in the dataset is roughly equivalent to the number of words and punctuation marks in the text. We will describe tokenization, the process of converting text into tokens, in more detail in the next chapter.

The main conclusion is that the size and diversity of the training datasets enables these models to perform well on a wide variety of tasks involving language syntax, semantics, and context, even tasks that require general knowledge.

**GPT-3 dataset details**

In Table 1.1, it is important to note that from each dataset, only a portion of the data (out of a total of 300 billion tokens) was used in the training process. This sampling approach means that the training did not include every piece of data available in each dataset. Instead, a selected subset of 300 billion tokens drawn from all datasets was utilized. Furthermore, while some datasets were not fully covered in this subset, others may have been included multiple times to reach the total of 300 billion tokens. The columns in the table indicating proportions, if not accounting for rounding errors, add up to 100% of this sampled data.

To provide context, consider the size of the CommonCrawl dataset, which alone contains 410 billion tokens and requires about 570GB of storage. In contrast, subsequent versions of models like GPT-3, such as Meta’s LLaMA, have expanded their training to include additional data sources such as Arxiv research papers (92GB) and code-related Q&A from StackExchange (78GB).

The Wikipedia corpus consists of the English Wikipedia. Although the authors of the GPT-3 paper did not specify further details, Books1 is likely a sample from Project Gutenberg (https://www.gutenberg.org/), and Books2 is likely from Libgen (https://en.wikipedia.org/wiki/Library_Genesis). CommonCrawl is a filtered subset of the CommonCrawl database (https://commoncrawl.org/), and WebText2 is the text of the external web pages from all Reddit posts with more than 3 likes.

The authors of the GPT-3 paper did not share the training dataset, but a publicly available similar dataset is The Pile (https://pile.eleuther.ai/). However, this collection may contain copyrighted works, and the exact terms of use may depend on the intended use case and country. For more information, see the discussion on HackerNews at (https://news.ycombinator.com/item?id=25607809).

The pre-trained nature of these models gives them great flexibility when further fine-tuning for downstream tasks, which is why they are referred to as base or bottom-layer models. Pre-training large language models (LLMs) requires access to a large number of resources and is very expensive. For example, the pre-training cost of GPT-3 is estimated to be $4.6 million in cloud computing credits.[2]

The good news is that many pre-trained large language models (LLMs), available as open source models, can be used as general tools to write, extract, and edit text that is not part of the training data. In addition, LLMs can be fine-tuned for specific tasks on relatively small datasets, thereby reducing the required computing resources and improving performance on specific tasks.

In this book, we will implement the pretrained code and use it to pretrain a large language model (LLM) for educational purposes. All computations can be performed on consumer hardware. After implementing the pretrained code, we will learn how to reuse publicly available model weights and load them into the architecture we will implement, which allows us to skip the expensive pretraining stage when fine-tuning the LLMs later in the book.