## Summary

Since our data is all text, we have the luxury and the burden of storing raw text data. While we can put this data directly into a `Dataset` object, there are a few drawbacks to doing so:

- Choosing an object from one library (for example the datasets package) may make it difficult to migrate to a different library in the future.
- If there is preprocessing or chunking in our Dataset object, we may not be able to reconstruct the full context - this matters because we may want to train a different model with a different context length in the future!
- If there is toxicity in our data, it is harder to go context by context in the Dataset object than to delete or edit the source file and rebuild.
First, let's talk about how to save a `Dataset`. Then, we can discuss the raw data.

### Review: Saving a Dataset
If we're going to use HuggingFace transformers for our pre-trained model, then we'll want to have a Dataset object from the datasets library. How we do this depends on whether we want to save to the local disk or the [HuggingFace hub](https://huggingface.co/datasets).

To create a Dataset from a Python dictionary, save it to the disk, and load it from disk we can run:

    from datasets import Dataset, load_from_disk
    # Create a dictionary
    data_dict = {"courses": ["Deep Learning", "Datasets for LLMs"], 
    "type": ["Nanodegree", "Standalone"]}

    # Create a Dataset object from the dictionary
    ds = Dataset.from_dict(data_dict)

    # Save the Dataset object to local disk
    ds.save_to_disk("my_dataset.hf")
    print("Dataset saved!")

    # Load the Dataset object from local disk
    ds = load_from_disk("my_dataset.hf")
    print("Dataset loaded!")

If you want to upload to the hub instead, check out the instructions on [share a dataset to the Hub](https://huggingface.co/docs/datasets/upload_dataset). If you're using a dataset from the hub, you'll use `load_dataset` instead of `load_from_disk`.

### Structuring our Raw Data
In terms of structure, there are two common options, each with its pros and cons:

- Option 1: Storing each context in its own `.txt` file, and, if applicable, storing questions/answers pairs or instructions/outputs pairs in a separate `.csv` file.
- Option 2: Storing everything within one `.csv` file (potentially one each for train/validation/test sets).
  
#### Option 1
Storing contexts separately in their own .txt files offers several advantages at the cost of additional complexity.

First, if we wish to [stream](https://huggingface.co/docs/datasets/stream) our data, we'll need our data in a format nice enough to do so - putting it all in a single CSV isn't going to do that for us!

Next, from a version control perspective, it's much easier to generate a list of files that have been added, deleted, and modified than to compare the differences between two files.

Finally, if we are storing our data in the cloud but training offline, we can reduce our egress charges by transferring only missing or updated files rather than retransmitting an entire large CSV.

Of course, there are drawbacks!

Chief among the drawbacks is that if we are storing many, many files, we need to have a naming convention that guarantees each file receives a unique name. Otherwise, we risk overwriting our hard-earned data.

Secondly, the directory structure matters - we may want separate directories for train/validation/test data, but this means we have to manage files and track versions across the directories.

Lastly, since there may be many, many files in our dataset, we can run into filesystem issues like exhausting the available [inodes](https://en.wikipedia.org/wiki/Inode) on older filesystems.

#### Option 2
The advantage of the CSV option is, of course, that it is easy to keep track of! Having one file greatly reduces the complexity of managing and moving datasets.

Furthermore, using a single `train.csv` file to populate a Pandas `DataFrame` or HuggingFace `Dataset` is very convenient.

On the other hand, for large datasets, these files can become **extremely** large and, therefore, challenging to move and share.

Additionally, since all the contexts are stored in a single (large!) file, versioning can be challenging, as we'll need to `diff` two very large files.

Ultimately, the smaller your dataset is, the more viable the CSV option is. As our dataset becomes larger, a filesystem-based approach becomes necessary. Which option you choose should ultimately depend on your comfort managing the raw data files and whether or not you expect your dataset to continue growing.



## Additional References

[]()