1. Why would you want to use the Data API?

The data API provides simplified abstractions to consume data when the dataset is too large to be loaded and held in memory.

It abstracts over the complexities of reading large datasets efficiently from disk, shuffling, parsing, batching, and prefetching to main memory or even GPU memory. This prevents the need for complex and potentially buggy implementations, and also provides these abstractions as tensorflow functions so that they can be used with autograph.

2. What are the benefits of splitting a large dataset into multiple files?

One benefit is the ability to interleave reading from files, which provides a mechanism to shuffle large datasets without having to do so before reading it.

Another benefit is that it provides natural partitions of the dataset, where subsets of files can be downloaded to different machines for distributed training.

3. During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?

Tensorboard provides profiling. You can determine whether preprocessing is the bottleneck if if needed optimize to ensure the GPU is being fully utilized.

4. Can you save any binary data to a TFRecord file, or only serialized protocol buffers ?

The wording of this question is confusing, but you can write any arbitrary binary data (text, images, audio, video, etc) in a `BytesList`. These bytes will be part of a `Feature` and all of it will be within a serialized protocol buffer. Binary data and protocol buffers are not mutually exclusive, the protocol buffer can contain byte fields if needed.

5. Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?

Using the provided `Example` protobuf format avoids the need to distribute `.proto` files and use of `protoc` to compile the proto files into the target language. This is likely more of a hassle.

6. When using TFRecords, when would you want to activate compression? Why not do it systematically?

The author doesn't say why, but I imagine it would be worth considering the tradeoff between the cost latency of CPU cycles for decompression vs the cost of latency of network transfer. If the data does not need to be transmitted over the network, preprocessing will be faster without the need to decompress if disk bandwidth is high enough. This might be nuanced, if the disk is slow it might be faster to load less data from disk and decompress, but this is all speculation, and entirely dependent on the system setup. The best thing to do would be to profile to evaluate performance.

7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

when writing files:
- Can pre-shuffle data when writing to file
- Doing any other transformation is possibly not a good idea, the raw training data should be as close to the raw data at the time of inference

within tf.data pipeline:
- Good for parsing, shuffling, batching, and so on.

preprocessing layers within model:
- Good to ensure that the model performs consistently across different environments, but can be potentially more costly at inference time.

TF Transform:
- Good when preprocessing is expensive, where computing it once before training improves performance
- Good when there are multiple deployment targets like TensorFlow.js or TensorFlow Lite where there can be training/serving skew

8. Name a few common techniques you can use to encode categorical features. What about text?

Categorical encoding and one-hot encoding.

For text Term Frequency Inverse Document Frequency (TFIDF) is one approach. This transforms a document to a vector which represents the count of each word as a separate dimension. The count is then scaled according to inverse document frequency, to ensure that very common terms (low information) are not overrepresented.

Another approach is word embeddings, where a vector representation is learned in a much lower dimensional space than a bag-of-words representation. In word embeddings, the representation is learned, and similar words are encouraged to have similar represenations as part of the training process.