# 13. Loading and Preprocessing Data with TensorFlow

### 1. Why would you want to use the `tf.data` API?

The `tf.data` API that provides TensorFlow provides a lot of benefits when handling data:

1. It provides built-in functions that allows us to handle datasets that are larger than the memory capacity of our system. 
2. It allow us to create an entire pipeline that is optimized to low the training time. 
3. Parallelization with the GPU is possible and we can use the hardware in a better way to train faster
4. We can work with a lot of different data sources (csv files, SQL databases, binary files, etc) in an optimized manner
5. It provides a binary protocol buffer (Protobufs) to store large datasets. This increases the performance of the training by allowing parallelization and decreasing training time.  

### 2. What are the benefits of splitting a large dataset into multiple files?

By splitting a large dataset into multiple files we can not only shuffle the records when loading the data, but we can interleave the lines of the files. This assures the data is well shuffled and we avoid having some correlation between examples that are close in the dataset.

### 3. During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?

We can detect this by analyzing the time that it takes to load the data and feed it to the model. Tensorboard offers the option to analyze a model and visualize the time that the input pipeline takes compared to the training of the model. We can fix this by enabling the graphical acceleration (if available) and by parallelizing the steps of the pipeline.

### 4. Can you save any binary data to a TFRecord file, or only serialized protocol buffers?

We can save any binary data to a TFRecord file, however, the protocol buffers (protobufs) are extensible, portable and efficient. Additionally, the protobufs are included in TensorFlow, and this makes its usage more convenient when saving complex data (images, text)

### 5. Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?

The example protobuf provided by TensorFlow has the advantage that can be used in the `tf.data` pipeline since it has been optimized for this purpose. Creating our own protobuf may slow the training or have some inconvenience when wrapping the code with `tf.py_function`. An alternative would be to use `tf.io.decode_proto()`, but this should be seen as an option when the `Example` protobuf does not meet all our requirements.

### 6. When using TFRecords, when would you want to activate compression? Why not do it systematically?

We would want to activate the compression when the bandwidth of our network may be a limitation. Although we can do it systematically, compressing the `TFRecords` may cause that some information may lost.

### 7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model. Can you list a few pros and cons of each option?

- If we preprocessed the data when writing the files that may be efficient since we are only preprocessing the data once, but we need to make sure that the functions used to do it are available in the server as well as the libraries used to preprocess the data. 
- When including the preprocessin in the `tf.data` pipeline we make sure the data is only preprocessed once, but we need to make sure the preprocessing is applied in production. this may affect the portability of the model.
- If we include the preprocessing in the model by using the preprocessing layers provided by Keras, the preprocessing will be included in the model, and therefore the portability will not be affected. However, the training speed will be affected. 

One good alternative to this is to preprocess the data in the `tf.data` pipeline with a preprocessing layer outside of the model. Once the model is trained, we can save the model by adding the preprocesing layers before the model. In this way the preprocesing will be saved with the model and portability won't be affected, neither the training speed.

### 8. Name a few common ways you can encode categorical integer features. What about text?

We can encode a categorical integer feature in *one-hot encoding* (by creating sparse tensors having one in the column of the categories and zeros in the rest of the columns), *multi-hot encoding* (same as *one-hot* encoding, but for multiple features), and *count encoding* (a tensor with the number of times that each class appears).

For text we have *one-hot encoding* (A sparse tensor similar to integer but instead of class in this case each column is a word), *hashing encoding* (it encodes using a hash of the words), *embeddings* (representing each word by a dense tensor that preserves context of the words), *tf-idf* (takes into consideration the length of the document and the appearances of the word in the document)