# TensorFlow Dataset API

**Learning Objectives**
1. Learn how to use tf.data to read data from memory
1. Learn how to use tf.data in a training loop
1. Learn how to use tf.data to read data from disk
1. Learn how to write production input pipelines with feature engineering (batching, shuffling, etc.)


In this notebook, we will start by refactoring the linear regression we implemented in the previous lab so that it takes data from a`tf.data.Dataset`, and we will learn how to implement **stochastic gradient descent** with it. In this case, the original dataset will be synthetic and read by the `tf.data` API directly  from memory.

In a second part, we will learn how to load a dataset with the `tf.data` API when the dataset resides on disk.

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/2_dataset_api.ipynb) -- try to complete that notebook first before reviewing this solution notebook.


In [5]:
# !python3 -m pip install 'tensorflow[and-cuda]'

In [2]:
# The json module is mainly used to convert the python dictionary above into a JSON string that can be written into a file
import json
# The math module in python provides some mathematical functions
import math
# The OS module in python provides functions for interacting with the operating system
import os
# The pprint module provides a capability to `pretty-print` arbitrary Python data structures in a form which can be used as input to the interpreter
from pprint import pprint

# Here we'll import data processing libraries like numpy and tensorflow
import numpy as np
import tensorflow as tf
# Here we'll show the currently installed version of TensorFlow
print(tf.version.VERSION)

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

2025-01-16 17:28:23.444122: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1737048503.469219    3517 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737048503.476613    3517 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-16 17:28:23.502255: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2.18.0


## Loading data from memory

### Creating the dataset

Let's consider the synthetic dataset of the previous section:

In [3]:
N_POINTS = 10
# The .constant() method will creates a constant tensor from a tensor-like object.
X = tf.constant(range(N_POINTS), dtype=tf.float32)
Y = 2 * X + 10

2025-01-16 17:28:27.994449: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [7]:
## X,Y as tensors
print(X)
print(Y)

## X,Y as NumPy
print(X.numpy())
print(Y.numpy())

tf.Tensor([0. 1. 2. 3. 4. 5. 6. 7. 8. 9.], shape=(10,), dtype=float32)
tf.Tensor([10. 12. 14. 16. 18. 20. 22. 24. 26. 28.], shape=(10,), dtype=float32)
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[10. 12. 14. 16. 18. 20. 22. 24. 26. 28.]


We begin with implementing a function that takes as input


- our $X$ and $Y$ vectors of synthetic data generated by the linear function $y= 2x + 10$
- the number of passes over the dataset we want to train on (`epochs`)
- the size of the batches the dataset (`batch_size`)

and returns a `tf.data.Dataset`: 

**Remark:** Note that the last batch may not contain the exact number of elements you specified because the dataset was exhausted.

If you want batches with the exact same number of elements per batch, we will have to discard the last batch by
setting:

```python
dataset = dataset.batch(batch_size, drop_remainder=True)
```

We will do that here.

In [8]:
# Let's define create_dataset() procedure
# TODO 1
def create_dataset(X, Y, epochs, batch_size):
# Using the tf.data.Dataset.from_tensor_slices() method we are able to get the slices of list or array
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder=True)
    return dataset

Let's test our function by iterating twice over our dataset in batches of 3 datapoints:

In [4]:
BATCH_SIZE = 3
EPOCH = 2

dataset = create_dataset(X, Y, epochs=EPOCH, batch_size=BATCH_SIZE)

for i, (x, y) in enumerate(dataset):
# You can convert a native TF tensor to a NumPy array using .numpy() method
# Let's output the value of `x` and `y`
    print("x:", x.numpy(), "y:", y.numpy())
    assert len(x) == BATCH_SIZE
    assert len(y) == BATCH_SIZE

x: [0. 1. 2.] y: [10. 12. 14.]
x: [3. 4. 5.] y: [16. 18. 20.]
x: [6. 7. 8.] y: [22. 24. 26.]
x: [9. 0. 1.] y: [28. 10. 12.]
x: [2. 3. 4.] y: [14. 16. 18.]
x: [5. 6. 7.] y: [20. 22. 24.]


In [None]:
## Each batch contains 3 values (3 X's and 3 Y's) becsaue batch size = 3
## We iterate over data twice (when 8. and 9. is missing becasue as it wouldn't create a new batch)

### Loss function and gradients

The loss function and the function that computes the gradients are the same as before:

In [9]:
# Let's define loss_mse() procedure which will return computed mean of elements across dimensions of a tensor.
def loss_mse(X, Y, w0, w1):
    Y_hat = w0 * X + w1
    errors = (Y_hat - Y)**2
    return tf.reduce_mean(errors)


# Let's define compute_gradients() procedure which will return value of recorded operations for automatic differentiation
def compute_gradients(X, Y, w0, w1):
    with tf.GradientTape() as tape:
        loss = loss_mse(X, Y, w0, w1)
    return tape.gradient(loss, [w0, w1])

### Training loop

The main difference now is that now, in the traning loop, we will iterate directly on the `tf.data.Dataset` generated by our `create_dataset` function. 



In [10]:
# Here we will configure the dataset so that it iterates 250 times over our synthetic dataset in batches of 2.
# TODO 2
EPOCHS = 250
BATCH_SIZE = 2
LEARNING_RATE = .02

MSG = "STEP {step} - loss: {loss}, w0: {w0}, w1: {w1}\n"

w0 = tf.Variable(0.0)
w1 = tf.Variable(0.0)

dataset = create_dataset(X, Y, epochs=EPOCHS, batch_size=BATCH_SIZE)

for step, (X_batch, Y_batch) in enumerate(dataset):

    dw0, dw1 = compute_gradients(X_batch, Y_batch, w0, w1)
    w0.assign_sub(dw0 * LEARNING_RATE)
    w1.assign_sub(dw1 * LEARNING_RATE)

    if step % 100 == 0:
        loss = loss_mse(X_batch, Y_batch, w0, w1)
        print(MSG.format(step=step, loss=loss, w0=w0.numpy(), w1=w1.numpy()))
        
assert loss < 0.0001
assert abs(w0 - 2) < 0.001
assert abs(w1 - 10) < 0.001

STEP 0 - loss: 109.76800537109375, w0: 0.23999999463558197, w1: 0.4399999976158142

STEP 100 - loss: 9.363959312438965, w0: 2.55655837059021, w1: 6.674341678619385

STEP 200 - loss: 1.393267273902893, w0: 2.2146825790405273, w1: 8.717182159423828

STEP 300 - loss: 0.20730558037757874, w0: 2.082810878753662, w1: 9.505172729492188

STEP 400 - loss: 0.03084510937333107, w0: 2.03194260597229, w1: 9.809128761291504

STEP 500 - loss: 0.004589457996189594, w0: 2.012321710586548, w1: 9.926374435424805

STEP 600 - loss: 0.0006827632314525545, w0: 2.0047526359558105, w1: 9.971602439880371

STEP 700 - loss: 0.00010164897685172036, w0: 2.0018346309661865, w1: 9.989042282104492

STEP 800 - loss: 1.5142451957217418e-05, w0: 2.000706911087036, w1: 9.995771408081055

STEP 900 - loss: 2.256260358990403e-06, w0: 2.0002737045288086, w1: 9.998367309570312

STEP 1000 - loss: 3.3405058275093324e-07, w0: 2.000105381011963, w1: 9.999371528625488

STEP 1100 - loss: 4.977664502803236e-08, w0: 2.000040054321289,

2025-01-16 17:32:34.575222: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Loading data from disk

### Locating the CSV files

We will start with the **taxifare dataset** CSV files that we wrote out in a previous lab. 

The taxifare dataset files have been saved into `../toy_data`.

Check that it is the case in the cell below, and, if not, regenerate the taxifare


In [13]:
# ls shows the working directory's contents.
# Using -l parameter will lists the files with assigned permissions
!ls -l ../toy_data/taxi*.csv

-rw-r--r-- 1 jupyter jupyter  61473 Jan 16 17:24 ../toy_data/taxi-test.csv
-rw-r--r-- 1 jupyter jupyter 288831 Jan 16 17:24 ../toy_data/taxi-train.csv
-rw-r--r-- 1 jupyter jupyter  61082 Jan 16 17:24 ../toy_data/taxi-valid.csv


### Use tf.data to read the CSV files

The `tf.data` API can easily read csv files using the helper function tf.data.experimental.make_csv_dataset

If you have TFRecords (which is recommended), you may use tf.data.experimental.make_batched_features_dataset

The first step is to define 

- the feature names into a list `CSV_COLUMNS`
- their default values into a list `DEFAULTS`

In [30]:
# Defining the feature names into a list `CSV_COLUMNS`
CSV_COLUMNS = [
    'fare_amount',
    'pickup_datetime',
    'pickup_longitude',
    'pickup_latitude',
    'dropoff_longitude',
    'dropoff_latitude',
    'passenger_count',
    'key'
]
LABEL_COLUMN = 'fare_amount'
# Defining the default values into a list `DEFAULTS`
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]

Let's now wrap the call to `make_csv_dataset` into its own function that will take only the file pattern (i.e. glob) where the dataset files are to be located:

In [31]:
# TODO 3
def create_dataset(pattern):
# The tf.data.experimental.make_csv_dataset() method reads CSV files into a dataset
    return tf.data.experimental.make_csv_dataset(
        pattern, 1, CSV_COLUMNS, DEFAULTS
        ,shuffle=False) ## added Shuffle=False so the order of batched data is the same as in CSV


tempds = create_dataset('../toy_data/taxi-test*') ## This indicates that it will match all csv files with similar name to "taxi-test" (eg taxi-test2, taxi-test3, etc)
# Let's output the value of `tempds`
print(tempds)

<_PrefetchDataset element_spec=OrderedDict([('fare_amount', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('pickup_datetime', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('pickup_longitude', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('pickup_latitude', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('dropoff_longitude', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('dropoff_latitude', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('passenger_count', TensorSpec(shape=(1,), dtype=tf.float32, name=None)), ('key', TensorSpec(shape=(1,), dtype=tf.string, name=None))])>


Note that this is a prefetched dataset, where each element is an `OrderedDict` whose keys are the feature names and whose values are tensors of shape `(1,)` (i.e. vectors).



##### #### COMMENT: Before going rurther, let's compare that with CSV file ####

In [32]:
## This is how RAW csv data looks like:
import os
cwd = os.getcwd()
cwd

import pandas as pd
pd.read_csv('/home/jupyter/training-data-analyst/courses/machine_learning/deepdive2/introduction_to_tensorflow/toy_data/taxi-test.csv')

Unnamed: 0,6.0,2013-03-27 03:35:00 UTC,-73.977672,40.784052,-73.965332,40.801025,2,0
0,19.3,2012-05-10 18:43:16 UTC,-73.954366,40.778924,-74.004094,40.723104,1,1
1,7.5,2014-05-20 23:09:00 UTC,-73.999165,40.738377,-74.003473,40.723862,2,2
2,12.5,2015-02-23 19:51:31 UTC,-73.965210,40.769482,-73.989494,40.739742,1,3
3,10.9,2011-03-19 03:32:00 UTC,-73.992590,40.742957,-73.989908,40.711053,1,4
4,7.0,2012-09-18 12:51:11 UTC,-73.971195,40.751566,-73.975922,40.756361,1,5
...,...,...,...,...,...,...,...,...
779,12.5,2009-05-27 20:37:00 UTC,-73.973645,40.764280,-74.005823,40.740177,1,780
780,7.3,2010-02-25 20:14:00 UTC,-73.980868,40.783878,-73.965785,40.804497,1,781
781,8.9,2012-07-02 20:25:45 UTC,-73.972365,40.761404,-73.947328,40.801115,1,782
782,6.5,2014-05-02 20:18:31 UTC,-73.976343,40.764976,-73.981483,40.760463,1,783


In [33]:
## After doing create_dataset we basically convert to a tensor and batch them. 
## If a batch = 1 then each batch contains a single row of the data in a batch (thus N batches = N rows)

# Note that this is a prefetched dataset, where each element is an OrderedDict 
# whose keys are the feature names and whose values are tensors of shape (1,) (i.e. vectors). 
## NOTE that tensors would have shapes of (2,) if the batch size would be 2, etc

# Iterate over the dataset and print batched data
for batch_number, data in enumerate(tempds.take(2)):  # Adjust the number of batches you want to see
    print(f"Batch {batch_number + 1}:")
    print(data)
    print("\n")

Batch 1:
OrderedDict([('fare_amount', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([19.3], dtype=float32)>), ('pickup_datetime', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'2012-05-10 18:43:16 UTC'], dtype=object)>), ('pickup_longitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-73.95437], dtype=float32)>), ('pickup_latitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.778923], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-74.0041], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.723103], dtype=float32)>), ('passenger_count', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>), ('key', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'1'], dtype=object)>)])


Batch 2:
OrderedDict([('fare_amount', <tf.Tensor: shape=(1,), dtype=float32, numpy=array([7.5], dtype=float32)>), ('pickup_datetime', <tf.Tensor: shape=(1,), dtype=string, nump

In [34]:
## Since the above print is not easy to read we convert to NumPy and unpack the dictionary
# Print batched data
for batch_number, data in enumerate(tempds.take(2)):  # Take first 2 batches
    print(f"Batch {batch_number + 1}:")
    pprint({k: v.numpy() for k, v in data.items()})
    print("\n")


Batch 1:
{'dropoff_latitude': array([40.723103], dtype=float32),
 'dropoff_longitude': array([-74.0041], dtype=float32),
 'fare_amount': array([19.3], dtype=float32),
 'key': array([b'1'], dtype=object),
 'passenger_count': array([1.], dtype=float32),
 'pickup_datetime': array([b'2012-05-10 18:43:16 UTC'], dtype=object),
 'pickup_latitude': array([40.778923], dtype=float32),
 'pickup_longitude': array([-73.95437], dtype=float32)}


Batch 2:
{'dropoff_latitude': array([40.72386], dtype=float32),
 'dropoff_longitude': array([-74.00347], dtype=float32),
 'fare_amount': array([7.5], dtype=float32),
 'key': array([b'2'], dtype=object),
 'passenger_count': array([2.], dtype=float32),
 'pickup_datetime': array([b'2014-05-20 23:09:00 UTC'], dtype=object),
 'pickup_latitude': array([40.738377], dtype=float32),
 'pickup_longitude': array([-73.99917], dtype=float32)}




Note: the order of the data in CSV is the same as the order of the data in the batched data so it can be compared (it's becasue when creating a dataset we did shuffle=False)

In [36]:
# tf.data.experimental.cardinality(tempds).numpy()

##### COMMENT: end of the comment

Let's iterate over the first two element of this dataset using `dataset.take(2)`. Then convert them ordinary Python dictionary with numpy array as values for more readability:

In [19]:
for data in tempds.take(2):
    pprint({k: v.numpy() for k, v in data.items()})
    print("\n")

{'dropoff_latitude': array([40.723103], dtype=float32),
 'dropoff_longitude': array([-74.0041], dtype=float32),
 'fare_amount': array([19.3], dtype=float32),
 'key': array([b'1'], dtype=object),
 'passenger_count': array([1.], dtype=float32),
 'pickup_datetime': array([b'2012-05-10 18:43:16 UTC'], dtype=object),
 'pickup_latitude': array([40.778923], dtype=float32),
 'pickup_longitude': array([-73.95437], dtype=float32)}


{'dropoff_latitude': array([40.72386], dtype=float32),
 'dropoff_longitude': array([-74.00347], dtype=float32),
 'fare_amount': array([7.5], dtype=float32),
 'key': array([b'2'], dtype=object),
 'passenger_count': array([2.], dtype=float32),
 'pickup_datetime': array([b'2014-05-20 23:09:00 UTC'], dtype=object),
 'pickup_latitude': array([40.738377], dtype=float32),
 'pickup_longitude': array([-73.99917], dtype=float32)}




### Transforming the features

What we really need is a dictionary of features + a label. So, we have to do two things to the above dictionary:

1. Remove the unwanted column "key"
1. Keep the label separate from the features

Let's first implement a function that takes as input a row (represented as an `OrderedDict` in our `tf.data.Dataset` as above) and then returns a tuple with two elements:

* The first element being the same `OrderedDict` with the label dropped
* The second element being the label itself (`fare_amount`)

Note that we will need to also remove the `key` and `pickup_datetime` column, which we won't use.

In [39]:
UNWANTED_COLS = ['pickup_datetime', 'key']

# Let's define the features_and_labels() method
# TODO 4a
def features_and_labels(row_data):
# The .pop() method will return item and drop from frame. 
    label = row_data.pop(LABEL_COLUMN) #.pop() method to remove the LABEL_COLUMN from row_data and stores its value in the label
    features = row_data
    
    # iterates over each column name in UNWANTED_COLS. For each iteration, 
    # features.pop(unwanted_col) removes that unwanted column from the features dictionary.
    for unwanted_col in UNWANTED_COLS: ## you can't supply .pop with a list that's why this loop
        features.pop(unwanted_col)

    return features, label

Let's iterate over 2 examples from our `tempds` dataset and apply our `feature_and_labels`
function to each of the examples to make sure it's working:

In [40]:
for row_data in tempds.take(2):
    features, label = features_and_labels(row_data)
    pprint(features)
    print(label, "\n")
    assert UNWANTED_COLS[0] not in features.keys()
    assert UNWANTED_COLS[1] not in features.keys()
    assert label.shape == [1]

OrderedDict([('pickup_longitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-73.95437], dtype=float32)>),
             ('pickup_latitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.778923], dtype=float32)>),
             ('dropoff_longitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-74.0041], dtype=float32)>),
             ('dropoff_latitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.723103], dtype=float32)>),
             ('passenger_count',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>)])
tf.Tensor([19.3], shape=(1,), dtype=float32) 

OrderedDict([('pickup_longitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-73.99917], dtype=float32)>),
             ('pickup_latitude',
              <tf.Tensor: shape=(1,), dtype=float32, numpy=array([40.738377], dtype=float32)>),
             ('dropoff_longitude',
              <tf

### Batching

Let's now refactor our `create_dataset` function so that it takes an additional argument `batch_size` and batch the data correspondingly. We will also use the `features_and_labels` function we implemented for our dataset to produce tuples of features and labels.

In [13]:
# Let's define the create_dataset() method
# TODO 4b
def create_dataset(pattern, batch_size):
# The tf.data.experimental.make_csv_dataset() method reads CSV files into a dataset
    dataset = tf.data.experimental.make_csv_dataset(
        pattern, batch_size, CSV_COLUMNS, DEFAULTS)
    return dataset.map(features_and_labels) 
    ## map()... applies the features_and_labels function to each batch of data in the dataset. 
    ## The map method transforms the dataset by applying the given function (features_and_labels) 
    ## to each element (batch) in the dataset. 

Let's test that our batches are of the right size:

In [14]:
BATCH_SIZE = 2

tempds = create_dataset('../toy_data/taxi-train*', batch_size=2)

for X_batch, Y_batch in tempds.take(2):
    pprint({k: v.numpy() for k, v in X_batch.items()})
    print(Y_batch.numpy(), "\n")
    assert len(Y_batch) == BATCH_SIZE

{'dropoff_latitude': array([40.762814, 40.74584 ], dtype=float32),
 'dropoff_longitude': array([-73.9767  , -74.001915], dtype=float32),
 'passenger_count': array([1., 1.], dtype=float32),
 'pickup_latitude': array([40.766552, 40.75046 ], dtype=float32),
 'pickup_longitude': array([-73.969025, -73.979   ], dtype=float32)}
[ 6. 13.] 

{'dropoff_latitude': array([40.741245, 40.772743], dtype=float32),
 'dropoff_longitude': array([-73.98776 , -73.967834], dtype=float32),
 'passenger_count': array([1., 1.], dtype=float32),
 'pickup_latitude': array([40.74568, 40.82851], dtype=float32),
 'pickup_longitude': array([-74.00553 , -73.948814], dtype=float32)}
[ 5.3 25. ] 



### Shuffling

When training a deep learning model in batches over multiple workers, it is helpful if we shuffle the data. That way, different workers will be working on different parts of the input file at the same time, and so averaging gradients across workers will help. Also, during training, we will need to read the data indefinitely.

Let's refactor our `create_dataset` function so that it shuffles the data, when the dataset is used for training.

We will introduce an additional argument `mode` to our function to allow the function body to distinguish the case
when it needs to shuffle the data (`mode == "train"`) from when it shouldn't (`mode == "eval"`).

Also, before returning we will want to prefetch 1 data point ahead of time (`dataset.prefetch(1)`) to speed-up training:

In [41]:
# TODO 4c
def create_dataset(pattern, batch_size=1, mode='eval'):
    dataset = tf.data.experimental.make_csv_dataset(
        pattern, batch_size, CSV_COLUMNS, DEFAULTS)

    dataset = dataset.map(features_and_labels).cache()
     ## The cache() method caches the elements of the dataset in memory (or optionally on disk) 
        # after the first iteration. This means that on subsequent iterations, the data is read directly from 
        # the cache rather than having to be recomputed or reloaded from the original source.
        # If there is just 1 epoch, then cache wouldn't be helpful.

    if mode == 'train':
        dataset = dataset.shuffle(1000).repeat()
        # shuffle(1000) -  The buffer size defines the number of elements from the dataset that TensorFlow will load into memory and shuffle. 
            ## In this case, shuffle(1000) means that TensorFlow will maintain a buffer of 1000 elements and shuffle them.
            ## A larger buffer size typically results in a more thoroughly shuffled dataset. However, larger buffer sizes also require more memory.
        # The repeat() method makes the dataset iterate indefinitely

    # take advantage of multi-threading; 1=AUTOTUNE
    dataset = dataset.prefetch(1)
        # method allows the data pipeline to fetch batches of data in the background while the 
        # current batch is being processed. The argument 1 specifies the number of batches to prefetch. 
        # In this case, it means to prefetch one batch at a time.
    
    return dataset

Let's check that our function works well in both modes:

In [42]:
tempds = create_dataset('../toy_data/taxi-train*', 2, "train")
print(list(tempds.take(1)))
 

[(OrderedDict([('pickup_longitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-73.9986 , -73.98047], dtype=float32)>), ('pickup_latitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([40.760715, 40.78295 ], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-74.01606 , -73.977295], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([40.71748 , 40.789543], dtype=float32)>), ('passenger_count', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([1., 1.], dtype=float32)>)]), <tf.Tensor: shape=(2,), dtype=float32, numpy=array([10.5,  4. ], dtype=float32)>)]


2025-01-16 19:06:50.770952: W tensorflow/core/kernels/data/cache_dataset_ops.cc:914] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [43]:
tempds = create_dataset('../toy_data/taxi-valid*', 2, "eval")
print(list(tempds.take(1)))

[(OrderedDict([('pickup_longitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-73.97565 , -73.968735], dtype=float32)>), ('pickup_latitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([40.755913, 40.761494], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-73.97869, -73.98792], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([40.683994, 40.73823 ], dtype=float32)>), ('passenger_count', <tf.Tensor: shape=(2,), dtype=float32, numpy=array([1., 1.], dtype=float32)>)]), <tf.Tensor: shape=(2,), dtype=float32, numpy=array([31.47,  7.7 ], dtype=float32)>)]


2025-01-16 19:06:51.978939: W tensorflow/core/kernels/data/cache_dataset_ops.cc:914] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


Copyright 2021 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.