## Pipeline 
- is a series of processing steps that prepare and load data into a machine learning model for training or inference.


**Here are the key components and steps involved in building a TensorFlow input pipeline:**

1. **Data source** 
- Define the source of your data, which can be in the form of files, databases or any other data storage. 

2. **Data Loading**
- Load data from the source into memory. 
- This step typically involves reading and decoding data from files and converting it into TensorFlow tensors. 

3. **Data Preprocessing**
- Preprocess the data to prepare it for model input. Common preprocessing steps include resizing images, normalizing values, tokenizing text, and applying data augmentation techniques. 

4. **Data Augmentation**
- If working with image data, data augmentation can be applied to increase the diversity of the training data. This helps the model generalize better. Common augmentations include rotations, flips, and random crops.

5. **Shuffling (Optional):**
- Randomly shuffle the examples within each batch to prevent the model from learning the order of the data. Shuffling is especially important in training to ensure that the model generalizes well.

6. **Batching:**  
- Divide the data into batches, where each batch contains a fixed number of examples (batch size). Batching improves training efficiency by allowing the model to process multiple examples in parallel.

7. **Prefetching:**
- Use the tf.data.Dataset.prefetch method to overlap data loading and model execution. Prefetching loads and preprocesses data for the next batch while the current batch is being processed by the model, reducing idle time and improving overall throughput.



In [1]:
import tensorflow as tf 

2023-08-23 08:49:01.353652: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-23 08:49:01.766686: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-23 08:49:01.772251: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# <span style="color: red">Create tf dataset from a list </span>

In [2]:
daily_sales_numbers = [21, 22, -108, 31, -1, 32, 34,31]

tf_dataset = tf.data.Dataset.from_tensor_slices(daily_sales_numbers)
tf_dataset


<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [3]:
# iterate through the dataset 
for sale in tf_dataset:
    print(sale.numpy())

21
22
-108
31
-1
32
34
31


In [4]:
# iterate through elements as numpy elements
for sales in tf_dataset.as_numpy_iterator():
    print(sales)

21
22
-108
31
-1
32
34
31


In [5]:
# filter sales numbers that are < 0
tf_dataset = tf_dataset.filter(lambda x: x > 0)

for sales in tf_dataset:
    print(sales.numpy())

21
22
31
32
34
31


**Convert sales numbers from USA dollars to Indian Rupees (INR) Assuming 1->72 conversation rate**

In [6]:
tf_dataset = tf_dataset.map(lambda x: x *72)
for sales in tf_dataset:
    print(sales.numpy())

1512
1584
2232
2304
2448
2232


In [9]:
# shuffle 
tf_dataset = tf_dataset.shuffle(2)

for sales in tf_dataset:
    print(sales.numpy())

1584
1512
2232
2304
2232
2448


In [10]:
# batching 
for sales in tf_dataset.batch(2):
    print(sales.numpy())

[2304 1584]
[2448 2232]
[2232 1512]


In [11]:
# perform all of the above operations in one shot 
tf_dataset = tf.data.Dataset.from_tensor_slices(daily_sales_numbers)

tf_dataset = tf_dataset.filter(lambda x: x > 0).map(lambda y: y * 72).shuffle(2).batch(2)

for sales in tf_dataset:
    print(sales.numpy())

[1584 1512]
[2304 2448]
[2232 2232]


## Images

In [12]:
images_ds = tf.data.Dataset.list_files('./images/*/*', shuffle=False)


In [13]:
image_count = len(images_ds)
image_count

130

In [14]:
type(images_ds)

tensorflow.python.data.ops.from_tensor_slices_op._TensorSliceDataset

In [15]:
for file in images_ds.take(3):
    print(file.numpy())

b'./images/cat/20 Reasons Why Cats Make the Best Pets....jpg'
b'./images/cat/7 Foods Your Cat Can_t Eat.jpg'
b'./images/cat/A cat appears to have caught the....jpg'


In [16]:
class_names = ['cat', 'dog']

In [17]:
train_size = int(image_count*0.8)
train_ds = images_ds.take(train_size)
test_ds = images_ds.skip(train_size)

In [18]:

len(train_ds)

104

In [19]:
len(test_ds)

26

In [20]:
def get_label(file_path):
    import os 
    parts = tf.strings.split(file_path, os.path.sep)
    return parts[-2]

In [22]:
get_label('./images//dog//20 Reasons Why Cats Make the Best Pets....jpg')

<tf.Tensor: shape=(), dtype=string, numpy=b''>

In [23]:
def process_image(file_path):
    label = get_label(file_path)
    img = tf.io.read_file(file_path)
    img = tf.image.decode_jpeg(img)
    img = tf.image.resize(img, [128, 128])
    
    return img, label

In [25]:
img, label = process_image('./images/cat/7 Foods Your Cat Can_t Eat.jpg')
img.numpy()[:2]

array([[[1.75414062e+02, 1.76414062e+02, 1.71414062e+02],
        [1.79000000e+02, 1.80000000e+02, 1.75000000e+02],
        [1.79488464e+02, 1.80488464e+02, 1.75488464e+02],
        [1.82414062e+02, 1.83414062e+02, 1.78414062e+02],
        [1.82000000e+02, 1.81000000e+02, 1.77000000e+02],
        [1.81114441e+02, 1.80114441e+02, 1.76114441e+02],
        [1.83585938e+02, 1.82585938e+02, 1.78585938e+02],
        [1.82585938e+02, 1.81585938e+02, 1.77585938e+02],
        [1.81000000e+02, 1.80000000e+02, 1.76000000e+02],
        [1.84068665e+02, 1.83068665e+02, 1.79068665e+02],
        [1.82712524e+02, 1.81712524e+02, 1.77712524e+02],
        [1.87546814e+02, 1.84546814e+02, 1.79546814e+02],
        [1.87161072e+02, 1.84161072e+02, 1.79161072e+02],
        [1.87000000e+02, 1.84000000e+02, 1.79000000e+02],
        [1.86313782e+02, 1.83313782e+02, 1.78313782e+02],
        [1.78621521e+02, 1.77621521e+02, 1.72621521e+02],
        [1.75828125e+02, 1.74828125e+02, 1.69828125e+02],
        [1.696

In [26]:
def scale(image, label):
    return image/255, label

In [28]:
train_ds = train_ds.map(scale)

TypeError: in user code:


    TypeError: outer_factory.<locals>.inner_factory.<locals>.tf__scale() missing 1 required positional argument: 'label'
