# End-to-End TensorFlow Pipeline
This notebook demonstrates a complete machine learning pipeline using only TensorFlow functionalities. It includes:
- CSV data loading
- Numeric and categorical preprocessing (including one-hot encoding)
- Train/test split
- Model definition and training
- Saving and loading the model for inference

## Configuration

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

In [None]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

### Basic preprocessing

TensorFlow provides a suite of preprocessing layers under `tf.keras.layers` that allow you to transform input data directly within your model. These layers are:

- Fully compatible with `tf.data` pipelines
- Exportable with the model for deployment
- Efficient and GPU/TPU-friendly

We'll explore key preprocessing layers for numeric, categorical, and text data.

#### Numeric Feature Normalization

Use `Normalization()` to scale numeric inputs to zero mean and unit variance. This is essential for stable training of neural networks.

In [None]:
# Sample numeric data
data = np.array([[1.0], [2.0], [3.0], [4.0], [5.0]])

# Create and adapt normalization layer
normalizer = tf.keras.layers.Normalization()
# Use the `Normalization.adapt` method to adapt the normalization layer to your data
normalizer.adapt(data)

# Apply normalization
normalized = normalizer(data)
print("Normalized output:\n", normalized.numpy())

##### Discretization

Use `Discretization` to convert continuous numeric values into discrete bins. This is useful for bucketing features like age or income.

In [None]:
# Sample numeric data
values = tf.constant([[5.0], [15.0], [25.0], [35.0], [10.0], [20.0]])

# Define bin boundaries
discretizer = tf.keras.layers.Discretization(bin_boundaries=[10.0, 20.0, 30.0])

# Apply discretization
binned = discretizer(values)
print("Discretized output:\n", binned.numpy())

#### Categorical Encoding

Use `StringLookup` and `CategoryEncoding` to convert string categories into one-hot or multi-hot encoded vectors.

In [None]:
# Sample categorical data
train_categories = tf.constant(['red', 'green', 'blue', 'green', 'red'])

# String lookup
# Try also ['int', 'one_hot']
lookup = tf.keras.layers.StringLookup(output_mode='one_hot')
lookup.adapt(train_categories)

# Encoding values
test_categories = tf.constant(['green', 'blue', 'red', 'black', ''])
encoded_values = lookup(test_categories)

# Convert to NumPy
input_strings = test_categories.numpy().astype(str)
encoded_array = encoded_values.numpy()

# Build DataFrame
df = pd.DataFrame(encoded_array)
df.insert(0, 'Input Category', input_strings)

# Display result
df.head()

In [None]:
# Sample categorical data
train_categories = tf.constant([['red', 'green'], ['blue', 'green'], ['red', 'yellow']])

# String lookup
# Try also ['multi_hot', 'count']
lookup = tf.keras.layers.StringLookup(output_mode='count')
lookup.adapt(train_categories)

# Encoding values
test_categories = tf.constant([['red', ''], ['blue', 'green'], ['black', 'yellow']])
encoded_values = lookup(test_categories)

# Convert to NumPy
input_strings = test_categories.numpy().astype(str)
encoded_array = encoded_values.numpy()

# Build DataFrame
df = pd.DataFrame(encoded_array)
for i in range(input_strings.shape[1]):
    df.insert(loc=i, column=f'Input_{i+1}', value=input_strings[:, i])

# Display result
df.head()

##### Hashing

Use `Hashing` to convert strings into integer indices using a hash function. Useful for high-cardinality categorical features.

In [None]:
# Sample string data
words = tf.constant(['green', 'red', 'blue', 'black', 'yellow', 'white', 'magenta'])

# Hashing layer
hasher = tf.keras.layers.Hashing(num_bins=4)
hashed = hasher(words)
print("Hashed output:\n", hashed.numpy())

#### Text Vectorization

Use `TextVectorization` to tokenize and vectorize raw text into integer sequences or n-grams.

In [None]:
# Sample text
texts = tf.constant(["TensorFlow is great", "Preprocessing is powerful"])

# Text vectorization layer
vectorizer = tf.keras.layers.TextVectorization(output_mode='int', max_tokens=10)
vectorizer.adapt(texts)

# Vectorized output
test_texts = tf.constant(["TensorFlow is fun, great and powerful", ""])
vectorized = vectorizer(test_texts)
print("Vectorized text:\n", vectorized.numpy())

### Building and Training Models in TensorFlow

This notebook demonstrates how to build, compile, train, evaluate, and save models using TensorFlow's high-level Keras API. We explore:

- Sequential and Functional APIs
- Model compilation and training
- Saving and loading models

#### Sequential API

The Sequential API is ideal for simple stack-like models where each layer has one input and one output.


In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])


#### Functional API

Use the Functional API for models with multiple inputs/outputs or non-linear topology.


In [None]:
inputs = tf.keras.Input(shape=(10,))
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(32, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)


#### Compile the Model

Specify the optimizer, loss function, and metrics for training.


In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

#### Train the Model

Use `model.fit()` to train the model on your dataset.


In [None]:
import numpy as np

# Dummy data
X_train = np.random.rand(1000, 10)
y_train = np.random.randint(0, 2, size=(1000,))

history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

#### Evaluate and Predict

Use `model.evaluate()` and `model.predict()` for testing and inference.


In [None]:
X_test = np.random.rand(200, 10)
y_test = np.random.randint(0, 2, size=(200,))

loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc} -- Test loss: {loss}")

predictions = model.predict(X_test[:5])
print("Predictions:", predictions)

#### Callbacks and Early Stopping

Use callbacks like `EarlyStopping` and `ModelCheckpoint` to control training.


In [None]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)
]

model.fit(X_train, y_train, epochs=20, validation_split=0.2, callbacks=callbacks)

#### Save and Load Models

Use `model.save()` and `tf.keras.models.load_model()` to persist models.


In [None]:
model.save('my_model.keras')
loaded_model = tf.keras.models.load_model('my_model.keras')

#### Custom Training Loop

For full control, use `GradientTape` to write your own training loop.


In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_fn = tf.keras.losses.BinaryCrossentropy()

for epoch in range(5):
    for i in range(0, len(X_train), 32):
        x_batch = X_train[i:i+32]
        y_batch = y_train[i:i+32]

        with tf.GradientTape() as tape:
            logits = model(x_batch, training=True)
            loss = loss_fn(y_batch, logits)

        grads = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

    print(f"Epoch {epoch+1}: Loss = {loss.numpy():.4f}")


#### Integrating Preprocessing into a Model

Preprocessing layers can be part of the model itself, making it portable and deployment-ready.

In [None]:
# Define preprocessing layers
normalizer = tf.keras.layers.Normalization()
normalizer.adapt(np.array([[1.0], [2.0], [3.0]]))

# Build model
model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)
])

# Compile and summarize
model.compile(optimizer='adam', loss='mse')
model.summary()

### My first TF pipeline

#### Step 1: Load CSV Data

We use `tf.data.experimental.make_csv_dataset` to load structured data from a CSV file.


In [None]:
CSV_COLUMNS = ['feature1', 'feature2', 'feature3', 'category', 'label']
DEFAULTS = [0.0, 0.0, 0.0, '', 0]

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern='data.csv',
    batch_size=32,
    column_names=CSV_COLUMNS,
    column_defaults=DEFAULTS,
    label_name='label',
    num_epochs=1,
    shuffle=True
)

In [None]:
print('Elements of dataset:')
# Try to display elements of the dataset

#### Step 2: Define Preprocessing Layers

We normalize numeric features and one-hot encode the categorical feature using Keras preprocessing layers.


In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np

NUMERIC_FEATURES = ['feature1', 'feature2', 'feature3']
CATEGORICAL_FEATURE = ['category']

# Extract raw data to adapt layers
def extract_features(ds):
    numeric_data = []
    categorical_data = []
    for batch in ds: 
        features, _ = batch
        numeric_data.append(tf.stack([features[feature] for feature in NUMERIC_FEATURES], axis=-1))
        categorical_data.append(tf.stack([features[feature] for feature in CATEGORICAL_FEATURE], axis=-1))
    return tf.concat(numeric_data, axis=0), tf.concat(categorical_data, axis=0)

# Adaptation Functions

def get_normalization_layer(numeric_tensor):
    # Create Normalization layer for numeric features
    normalizer = tf.keras.layers.Normalization(axis=-1)
    normalizer.adapt(numeric_tensor)
    return normalizer

def get_category_encoding_layer(category_tensor, output_mode='one_hot'):
    # Create StringLookup + One-hot encoding for categorical feature
    lookup = tf.keras.layers.StringLookup(output_mode='int')
    lookup.adapt(category_tensor)
    category_encoding_layer = tf.keras.layers.CategoryEncoding(num_tokens=lookup.vocabulary_size(), output_mode=output_mode)
    
    return lookup, category_encoding_layer

# Adapt the Layers
numeric_tensor, category_tensor = extract_features(dataset)

normalizer = get_normalization_layer(numeric_tensor)

lookup, encoder = get_category_encoding_layer(category_tensor)

print("Preprocessing layers adapted successfully.")

#### Step 3: Build Preprocessing Submodel

We create a preprocessing model that transforms raw dictionary inputs into numeric tensors.


In [None]:
# Define Raw Feature Inputs
# Inputs must match the structure (name and dtype) of the data yielded by your CSV dataset.
numeric_inputs_list = []
for name in NUMERIC_FEATURES:
    # Numeric inputs are batched, so shape is (None, 1) or just (1,) if unbatched
    numeric_inputs_list.append(tf.keras.Input(shape=(1,), name=name, dtype=tf.float32))

cat_inputs_list = []
for name in CATEGORICAL_FEATURE:
    # Numeric inputs are batched, so shape is (None, 1) or just (1,) if unbatched
    cat_inputs_list.append(tf.keras.Input(shape=(1,), name=name, dtype=tf.string))

# Preprocessing
numeric_inputs = tf.keras.layers.Concatenate()(numeric_inputs_list)
cat_inputs = tf.keras.layers.Concatenate()(cat_inputs_list)
normalized = normalizer(numeric_inputs)
encoded_category = encoder(lookup(cat_inputs))

all_features = tf.keras.layers.Concatenate()([normalized, encoded_category])

# Add Dense Layers (Model Architecture)
x = tf.keras.layers.Dense(64, activation='relu')(all_features)
x = tf.keras.layers.Dense(32, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x) # Binary classification output

# Create the final end-to-end model
end_to_end_model = tf.keras.Model(inputs=numeric_inputs_list+cat_inputs_list, outputs=outputs)

#### Step 4: Build and Compile Full Model

We embed the preprocessing model into the full model so it becomes part of the saved graph.


In [None]:
# Compile the Model
end_to_end_model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("\nEnd-to-End Model Summary:")
end_to_end_model.summary()

#### Step 5: Prepare Dataset for Training

We map the dataset to a dictionary format compatible with the model's input signature.


In [None]:
def format_batch(features, label):
    return (
        {
            'feature1': tf.expand_dims(features['feature1'], -1),
            'feature2': tf.expand_dims(features['feature2'], -1),
            'feature3': tf.expand_dims(features['feature3'], -1),
            'category': tf.expand_dims(features['category'], -1),
        },
        label
    )

train_dataset = dataset.map(format_batch)

#### Step 6: Train the Model

We train the model using the standard `fit()` method.


In [None]:
# --- 8. Train the Model ---
EPOCHS = 5

print(f"\nStarting training for {EPOCHS} epochs on raw dataset...")



# We use the original 'dataset' as the training input
history = end_to_end_model.fit(
    train_dataset,
    epochs=EPOCHS
)

print("\nTraining complete.")

#### Step 7: Save and Reload the Model

We save the entire model including preprocessing layers and reload it for inference.


In [None]:
# Save the model
end_to_end_model.save('model/full_pipeline.keras')  # Saves preprocessing layers too

# Reload
loaded_model = tf.keras.models.load_model('model/full_pipeline.keras')

#### Step 8: Inference with Raw Dictionary Input

We can now pass raw feature dictionaries directly to the reloaded model.


In [None]:
sample = {
    'feature1': tf.constant([[1.2]]),
    'feature2': tf.constant([[2.3]]),
    'feature3': tf.constant([[3.1]]),
    'category': tf.constant([['green']])
}

prediction = loaded_model(sample)
print("Prediction:", prediction.numpy())


## My first TF Lab: Building a TensorFlow Prediction Pipeline

- Build a self-contained Titanic survival prediction pipeline utilizing the provided dataset and integrating all necessary data preparation and modeling logic.

- Bonus: Export and serve the trained model using a simple FastAPI (or Flask) API