# Keras Cheat Sheet

<!--- Start of badges -->
<!-- Badges: python,keras,machinelearning,deeplearning -->

<p align="left">
<img alt="Deeplearning" src="https://img.shields.io/badge/-Deep_Learning-333333.svg?logo=&style=flat-square" />
 <img alt="Keras" src="https://img.shields.io/badge/-Keras-D00000?logo=keras&logoColor=white&style=flat-square" />
 <img alt="Machinelearning" src="https://img.shields.io/badge/-Machine_Learning-333333.svg?logo=&style=flat-square" />
 <img alt="Python" src="https://img.shields.io/badge/-Python-3776AB?logo=python&logoColor=white&style=flat-square" />
</p>
<!--- End of badges -->

<!--- Blurb
This notebook is a hands-on toolkit for Keras, covering the complete workflow for building neural networks, from data preparation to model evaluation. It provides clear examples for using both the Sequential and Functional APIs to build a range of architectures, including basic neural networks to advanced CNNs for image classification and RNNs for time series forecasting.
-->

<!--- Start of Thumbnail-->
<!--- src="Images/keras_thumbnail.png" --->
<!--- End of Thumbnail-->

This notebook provides a comprehensive Keras cheat sheet, based on material from the ['Deep Learning with Keras and TensorFlow'](https://www.coursera.org/learn/building-deep-learning-models-with-tensorflow/home/welcome) course by IBM on Coursera. It covers essential concepts and code examples for building and training neural networks, including layers, models, pre-processing, training/testing, and other Keras features.




In [2]:
import generate_notebook_toc 
from IPython.display import display, Markdown
current_notebook_filename = "CS_Keras.ipynb"
display(Markdown(generate_notebook_toc.get_html_toc(current_notebook_filename)))

<div style="background-color: whitesmoke; padding: 10px; padding-left: 30px;">
  <h2>Table of Contents</h2>
  <hr>
  <div style="font-weight: bold; font-size: 1.1em;"><a href="#Layers">1. Layers</a></div>
  <div style="padding-left: 25px;"><a href="#Custom-Layers-(Subclassing)">Custom Layers (Subclassing)</a></div>
  <div style="padding-left: 25px;"><a href="#Input">Input</a></div>
  <div style="padding-left: 25px;"><a href="#Dense">Dense</a></div>
  <div style="padding-left: 25px;"><a href="#Convolution">Convolution</a></div>
  <div style="padding-left: 25px;"><a href="#Transpose-Convolution">Transpose Convolution</a></div>
  <div style="padding-left: 25px;"><a href="#Dropout">Dropout</a></div>
  <div style="padding-left: 25px;"><a href="#Batch-Normalization">Batch Normalization</a></div>
  <div style="font-weight: bold; font-size: 1.1em;"><a href="#Models">2. Models</a></div>
  <div style="padding-left: 25px;"><a href="#Custom-Models-(Subclassing)">Custom Models (Subclassing)</a></div>
  <div style="padding-left: 25px;"><a href="#Conventional-Fully-Connected-Neural-Network">Conventional Fully Connected Neural Network</a></div>
  <div style="padding-left: 50px;"><a href="#Regression">Regression</a></div>
  <div style="padding-left: 50px;"><a href="#Classification">Classification</a></div>
  <div style="padding-left: 25px;"><a href="#Convolutional-Neural-Network-(CNN)">Convolutional Neural Network (CNN)</a></div>
  <div style="padding-left: 50px;"><a href="#Simple-CNN">Simple CNN</a></div>
  <div style="padding-left: 50px;"><a href="#VGG">VGG</a></div>
  <div style="padding-left: 50px;"><a href="#ResNet">ResNet</a></div>
  <div style="padding-left: 50px;"><a href="#Pre-trained:-VGG16">Pre-trained: VGG16</a></div>
  <div style="padding-left: 25px;"><a href="#Recurrent-Neural-Network-(RNN)">Recurrent Neural Network (RNN)</a></div>
  <div style="padding-left: 50px;"><a href="#Simple-RNN">Simple RNN</a></div>
  <div style="padding-left: 50px;"><a href="#Long-Short-Term-Memory-(LSTM)">Long Short-Term Memory (LSTM)</a></div>
  <div style="padding-left: 25px;"><a href="#Transformers">Transformers</a></div>
  <div style="padding-left: 50px;"><a href="#Prediction:-sequential-data-tasks">Prediction: sequential data tasks</a></div>
  <div style="padding-left: 50px;"><a href="#Text-Generation">Text Generation</a></div>
  <div style="padding-left: 25px;"><a href="#Q-learning-(Reinforcement-Learning)">Q-learning (Reinforcement Learning)</a></div>
  <div style="padding-left: 25px;"><a href="#Autoencoders">Autoencoders</a></div>
  <div style="padding-left: 25px;"><a href="#Diffusion-models">Diffusion models</a></div>
  <div style="padding-left: 25px;"><a href="#Generative-Adversarial-Networks-(GANs)">Generative Adversarial Networks (GANs)</a></div>
  <div style="font-weight: bold; font-size: 1.1em;"><a href="#Pre-processing">3. Pre-processing</a></div>
  <div style="padding-left: 25px;"><a href="#Data-Augmentation">Data Augmentation</a></div>
  <div style="font-weight: bold; font-size: 1.1em;"><a href="#Train-and-Test">4. Train and Test</a></div>
  <div style="padding-left: 25px;"><a href="#Build-model">Build model</a></div>
  <div style="padding-left: 25px;"><a href="#Compile">Compile</a></div>
  <div style="padding-left: 25px;"><a href="#Fit">Fit</a></div>
  <div style="padding-left: 25px;"><a href="#Evaluate">Evaluate</a></div>
  <div style="padding-left: 25px;"><a href="#Hyperparameter-Tuning">Hyperparameter Tuning</a></div>
  <div style="padding-left: 25px;"><a href="#Learning-rate-scheduling">Learning rate scheduling</a></div>
  <div style="padding-left: 25px;"><a href="#Custom-training-loops">Custom training loops</a></div>
  <div style="padding-left: 25px;"><a href="#Custom-callbacks">Custom callbacks</a></div>
  <div style="font-weight: bold; font-size: 1.1em;"><a href="#Other-Features">5. Other Features</a></div>
  <div style="padding-left: 25px;"><a href="#Save-and-load-Keras-models">Save and load Keras models</a></div>
  <div style="padding-left: 25px;"><a href="#Mixed-precision-training">Mixed precision training</a></div>
  <div style="padding-left: 25px;"><a href="#Model-pruning">Model pruning</a></div>
  <div style="padding-left: 25px;"><a href="#Quantization">Quantization</a></div>
  <hr>
</div>

## Layers

### Custom Layers (Subclassing)
- Allows you to define your own operations in a neural network
- Allows implementation of custom functionality not provided by existing Keras layers

In [None]:
from keras.layers import Layer
from keras.models import Sequential

class CustomLayer(Layer):
    def __init__(self, units=32):
        super(CustomLayer, self).__init__()
        self.units = units

    def build(self, input_shape):
        self.w = self.add_weight(shape=(input_shape[-1], self.units),
                                 initializer='random_normal',
                                 trainable=True)
        self.b = self.add_weight(shape=(self.units,),
                                 initializer='zeros',
                                 trainable=True)
    def call(self, inputs):
        return tf.nn.relu(tf.matmul(inputs, self.w) + self.b)

### Input

In [None]:
from keras.layers import Input
input_layer = Input(shape=(28, 28, 1))

### Dense
- Fully-connected layer

In [None]:
from keras.layers import Dense
from keras.initializers import HeNormal

dense_layer = Dense(units=100, activation='relu')

# To implement He weight initialization (set the initial weights to avoid issues like vanishing or exploding gradients)
# He initialization is suitable for layers with relu activation, helping maintain a stable gradient flow during training.
dense_layer = Dense(units=100, activation='relu', kernel_initializer=HeNormal())

### Convolution
- A filter/kernel is moved across the input image to produce a feature map. 
- Reduces the spatial dimensions of the input, which is useful for feature extraction.

In [None]:
from keras.layers import Conv2D
conv_layer = Conv2D(filters=16, kernel_size=(5, 5), strides=(1, 1), activation='relu')

### Transpose Convolution
- Zeroes are inserted between the elements of the input feature map before applying the convolution operation
- Performs the inverse convolution operation, effectively up-sampling the input image to a larger higher resolution size
**Applications:** image generation (in generative adversarial networks, i.e. GANs), super-resolution, semantic segmentation

In [None]:
from keras.layers import Conv2DTranspose
transpose_conv_layer = Conv2DTranspose(filters=1, kernel_size=(3, 3), activation='sigmoid', padding='same') 

### Dropout
- A regularisation technique that helps prevent overfitting in neural networks. 
- Some of the input units are randomly set to zero at each update cycle. This prevents the model from becoming overly reliant on any specific neurons, which encourages the network to learn more robust features that generalize better to unseen data.
- Dropout is only applied during training, not during inference.
- The dropout rate is a hyperparameter that determines the fraction of neurons to drop.

In [None]:
from keras.layers import Dropout
dropout_layer = Dropout(rate=0.5)

### Batch Normalization

- A regularisation technique used to improve the training stability and speed of neural networks. 
- The output of a previous layer is normalised by re-centering (mean of zero) and re-scaling (variance of one) the data, which helps in stabilising the learning process. By reducing the internal covariate shift (the changes in the distribution of layer inputs), batch normalisation allows the model to use higher learning rates, which often speeds up convergence.
- It is applied during both training and inference, although its behaviour varies slightly between the two phases.
- Introduces two learnable parameters that allow the model to scale and shift the normalised output, which helps in restoring the model's representational power.

In [None]:
from keras.layers import BatchNormalization
batch_norm_layer = BatchNormalization()

## Models

### Custom Models (Subclassing)
- Allows you to define custom and dynamic models
- Particularly useful when the forward pass cannot be defined statically
- Widely used for custom training loops and non-standard architectures

In [None]:
from keras.models import Model
from keras.layers import Dense

class CustomModel(Model):
    def __init__(self):
        super(CustomModel, self).__init__()
        #Define layers
        self.dense1 = Dense(64, activation='relu')
        self.dense2 = Dense(10, activation='softmax')
    def call(self, inputs):
        # Forward pass
        x = self.dense1(inputs)
        return self.dense2(x)
    
model = CustomModel()

### Conventional Fully Connected Neural Network 

#### Regression

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Input

def regression_NN(input_shape, output_shape):

    model = Sequential()
    model.add(Input(shape=(input_shape,)))
    model.add(Dense(units=50, activation='relu'))
    model.add(Dense(units=50, activation='relu'))
    model.add(Dense(units=output_shape))
    
    return model

#### Classification

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Input

def classification_NN(input_shape, output_shape):

    model = Sequential()
    model.add(Input(shape=(input_shape,)))
    model.add(Dense(units=50, activation='relu'))
    model.add(Dense(units=50, activation='relu'))
    model.add(Dense(units=output_shape, activation='softmax'))
    
    return model

### Convolutional Neural Network (CNN)

**Applications:** image recognition, object detection, computer vision

**Features:**
- Convolutional layers: extract features from the input image
- Pooling layers: downsample the feature maps to reduce dimensionality
- Fully connected layers: perform final classification

**Advanced CNN architectures:** VGG, ResNet, inception networks, deeper networks

#### Simple CNN

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten

def simple_CNN(input_shape,output_shape):

    model = Sequential()
    model.add(Input(shape=input_shape))
    
    model.add(Conv2D(filters=16, kernel_size=(5, 5), strides=(1, 1), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    
    # Optional: repeat convolutional and pooling layers
    model.add(Conv2D(filters=8, kernel_size=(2, 2), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    
    model.add(Flatten())
    model.add(Dense(units=100, activation='relu'))
    model.add(Dense(units=output_shape, activation='softmax'))
    
    return model

#### VGG

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten

def VGG_CNN(input_shape,output_shape):

    model = Sequential()
    model.add(Input(shape=input_shape))
    
    model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    model.add(Conv2D(filters=128, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(filters=128, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    model.add(Conv2D(filters=256, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(filters=256, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    model.add(Flatten())
    model.add(Dense(units=512, activation='relu'))
    model.add(Dense(units=512, activation='relu'))
    model.add(Dense(units=output_shape, activation='softmax'))
    
    return model

#### ResNet

- Introduces residual connections, which help train deep networks by addressing the vanishing gradient problem. 
- Residual connections allow the network to learn identity mappings, making it easier to train deeper networks.

In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten, BatchNormalization, Add, Activation

def ResNet_CNN(input_shape,output_shape):
    
    def residual_block(x, filters, kernel_size=3, stride=1):
        shortcut = x
        x = Conv2D(filters, kernel_size, strides=stride, padding='same')(x)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        x = Conv2D(filters, kernel_size, strides=1, padding='same')(x)
        x = BatchNormalization()(x)
        x = Add()([x, shortcut])
        x = Activation('relu')(x)
        return x

    input_layer = Input(shape=input_shape)
    x = Conv2D(filters=64, kernel_size=(7,7), strides=2, padding='same')(input_layer)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = residual_block(x, 64)
    x = residual_block(x, 64)
    x = Flatten()(x)
    output_layer = Dense(output_shape, activation='softmax')(x)
    
    model = Model(inputs=input_layer, outputs=output_layer)
    
    return model

#### Pre-trained: VGG16

In [None]:
from tensorflow.keras.applications import VGG16

def VGG16_CNN(input_shape, output_shape):
    
    # Load the VGG16 model pre-trained on ImageNet
    base_model = VGG16(weights='imagenet', include_top=False, input_shape=input_shape)

    # Freeze the base model layers
    for layer in base_model.layers:
        layer.trainable = False
    
    model = Sequential()
    model.add(base_model)
    
    model.add(Flatten())
    model.add(Dense(units=256, activation='relu'))
    model.add(Dense(units=output_shape, activation='sigmoid'))
    
    #Optional: fine-tuning 
    for layer in base_model.layers[-4:]: #Unfreeze the last four layers of the VGG16 model
        layer.trainable = True 
    
    return model

### Recurrent Neural Network (RNN)
- Take in the output from previous data points
- Very good at modeling patterns and sequences of data

**Applications:** image generation, handwriting generation, auto-captioning, genomes, stock markets

#### Simple RNN

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense, SimpleRNN

def simple_RNN(input_shape,output_shape):

    model = Sequential()
    model.add(Input(shape=input_shape))
    
    model.add(SimpleRNN(units=50, activation='relu'))
    model.add(Dense(units=output_shape))
    
    return model

#### Long Short-Term Memory (LSTM)

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense, LSTM

def LSTM_RNN(input_shape,output_shape):

    model = Sequential()
    model.add(Input(shape=input_shape))
    
    model.add(LSTM(units=50, activation='relu'))
    model.add(Dense(units=output_shape))
    
    return model

### Transformers
- Leverage self attention mechanisms to process input data in parallel
- Consist of two main parts, the encoder and the decoder with:
  - Self attention mechanism layers: captures dependencies that are far apart in the input sequence
  - Feed forward neural network layers: transforms input data
- Like RNNs and LSTMs, good for sequential data (e.g. natural language text and time series data), with the additional advantage of being better at handling longer sequencies and parallerisation
- The key components of the transformer model include an embedding layer, multiple transformer blocks, and a final dense layer for output prediction

**Examples:** BERT, GPT, image transformers

**Applications:** Natural Language Processing (NLP) (e.g. machine translation, question answering, text summarisation, text-to-image), time-series forecasting, computer vision, speech recognition, reinforcement learning

In [None]:
from keras.layers import Layer, Dense, LayerNormalization, Dropout, Input, Flatten, MultiHeadAttention, Embedding
from keras.models import Sequential, Model
import tensorflow as tf

#Define the Multi-Head Self-Attention mechanism
class MultiHeadSelfAttention(Layer):
#implements the multi-head self-attention mechanism, which allows the model to focus on different parts of the input sequence simultaneously

    def __init__(self, embed_dim, num_heads=8): 
        super(MultiHeadSelfAttention, self).__init__() 
        self.embed_dim = embed_dim 
        self.num_heads = num_heads 
        self.projection_dim = embed_dim // num_heads 
        self.query_dense = Dense(embed_dim) 
        self.key_dense = Dense(embed_dim) 
        self.value_dense = Dense(embed_dim) 
        self.combine_heads = Dense(embed_dim) 


    def attention(self, query, key, value): #computes the attention scores and weighted sum of the values 
        score = tf.matmul(query, key, transpose_b=True) 
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32) 
        scaled_score = score / tf.math.sqrt(dim_key) 
        weights = tf.nn.softmax(scaled_score, axis=-1) 
        output = tf.matmul(weights, value) 
        return output, weights 

    def split_heads(self, x, batch_size): #splits the input into multiple heads for parallel attention computation
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim)) 
        return tf.transpose(x, perm=[0, 2, 1, 3]) 

    def call(self, inputs): #applies the self-attention mechanism and combines the heads
        batch_size = tf.shape(inputs)[0] 
        query = self.query_dense(inputs) 
        key = self.key_dense(inputs) 
        value = self.value_dense(inputs) 
        query = self.split_heads(query, batch_size) 
        key = self.split_heads(key, batch_size) 
        value = self.split_heads(value, batch_size) 
        attention, _ = self.attention(query, key, value) 
        attention = tf.transpose(attention, perm=[0, 2, 1, 3]) 
        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim)) 
        output = self.combine_heads(concat_attention) 
        return output 

#Define the Transformer block
class TransformerBlock(Layer): 
#combines multi-head self-attention with a feed-forward neural network and normalization layers

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1): 
        super(TransformerBlock, self).__init__() 
        
        #EITHER:
        # self.att = MultiHeadSelfAttention(embed_dim, num_heads) 
        #OR:
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        
        self.ffn = Sequential([ 
            Dense(ff_dim, activation="relu"), 
            Dense(embed_dim), 
        ]) 

        self.layernorm1 = LayerNormalization(epsilon=1e-6) 
        self.layernorm2 = LayerNormalization(epsilon=1e-6) 
        self.dropout1 = Dropout(rate) #dropout used to prevent overfitting
        self.dropout2 = Dropout(rate) 


    def call(self, inputs, training): #applies the self-attention, followed by the feedforward network with residual connections and layer normalization
        # EITHER:
        # attn_output = self.att(inputs) 
        # OR:
        attn_output = self.att(inputs, inputs)    
        attn_output = self.dropout1(attn_output, training=training) 
        out1 = self.layernorm1(inputs + attn_output) 
        ffn_output = self.ffn(out1) 
        ffn_output = self.dropout2(ffn_output, training=training) 
        return self.layernorm2(out1 + ffn_output) 

#### Prediction: sequential data tasks

In [None]:
#Define the Transformer Encoder
class TransformerEncoder(Layer):
#composed of multiple TransformerBlock layers, implementing the encoding part of the Transformer architecture.

    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, rate=0.1): 
        super(TransformerEncoder, self).__init__() 
        self.num_layers = num_layers 
        self.embed_dim = embed_dim 
        self.enc_layers = [TransformerBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)] 
        self.dropout = Dropout(rate) 

    def call(self, inputs, training=False): 
        x = inputs 
        for i in range(self.num_layers): 
            x = self.enc_layers[i](x, training=training) 
        return x  

def Predictions_Transformer(input_shape, output_shape):
#defines the necessary parameters, flattens the output, and ends with a dense layer to produce the final output

    # Hyperparameters
    embed_dim = 128 
    num_heads = 8 
    ff_dim = 512 
    num_layers = 4 

    # Define the Transformer Encoder 
    transformer_encoder = TransformerEncoder(num_layers, embed_dim, num_heads, ff_dim) 

    input_layer = Input(shape=input_shape) 

    # Project the inputs to the embed_dim 
    x = Dense(units=embed_dim)(input_layer) #embedding layer
    encoder_outputs = transformer_encoder(x) 
    flatten = Flatten()(encoder_outputs) 
    output_layer = Dense(units=output_shape)(flatten) 
    model = Model(input_layer, output_layer) 
    
    return model

#### Text Generation

In [None]:
from keras.models import Model

class TextGen_TransformerModel(Model):  # Model is now properly imported
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, seq_length):
        super(TextGen_TransformerModel, self).__init__()
        self.embedding = Embedding(vocab_size, embed_dim)
        self.pos_encoding = self.positional_encoding(seq_length, embed_dim)
        self.transformer_blocks = [TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)]
        self.dense = Dense(vocab_size)

    def positional_encoding(self, seq_length, embed_dim):
        angle_rads = self.get_angles(np.arange(seq_length)[:, np.newaxis], np.arange(embed_dim)[np.newaxis, :], embed_dim)
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        pos_encoding = angle_rads[np.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, embed_dim):
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(embed_dim))
        return pos * angle_rates

    def call(self, inputs, training=False):
        seq_len = tf.shape(inputs)[1]
        x = self.embedding(inputs)
        x += self.pos_encoding[:, :seq_len, :]
        for transformer_block in self.transformer_blocks:
            x = transformer_block(x, training=training)  # Pass training argument correctly
        output = self.dense(x)
        return output
    
def TextGen_Transformer():

    # Hyperparameters 
    embed_dim = 256 
    num_heads = 4 
    ff_dim = 512 
    num_layers = 4 

    # Build the Transformer model 
    model = TextGen_TransformerModel(vocab_size, embed_dim, num_heads, ff_dim, num_layers, seq_length)
    
    return model

### Q-learning (Reinforcement Learning)
- Reinforcement learning algorithm
  - **Agents** choose from a set of **Actions**
  - **Actions** impact the **Environment**, which impacts **Agents** via **Rewards**
  - **Rewards** are unknown and must be estimated by the **Agent**
  - The process repeats dynamically, so that **Agents** learn how to estimate **Rewards**
- The essence of Q-learning lies in the Q-value function, Q(s, a): yhe Q-values are updated iteratively using the Bellman equation, which incorporates both the immediate reward and the estimated future rewards.
- The steps to implement Q-learning with Keras include initializing the environments, building the Q-network, training the Q-network, and evaluating the agent.

**Applications**: recommendation engines, marketing, automated bidding

In [None]:
import gym # a toolkit for developing and comparing reinforcement learning algorithms (OpenAI Gym library)
from keras.models import Sequential
from keras.layers import Dense, Input 
from keras.optimizers import Adam
import random
import numpy as np
from collections import deque
import tensorflow as tf

# Create the environment 
# CartPole-v1 is an environment where a pole is balanced on a cart, and the goal is to prevent the pole from falling over 
# (a common benchmark for reinforcement learning algorithms)
env = gym.make('CartPole-v1') 

# Global settings
episodes = 10  # Number of episodes
batch_size = 32  # Size of the mini-batch for training
epsilon = 1.0  # Starting with a high exploration rate
epsilon_min = 0.01  # Minimum exploration rate
epsilon_decay = 0.99  # Faster decay rate for epsilon after each episode
memory_size=2000

memory = deque(maxlen=memory_size)  # Memory buffer to store experiences

# Define state size and action size based on the environment
state_size = env.observation_space.shape[0]  # State space size from the environment
action_size = env.action_space.n  # Number of possible actions from the environment

# Define the model building function (takes the state as input and outputs Q-values for each action)
def build_model(state_size, action_size): 
    model = Sequential() 
    model.add(Input(shape=(state_size,)))  # Use Input layer to specify the input shape 
    model.add(Dense(24, activation='relu')) 
    model.add(Dense(24, activation='relu')) 
    model.add(Dense(action_size, activation='linear')) #linear activation function for the output layer, as we are predicting continuous Q-values
    model.compile(loss='mse', optimizer=Adam(learning_rate=0.001)) 
    return model 

model = build_model(state_size, action_size)

# Replay memory
memory = deque(maxlen=2000)

def remember(state, action, reward, next_state, done):
    """Store experience in memory."""
    memory.append((state, action, reward, next_state, done))

def replay(batch_size):  # Increased batch size
    """Train the model using a random sample of experiences from memory."""
    if len(memory) < batch_size:
        return  # Skip replay if there's not enough experience

    minibatch = random.sample(memory, batch_size)  # Sample a random batch from memory
    
    # Extract information for batch processing
    states = np.vstack([x[0] for x in minibatch])
    actions = np.array([x[1] for x in minibatch])
    rewards = np.array([x[2] for x in minibatch])
    next_states = np.vstack([x[3] for x in minibatch])
    dones = np.array([x[4] for x in minibatch])
    
    # Predict Q-values for the next states in batch
    q_next = model.predict(next_states)
    # Predict Q-values for the current states in batch
    q_target = model.predict(states)
    
    # Vectorized update of target values
    for i in range(batch_size):
        target = rewards[i]
        if not dones[i]:
            target += 0.95 * np.amax(q_next[i])  # Update Q value with the discounted future reward
        q_target[i][actions[i]] = target  # Update only the taken action's Q value
    
    # Train the model with the updated targets in batch
    model.fit(states, q_target, epochs=1, verbose=0)  # Train in batch mode

    # Reduce exploration rate (epsilon) after each training step
    global epsilon
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

def act(state):
    """Choose an action based on the current state and exploration rate."""
    if np.random.rand() <= epsilon:
        return random.randrange(action_size)  # Explore: choose a random action
    act_values = model.predict(state)  # Exploit: predict action based on the state
    return np.argmax(act_values[0])  # Return the action with the highest Q-value

# Define the number of episodes you want to train the model for
episodes = 10  # You can set this to any number you prefer
train_frequency = 5  # Train the model every 5 steps

for e in range(episodes):
    state, _ = env.reset()  # Unpack the tuple returned by env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(200):  # Limit to 200 time steps per episode
        action = act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        remember(state, action, reward, next_state, done)  # Store experience
        state = next_state
        
        if done:
            print(f"episode: {e+1}/{episodes}, score: {time}, e: {epsilon:.2}")
            break
        
        # Train the model every 'train_frequency' steps
        if time % train_frequency == 0:
            replay(batch_size)  # Call replay with larger batch size for efficiency

env.close()

# evaluate the performance of the trained Q-Learning agent
for e in range(10):  

    state, _ = env.reset()  # Unpack the state from the tuple 
    state = np.reshape(state, [1, state_size])  # Reshape the state correctly 
    for time in range(500):  
        env.render()  
        action = np.argmax(model.predict(state)[0])  
        next_state, reward, terminated, truncated, _ = env.step(action)  # Unpack the five return values 
        done = terminated or truncated  # Check if the episode is done 
        next_state = np.reshape(next_state, [1, state_size])  
        state = next_state  
        if done:  
            print(f"episode: {e+1}/10, score: {time}")  
            break  

env.close() 

### Autoencoders
- Unsupervised learner
- A type of neural network used to learn efficient representations of data for the purpose of dimensionality reduction or feature learning
- Consist of three main parts:
  - Encoder: this part compresses the input into a smaller latent space representation
  - Bottleneck: compressed representation
  - Decoder: this part reconstructs the input from the latent space representation. 
- The key idea is that the autoencoder is trained to minimize the difference between the input and the reconstructed output, forcing the network to learn meaningful representations of the data.

**Applications:** data denoising, dimensionality reduction, and feature learning

**Advanced Autoencoder architectures:** 
- Variational Autoencoders (VAEs): have probabilistic:  elements used for generating new data samples
- Convolutional Autoencoder: use convolutional layers and are effective for image data

In [None]:
from tensorflow.keras.models import Model 
from tensorflow.keras.layers import Input, Dense 

def Autoencoder(input_shape,output_shape):
    
    input_layer = Input(input_shape)
    
    # Encoder 
    encoded = Dense(64, activation='relu')(input_layer) 

    # Bottleneck 
    bottleneck = Dense(32, activation='relu')(encoded) 

    # Decoder 
    decoded = Dense(64, activation='relu')(bottleneck) 
    output_layer = Dense(784, activation='sigmoid')(decoded) 

    # Autoencoder model 
    model = Model(input_layer, output_layer) 

    return model 

### Diffusion models

- Unsupervised learner
- Probabilistic models that generate data by iteratively refining a noisy initial sample
- They start with a random noise and gradually apply a series of transformations to produce a coherent data sample
- Simulate the physical process of diffusion, where particles spread out from regions of high concentration to regions of low concentration
- Diffusion models work by defining a forward process and a reverse process
- Consist of three main parts:
  - Encoder: this part compresses the input into a smaller latent space representation
  - Bottleneck: compressed representation
  - Decoder: this part reconstructs the input from the latent space representation. 

**Applications**: image generation, image enhancement/denoising, data augmentation

In [None]:
from tensorflow.keras.models import Model 
from tensorflow.keras.layers import Input, Dense, Conv2D, Conv2DTranspose, Flatten, Reshape

def DiffusionModel(input_shape,output_shape):
    
    input_layer = Input(shape=(28, 28, 1))
    
    #Encoder
    x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_layer)  # Reduced filters
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)  # Reduced filters
    
    # Bottleneck
    x = Flatten()(x)
    x = Dense(64, activation='relu')(x)  # Reduced size
    
    #Encoder
    x = Dense(28*28*32, activation='relu')(x)  # Reduced size
    x = Reshape((28, 28, 32))(x)
    x = Conv2DTranspose(32, (3, 3), activation='relu', padding='same')(x)  # Reduced filters
    x = Conv2DTranspose(16, (3, 3), activation='relu', padding='same')(x)  # Reduced filters
    output_layer = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    model = Model(input_layer, output_layer)

    return model

### Generative Adversarial Networks (GANs)
- Unsupervised learner
- Consist of two networks:
  - Generator network: generates new data instances that resemble the training data.
  - Discriminator network: evaluates the authenticity of the generated data.
- The two networks are trained simultaneously through adversarial training: the generator tries to fool the discriminator while the discriminator tries to distinguish between real and fake data. This adversarial process leads to the generator producing increasingly realistic data. 

**Applications:** image generation, text-to-image synthesis, image-to-image translation, data augmentation

In [None]:
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization, Reshape, Flatten

# Define the generator model
# The generator takes a random noise vector as an input and generates a synthetic image
def build_generator(): 
    model = Sequential() 
    model.add(Dense(256, input_dim=100)) 
    model.add(LeakyReLU(alpha=0.2)) 
    model.add(BatchNormalization(momentum=0.8)) 
    model.add(Dense(512)) 
    model.add(LeakyReLU(alpha=0.2)) 
    model.add(BatchNormalization(momentum=0.8)) 
    model.add(Dense(1024)) 
    model.add(LeakyReLU(alpha=0.2)) 
    model.add(BatchNormalization(momentum=0.8)) 
    model.add(Dense(28 * 28 * 1, activation='tanh')) 
    model.add(Reshape((28, 28, 1))) 
    return model 

# Build the generator 
generator = build_generator() 
generator.summary()

# Define the discriminator model
# The discriminator takes an image as an input and outputs a probability indicating whether the image is real or fake
def build_discriminator(): 
    model = Sequential() 
    model.add(Flatten(input_shape=(28, 28, 1))) 
    model.add(Dense(512)) 
    model.add(LeakyReLU(alpha=0.2)) 
    model.add(Dense(256)) 
    model.add(LeakyReLU(alpha=0.2)) 
    model.add(Dense(1, activation='sigmoid')) 
    return model 

# Build and compile the discriminator 
discriminator = build_discriminator() 
discriminator.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 
discriminator.summary()

# Create the GAN by stacking the generator and the discriminator
# The GAN takes a noise vector as an input, generates a synthetic image using the generator, and classifies the image using the discriminator
def build_gan(generator, discriminator): 
    discriminator.trainable = False # The discriminator is set to non-trainable when compiling the GAN to ensure that only the generator is updated during the adversarial training. 
    gan_input = Input(shape=(100,)) # Create an input layer for the noise vector
    generated_image = generator(gan_input) # Pass the noise vector through the generator to produce a synthetic image. 
    gan_output = discriminator(generated_image) #Pass the synthetic image through the discriminator to get the classification.
    model = Model(gan_input, gan_output) # Compile the GAN using binary cross-entropy loss and the Adam optimizer.

    return model 

# Build the GAN 
gan = build_gan(generator, discriminator) 
gan.summary()

In [None]:
# Training parameters 
batch_size = 64 
epochs = 50
sample_interval = 10

# Adversarial ground truths 
real = np.ones((batch_size, 1)) 
fake = np.zeros((batch_size, 1)) 

# Training loop 
for epoch in range(epochs): 
    # Train the discriminator 
    idx = np.random.randint(0, x_train.shape[0], batch_size) 
    real_images = x_train[idx] 
    noise = np.random.normal(0, 1, (batch_size, 100)) 
    generated_images = generator.predict(noise) 
    d_loss_real = discriminator.train_on_batch(real_images, real) 
    d_loss_fake = discriminator.train_on_batch(generated_images, fake) 
    d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) 

    # Train the generator 
    noise = np.random.normal(0, 1, (batch_size, 100)) 
    g_loss = gan.train_on_batch(noise, real) 

    # Print the progress 
    if epoch % sample_interval == 0: 
        print(f"{epoch} [D loss: {d_loss[0]}] [D accuracy: {100 * d_loss[1]}%] [G loss: {g_loss}]")


## Pre-processing

### Data Augmentation

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Custom augmentation function
def add_random_noise(image):
    noise = np.random.normal(0, 0.1, image.shape)
    return image + noise

# Create an instance of ImageDataGenerator with basic augmentations
datagen = ImageDataGenerator(
    
    #basic augmentations
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=20,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    
    #normalizations
    featurewise_center=True,
    featurewise_std_normalization=True,
    samplewise_center=True,
    samplewise_std_normalization=True,
    
    #custom augmentations
    preprocessing_function=add_random_noise
)

#Only required if featurewise_center or featurewise_std_normalization or zca_whitening are set to True
# datagen.fit(train_data) #Computes the internal data stats related to the data-dependent transformations

# Visualizing multiple augmented versions of the same image
# plt.figure(figsize=(10, 10))
# for i, batch in enumerate(datagen.flow(x, batch_size=1)):
#     if i >= 4:  # Show only 4 versions
#         break
#     plt.subplot(2, 2, i+1)
#     plt.imshow(batch[0].astype('uint8'))
# plt.show()

## Train and Test

### Build model

In [None]:
from keras.utils import plot_model

# input_shape = X_train.shape[1:]
# output_shape = y_train.shape[1] # number of categories
# model = simple_CNN(input_shape, output_shape)

model = TextGen_Transformer()

model.summary()
# plot_model(model, show_shapes=True, show_layer_names=True)

### Compile

In [None]:
#Optimizer options
optimizer='adam'

#Loss options
loss='mean_squared_error' # Regression
loss='categorical_crossentropy' # Multi-classification
loss='binary_crossentropy' # Binary classification

#Metrics options
metrics=['accuracy']

# Compile the model
model.compile(optimizer, loss, metrics) # Regression

### Fit

In [None]:
model.fit(X_train, y_train, validation_split=0.3, epochs=n_epochs, verbose=2) #with validation set, but training data not split into training and validation
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=n_epochs, verbose=2) #with validation set, training data has already been split
model.fit(train_generator, validation_data=valid_generator, epochs = n_epochs, verbose=2) #when using data generators
model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size) #without validation set
model.fit(X_train_noisy, X_train, epochs=n_epochs, batch_size=batch_size, shuffle=True, validation_data=(X_test, X_test)) #autoencoders, diffusion model

### Evaluate

In [None]:
#Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
scores = model.evaluate(test_generator, verbose=0)
print("Loss: {} \n Accuracy: {} \n Error: {}".format(scores[0], scores[1], 100-scores[1]*100))

#Visualise training results (plot the loss and accuracy curves)
train_history = model.history.history  # After training

plt.title("Loss Curves")
plt.ylabel("Loss")
plt.xlabel('Epoch')
plt.plot(train_history['loss'])
plt.plot(train_history['val_loss'])
plt.legend(loc="upper left")
plt.show()

plt.title("Accuracy Curves")
plt.ylabel("Accuracy")
plt.xlabel('Epoch')
plt.plot(train_history['accuracy'], label='Training Accuracy')
plt.plot(train_history['val_accuracy'], label='Validation Accuracy')
plt.legend(loc="upper right")
plt.show()

### Hyperparameter Tuning

In [None]:
# Define a model-building function 
def build_model(hp):
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(units=hp.Int('units', min_value=32, max_value=512, step=32), activation='relu'),
        Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer=Adam(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='LOG')),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Create a RandomSearch Tuner 
tuner = kt.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=10,
    executions_per_trial=2,
    directory='my_dir',
    project_name='intro_to_kt'
)

# Display a summary of the search space 
tuner.search_space_summary()

# Run the hyperparameter search 
tuner.search(X_train, y_train, epochs=5, validation_data=(x_val, y_val)) 

# Display a summary of the results 
tuner.results_summary() 

# Retrieve the best hyperparameters 

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0] 
print(f""" 

The optimal number of units in the first dense layer is {best_hps.get('units')}. 

The optimal learning rate for the optimizer is {best_hps.get('learning_rate')}. 

""") 

#Build and Train the Model with Best Hyperparameters 
model = tuner.hypermodel.build(best_hps) 
model.fit(X_train, y_train, epochs=10, validation_split=0.2) 

# Evaluate the model on the test set 
test_loss, test_acc = model.evaluate(X_val, y_val) 
print(f'Test accuracy: {test_acc}') 

### Learning rate scheduling

In [None]:
from keras.callbacks import LearningRateScheduler
import tensorflow as tf

def scheduler(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return float(lr * tf.math.exp(-0.1))
    
lr_scheduler = LearningRateScheduler(scheduler)

model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, callbacks=[lr_scheduler])

### Custom training loops

Can be implemented instead of using the standard Keras fit method for a more tailored training process.

**Advantages**:
- Custom loss functions and optimization strategies.
- Enable advanced logging and monitoring
- Flexibility for research
- Integration with custom operations and layers

In [None]:
import tensorflow as tf  

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) 
optimizer = keras.optimizers.Adam()
accuracy_metric = keras.metrics.SparseCategoricalAccuracy()

epochs = 5  # Number of epochs for training

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(32)

for epoch in range(epochs):
    print(f'Start of epoch {epoch + 1}')
    
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            # Forward pass: Compute predictions
            logits = model(x_batch_train, training=True)
            # Compute loss
            loss_value = loss_fn(y_batch_train, logits)
        
        # Compute gradients
        grads = tape.gradient(loss_value, model.trainable_weights)
        # Apply gradients to update model weights
        optimizer.apply_gradients(zip(grads, model.trainable_weights))
        
        # Update the accuracy metric
        accuracy_metric.update_state(y_batch_train, logits)

        # Log the loss and accuracy every 200 steps
        if step % 200 == 0:
            print(f'Epoch {epoch + 1} Step {step}: Loss = {loss_value.numpy()} Accuracy = {accuracy_metric.result().numpy()}')
    
    # Reset the metric at the end of each epoch
    accuracy_metric.reset_state()

### Custom callbacks

In [None]:
from tensorflow.keras.callbacks import Callback 

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # Loss function for multi-class classification
optimizer = keras.optimizers.Adam()  # Adam optimizer for efficient training
accuracy_metric = keras.metrics.SparseCategoricalAccuracy()  # Metric to track accuracy during training

class CustomCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        print(f'End of epoch {epoch + 1}, loss: {logs.get("loss")}, accuracy: {logs.get("accuracy")}')
        
model.fit(X_train, y_train, epochs=10, callbacks=[CustomCallback()]) #with validation set, but training data not split into training and validation

## Other Features 

### Save and load Keras models

In [None]:
model.save('filename.h5')
pretrained_model = keras.models.load_model('filename.h5')

### Mixed precision training
Mixed precision training involves using both 16-bit and 32-bit floating-point types to speed up training on modern GPUs, leading to faster computation and reduced memory usage. 

In [None]:
from keras import mixed_precision

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

### Model pruning
Model pruning reduces the number of parameters in a model by removing less significant connections or neurons, making it more efficient without a substantial loss in accuracy. 

In [None]:
# !pip -q install tensorflow-model-optimization
import tensorflow_model_optimization as tfmot

prune_low_magnitude =tfmot.sparsity.keras.prune_low_magnitude

# Apply pruning to model
pruning_params = {'prunint_schedule':
                  tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.0,
                                                       final_sparsity=0.5,
                                                       begin_step=0,
                                                       end_step=2000)}

model_pruned=prune_low_magnitude(model, **pruning_params)

### Quantization
Reduces the precision of the numbers used to represent the models' weights, which helps in deploying models on edge devices by reducing memory usage and inference time.

In [None]:
import tensorflow as tf  
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()