In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers

"""
Feedforward Neural Network (FNN) Example with Dropout Regularization using TensorFlow:

Input Data:
- X_train: Input features (2-dimensional array)
- y_train: Target labels (binary classification, 1-dimensional array)

Model Architecture:
- Input Layer: 2 neurons corresponding to the input features
- Hidden Layer 1: Dense layer with 64 neurons and ReLU activation function
- Dropout Layer 1: Regularization layer with dropout rate of 0.5
- Hidden Layer 2: Dense layer with 32 neurons and ReLU activation function
- Dropout Layer 2: Regularization layer with dropout rate of 0.5
- Output Layer: Dense layer with 1 neuron and sigmoid activation function (binary classification)

Training Parameters:
- Loss function: Binary crossentropy
- Optimizer: Adam optimizer
- Regularization: L2 regularization with regularization rate of 0.001 applied to kernel weights of dense layers
- Batch size: 4
- Number of epochs: 100

Benefits:
- Dropout regularization helps prevent overfitting by randomly setting a fraction of input units to 0 during training.
- L2 regularization penalizes large weights in the model, promoting simpler models and reducing overfitting.
- Adam optimizer is known for its efficiency and effectiveness in training neural networks.
- ReLU activation function is commonly used in hidden layers of neural networks and helps alleviate the vanishing gradient problem.
- Sigmoid activation function in the output layer is suitable for binary classification tasks.

Alternatives:
- Regularization: Besides L2 regularization, alternatives like L1 regularization or a combination of both can be used.
- Activation functions: Other activation functions like tanh or Leaky ReLU can be experimented with.
- Optimizers: Besides Adam optimizer, alternatives like SGD (Stochastic Gradient Descent) or RMSprop can be used.
- Model architecture: Experimenting with different numbers of layers and neurons can help find the optimal architecture for the given task.

"""








# Create sample data
X_train = [[0, 0], [0, 1], [1, 0], [1, 1]]
y_train = [[0], [1], [1], [0]]








# Build the model
model = Sequential([
    Dense(64, activation='relu', input_shape=(2,), kernel_regularizer=regularizers.l2(0.001)),
    Dropout(0.5),
    Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

"""
Activation Functions for Neural Networks:

Activation functions introduce non-linearity into neural network models, enabling them to learn complex patterns and relationships in the data. Different activation functions have unique properties that impact the model's training dynamics, convergence behavior, and performance.

1. Sigmoid:
   - Description: Sigmoid function squashes the input values to the range [0, 1].
   - Use Cases: Sigmoid activation is commonly used in the output layer of binary classification tasks.
   - Alternatives: Other activation functions like Tanh or ReLU are preferred in hidden layers due to the vanishing gradient problem.
   - Pros: Sigmoid activation produces probabilities that are interpretable as class probabilities in binary classification tasks.
   - Cons: Sigmoid activation suffers from the vanishing gradient problem, leading to slow convergence and potential saturation of gradients.

2. Tanh (Hyperbolic Tangent):
   - Description: Tanh function squashes the input values to the range [-1, 1].
   - Use Cases: Tanh activation is often used in hidden layers of neural networks.
   - Alternatives: Tanh can be replaced by other activation functions like ReLU or Leaky ReLU to mitigate the vanishing gradient problem.
   - Pros: Tanh activation introduces stronger non-linearity compared to sigmoid, making it effective for capturing complex patterns in the data.
   - Cons: Tanh activation also suffers from the vanishing gradient problem, especially in deep networks with many layers.

3. ReLU (Rectified Linear Unit):
   - Description: ReLU function returns 0 for negative input values and the input value itself for positive values.
   - Use Cases: ReLU activation is widely used in hidden layers of deep neural networks.
   - Alternatives: Leaky ReLU, Parametric ReLU (PReLU), and Exponential ReLU (ELU) are variations of ReLU that address some of its limitations.
   - Pros: ReLU activation accelerates the convergence of training due to its non-saturating nature and sparsity of activation.
   - Cons: ReLU can suffer from the dying ReLU problem, where neurons become inactive and stop updating their weights during training.

4. Leaky ReLU:
   - Description: Leaky ReLU allows a small, non-zero gradient for negative input values to prevent neurons from becoming completely inactive.
   - Use Cases: Leaky ReLU is used as an alternative to ReLU to mitigate the dying ReLU problem.
   - Alternatives: Parametric ReLU (PReLU) and Exponential ReLU (ELU) are other alternatives that address similar issues.
   - Pros: Leaky ReLU prevents neurons from dying during training, leading to more stable convergence and better performance.
   - Cons: Leaky ReLU introduces an additional hyperparameter (leakiness) that needs to be tuned, which can increase model complexity.

5. Softmax:
   - Description: Softmax function computes the probability distribution over multiple classes, ensuring that the sum of probabilities equals 1.
   - Use Cases: Softmax activation is used in the output layer of multi-class classification tasks.
   - Alternatives: Sigmoid activation can be used in binary classification tasks, while softmax is preferred for multi-class classification.
   - Pros: Softmax activation provides a probability distribution over multiple classes, making it suitable for multi-class classification.
   - Cons: Softmax activation can suffer from numerical instability when dealing with large or very small input values, leading to potential overflow or underflow issues.

6. Linear:
   - Description: Linear activation function returns the input value itself without any transformation.
   - Use Cases: Linear activation is commonly used in the output layer for regression tasks.
   - Alternatives: Sigmoid or softmax activations are used for classification tasks.
   - Pros: Linear activation produces unbounded output values, making it suitable for regression tasks where the target values are continuous.
   - Cons: Linear activation may not be suitable for classification tasks where non-linear decision boundaries are required.

Kernel Regularization Techniques for Neural Networks:

Kernel regularization methods are used in neural networks to prevent overfitting by penalizing large weights or imposing constraints on the network's parameters. These techniques help improve the generalization performance of the model by reducing its complexity and preventing it from memorizing the training data.

1. L1 Regularization (Lasso):
   - Description: L1 regularization adds the absolute value of the weights as a penalty term to the loss function.
   - Use Cases: L1 regularization is effective when the goal is to induce sparsity in the model, i.e., to encourage some weights to be exactly zero.
   - Alternatives: L2 regularization, Elastic Net regularization.
   - Pros: L1 regularization can lead to sparse solutions, making the model more interpretable and memory-efficient.
   - Cons: L1 regularization tends to select only a subset of features, which may discard potentially useful information and reduce model accuracy.

2. L2 Regularization (Ridge):
   - Description: L2 regularization adds the squared magnitude of the weights as a penalty term to the loss function.
   - Use Cases: L2 regularization is widely used to prevent overfitting in neural networks and is effective in improving model generalization.
   - Alternatives: L1 regularization, Elastic Net regularization.
   - Pros: L2 regularization penalizes large weights more smoothly than L1 regularization, leading to better numerical stability during training.
   - Cons: L2 regularization may not induce sparsity as effectively as L1 regularization, and the resulting models may be less interpretable.

3. Elastic Net Regularization:
   - Description: Elastic Net regularization combines L1 and L2 regularization by adding both the absolute and squared magnitudes of the weights to the loss function.
   - Use Cases: Elastic Net regularization is useful when there are many correlated features, as it can select groups of correlated features together.
   - Alternatives: L1 regularization, L2 regularization.
   - Pros: Elastic Net regularization combines the benefits of L1 and L2 regularization, providing a balance between sparsity and smoothness in the learned weights.
   - Cons: Elastic Net regularization introduces an additional hyperparameter (mixing ratio) that needs to be tuned, increasing the complexity of the model.

4. Dropout:
   - Description: Dropout randomly sets a fraction of input units to zero during training to prevent co-adaptation of neurons and encourage robustness.
   - Use Cases: Dropout is commonly used in deep neural networks, especially when dealing with large datasets and complex architectures.
   - Alternatives: Weight decay, Batch normalization.
   - Pros: Dropout acts as a form of ensemble learning by training multiple subnetworks, leading to improved generalization and reduced overfitting.
   - Cons: Dropout increases training time since each forward pass involves randomly dropping units, and it requires careful tuning of the dropout rate.
"""

"""
Layers in Feedforward Neural Networks (FNN):

1. Dense Layer:
   - Description: Dense layer, also known as a fully connected layer, connects each neuron in the current layer to every neuron in the next layer.
   - Pros:
     - Versatile: Can model complex relationships between input and output.
     - Suitable for various tasks: Can be used for classification, regression, and other tasks.
     - Regularization: Can help prevent overfitting with dropout, L1/L2 regularization, etc.
   - Cons:
     - Parameter-heavy: Requires a large number of parameters, leading to longer training times and potential overfitting.
     - Lack of interpretability: Due to the dense connectivity, understanding the role of individual neurons can be challenging.
   - Use Cases:
     - Image Classification: Dense layers are commonly used in the classification of images.
     - Natural Language Processing (NLP): Used in tasks such as sentiment analysis, text classification, etc.
     - Regression Tasks: Suitable for predicting continuous variables, such as housing prices, stock prices, etc.

2. Dropout Layer:
   - Description: Dropout layer randomly sets a fraction of input units to zero during training to prevent overfitting.
   - Pros:
     - Regularization: Helps prevent overfitting by reducing interdependence between neurons.
     - Improved Generalization: Forces the network to learn more robust features.
     - Easy to Implement: Simply insert Dropout layers into the model architecture.
   - Cons:
     - Increased Training Time: Dropout increases training time since the network needs more epochs to converge.
     - Decreased Network Capacity: Dropout can reduce the effective capacity of the network.
   - Use Cases:
     - Deep Neural Networks: Used in conjunction with dense layers in deep neural networks to prevent overfitting.
     - Large Networks: Effective for large networks with many parameters.
     - Classification Tasks: Commonly applied in classification tasks to improve generalization.

3. Activation Layers (e.g., ReLU, Sigmoid, Tanh):
   - Description: Activation layers introduce non-linearity to the network, allowing it to model complex relationships in the data.
   - Pros:
     - Introduce Non-Linearity: Enable the network to learn complex mappings between inputs and outputs.
     - Computationally Efficient: Many activation functions are computationally efficient to compute.
     - Differentiability: Activation functions are usually differentiable, allowing for gradient-based optimization.
   - Cons:
     - Vanishing/Exploding Gradient Problem: Some activation functions can suffer from vanishing or exploding gradients.
     - Saturation: Activation functions may saturate in certain regions, leading to slow learning.
   - Use Cases:
     - ReLU: Widely used in deep learning architectures due to its simplicity and effectiveness.
     - Sigmoid: Used in binary classification tasks where the output needs to be between 0 and 1.
     - Tanh: Commonly used in recurrent neural networks (RNNs) and LSTMs.

"""












# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

"""
Optimizers for Neural Networks:

Optimizers are algorithms used to update the parameters (weights and biases) of neural network models during the training process. They play a crucial role in determining how quickly and effectively the model learns from the training data.

1. Adam (Adaptive Moment Estimation):
   - Description: Adam is an adaptive learning rate optimization algorithm that computes adaptive learning rates for each parameter. It combines ideas from both RMSprop and Momentum methods.
   - Use Cases: Adam is widely used in training deep neural networks for various tasks such as image classification, natural language processing, and reinforcement learning.
   - Alternatives: Other popular optimizers include Stochastic Gradient Descent (SGD), RMSprop, and AdaGrad.
   - Pros: Adam usually converges faster and is less sensitive to the choice of hyperparameters compared to other optimization algorithms. It also performs well on a wide range of problems with minimal hyperparameter tuning.
   - Cons: Adam may require more memory due to the additional computations involved. It can also overshoot the minimum due to its adaptive learning rate behavior.

2. Stochastic Gradient Descent (SGD):
   - Description: SGD is a classic optimization algorithm used to minimize the loss function by updating the model parameters in the direction of the negative gradient of the loss with respect to the parameters.
   - Use Cases: SGD is commonly used in training neural networks, especially in settings where computational resources are limited or memory constraints exist.
   - Alternatives: Variants of SGD such as Mini-batch SGD, Momentum SGD, and Nesterov Accelerated Gradient (NAG) can be used to improve convergence and stability.
   - Pros: SGD is simple and computationally efficient, making it suitable for large-scale machine learning tasks. It is also less prone to overshooting the minimum compared to adaptive methods.
   - Cons: SGD may converge slowly and get stuck in local minima, especially in high-dimensional parameter spaces.

3. RMSprop (Root Mean Square Propagation):
   - Description: RMSprop is an adaptive learning rate optimization algorithm that divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.
   - Use Cases: RMSprop is effective for training deep neural networks, particularly in scenarios where the gradients vary widely across different dimensions of the parameter space.
   - Alternatives: Adam, SGD, and AdaGrad are alternative optimization algorithms that can be used in place of RMSprop.
   - Pros: RMSprop adapts the learning rate independently for each parameter, making it well-suited for non-stationary and noisy objective functions.
   - Cons: RMSprop may require more tuning of hyperparameters compared to Adam. It can also converge prematurely in some cases, leading to suboptimal solutions.

4. AdaGrad (Adaptive Gradient Algorithm):
   - Description: AdaGrad is an adaptive learning rate optimization algorithm that adapts the learning rates of model parameters based on the historical gradients accumulated for each parameter.
   - Use Cases: AdaGrad is suitable for training models with sparse data or when dealing with features that have different frequencies in the input data.
   - Alternatives: Adam, RMSprop, and SGD are alternative optimization algorithms that can be used instead of AdaGrad.
   - Pros: AdaGrad automatically scales the learning rates of model parameters based on their historical gradients, which can improve convergence on convex optimization problems.
   - Cons: AdaGrad may become too conservative in later iterations, resulting in slow convergence. It also accumulates squared gradients over time, potentially leading to vanishing learning rates for frequently occurring features.

Loss Functions for Neural Networks:

Loss functions, also known as objective functions or cost functions, are used to quantify the difference between the predicted output of a neural network model and the actual target values during the training process. They play a crucial role in guiding the optimization algorithm to update the model parameters to minimize the error.

1. Mean Squared Error (MSE):
   - Description: MSE computes the average of the squared differences between the predicted and actual target values.
   - Use Cases: MSE is commonly used in regression tasks, where the model predicts continuous numeric values.
   - Alternatives: Other regression loss functions such as Mean Absolute Error (MAE) and Huber loss can be used as alternatives to MSE.
   - Pros: MSE penalizes large errors more heavily, making it sensitive to outliers and suitable for tasks where precise estimation is required.
   - Cons: MSE is sensitive to outliers and can be influenced by the scale of the target variable, leading to biased models.

2. Binary Cross-Entropy Loss (Binary CE):
   - Description: Binary CE computes the cross-entropy loss between the predicted probabilities and the binary (0/1) target labels.
   - Use Cases: Binary CE is commonly used in binary classification tasks, where the model predicts the probability of a binary outcome.
   - Alternatives: Other binary classification loss functions such as Hinge loss and Squared Hinge loss can be used for linear classifiers.
   - Pros: Binary CE encourages the model to produce high probabilities for the correct class, making it effective for training binary classifiers.
   - Cons: Binary CE can suffer from vanishing gradients when the predicted probabilities are close to the true labels, leading to slow convergence.

3. Categorical Cross-Entropy Loss (Categorical CE):
   - Description: Categorical CE computes the cross-entropy loss between the predicted probabilities and the one-hot encoded target labels.
   - Use Cases: Categorical CE is commonly used in multi-class classification tasks, where the model predicts the probability distribution over multiple classes.
   - Alternatives: Other multi-class classification loss functions such as Sparse Categorical Cross-Entropy and Kullback-Leibler divergence can be used for similar tasks.
   - Pros: Categorical CE encourages the model to produce sharp and accurate probability distributions over multiple classes, making it suitable for multi-class classification.
   - Cons: Categorical CE requires one-hot encoded target labels, which may increase memory consumption and computational overhead.

4. Huber Loss:
   - Description: Huber loss is a combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE), where the loss is quadratic for small errors and linear for large errors.
   - Use Cases: Huber loss is commonly used in regression tasks, particularly when the dataset contains outliers or noisy data points.
   - Alternatives: Other robust regression loss functions such as Quantile loss and Tukey's biweight loss can be used for similar purposes.
   - Pros: Huber loss is less sensitive to outliers compared to MSE, making it more robust in the presence of noisy data.
   - Cons: Huber loss requires tuning of a hyperparameter (delta) to balance between the quadratic and linear loss components, which can affect model performance.

Evaluation Metrics for Neural Networks:

Evaluation metrics are used to assess the performance of a neural network model on a specific task, such as classification or regression. These metrics provide valuable insights into the model's predictive accuracy, generalization capability, and overall effectiveness.

1. Accuracy:
   - Description: Accuracy measures the proportion of correctly classified samples out of the total number of samples.
   - Use Cases: Accuracy is commonly used in classification tasks to evaluate the overall correctness of the model's predictions.
   - Alternatives: Other classification metrics such as Precision, Recall, F1 Score, and Area Under the ROC Curve (AUC-ROC) can provide additional insights into the model's performance.
   - Pros: Accuracy provides a simple and intuitive measure of the model's classification performance, making it easy to interpret.
   - Cons: Accuracy may not be suitable for imbalanced datasets, where the class distribution is skewed, as it can be biased towards the majority class.

2. Precision:
   - Description: Precision measures the proportion of true positive predictions among all positive predictions made by the model.
   - Use Cases: Precision is useful in scenarios where the cost of false positive predictions is high, such as medical diagnosis or fraud detection.
   - Alternatives: Precision is often used in conjunction with Recall to compute the F1 Score, which provides a balanced measure of both precision and recall.
   - Pros: Precision focuses on the accuracy of positive predictions, making it valuable in scenarios where false positives are costly.
   - Cons: Precision alone may not provide a complete picture of the model's performance, especially in cases where false negatives are also important.

3. Recall (Sensitivity):
   - Description: Recall measures the proportion of true positive predictions among all actual positive instances in the dataset.
   - Use Cases: Recall is important in scenarios where the cost of false negative predictions is high, such as disease detection or anomaly detection.
   - Alternatives: Recall is often used in conjunction with Precision to compute the F1 Score, which provides a balanced measure of both precision and recall.
   - Pros: Recall focuses on the ability of the model to correctly identify positive instances, making it valuable in scenarios where false negatives are critical.
   - Cons: Recall alone may not provide a complete picture of the model's performance, especially in cases where false positives are also important.

4. Mean Squared Error (MSE):
   - Description: MSE measures the average of the squared differences between the predicted and actual target values.
   - Use Cases: MSE is commonly used in regression tasks to quantify the model's predictive accuracy and goodness of fit.
   - Alternatives: Other regression evaluation metrics such as Mean Absolute Error (MAE), R-squared (Coefficient of Determination), and Root Mean Squared Error (RMSE) can provide alternative measures of regression performance.
   - Pros: MSE penalizes large prediction errors more heavily, making it sensitive to outliers and useful for assessing the model's performance.
   - Cons: MSE is influenced by the scale of the target variable, which can make it difficult to compare across different datasets or models.
"""





# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=4, verbose=1)

"""
Model Fit Parameters:

1. x: Input data.
   - Description: The input data used for training the model. It could be a NumPy array or a TensorFlow tensor.
   - Example: x_train (NumPy array), tf.data.Dataset object containing input data.

2. y: Target data (labels).
   - Description: The target data (labels) corresponding to the input data. It could be a NumPy array or a TensorFlow tensor.
   - Example: y_train (NumPy array), tf.data.Dataset object containing target data.

3. batch_size: Number of samples per gradient update.
   - Description: Defines the number of samples that will be propagated through the network before a gradient update is performed.
   - Example: 32, 64, 128.

4. epochs: Number of epochs to train the model.
   - Description: An epoch is one complete pass through the entire training dataset.
   - Example: 10, 20, 50.

5. verbose: Verbosity mode.
   - Description: Controls the amount of information printed during training.
   - Options: 0 (silent), 1 (progress bar), 2 (one line per epoch).
   - Example: 1.

6. callbacks: List of callbacks to apply during training.
   - Description: Callbacks are objects that perform actions at various stages of training.
   - Example: [ModelCheckpoint(), EarlyStopping(), TensorBoard()].

7. validation_data: Data on which to evaluate the model at the end of each epoch.
   - Description: It can be a tuple (x_val, y_val) or a data generator.
   - Example: (x_val, y_val), validation_data_generator.

8. validation_split: Fraction of the training data to be used as validation data.
   - Description: The model will set apart this fraction of the training data and will evaluate on it at the end of each epoch.
   - Example: 0.1 (10% of training data used for validation).

9. shuffle: Whether to shuffle the training data before each epoch.
   - Description: If True, the training data will be shuffled before each epoch.
   - Example: True, False.

10. initial_epoch: Epoch at which to start training.
    - Description: Useful for resuming a previous training run.
    - Example: 0, 5 (start training from the 5th epoch).

11. steps_per_epoch: Number of steps (batches) to yield from the generator before declaring one epoch finished.
    - Description: It can be used when training with a generator.
    - Example: 100, 200.

12. validation_steps: Number of steps (batches) to yield from the validation data generator at the end of each epoch.
    - Description: It can be used when validating with a generator.
    - Example: 50, 100.

Callbacks:

Callbacks are objects that can perform actions at various stages during training, such as at the start or end of each epoch, before or after a batch is processed, etc. They provide a way to customize the behavior of the training process without modifying the model itself.

Options:
- ModelCheckpoint: Saves the model after every epoch or only when an improvement is observed on the validation set.
- EarlyStopping: Stops training when a monitored metric has stopped improving on the validation data.
- TensorBoard: Logs training metrics and visualizes them using TensorBoard.
- LearningRateScheduler: Dynamically adjusts the learning rate during training.
- ReduceLROnPlateau: Reduces the learning rate when a metric has stopped improving.
- CSVLogger: Streams epoch results to a CSV file.
- RemoteMonitor: Streams epoch results to a server for remote monitoring.
- LambdaCallback: Allows you to define custom callback functions on-the-fly.
- Custom Callbacks: You can create your own custom callback by subclassing the keras.callbacks.Callback class.

Use Cases:
- ModelCheckpoint: Useful for saving the model's weights during training to ensure no progress is lost.
- EarlyStopping: Prevents overfitting by stopping training when the model's performance on the validation data starts to degrade.
- TensorBoard: Provides visualization tools to monitor training and helps in debugging model performance.
- LearningRateScheduler: Allows for dynamic adjustment of learning rate based on training progress.
- ReduceLROnPlateau: Helps in fine-tuning the learning rate to accelerate training convergence.
- CSVLogger: Keeps track of training history for later analysis.
- RemoteMonitor: Useful for monitoring training progress remotely.
- LambdaCallback: Provides flexibility to define custom actions at different stages of training.
- Custom Callbacks: Allows for implementing complex behaviors during training, such as custom learning rate schedules, data augmentation, etc.

Pros:
- Enhances flexibility by allowing customized behavior during training.
- Helps in implementing advanced training strategies like learning rate scheduling, model checkpointing, etc.
- Facilitates monitoring and visualization of training metrics.
- Enables early stopping to prevent overfitting and save computational resources.

Cons:
- May increase complexity in code due to the need to define custom callback functions.
- Poorly implemented callbacks may lead to unexpected behavior or errors in training.

"""





# Evaluate the model
loss, accuracy = model.evaluate(X_train, y_train)
print(f'Loss: {loss}, Accuracy: {accuracy}')


# Make predictions
predictions = model.predict(X_train)
print("Predictions:")
print(predictions)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
"""
Modified Fully Connected Neural Network (FNN) with Hyperparameter Tuning:

1. Hyperparameter Tuning:
   - Description: Hyperparameter tuning involves optimizing the model's hyperparameters to improve performance.
   - Use Cases:
     - Enhancing Model Performance: Improves model accuracy, generalization, and convergence speed.
     - Preventing Overfitting: Helps prevent overfitting by optimizing regularization parameters.
   - Alternatives:
     - Grid Search: Exhaustively searches a predefined hyperparameter space.
     - Random Search: Randomly samples hyperparameters from a distribution.
     - Bayesian Optimization: Uses probabilistic models to search for optimal hyperparameters.

2. Number of Layers:
   - Description: Specifies the number of hidden layers in the neural network architecture.
   - Use Cases:
     - Model Complexity: Determines the complexity and capacity of the model.
     - Feature Representation: Controls the level of abstraction and representation power of the network.
   - Alternatives:
     - Fewer Layers: Simpler models with fewer layers may generalize better on small datasets.
     - More Layers: Deeper networks with more layers can learn complex hierarchical representations.

3. Width of Layers:
   - Description: Defines the number of neurons (units) in each hidden layer of the neural network.
   - Use Cases:
     - Model Capacity: Influences the model's capacity to capture complex patterns in the data.
     - Computational Efficiency: Larger width may increase computational requirements during training.
   - Alternatives:
     - Narrow Layers: Smaller width may reduce overfitting but may also limit the model's expressive power.
     - Wide Layers: Larger width enables the model to learn intricate feature representations but may lead to overfitting.

4. Activation Functions:
   - Description: Activation functions introduce non-linearity to the network, allowing it to learn complex mappings between inputs and outputs.
   - Use Cases:
     - Non-linearity: Facilitates the modeling of non-linear relationships in the data.
     - Gradient Propagation: Ensures stable gradient flow during backpropagation.
   - Alternatives:
     - ReLU (Rectified Linear Unit): Commonly used due to its simplicity and effectiveness in mitigating vanishing gradients.
     - Sigmoid: Suitable for binary classification tasks, squashes output values to the range [0, 1].
     - Tanh (Hyperbolic Tangent): Similar to sigmoid but squashes output values to the range [-1, 1].

5. Dropout Layers:
   - Description: Dropout layers randomly deactivate a fraction of neurons during training, reducing overfitting by promoting model generalization.
   - Use Cases:
     - Regularization: Prevents co-adaptation of neurons and encourages robustness in the network.
     - Ensemble Learning: Mimics the effect of training multiple networks with different subsets of neurons.
   - Alternatives:
     - L2 Regularization: Penalizes large weights in the network, discouraging overfitting by reducing model complexity.
     - Batch Normalization: Normalizes activations between layers, improving stability and accelerating convergence.
     - Early Stopping: Halts training when validation performance stops improving, preventing overfitting.
"""


import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import Adam

def create_model(layers=2, layer_width=64, activation='relu', dropout_rate=0.5, learning_rate=0.001):
    """
    Function to create a Keras Sequential model with configurable architecture.

    Parameters:
    - layers: Number of hidden layers in the model (default: 2)
    - layer_width: Width of each hidden layer (default: 64)
    - activation: Activation function for hidden layers (default: 'relu')
    - dropout_rate: Dropout rate for dropout layers (default: 0.5)
    - learning_rate: Learning rate for the optimizer (default: 0.001)

    Returns:
    - Keras Sequential model
    """
    model = Sequential()
    # Input layer
    model.add(Dense(layer_width, activation=activation, input_shape=(2,), kernel_regularizer=regularizers.l2(0.001)))
    model.add(Dropout(dropout_rate))
    # Hidden layers
    for _ in range(layers - 1):
        model.add(Dense(layer_width, activation=activation, kernel_regularizer=regularizers.l2(0.001)))
        model.add(Dropout(dropout_rate))
    # Output layer
    model.add(Dense(1, activation='sigmoid'))
    # Compile the model
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create sample data
X_train = [[0, 0], [0, 1], [1, 0], [1, 1]]
y_train = [[0], [1], [1], [0]]

# Define hyperparameters for tuning
param_grid = {
    'layers': [1, 2, 3],
    'layer_width': [32, 64, 128],
    'activation': ['relu', 'tanh'],
    'dropout_rate': [0.2, 0.5, 0.8],
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [4, 8, 16]
}

# Create KerasClassifier
keras_clf = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)

# Perform grid search cross-validation
grid = GridSearchCV(estimator=keras_clf, param_grid=param_grid, cv=3)
grid_result = grid.fit(X_train, y_train)

# Print best results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import regularizers
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import Adam, RMSprop, SGD

def create_model(optimizer='adam', layers=2, layer_width=64, activation='relu', dropout_rate=0.5, learning_rate=0.001):
    """
    Function to create a Keras Sequential model with configurable architecture.

    Parameters:
    - optimizer: Optimizer type (default: 'adam')
    - layers: Number of hidden layers in the model (default: 2)
    - layer_width: Width of each hidden layer (default: 64)
    - activation: Activation function for hidden layers (default: 'relu')
    - dropout_rate: Dropout rate for dropout layers (default: 0.5)
    - learning_rate: Learning rate for the optimizer (default: 0.001)

    Returns:
    - Keras Sequential model
    """
    model = Sequential()
    # Input layer
    model.add(Dense(layer_width, activation=activation, input_shape=(2,), kernel_regularizer=regularizers.l2(0.001)))
    model.add(Dropout(dropout_rate))
    # Hidden layers
    for _ in range(layers - 1):
        model.add(Dense(layer_width, activation=activation, kernel_regularizer=regularizers.l2(0.001)))
        model.add(Dropout(dropout_rate))
    # Output layer
    model.add(Dense(1, activation='sigmoid'))
    # Compile the model
    model.compile(optimizer=optimizer(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Create sample data
X_train = [[0, 0], [0, 1], [1, 0], [1, 1]]
y_train = [[0], [1], [1], [0]]

# Define hyperparameters for tuning
param_grid = {
    'optimizer': [Adam, RMSprop, SGD],
    'layers': [1, 2, 3],
    'layer_width': [32, 64, 128],
    'activation': ['relu', 'tanh'],
    'dropout_rate': [0.2, 0.5, 0.8],
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [8, 16, 32],  # Add batch size options
    'epochs': [100],  # Number of epochs for training
    'validation_split': [0.2]  # Validation split ratio
}

# Create KerasClassifier
keras_clf = KerasClassifier(build_fn=create_model, verbose=0)

# Perform grid search cross-validation
grid = GridSearchCV(estimator=keras_clf, param_grid=param_grid, cv=3)
grid_result = grid.fit(X_train, y_train)

# Print best results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
