# Study Guide

1. **What is the significance of exploring different topologies in an ANN and how do the variations in layers and neurons affect model performance?**

    Exploring different topologies allows for understanding how the complexity of a model influences its ability to capture relationships in the data. Adding more layers or neurons can increase the model's capacity to learn more complex functions. However, it also increases the risk of overfitting, as the model may start to learn the noise in the training data instead of the underlying distribution. The right balance needs to be struck to achieve a model that generalizes well to unseen data.

2. **Why might the hyperparameters learning rate and the number of epochs need to be optimized, and what could be the impact of choosing sub-optimal values for these hyperparameters?**

    The learning rate controls the step size at each iteration while moving toward a minimum of a loss function. A learning rate that is too high can cause the model to converge too quickly to a suboptimal solution, or it can cause divergence. Conversely, a learning rate that is too low can make the convergence process very slow, which is computationally inefficient and can get stuck in local minima. The number of epochs controls how many times the learning algorithm will work through the entire training dataset. Too few epochs can lead to underfitting, while too many can lead to overfitting.

3. **Given the project does not allow the use of learning optimizers, what is the purpose of implementing a CustomOptimizer in the code?**

    The CustomOptimizer is a basic implementation of a gradient descent optimizer without any advanced features such as momentum or adaptive learning rates, which are commonly found in optimizers like Adam or RMSprop. By creating a CustomOptimizer, the project complies with the requirement of not using any pre-built learning optimizer, ensuring a controlled learning process that relies solely on the hyperparameters specified by the experiment's setup.

4. **How does the loss function 'binary_crossentropy' align with the objectives of the ANN models in this classification task?**

    Binary crossentropy is a loss function that is appropriate for binary classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. The loss increases as the predicted probability diverges from the actual label. Being a logarithmic function, it is very sensitive to differences when the predicted probability is close to the actual label, which is desirable in a classification task as it penalizes wrong predictions more heavily.

5. **Why is the AUC metric used as a benchmark for model performance, and what does an AUC of 0.80 indicate about a model's performance?**

    The Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) is a performance measurement for classification problems at various threshold settings. The ROC is a probability curve, and AUC represents the degree or measure of separability. An AUC of 0.80 means that the model has a 80% chance of correctly distinguishing between the positive and negative class for a random pair of observations. It indicates a good level of discrimination for the model.

6. **What is the purpose of displaying the mean of the loss function versus epochs, and how can this plot indicate overfitting?**

    Plotting the mean loss versus epochs allows us to visualize the learning process of the model. If the loss on the training set keeps decreasing, but the loss on a validation set starts to increase, this is a sign of overfitting - meaning the model is learning the training data too well, including its noise and outliers, at the expense of losing generalization performance.

# Theory

**Topological Variance Impact:**

- Variance with topology: The variance of an ANN's output increases with the network's capacity, which is governed by the number of layers and neurons. With more layers and neurons, the network can capture more complex patterns but also becomes more susceptible to fitting noise in the data (overfitting). The bias-variance trade-off implies that as variance increases, bias decreases, and vice versa; hence, an optimal network architecture seeks to balance these two.
- Universal approximation theorem: This theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n, given appropriate weights and activation function. Different topologies, meaning different arrangements and numbers of neurons and layers, affect the types of functions the ANN can approximate. The complexity and depth of the network influence its representational power.

**Activation Functions:**

- Gradient flow: Activation functions like sigmoid or tanh can lead to vanishing gradients in deep networks because their derivatives can be very small. This means that during backpropagation, gradients that are backpropagated can become insignificant, stalling the training. ReLU, on the other hand, does not saturate in the positive domain, which can help alleviate the vanishing gradient problem and thus support deeper network training.
- Learning dynamics: The choice of activation function influences how an ANN learns. For example, ReLU is zero for all negative inputs, which creates sparse activations (an aspect of regularization) and can speed up training. Sigmoid and tanh functions provide smooth gradients but can cause saturation, where neurons stop learning if their weights reach a certain range.

**Learning Rate Sensitivity:**

- Fine-tuning necessity: A learning rate that is too high can cause the model to converge too quickly to a suboptimal solution or oscillate around a minima, while a rate that's too low can slow down convergence, perhaps to a point where training is impractical. Fine-tuning is essential to find the balance between efficient training and converging to good minima.
- Interaction with initialization: The learning rate and weight initialization scheme must be compatible to ensure stable convergence. Poor initialization can lead to vanishing/exploding gradients, while the right initialization (e.g., Xavier initialization) can preserve the variance of activations across layers, which works in tandem with an appropriate learning rate.

**Epochs and Generalization:**

- Role in generalization: The number of epochs determines how long the ANN will learn from the data. Too few epochs might underfit, while too many can lead to overfitting, especially in deep networks which have a high capacity for learning patterns. A good balance needs to be struck, with techniques like early stopping acting as a form of regularization.
- High epochs and overfitting: Using many epochs without regularization can lead to a model that performs exceedingly well on training data but poorly on unseen data. This is because the model starts to memorize the noise and outliers in the training data rather than learning the underlying distribution.

**Optimization Without Advanced Techniques:**

- Avoiding suboptimal local minima: Without advanced techniques, one might use a combination of experience and trial-and-error to adjust the learning rate or employ regularization methods to help the model generalize better. Additionally, stochastic gradient descent, by its nature, introduces noise into the optimization process, which can help escape shallow local minima.
- Optimization landscape effects: Adjusting only the learning rate, as opposed to using momentum-based (which can navigate the landscape more effectively by accumulating gradients) or adaptive learning rate techniques (which adjust the learning rate based on past gradients), can mean slower convergence and a higher chance of getting stuck in local minima.

**Loss Function Investigation:**

- Binary crossentropy implications: Binary crossentropy is particularly suited for binary classifications as it measures the distance between probability distributions - in this case, the predicted probabilities and the actual labels. It handles probabilities elegantly, and its derivative is simple, which facilitates efficient learning. In comparison, hinge loss is more suited for "maximum-margin" classification, like support vector machines.
- Influence on decision boundary: The loss function influences the optimization process and the shape of the decision boundary. Binary crossentropy encourages a model to output probabilities and therefore can lead to a probabilistic decision boundary. In contrast, hinge loss encourages hard margins, which can create more definitive decision boundaries.

**Metrics and Their Interpretations:**

- Relation to errors: Precision is the proportion of true positives among all positive predictions, thus relating to Type I error (false positives). Recall, or sensitivity, measures the proportion of actual positives correctly identified, relating to Type II error (false negatives). These metrics are crucial in imbalanced datasets where the cost of false positives and false negatives is not the same.
- Significance of AUC: AUC (Area Under the ROC Curve) provides a measure of the model's ability to distinguish between the classes at all thresholds, not just a single cutoff like accuracy. This is important in imbalanced datasets or when different trade-offs between true positives and false positives are desired.

**Traceability and Reproducibility:**

- Fixing random state: Fixing the random state ensures that each run of the model or cross-validation process starts with the same initial conditions, such as weight initialization and data shuffling. This is essential for reproducibility, as it allows others to replicate the exact conditions of an experiment, which is foundational for scientific validation.
- Normalization tracking: Different normalization techniques (min-max, z-score, etc.) scale the features in different ways, which can affect the ANN's ability to learn patterns in the data. Tracking the normalization process ensures the reproducibility of results and allows for a better understanding of how the input data's distribution affects the ANN's performance.

# Math (ugh)

#### 1. Weight Update Equation in Gradient Descent
The weight update rule in gradient descent is expressed as:

w_new = w_old - learning_rate * gradient_of_J(w_old)

This rule is based on the principle that the gradient of a function gives the direction of the steepest ascent. Therefore, to minimize the loss function J(w), one must move in the opposite direction, i.e., the direction of steepest descent. The 'learning_rate' controls the size of the steps we take towards the minimum. If 'learning_rate' is too large, we may overshoot the minimum; if it's too small, the convergence might be very slow. An inappropriate 'learning_rate' can lead to divergence of the loss function instead of convergence.

#### 2. Backpropagation Chain Rule
The backpropagation algorithm relies on the chain rule of calculus to compute the gradient of the loss function with respect to each weight. The chain rule states that the derivative of a composed function is the product of the derivatives. For a neuron's weight, this can be expressed as:

dJ/dw = (dJ/da) * (da/dz) * (dz/dw)

where J is the loss, a is the neuron's activation, and z is the neuron's input sum. This rule is crucial because it allows the gradient to be propagated backward through layers, allowing for the efficient training of deep networks.

#### 3. Binary Cross-Entropy Loss Function
The binary cross-entropy loss for a single prediction is given by:

J(w) = -[y * log(y_hat) + (1 - y) * log(1 - y_hat)]

where y is the true label, and y_hat is the predicted probability. The log function intensifies the punishment as the predicted probability diverges from the actual label. This loss function is preferred over mean squared error (MSE) in classification because MSE assumes Gaussian distributed errors and can cause issues due to its squaring of error terms, leading to a non-convex loss surface for classification tasks.

#### 4. Activation Function Derivatives
The derivatives for common activation functions are as follows:

- Sigmoid: sigmoid_prime(z) = sigmoid(z) * (1 - sigmoid(z))
- Tanh: tanh_prime(z) = 1 - tanh(z)^2
- ReLU: ReLU_prime(z) = 1 if z > 0 else 0

The computational efficiency of these derivatives is crucial for the speed of training neural networks, as they need to be computed at every step of backpropagation.

#### 5. Regularization Terms
L2 regularization is added to the loss function to penalize large weights and is defined as:

Omega(w) = lambda * sum(w^2)

This term affects the gradient of the loss function with respect to the weights, leading to an updated gradient:

gradient_of_J(w) = gradient_of_J0(w) + 2 * lambda * w

where J0(w) is the original loss function without regularization. This modified gradient shrinks weights slightly at each update, encouraging simpler models that can generalize better.

#### 6. Learning Rate Decay
Learning rate decay is a technique to reduce the learning rate over time. A simple step decay function can be mathematically expressed as:

learning_rate_t = learning_rate_initial * (decay_rate ^ (t / decay_step))

where learning_rate_t is the learning rate at time t, learning_rate_initial is the initial learning rate, and decay_rate is a hyperparameter. This approach helps in fine-tuning the convergence by taking smaller steps as we approach the minimum of the loss function.

#### 7. ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. The Area Under the Curve (AUC) measures the entire two-dimensional area underneath the entire ROC curve. An AUC of 0.5 signifies that the model has no discriminative ability between positive and negative classes.

#### 8. Stratified K-Fold Cross-Validation
Stratified K-fold cross-validation ensures that each fold of the dataset contains approximately the same percentage of samples of each target class as the complete set. If n_i is the total number of samples of class i and k is the number of folds, each fold should contain about (n_i / k) samples of class i. This stratification reduces variance and bias in the model assessment process.

#### 9. Neural Network Initialization
The Xavier (Glorot) initialization sets a layer's weights to values taken from a distribution with zero mean and a variance that keeps the variance of the activations constant. It is formulated as:

W = random(n_in, n_out) * sqrt(2 / (n_in + n_out))

where n_in and n_out are the number of input and output units in the weight matrix W. This helps prevent the vanishing/exploding gradient problem during training.

#### 10. Precision-Recall Trade-off
Precision and recall are defined as follows:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

where TP is true positives, FP is false positives, and FN is false negatives. There is often a trade-off between precision and recall, as improving one can lead to a reduction in the other. This trade-off can be represented in a Precision-Recall curve, plotting precision (y-axis) vs. recall (x-axis) for different threshold values.
