### Baseline
Simple neural netowrk with self-defined logistic regression, linear regression, regularized logistic regression, regularized linear regression, sigmoid function, ReLU function, and softmax function.

### Dataset Analysis:
the dataset includes cat, dog and other species. additionally, an image may contains two species. 
* we should first do face detection from line segments, to segments grouping in order to form a small region or part of the face, to forming an entire face by grouping based on the regions. in the following layer, the model should be able to identify if the face is one of cat, dog, or wild animals. in this part, there will be at least 3 layers given there will be at least one for line segments, at least one for segment grouping into region, amd at least one for face forming based on regions.
* in this case, in certain step we shoudl identify if the picture contains cat, dog, and e.g. rabbit or no. so the here it should be binary classification for each unit(neuron). e.g. only two classes, cat and dog. output y = [ bool_has_cat, bool_has_dog ]. Hence this layer should be for instance using sigmoid as activation function.
* Then, in the following layer, emotion should be identified to learn that y = [cat happy?, dog happy?, cat sad?, dog sad?, cat angry?, dog angry?, cat relax?, dog relax?] which given each emotional class we should be answering yes or no. if the image is without cat, then all emotional classes relating to cat should be zero. if the image contains cat and dog, emotional classes relating cat and dog should at least have two ones such that cat has certain emotoin relating to a emotion category, and dog should also have certain emotion relating to certain category. it is possible that the specie is happy and sad at the same time such that there are more than two ones. It is due to this that for each layer, it should for instance use sigmoid activatoin function.
* given that the chance of having that emotion category is approximated, now this follwoing layer should further learn the level of emotion for each emotional class. for instance, if the image contains cat and dog and that for cat the happy and relaxed are both detected, the two neurons for cat happy and cat relaxed should one compute the level of happy and the other one compute the level of relaxed. the same logic applies to dog. in this case, the function should be ReLU. e.g. y = [level of cat is happy, level of dog is happy, level of cat is sad, level of dog is sad, level of cat is angry, level of dog is angry, level of cat is relaxed, level of dog is relaxed]. if the image contains only cat, then all emtional classes level relating to dog should be zero.
* now this layer should identify based on the level of each emotional class. for instance, let's say, the input to this layer is = [level of cat is happy, level of dog is happy, level of cat is sad, level of dog is sad, level of cat is angry, level of dog is angry, level of cat is relaxed, level of dog is relaxed] and that the image contains both cat and dog. cat is detected as happy and relaxed, dog is detected as happy and sad, the input to this layer is then = [0.5, 0.4, 0.0, 0.5, 0.0, 0.0, 0.7, 0.0], then the output of this layer should contains only two ones being [0, 0, 0, 1, 0, 0, 1, 0] such that the cat is relaxed and dog is sad. this layer hence should use either softmax, or, for each unit, for instance, a sigmoid.

In [None]:
# libraries
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
import tensorflow as tf
import logging
logging.getLogger('tensorflow').setLevel
tf.autograph.set_verbosity(0)

In [None]:
# Neural Network Class
class SimpleNeuralNetwork:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_derivative = None
        self.final_layer = None
        self.loss_history = []

    def add_layer(self, layer):
        self.layers.append(layer)

    def set_loss(self, loss, loss_derivative):
        self.loss = loss
        self.loss_derivative = loss_derivative

    def forward(self, X):
        output = X
        for layer in self.layers:
            output = layer.forward(output)
        return output

    def backward(self, output_error, learning_rate):
        for layer in reversed(self.layers):
            output_error = layer.backward(output_error, learning_rate)

    def train(self, X, y, epochs, learning_rate):
        for epoch in range(epochs):
            output = self.forward(X)
            self.loss_history.append(self.loss(y, output))
            output_error = self.loss_derivative(y, output)
            self.backward(output_error, learning_rate)
            
    def add_final_layer(self, layer):
        self.final_layer = layer
            


In [1]:
# ----------------------  Activation Functions and Derivatives ----------------------
def sigmoid(z):
    """
    Compute the sigmoid of z

    Args:
        z (ndarray): A scalar, numpy array of any size.

    Returns:
        g (ndarray): sigmoid(z), with the same shape as z
         
    """

    g = 1/(1+np.exp(-z))
   
    return g
  
def sigmoid_derivative(output):
    return output * (1 - output)

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def regularized_logistic_regression(X, y, weights, lambda_reg):
    m = len(y)
    predictions = sigmoid(np.dot(X, weights))
    error = (-y * np.log(predictions)) - ((1 - y) * np.log(1 - predictions))
    cost = np.sum(error) / m
    regularization = (lambda_reg / (2 * m)) * np.sum(weights**2)
    return cost + regularization

# ---------------------------------  Normalization Functions ---------------------------------
def normalize_data(X):
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# ---------------------------------  Loss Functions ---------------------------------
def compute_cost(y_true, y_pred):
    m = y_true.shape[0]
    cost = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / m
    return cost

def compute_cost_derivative(y_true, y_pred):
    return y_pred - y_true
  
# compute_cost for singular layer logistic regression
def compute_cost_logistic(X, y, w, b, *argv):
    """
    Computes the cost over all examples
    Args:
      X : (ndarray Shape (m,n)) data, m examples by n features
      y : (ndarray Shape (m,))  target value 
      w : (ndarray Shape (n,))  values of parameters of the model      
      b : (scalar)              value of bias parameter of the model
      *argv : unused, for compatibility with regularized version below
    Returns:
      total_cost : (scalar) cost 
    """

    m, n = X.shape
    
    ### START CODE HERE ###
    
    # first calculate Z = w[0]*X[i][0]+...+w[n-1]*X[i][n-1]+b
    Z = np.dot(X,w) + b
    F_wb = sigmoid(Z)
    
    total_cost = (np.dot(-y,np.log(F_wb)) - np.dot(1-y,np.log(1-F_wb)))/m
    
    ### END CODE HERE ### 

    return total_cost


# compute_gradient for singular layer logistic regression
def compute_gradient_logistic(X, y, w, b, *argv): 
    """
    Computes the gradient for logistic regression 
 
    Args:
      X : (ndarray Shape (m,n)) data, m examples by n features
      y : (ndarray Shape (m,))  target value 
      w : (ndarray Shape (n,))  values of parameters of the model      
      b : (scalar)              value of bias parameter of the model
      *argv : unused, for compatibility with regularized version below
    Returns
      dj_dw : (ndarray Shape (n,)) The gradient of the cost w.r.t. the parameters w. 
      dj_db : (scalar)             The gradient of the cost w.r.t. the parameter b. 
    """
    m, n = X.shape
    dj_dw = np.zeros(w.shape)
    dj_db = 0.

    ### START CODE HERE ### 
    # calcualte z for the sigmoid function
    Z = np.dot(X,w) + b
    F_wb = sigmoid(Z)
    dj_dw = np.dot((F_wb - y),X)/m 
    dj_db = np.sum(F_wb - y)/m
    ### END CODE HERE ###

        
    return dj_db, dj_dw

# compute_cost for regularized logistic regression
def compute_cost_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X : (ndarray Shape (m,n)) data, m examples by n features
      y : (ndarray Shape (m,))  target value 
      w : (ndarray Shape (n,))  values of parameters of the model      
      b : (scalar)              value of bias parameter of the model
      lambda_ : (scalar, float) Controls amount of regularization
    Returns:
      total_cost : (scalar)     cost 
    """

    m, n = X.shape
    
    # Calls the compute_cost function that you implemented above
    cost_without_reg = compute_cost_logistic(X, y, w, b) 
    
    # You need to calculate this value
    reg_cost = 0.
    
    ### START CODE HERE ###
    reg_cost = (np.dot(w,w))* lambda_ / (2*m)
        
    
    ### END CODE HERE ### 
    
    # Add the regularization cost to get the total cost
    total_cost = cost_without_reg + reg_cost

    return total_cost

# compute the gradient for regularized logistic regression
def compute_gradient_reg(X, y, w, b, lambda_ = 1): 
    """
    Computes the gradient for logistic regression with regularization
 
    Args:
      X : (ndarray Shape (m,n)) data, m examples by n features
      y : (ndarray Shape (m,))  target value 
      w : (ndarray Shape (n,))  values of parameters of the model      
      b : (scalar)              value of bias parameter of the model
      lambda_ : (scalar,float)  regularization constant
    Returns
      dj_db : (scalar)             The gradient of the cost w.r.t. the parameter b. 
      dj_dw : (ndarray Shape (n,)) The gradient of the cost w.r.t. the parameters w. 

    """
    m, n = X.shape
    
    dj_db, dj_dw = compute_gradient_reg(X, y, w, b)

    ### START CODE HERE ###     
    dj_dw += np.dot((lambda_/m),w)
    ### END CODE HERE ###         
        
    return dj_db, dj_dw
  
# ---------------------------------  Gradient Descent Functions ---------------------------------
# gradient descent to learn w and b and update them with each iteration
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters, lambda_): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X :    (ndarray Shape (m, n) data, m examples by n features
      y :    (ndarray Shape (m,))  target value 
      w_in : (ndarray Shape (n,))  Initial values of parameters of the model
      b_in : (scalar)              Initial value of parameter of the model
      cost_function :              function to compute cost
      gradient_function :          function to compute gradient
      alpha : (float)              Learning rate
      num_iters : (int)            number of iterations to run gradient descent
      lambda_ : (scalar, float)    regularization constant
      
    Returns:
      w : (ndarray Shape (n,)) Updated values of parameters of the model after
          running gradient descent
      b : (scalar)                Updated value of parameter of the model after
          running gradient descent
    """
    
    # number of training examples
    m = len(X)
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w_history = []
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db, dj_dw = gradient_function(X, y, w_in, b_in, lambda_)   

        # Update Parameters using w, b, alpha and gradient
        w_in = w_in - alpha * dj_dw               
        b_in = b_in - alpha * dj_db              
       
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            cost =  cost_function(X, y, w_in, b_in, lambda_)
            J_history.append(cost)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0 or i == (num_iters-1):
            w_history.append(w_in)
            print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}   ")
        
    return w_in, b_in, J_history, w_history #return w and J,w history for graphing


# ---------------------------------  Predict Functions ---------------------------------
# predict for singular layer logistic regression
def predict_logistic(X, w, b): 
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w
    
    Args:
      X : (ndarray Shape (m,n)) data, m examples by n features
      w : (ndarray Shape (n,))  values of parameters of the model      
      b : (scalar)              value of bias parameter of the model

    Returns:
      p : (ndarray (m,)) The predictions for X using a threshold at 0.5
    """
    # number of training examples
    m, n = X.shape   
    p = np.zeros(m)
   
    ### START CODE HERE ### 
    # Loop over each example
    Z = np.dot(X,w) + b
    # Calculate the prediction for this example
    F_wb = sigmoid(Z)

    # Apply the threshold
    p = F_wb >= 0.5
    p.astype(int)
        
    ### END CODE HERE ### 
    return p


# ---------------------------------  Softmax Function ---------------------------------
def softmax(z):
    """
    Compute the softmax of z

    Args:
        z (ndarray): A scalar, numpy array of any size.

    Returns:
        g (ndarray): softmax(z), with the same shape as z
         
    """
    ez = np.exp(z)      #element-wise exponenial
    a = ez/np.sum(ez)
    return(a)

In [None]:
# ----------------------  Dense layer class ----------------------
class Dense:
    def __init__(self, input_size, output_size, activation, activation_derivative):
        self.weights = np.random.randn(input_size, output_size) * 0.01
        self.bias = np.zeros((1, output_size))
        self.activation = activation
        self.activation_derivative = activation_derivative
        self.input = None
        self.output = None
        self.input_error = None
        self.weights_error = None

    def forward(self, input_data):
        self.input = input_data
        self.output = np.dot(input_data, self.weights) + self.bias
        if self.activation is not None:
            self.output = self.activation(self.output)
        return self.output

    def backward(self, output_error, learning_rate):
        self.input_error = np.dot(output_error, self.weights.T)
        self.weights_error = np.dot(self.input.T, output_error)

        # Update parameters
        self.weights -= learning_rate * self.weights_error
        self.bias -= learning_rate * np.sum(output_error, axis=0, keepdims=True)
        return self.input_error


### 1. Data pre-processing
* Since in the baseline model insteresed implementation should use conventional model, we'll stick with simple neural networks from draft for educational purposes, although it's important to note that this approach will possibily yield a very poor result for such a complex image processing task.
* First, you need to preprocess your images. This involves loading the images, resizing them to a uniform size, converting them to grayscale, and flattening them into vectors.

#### Benefits of Using Grayscale Images
Reduced Complexity: Grayscale images are less complex than color images, making them easier to process with simpler algorithms.
Reduced Computational Load: Grayscale images require less computational power and memory, as they have only one channel compared to three in color images.
Focus on Texture and Shape: Converting to grayscale can help the model focus on the texture and shape information, which might be more relevant for certain tasks like emotion detection in animals.

In [None]:
def load_images_from_folder(folder, label):
    images = []
    labels = []
    file_paths = []  # List to store file paths
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)  # Get the full file path
        img = cv2.imread(file_path)
        if img is not None:
            gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # Convert to grayscale
            img_resized = cv2.resize(gray_img, (32, 32))  # Resize images
            images.append(img_resized.flatten())
            labels.append(label)
            file_paths.append(file_path)  # Store the file path for prediction visualization
    return images, labels, file_paths


### 2. Labeling the Data
If one has dataset for training and is without labels, one will need to assign labels to the data. Since we have separate folders for each emotion and labels for each data, we can skipp this part.
#### Define the path to the sub dataset folders

In [None]:
# Define the path to the dataset folders
happy_folder = "/kaggle/input/pets-facial-expression-dataset/happy"
sad_folder = "/kaggle/input/pets-facial-expression-dataset/Sad"
angry_folder = "/kaggle/input/pets-facial-expression-dataset/Angry"

#### Load data and Combine data

In [None]:
# Load data
happy_images, happy_labels, happy_file_paths = load_images_from_folder(happy_folder, 0)  # Label 0 for happy
sad_images, sad_labels, sad_file_paths = load_images_from_folder(sad_folder, 1)  # Label 1 for sad
angry_images, angry_labels, angry_file_paths = load_images_from_folder(angry_folder, 2)  # Label 2 for angry

# Combine data
X = np.array(happy_images + sad_images + angry_images)
y = np.array(happy_labels + sad_labels + angry_labels)

# Show the shape of the data
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

### 3. Splitting the Data
Split dataset into training and testing sets. This is essential for evaluating the performance of the model.
A common split is 80% for training and 20% for testing.

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4. Normalize the Data

In [None]:
# Normalize the data
X_train_normalized = X_train / 255.0
X_test_normalized = X_test / 255.0

### 5. Model Functions
Use sigmoid function, ReLU, and softmax to build the simple neural network. 

### 5.1 Refresher on logistic regression and decision boundary

* Recall that for logistic regression, the model is represented as 

  $$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) \tag{1}$$

  where $g(z)$ is known as the sigmoid function and it maps all input values to values between 0 and 1:

  $$g(z) = \frac{1}{1+e^{-z}}\tag{2}$$
  and $\mathbf{w} \cdot \mathbf{x}$ is the vector dot product:
  
  $$\mathbf{w} \cdot \mathbf{x} = w_0 x_0 + w_1 x_1$$
  
  
 * We interpret the output of the model ($f_{\mathbf{w},b}(x)$) as the probability that $y=1$ given $\mathbf{x}$ and parameterized by $\mathbf{w}$ and $b$.
* Therefore, to get a final prediction ($y=0$ or $y=1$) from the logistic regression model, we can use the following heuristic -

  if $f_{\mathbf{w},b}(x) >= 0.5$, predict $y=1$
  
  if $f_{\mathbf{w},b}(x) < 0.5$, predict $y=0$
  
  
* Let's plot the sigmoid function to see where $g(z) >= 0.5$

### 5.2 Cost function

In a previous lab, you developed the *logistic loss* function. Recall, loss is defined to apply to one example. Here you combine the losses to form the **cost**, which includes all the examples.


Recall that for logistic regression, the cost function is of the form 

$$ J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right] \tag{1}$$

where
* $loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

    $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{2}$$
    
*  where m is the number of training examples in the data set and:
$$
\begin{align}
  f_{\mathbf{w},b}(\mathbf{x^{(i)}}) &= g(z^{(i)})\tag{3} \\
  z^{(i)} &= \mathbf{w} \cdot \mathbf{x}^{(i)}+ b\tag{4} \\
  g(z^{(i)}) &= \frac{1}{1+e^{-z^{(i)}}}\tag{5} 
\end{align}
$$
 

### 5.3 Gradient for logistic regression

In this section, you will implement the gradient for logistic regression.

Recall that the gradient descent algorithm is:

$$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & b := b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline       \; & w_j := w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j := 0..n-1}\newline & \rbrace\end{align*}$$

where, parameters $b$, $w_j$ are all updated simultaniously

The `compute_gradient` function to compute $\frac{\partial J(\mathbf{w},b)}{\partial w}$, $\frac{\partial J(\mathbf{w},b)}{\partial b}$ from equations (2) and (3) below.

$$
\frac{\partial J(\mathbf{w},b)}{\partial b}  = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)}) \tag{2}
$$
$$
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x_{j}^{(i)} \tag{3}
$$
* m is the number of training examples in the dataset

    
*  $f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the actual label


- **Note**: While this gradient looks identical to the linear regression gradient, the formula is actually different because linear and logistic regression have different definitions of $f_{\mathbf{w},b}(x)$.

As before, you can use the sigmoid function that you implemented above and if you get stuck, you can check out the hints presented after the cell below to help you with the implementation.

### 5.4 Learning parameters using gradient descent 

Similar to the previous assignment, you will now find the optimal parameters of a logistic regression model by using gradient descent. 
- You don't need to implement anything for this part. Simply run the cells below. 

- A good way to verify that gradient descent is working correctly is to look
at the value of $J(\mathbf{w},b)$ and check that it is decreasing with each step. 

- Assuming you have implemented the gradient and computed the cost correctly, your value of $J(\mathbf{w},b)$ should never increase, and should converge to a steady value by the end of the algorithm.

Now let's run the gradient descent algorithm above to learn the parameters for our dataset.

**Note**
The code block below takes a couple of minutes to run, especially with a non-vectorized version. You can reduce the `iterations` to test your implementation and iterate faster. If you have time later, try running 100,000 iterations for better results.

### 5.5 Predict

Please complete the `predict` function to produce `1` or `0` predictions given a dataset and a learned parameter vector $w$ and $b$.
- First you need to compute the prediction from the model $f(x^{(i)}) = g(w \cdot x^{(i)} + b)$ for every example 
    - You've implemented this before in the parts above
- We interpret the output of the model ($f(x^{(i)})$) as the probability that $y^{(i)}=1$ given $x^{(i)}$ and parameterized by $w$.
- Therefore, to get a final prediction ($y^{(i)}=0$ or $y^{(i)}=1$) from the logistic regression model, you can use the following heuristic -

  if $f(x^{(i)}) >= 0.6$, predict $y^{(i)}=1$
  
  if $f(x^{(i)}) < 0.6$, predict $y^{(i)}=0$
    
If you get stuck, you can check out the hints presented after the cell below to help you with the implementation.

### 5.6 Regularized Logistic Regression

In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. 

### 5.6.1 Problem Statement

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. 
- From these two tests, you would like to determine whether the microchips should be accepted or rejected. 
- To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.


### 5.6.2 Loading and visualizing the data

Similar to previous parts of this exercise, let's start by loading the dataset for this task and visualizing it. 

- The `load_dataset()` function shown below loads the data into variables `X_train` and `y_train`
  - `X_train` contains the test results for the microchips from two tests
  - `y_train` contains the results of the QA  
      - `y_train = 1` if the microchip was accepted 
      - `y_train = 0` if the microchip was rejected 
  - Both `X_train` and `y_train` are numpy arrays.

While the feature mapping allows us to build a more expressive classifier, it is also more susceptible to overfitting. In the next parts of the exercise, you will implement regularized logistic regression to fit the data and also see for yourself how regularization can help combat the overfitting problem.

### 5.7 Cost function for regularized logistic regression

In this part, you will implement the cost function for regularized logistic regression.

Recall that for regularized logistic regression, the cost function is of the form
$$J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$$

Compare this to the cost function without regularization (which you implemented above), which is of the form 

$$ J(\mathbf{w}.b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right]$$

The difference is the regularization term, which is $$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$$ 
Note that the $b$ parameter is not regularized.


### 5.8 Compute Regularized Logistic Regression

Please complete the `compute_cost_reg` function below to calculate the following term for each element in $w$ 
$$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$$

The starter code then adds this to the cost without regularization (which you computed above in `compute_cost`) to calculate the cost with regulatization.

If you get stuck, you can check out the hints presented after the cell below to help you with the implementation.

### 5.9 Gradient for regularized logistic regression

In this section, you will implement the gradient for regularized logistic regression.


The gradient of the regularized cost function has two components. The first, $$\frac{\partial J(\mathbf{w},b)}{\partial b}$$ is a scalar, the other is a vector with the same shape as the parameters $\mathbf{w}$, where the $j^\mathrm{th}$ element is defined as follows:

$$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}  \sum_{i=0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$$

$$\frac{\partial J(\mathbf{w},b)}{\partial w_j} = \left( \frac{1}{m}  \sum_{i=0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} w_j  \quad\, \text{for $j=0...(n-1)$}$$

Compare this to the gradient of the cost function without regularization (which you implemented above), which is of the form 
$$
\frac{\partial J(\mathbf{w},b)}{\partial b}  = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)}) \tag{2}
$$
$$
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - \mathbf{y}^{(i)})x_{j}^{(i)} \tag{3}
$$


As you can see,$\frac{\partial J(\mathbf{w},b)}{\partial b}$ is the same, the difference is the following term in $$\frac{\partial J(\mathbf{w},b)}{\partial w}$$, which is $$\frac{\lambda}{m} w_j  \quad\, \text{for $j=0...(n-1)$}$$ 





### 6. Define the Network Architecture
1. Face Detection and Segmentation
* **Objective**: Detect and segment animal faces from images.
* **Approach**: This typically requires convolutional neural networks (CNNs) that can identify patterns (like edges, textures) and group them into larger structures (like faces).
* **Layers**: Start with convolutional layers for feature extraction, followed by pooling layers to reduce dimensionality, and fully connected layers for classification.
Activation Functions: ReLU is commonly used in CNNs for its efficiency.
2. Species Identification
* **Objective**: Identify whether the image contains a cat, dog, or other species.
* **Approach**: This is a multi-label classification problem (since an image can have more than one label).
* **Layers**: Fully connected layers following the feature extraction layers.
* **Activation Function**:** Sigmoid activation function for each neuron (since it's a binary classification for each species).
3. Emotion Detection
* **Objective**: Identify emotions of each species present in the image.
* **Approach**: This is again a multi-label classification problem.
Layers: Additional fully connected layers.
* **Activation Function**: Sigmoid, as each emotion (happy, sad, etc.) is treated as a separate binary classification problem.
4. Emotion Intensity Level
* **Objective**: Determine the intensity level of each detected emotion.
* **Approach**: This can be seen as a regression problem for each emotion.
* **Layers**: Fully connected layers.
* **Activation Function**: ReLU or a similar function, as the output is a non-negative intensity level.
5. Final Layer for Emotion Classification
* **Objective**: Classify the predominant emotion for each species.
* **Approach**: This is a classification problem, but with a twist. You're interested in the predominant emotion, which is a bit different from standard classification.
* **Layers**: Fully connected layer.
* **Activation Function**: Softmax if you're classifying one predominant emotion per species, or sigmoid for binary classification of each emotion.

### Additional Considerations:
* **Data Preprocessing**: Ensure images are properly preprocessed (normalized, resized, etc.).
* **Model Complexity**: This is a complex model. Start with a simpler version and iteratively add complexity.
* **Training Data**: You'll need a large and well-labeled dataset for this task, especially for the emotion detection and intensity levels.
* **Evaluation Metrics**: Choose appropriate metrics for each stage (accuracy, F1 score, mean squared error for intensity levels, etc.).
* **Computational Resources**: This model might require significant computational resources, especially for training.

### 7. Configure the NN
Since we have four output classes (happy, sad, angry, relaxed), the last output layer should have 4 neurons with a softmax activation function for multi-class classification.

In [None]:
# Assuming each image is flattened to a vector of size 1024 (32x32)
input_size = 1024
network = SimpleNeuralNetwork()

# Add layers
# Network Structure Proposal:
# Input Layer: Size depends on the preprocessed image dimensions.
# Hidden Layers: A series of dense layers with ReLU activation for feature extraction.
# Output Layer for Species Classification: ReLU/Sigmoid activation for multi-class classification.
# Output Layer for Emotion Detection: ReLU/Sigmoid/Sigmoid activation for multi-label classification.
# Output Layer for Emotion Intensity Level: ReLU activation for regression of intensity levels.
# Assuming input size is determined by preprocessed image size

network.add_layer(Dense(input_size, 512, relu, relu_derivative))
network.add_layer(Dense(512, 256, relu, relu_derivative))
network.add_layer(Dense(256, 128, relu, relu_derivative))
network.add_layer(Dense(128, 64, relu, relu_derivative))
network.add_layer(Dense(64, 48, relu, relu_derivative))

'''Task-Specific Branches: 
    After the common layers, 
    create separate branches for each task. 
    Each branch will have its own layers that are specifically designed for that task.
'''
# # Branch for Species Classification
# species_network = SimpleNeuralNetwork() 
# species_network.add_layer(Dense(48, 3, softmax, None))  # Softmax/ReLU/sigmoid for species classification (3 species: cat, dog, others)

# # Branch for Emotion Detection
# emotion_detection_network = SimpleNeuralNetwork()
# emotion_detection_network.add_layer(Dense(48, 4, sigmoid, sigmoid_derivative))  # Sigmoid for emotion detection

# # Branch for Emotion Intensity Level
# emotion_intensity_network = SimpleNeuralNetwork()
# emotion_intensity_network.add_layer(Dense(48, 4, relu, relu_derivative))  # ReLU for emotion intensity level

# # Set loss function - cross-entropy for classification tasks
# network.set_loss(compute_cost, compute_cost_derivative)

# '''Connecting the Branches:
#     ensure that the output of the common feature extraction base is fed into each of the task-specific branches. 
#     This can be done during the forward pass of the network.
# '''
# # Assuming 'X_train_normalization' is your input
# common_features = network.forward(X_train_normalized)

# # Using common features in each branch
# species_output = species_network.forward(common_features)
# emotion_detection_output = emotion_detection_network.forward(common_features)
# emotion_intensity_output = emotion_intensity_network.forward(common_features)

# # Combine outputs and pass through final layer
# combined_output = np.concatenate((species_output, emotion_detection_output, emotion_intensity_output))
# final_layer = Dense(combined_output.size, 8, softmax, None)
# # add the final layer to the network
# network.add_final_layer(final_layer)
# # final_output = final_layer.forward(combined_output)
# final_output = network.final_layer.forward(combined_output)

# # combine the loss functions
# def combined_loss(y_true, y_pred):
#     species_loss = compute_cost(y_true[:, 0:3], y_pred[:, 0:3])
#     emotion_detection_loss = compute_cost(y_true[:, 3:7], y_pred[:, 3:7])
#     emotion_intensity_loss = compute_cost(y_true[:, 7:8], y_pred[:, 7:8])
#     return species_loss + emotion_detection_loss + emotion_intensity_loss

# # combine the loss derivatives
# def combined_loss_derivative(y_true, y_pred):
#     species_loss_derivative = compute_cost_derivative(y_true[:, 0:3], y_pred[:, 0:3])
#     emotion_detection_loss_derivative = compute_cost_derivative(y_true[:, 3:7], y_pred[:, 3:7])
#     emotion_intensity_loss_derivative = compute_cost_derivative(y_true[:, 7:8], y_pred[:, 7:8])
#     return np.concatenate((species_loss_derivative, emotion_detection_loss_derivative, emotion_intensity_loss_derivative), axis=1)

# # set the loss function and loss derivative
# network.set_loss(combined_loss, combined_loss_derivative)


'''Without branching, the network will have a single output layer.
    The output layer will have 12 nodes,
    with the each set of four nodes representing the classified emotion of the corresponding species. 
    output = [cat_happy, cat_sad, cat_angry, cat_neutral, dog_happy, dog_sad, dog_angry, dog_neutral, others_happy, others_sad, others_angry, others_neutral]
'''
network.add_layer(Dense(48, 3, sigmoid, sigmoid_derivative)) # ReLU/Sigmoid for species classification (3 species: cat, dog, others)
network.add_layer(Dense(3, 4, sigmoid, sigmoid_derivative)) # Sigmoid for emotion detection
network.add_layer(Dense(4, 4, relu, relu_derivative)) # ReLU for emotion intensity level
network.add_layer(Dense(4, 12, softmax, None)) # Softmax for to get probabilities for each emotion for each species

# Set loss function - cross-entropy for classification tasks
network.set_loss(compute_cost, compute_cost_derivative)


### 8. Train the NN

In [None]:
# Train the network
epochs = 1000
learning_rate = 0.01
network.train(X_train_normalized, y_train, epochs, learning_rate)

# Plot the loss history
plt.plot(network.loss_history)
plt.title("Loss History")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

# calculate training accuracy
y_pred = network.forward(X_train_normalized)
y_pred = np.argmax(y_pred, axis=1)
accuracy = np.mean(y_pred == y_train)
print(f"Training Accuracy: {accuracy * 100:.2f}%")

### 5. Predict and evaluate

In [None]:
# Predict on test data
predictions = network.forward(X_test_normalized)
predictions = np.argmax(predictions, axis=1)

# Calculate accuracy
accuracy = np.mean(predictions == y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")


'''For evaluation, you'll need to compare the predictions with the true labels. The specific evaluation metrics depend on the nature of your tasks (classification, multi-label classification, regression).'''
'''
Classification (Species): Use accuracy, precision, recall, F1-score, etc.
Multi-label Classification (Emotion Detection): Use accuracy, Hamming loss, etc.
Regression (Emotion Intensity Level): Use mean squared error, mean absolute error, etc.
'''

from sklearn.metrics import accuracy_score, mean_squared_error

# Plot the confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, predictions)
plt.figure(figsize=(10, 10))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


#### visualizing the prediction result

In [None]:
import matplotlib.pyplot as plt
import cv2

def visualize_predictions(file_paths, predictions, true_labels=None, num_images=10):
    plt.figure(figsize=(15, 5))
    for i in range(num_images):
        img = cv2.imread(file_paths[i], cv2.IMREAD_GRAYSCALE)
        img_resized = cv2.resize(img, (32, 32))
        plt.subplot(2, num_images // 2, i + 1)
        plt.imshow(img_resized, cmap='gray')
        title = f"Pred: {predictions[i]}"
        if true_labels is not None:
            title += f"\nTrue: {true_labels[i]}"
        plt.title(title)
        plt.axis('off')
    plt.show()


#### Implement Prediction and Evaluation

In [None]:
# Assuming 'file_paths' is the list of file paths for the test dataset
file_paths = happy_file_paths + sad_file_paths + angry_file_paths
visualize_predictions(file_paths, y_pred, y_test, num_images=3)


## Limitations
* Feature Extraction: This approach uses very basic feature extraction (flattening the image), which might not capture the necessary details for accurate emotion classification.
* Model Complexity: Logistic regression is quite basic for image classification tasks.
* Data Quality: The quality and size of your dataset will significantly impact the performance of your model.