# CSCI 5922 Final Exam

## Part 1: Practice Interview Questions

1)	What is an activation function (in neural networks) and why is it used?  

An activation function is a mathematical operation that is applied to the output of a neuron in a neural network to generate an “activated” output. Because neurons are typically a linear sum-product of weights and input values, they have difficulty approximating non-linear functions, such as the infamous XOR problem. Activation functions are typically non-linear, such as sigmoids or hyberbolic tangents, which give the network the ability to approximate non-linear functions. 

2)	What is an exploding gradient and give an example of what could cause it.

An exploding gradient is a quickly increasing derivative of the loss function with respect to the model parameters. The gradient is used to determine the direction and size of the change of model parameters when performing backpropagation, in tandem with the learning rate. If the gradient is increasing too much, the model will not converge to a minimum. This could occur easily in a recurrent neural network that has long series of inputs because the gradient for some parameters is a function of all prior inputs. If the values in this chain are > 1, their product and thus the gradient can get very large very quickly. 

3)	When using an activation function in a CNN that predicts images, why might you choose the ReLU?

ReLU’s power is in the simplicity of its calculation and of its derivative—for values below 0, the derivative is 0, and for values above 0, the derivate is 1. Furthermore, any negative values are simply reduced to a 0. This is important in a CNN designed to work with images because the input typically has really high dimensionality and thus the model has many layers with many parameters. ReLU helps to keep the derivatives and thus the gradient within reasonable values. 

4)	A transformer (such as for language translation) has an encoder and a decoder. Suppose you have a word that is one-hot encoded. What 4 things will a common encoder include when embedding that word? Hint: The first one is some kind of embedding. What are the other three?

A transformer encoder commonly includes first the embedding of the word into a vector space that captures the meaning of the word. Next, positional encoding is added to capture the location of the word in the input. Next, a self-attention mechanism captures the relative importance of the word in relation to the other words in the input. Finally, normalization and feed forward layers encode all of the above, typically into other encoder blocks.

5)	Define BERT (Bidirectional Encoder Representations from Transformers).

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained natural language processing model developed by Google. It utilizes a transformer architecture and is trained on vast amounts of text data to learn contextualized representations of words. BERT is bidirectional, meaning it considers both left and right context in a sentence, enabling it to capture richer semantic meanings. It has achieved state-of-the-art results in various NLP tasks by fine-tuning its pre-trained weights on specific downstream tasks.

6)	What is cross attention in a transformer?

Cross-attention is a mechanism by which a transformer model pays varying degrees of attention to different parts of the input sequence when generating the output sequence. Unlike self-attention, which considers relationships within a single sequence, cross-attention involves attending to positions in the input sequence based on the context provided by another sequence.

7)	What is transfer learning?

Transfer learning is a machine learning model training technique that involves pre-training the model on a different but similar dataset to the target task before fine-tuning on the specific task at hand. It is a powerful technique that allows a model to learn on a vast trove of labeled or unlabeled data before attempting a task that might have much less data. 

8)	Define GAN (Generative adversarial network)?

A GAN is a type of neural network model that learns from the competing aims of a generator and a discrimator. The discriminator is attempting to learn to distinguish between a real input and a fake one, while the generator simultaneously learns to create fake inputs that are indistinguishable from real ones. They achieved impressive results in the field of generative AI. 

9)	Define GPT.

GPT stands for Generative Pre-Trained Transformer. It refers to the family of language models based on transformers that are pre-trained on massive amounts of diverse text data in order to learn contextualized representations of words. The generative part refers to the fact that after these models are prompted with input, they generate a bunch of coherent and contextually relevant text. 

10)	Define ChatGPT.

ChatGPT is a software product created by OpenAI in late 2022 that brought a lot of joy to students and a lot of headaches for teachers accustomed to assigning boring, rote assignments. Why? Well, based on the GPT defined above, OpenAI made it possible for anyone with an internet connection to prompt their large language model GPT3.5 and get a coherent response. ChatGPT has since passed the bar exam, med school exams, and written likely millions of high school English essays across the world.  If ChatGPT can answer this question, does that make it self aware?

11)	Common activation functions include ReLU, sigmoid, tanh, and softmax (among others). Give an example when you would use the sigmoid as the last activation function in a NN. Give an example when you would use softmax as the last activation function in a NN. 

Sigmoid is a very useful activation function that is excellent for binary classification. Because it’s shape, it normalizes the input value to between 0 and 1 and for most real numbers, the value is very close to 1 or 0. The softmax function is similarly useful for classification, but instead for multiclass problems. It takes a whole range of input weights and normalizes them such that they are all between 0 and 1 and add up to one, making the outputs interpretable as probabilities that the input belongs to a given class.

12)	Suppose you have labeled input data where the labels are one-hot encoded. Suppose also that your labels can be one of three categories (like dog, cat, mouse for example). Next, suppose the last activation function of your NN is the softmax. Which Loss function would you choose to use in this case and why?	

For this case, I would most certainly use categorical cross entropy because this is a classification problem with more than 2 classes. CCE rewards a model not just for guessing the correct category, but it also rewards guessing correctly with high confidence. It therefore performs excellently at training the model progressively, helping the model get slowly more and more confident as it trains. 

13)	Why use max pooling CNNs – what does max pooling do?

Max pooling is a type of layer whose primary purpose is to reduce the dimensionality of the features in a model.  It does this by defining a pool, or patch, of a given size (say 3 by 3) and pulling out the maximum value from within that pool. This simplifies the input (in our case 9 values) down to just a single value that still captures a lot of the important signal. It’s commonly used with images which have notoriously high dimensionality to get us from millions of pixels down to feature space of just a few thousand categories.


## Part 2: ANN & CNN Architectures

### Question 1A: ANN Diagram

![ANN Diagram](ann_diagram.png)

### Question 1B: ANN Derivatives  
  
#### Equations for ANN:  
Z1 = X * W1 + B  
H1 = σ(Z1)  
Z2 = H1 * W2 + C  
H2 = ReLU(Z2)  
ŷ = softmax(H2)  
L = CCE(y, ŷ)  
L = -y * log(ŷ)  
  
#### Derviative ∂L/∂W2_11:
  
∂L/∂W2_11 = ∂L/∂ŷ * ∂ŷ/∂H2 * ∂H2/∂Z2 * ∂Z2/∂W2_11  
  
∂L/∂ŷ = -y/ŷ  
∂ŷ/∂H2 = ŷ * (1 - ŷ)  
∂H2/∂Z2 = 1 if Z2 > 0 else 0  
∂Z2/∂W2_11 = H1  
  
if Z2 > 0:  
∂L/∂W2_11 = -y/ŷ * ŷ * (1 - ŷ) * 1 * H1  
∂L/∂W2_11 = -y * (1 - ŷ) * H1  
  
if Z2 <= 0:  
∂L/∂W2_11 = -y/ŷ * ŷ * (1 - ŷ) * 0 * H1  
∂L/∂W2_11 = 0  




In [3]:
# Create this ANN in Keras

# import tensorflow
import tensorflow as tf

# create the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(4, activation='sigmoid', input_shape=(4,)),
    tf.keras.layers.Dense(3, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# print the model summary
model.summary()


Model: "sequential_2"


_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 4)                 20        
                                                                 
 dense_7 (Dense)             (None, 3)                 15        
                                                                 
 dense_8 (Dense)             (None, 3)                 12        
                                                                 
Total params: 47
Trainable params: 47
Non-trainable params: 0
_________________________________________________________________


### Question 2A: CNN Diagram

![CNN Diagram](cnn_diagram.png)

In [4]:
# Create the CNN in Keras

# import tensorflow
import tensorflow as tf

# create the model
cnn = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters=2, kernel_size=(3, 3), activation='relu', padding='same', input_shape=(30, 30, 1)),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Conv2D(filters=4, kernel_size=(3, 3), activation='relu', padding='same'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dense(units=3, activation='softmax')
])

# compile the model
cnn.compile(
    loss = 'categorical_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy']
)

# print the model summary
cnn.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 30, 30, 2)         20        
                                                                 
 max_pooling2d (MaxPooling2D  (None, 15, 15, 2)        0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 15, 15, 4)         76        
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 7, 7, 4)          0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 196)               0         
                                                                 
 dense_9 (Dense)             (None, 64)               

## Part 3: Applying Neural Nets (ANN, CNN, LSTM) on Real Labeled Data