## Lab Part 2: CNN Architecture Design Decisions

### Instructions

Discuss the following key architectural decisions you would make and why:

1. What kernel sizes would you choose for different layers and why?
2. How would you incorporate pooling layers and what benefits would they provide in this specific application?
3. What considerations would guide your decisions about network depth (number of layers) and width (number of filters per layer)?
4. How might you leverage transfer learning in this scenario, and what would be the advantages?

Support your answers with reasoning that demonstrates your understanding of CNN principles and the practical constraints of mobile deployment.

### Mark's Answer

My approach to the architecture of this CNN model is based on the following assumptions:

1. I am assuming that the dataset of labeled images is quite high-resolution but will be scaled to 256 x 256 for the training step
2. I will err on the side of a large number of kernels because: 
    a. the model will need to identify a wide range of defect types on a variety of different materials/parts
    b. the dataset is relatively large, somewhat reducing the risk of overfitting
3. Finalizing model architecture will be an interative process-what I describe below my not be the optimal configuration but is a good first attempt

I'm also loosely basing the architecture of this CNN model on a combination of the model in the module 2 example and the VGG-16 model developed by a team at the University of Oxford. 

#### Outline of the Architecture

1. Input layer: 256 x 256 x 3
---
2. Convolution layer: 32 3x3 kernels, 
3. Activation layer: reLU
4. Max pooling layer: 2x2, stride 2
---
5. Convolution layer: 64 3x3 kernels
6. Activation layer: reLU
7. Max pooling layer: 2x2, stride 2
---
8. Convolution layer: 128 3x3 kernels
9. Activation layer: reLU
10. Max pooling layer: 2x2, stride 2
---
11. Convolution layer: 256 3x3 filters
12. Activation layer: reLU
10. Max pooling layer: 2x2, stride 2
---
11. Flattening layer
---
12. Fully connected layer: 512 neurons, reLU activation
13. Dropout regularization: 0.5 dropout rate
14. Fully connected layer: 128 neurons, reLU activation
15. Dropout regularization: 0.5 dropout rate
---
16. Final output layer: 1 neuron, sigmoid activation

#### Notes and Commentary

##### Kernel sizes by layer

I've included 4 convolution layers with the following kernel counts: 32, 64, 128, 256.

- The kernel count increases to allow the model to detect increasingly complex features in the deeper layers (i.e. hierarchical feature learning)
- Each convolution layer is followed by an activation layer to introduce nonlinearity and a max pooling layer to downscale feature dimensions
- I selected 4 convolution layers to give the model reasonably high sensitivity to complex features
    - This is intentionally between the module 2 example and the VGG model as I believe the parts images are more complex than the x-rays, but don't need the computation/horsepower of VGG

##### Testing, Adjusting, and Tuning the Model

After training and evaluating this model, I'd look for signs of over- and/or underfitting.

- If the model were overfitting, I would first remove a convolution layer and re-evaluate, then reduce the kernel sizes if overfitting were still occuring
    - Both of these steps would reduce the sensitivty of the model, lessening the chance of the model memorizing the training data
- If the model were underfitting, I would first add a convolution layer (with accompanying activation and pooling layers), then increase the kernel sizes if underfitting continued 
    - These actions would increase the sensitivity of the model, allowing it to detect more complex features

##### Transfer Learning

I would be keen to use transfer learning in this scenario (i.e. reusing knowledge learned on another dataset). The most important part of implementing transfer learning would be finding a comparable dataset and a model that performed well on it.

This would bring a few notable advantages:

1. The model and its architecture should require less adjusting and tuning, since it would already have demonstrated strong performance on a similar task
2. The risk of over- or underfitting would also be less for the same reason