<a href="https://colab.research.google.com/github/marloncalvo/cap4630-spr2020/blob/master/Homework5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Concepts Learned From COP4630

## General Concepts

From this course, I gained comprehension on what Artificial Intelligence, Machine Learning, and Deep Learning entails and their applications on real-world problems.


### Artificial Intelligence
Artificial Intelligence (AI) is the "big umbrella term" for techniques that allow machines to make decisions on "sensory" data. AI is implemented through various algorithms that provide some means to make a decision based on enviromental perceptions. Since this a very broad term, it is utilized as a 
descriptive category similar to what computer science is.

### Machine Learning
Machine learning is a type of AI technique that utilizes the idea of "learning" 
to automatically solve problems. They require no explicit description on how to
solve the problem, but instead, find the solution to some problem by finding
patterns in its "environment". ML models utilize networks, somewhat similar to the architecture of the human brain, passing data through these networks to make decision on certain inputs.


### Deep Learning
Deep learning is a technique within machine learning, which employs networks that composes of multiple "inner" layers, in comparison to shallow networks that utilize a single "inner" layer. Deep learning models can extract much more information from the input data, generalize concepts much more efficiently, and as such, deep learning models are one of the best AI algorithms that are utilized today.


## Basic Concepts

These are several basic, but key, concepts that machine learning models utilizes for decision making.




### Gradients and Gradient Descent

In calculus 3, you learn about gradients and their use in 3D geometry. They are useful for computing the derivative, or the rate of change, of every axis "at the same time". So one can find the rate of change for some direction, in all axises, as well as the ability to know where the direction of the greatest rate of change is. This is important in machine learning, as it is a generalization of the 1D derivative, to be able to find some direction to update the model's weight such that it can converge it's error. To explain the previous idea, gradients are utilized in gradient descent, where the ML model computes the gradient of it's error (some formula f what computes the error between the predicted value and expected value), such that it can move towards the direction where the error is zero.

### Linear Regression

A line can be described by the following formula:
$$ y = mx + b $$

For data that is based on some linear equation, ML models can be employed to calculate linear regression against some input data so that predicitions can be made from future input. As such, it is important that the input data's features have a linear correlation with each other, such that the ML model can perform efficiently. One implementation of perform linear regression is finding the error between the model's prediction value, and the actual value, and finding squaring that error. This type of calculation creates a "upside-down" bell curve, when the gradient can be computed such that we adjust the model's weight to converge to 0 (no error!). Here's how you can do so.

In [0]:
def train(ws, xs, ys, epochs=100, lr=0.5):
  n = ys.shape[0]

  for epoch in range(epochs):
    gradient = 1/n * np.matmul(xs.T, np.matmul(xs, ws) - ys)
    ws = ws - lr * gradient

  return ws

In this piece of code, we are computing the gradient of the predicted values minus the expected `np.matmul(xs, ws)-ys`, and then computing the average gradient to update our weight with. Then we simply update our weight with the negative of the gradient. This allows our weight to slowly converge to a value which best predicts some input y.

### Logistic Regression

When the input data does not have some linear correlation between the variables, logistic regression can be employed to draw a "separating" line that can "separate" data in non-linear ways. This separation is useful for making decisions based on some input. The algorithm for logistic regression is similar to linear regression, but employs the idea of "logistic regression" and the use of different error formulas to create non-linear "separating" lines.

The sigmoid function maps some input data $(-\infty, \infty)$, into $(-1, 1)$, and is equal to $1/2$ at 0. This allows to map some input data, half into some category, and the other to another category. This activation function is applied onto the predictions of the model, that maps all the data into some $<0.5$ and $>0.5$, which the model can utilize to make decision. Furthermore, to utilize this activation function efficiently, cross-entropy losses are utilizes to compute the gradient. It doesn't make sense to just update in terms of the error between some "TRUE" field, and the computed probability. Instead, the use of the cross-entropy function allows to update the gradient based on probabilities.

In [0]:
def train(self, X, Y, lr=0.1, epochs=10, validation_split=0):

    X = np.hstack((np.ones((X.shape[0], 1)), X))
    W = np.ones((X.shape[1], 1))

    N = len(Y)
    
    for epoch in range(epochs):

      A = predict(W, X)

      loss = binary_crossentropy(A, Y)
      acc  = prediction_accuracy(A, Y)

      gradient = (X.T @ (A - Y)) / N
      W -= gradient * lr


Similar to linear regression, we computed the gradient in line 15, but instead we utilize the new gradient formula `A-Y`.

## Building a Model

In the previous section, we discussed the ideas behind very basic ML model concepts, and implementation of basic model's that implemented such idea. Here, we will discuss much more complicated models.

### Convolutional Networks

For data that has many "sub-features" within a single input, such as an image, it may make a lot more sense for the model to capture the features automatically, instead of manually generating features for a model to utilize. Convolutional networks provide such by generating features from "windows" of the original data, and then performing decisions based on the state of these windows. To build a convolutional network, there are a couple of steps to perform this correctly.

#### "Fixing" Input

The first step is to clean the input, to clean any data that will make it difficult for the model to efficiently converge. If the image is too large, or noisy, bad contrasts, etc., this would be the initial stage where such issues would be resolved. You can also normalize the image data, so that all values for the image are changed to $[0,1]$ range. Additionally, here you would separate the data into a training set, validation set, and a test set.

#### Training Model

After the input has been improved to enhance the model's performance, you can start training your model. You have the choice to build a network architecture from scratch, or utilize several well-known network architectures to perform training. In general, the network will have at least a convolutional layer, with a preceding max pooling layer, whose output will be piped to a classification layer that will perform the actual classification (making a decision on your problem!). It is important that you train your model on your training data, and utilize a validation data set that allows you to evaluate your model's performance without creating a bias with your test set.





In [0]:
model = models.Sequential()
model.add(layers.Conv2D(16,(4, 4),activation=’relu’))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (4, 4), activation=’relu’))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation=’softmax’))

In this example, this is a simple convolutional network that performs some extraction on the original input data, and performs a classification for 10 different classes.

More advanced convolutional networks will have blocks of layers, and the inputs are spread out through these blocks, which each performs some sort of filtering / extraction on the original input, where each block will play a major role in extracting important information for classification. Some examples of convolutional networks that employ such architecture are the `MobileNetV2` and `VGG16` networks.

## Compiling a Model

Several important ideas were omitted from the previous model explanation, where we only discussed about how network architectures are utilized in ML. It is also important to utilize correct optimizers, activation functions, regularizers, and learning rates.

#### Optimizers

We previously discussed the idea and utilization of optimizers, but were not explicitly defined. The loss functions descibed in linear and logistic regression are two different type of optimizers. These optimizers are utilized for updating the model's weight as the model trains on the input data. Optimizers are specific to certain situations, such as `binary_cross_entropy` for binary classification problems, and `cross_entropy` for non-binary classification problems. The differences between each, and their potential usage, can be studied at the keras documentation or google. Learning rates are utilized in gradient descents as a means to control how much a model can update it's weights in one batch. This is very important because learning rates change the magnitude of the gradient that is applied to the model's weight, circumventing or causing gradient descent updates that diverge or are cyclical.

#### Activation Functions

We saw for logistic regression, we had to apply the sigmoid activation function so that we can "morph" the output data of the model into some probability range, as to signify the probabilty of being some class. Layer's can be instructed to utilize an activation function, which affects the output read in by the subsequent layer. A common activation function is the relu operation seen in the previous code example. They are very common in deep learning as they diminish the problems seen from the sigmoid activation functions, where you see a diminishing gradient (imagine how little the gradient differences will be for very high n!). These activation functions apply non-linearity to the input data, as seen before, so that we can perform classification on input data.

#### Regularizers

While learning rates provide a means to reduce the magnitude of the gradient updates, sometimes it is useful to employ penalties for computing very high weights. Since model's are just a bunch of weights, complexity of a model can be seen by the diversity of high magnitudes of weights. The idea behind regularizers are to control how a model updates its weight when dealing moving towards higher weights. By limiting this complexity, models tend to overfit less (which will be discussed later!). Complexity in ML models are seen as bad, as they tend to not generalize very well to new data.

## Training a Model

After building a model, and compiling it, you must now train the network and access it's performance!

### Accessing Model Performance

During the training of your model, you should consistently evaluate the model loss and accuracy on the training and validation set. First of all, the loss of the model tells you how well the model is predicting the values on the input. A high loss means that it is not predicting closely to the expected value, and high loss means otherwise. This metric is important to follow, as it is a good indicator for accessing if your model is improving. On the other hand, accuracy describes how many times it was correct in making a decision, for each input. Notice that an algorithm with high loss, can still have high accuracy, if it consistently makes good decisions. But, it is not evaluating the input data well. When accessing your model's performance, you should see how the loss and accuracy are changing for both the validation and training data. If you see that the validation loss is not changing for your model, then you've hit the limit with your current options, and must re-evaluated your parameters / architecture. If you notice that your loss for the training data is decreasing, but the loss for the validation data is not changing or increasing, then you are likely overfitting to your training data. Overfitting occurs when your model becomes too specific to the training data, and fails to generalize features that are important for solving other arbitrary inputs. Furthermore, if you notice that your model is dropping loss before the training finishes, then you are likely underfitting. This occurs when your model has not succesfully converged, and requires more time to converge.

### Updating Model's Hyperparameters to Improve Performance

Depending on your model's state, you must tune several hyperparameters to improve performance.

#### During Underfitting

If your model is underfitting, it is typically solved by letting your model train for more time. By increasing the number of epochs of your model, you allow it to train for longer time, until it succesfully converges. Other things you can do, that are less safe, includes increasing the learning rate which decreases the amount of iterations required to converge, but can lead to situations that you will be unable to converge (jumping side to side during gradient descent). You may also add more information to your model (more layers, more neurons per layer), such that your model can utilize this information to converge.


#### During overfitting

This typically occurs when your model either has too much information available to it, specializing it's weights to fit mainly to the training data. Here, you need to re-evaluate your model, specifially how much information you are providing / how long your model is running for. One simple solution is to reduce how many neurons some of your layers utilize, such that your model must generalize to achieve the same performance. Techniques such as regularization and dropout can be added, to limit how much your model can "catch on" to specifics in the training data you are utilizing.

#### Things to Watch Out For

When updating the hyperparameters, you must be careful that you do not overdo it. The more you update your model's parameters, the more you are "fitting" to your model's training and validation set. With enough iterations of tuning your hyperparameters, you may get your model to perform well to the training and validation data, but since your model may have overfit to both of those datasets (since you indirectly caused overfitting to the validation data), then it will perform poorly on the final test set.

## Finetuning a Pre-Trained Model

When dicussing building a model, we described that we can utilize pre-existing network architectures. I will describe several techniques to improve the imported architecture for your problem.

### Unfreezing Layers

As the network you imporant is modifiable, you should train some of its layers to conform to the problem you are solving. It is important that you learn about the network architecture you are importing, and then unfreeze the layers from the model that you deem are important for a model to perform well on your problem. When you import a model, it is typically frozen. What this means is that when you are training your model, the weights for the layers in the imported model are not updated. If you want to update them, you have to unfreeze them, which will then be updated during training. A specific example of what layers to unfreeze can be seen in `MobileNetV2`, where I unfroze the compression and decompression layers that are critical for extracting most of the features from the input, and filtering those which are most important. A model can only tell what is important or not, and how to properly extract that data, if it is trained well in your problem.

### Piping to Classification Layers

Another good idea is to pipe the last layer's output to your own network, such that you can utilize that data to perform classification. There are several techniques you can utilize to perform classification on this data. One idea is to filter down the data some more, such that you can improve generalization. Then, you can connect to several dense layers, which will perform the final classification. You can also expand from the output layer, perform convolution again, to extract certain features from the layer's output. In any case, you must examine how generalization or specificity will impact your model after utilizing the pre-trained model's output.