<a href="https://colab.research.google.com/github/mujahidrj/Artificial-Intelligence/blob/master/HW5/HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Summarize and describe the different concepts/methods/algorithms that you have learned in this course.

# Section 1: General Concepts

### Artificial Intelligence, ML, and Deep Learning
On the first day of class, I remember Professor Wocjan explaining to us how the subject of Artificial Intelligence is very broad. He said that we were going to be deep learning, which is a branch of machine learning, which is a branch of AI. AI is defined as the science and engineering of making intelligent machines according to John McCarthy. Programs that use machine learning (ML) normally adjust themselves in response to whatever data they are being fed. There are three main types of learning in ML and these are Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

### Supervised Learning
Supervised Learning is when a model is given training data that is pre-labeled. This way, a model can see the input and the correct output for each piece of data. Supervised learning is usually the learning format of choice for problems that have to do with classification or regression. Classification is when the model receives some input and has to figure out what category it fits in, and regression is used for continuous data. A good application of regression is if the model had to predict the price of some area of land if it's given the area, location, city, and size of the land. 

### Unsupervised Learning
Unsupervised Learning, when the model is given training data that is **not** labeled, so it is up to the model to find patterns in the data without the labels. One example of an application is that if you don't know what type of animal is contained in a picture, you can look at the different features it has and then base it off of that.

### Reinforcement Learning
Reinforcement Learning, where a model interacts with an environment and after every time it performs an action, it finds out if it is right or wrong with trial-and-error. One example of a baby using reinforcement learning is when a self-driving vacuum starts being used in the house. Initially, it might miss some spots or hit the wall often but after some time, it will learn where it can go and clean, and where it is not possible.

# Section 2: Basic Concepts

Some of the basic concepts I learned about in this class were linear regression, logistic regression, and gradient descent.

### Linear Regression

Linear regression is when a model predicts continuous values given some set of independent variables that can either be continuous or discrete. Here, the data is modeled using a straight line (line of best fit) and requires a linear relationship between the independent and dependent variables. One case where linear regression would be helpful is to find the amount of revenue a company makes given the amount of money it spends on ads. The more ads for a company, the more people are exposed, and the higher the revenue. The equation of a simple linear regression equation is $y = b + w_{1}x_{1}$, where $y$ is the label that is being predicted, $b$ is the bias, $w_1$ is the slope of the first feature, and $x_1$ is an input or feature that is already known.

### Logistic Regression

Logistic regression is when classification needs to be done with only 2 possible class values (binary classification). The logistic function is also known as a sigmoid function where the output is determined by $\frac{1}{1+e^{-t}}$, where $t$ is any real input. In order to make a prediction, a certain number or numbers need to be plugged into the logistic regression equation and the output will be a decimal between 0 and 1, and the closer the output to either 0 or 1, the higher the probability it corresponds to that class. A really useful application for logistic regression is to determine if a certain email is spam or not so it can be automatically labeled as it gets delivered to a person.

### Gradient Descent

Gradient descent is arguably the heart of almost all ML algorithms. It is an optimization algorithm that is usually used to find what parameters (also known as coefficients) of a function would minimize the cost. The first coefficient tested is usually a small number like 0 and is plugged into the function to figure out the cost Then, the derivative of that is calculated to figure out the slope and based on some learning rate parameter $\alpha$, the coefficient values can be either increased or decreased. This process repeats itself until the cost itself is very low. The learning rate $\alpha$ is crucial as if it is too small, the learning process will take a very long time to take place, and if it is too large, it will overshoot and the result will be very inaccurate. 

The following code block shows an example of mini-batch gradient descent with a learning rate or step of 0.01, where a few training samples have been selected in order to compute the gradient.

In [0]:
epochs = 20
# learning rate
lr = 0.01
# fix initial random weight
initial_weight = np.random.randn(2, 1) 
other_weight = np.random.randn(1)
b = np.zeros(1)

weight = initial_weight
weight_path_mgd = []
batch_size = 2

weight_path_mgd.append(weight)
for epoch in range(epochs):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, batch_size):
        xi = X_b_shuffled[i:i+batch_size]
        yi = y_shuffled[i:i+batch_size]
        gradient = 1 / batch_size * xi.T.dot(xi.dot(weight) - yi)
        weight = weight - lr * gradient
        weight_path_mgd.append(weight)

# Section 3: Building a Model

There are many types of networks currently used: Dense-layered models, Convolutional Neural Networks, Recurrent Neural Networks, and more. Each network has a purpose that it can excel at. For example, Convolutional Neural Networks (CNN) are a category of Neural Networks that have been proven to be excellent at image recognition and classification. Every CNN has the same backbone: convolution, non linearity, pooling, and classification. 

### Structure of a CNN

1.   Convolution is used to extract features from an input image. It learns the features by using small squares of input data. Since every image is just a matrix filled up with pixel values, convolution can be performed by applying a filter or kernel on different parts of the image. The resulting matrix from the kernel moving by a stride on the image is called the **Feature Map**
2.   Non-Lineraity is used after every convolution operation. It replaces all negative pixel values in the Feature Map with 0 so that there is no linear relationship in the picture. 
3.   Pooling usually retains the most important information in the feature map and there are three common types: Max, Average, and Sum. Max pooling is the most common and takes the largest element from a small window. Pooling is done so that the size of the input is reduced and more manageable
4.   Classification is done by using a softmax activation function which produces a fully connected later, which means that each neuron is connected to the last one. This layer contains all the high-level features that the input image contains. The activation function takes all the scores and puts them together so they are a set of values that add up to 1, with each value being the probability of what an image will be.

The code block below shows an example of a linear or sequential stack of layers being used. Here, the network applies a ReLU (Rectified Linear Unit) transformation to the feature that is convolved so that nonlinearity is introduced in to the model. 


In [0]:
%tensorflow_version 2.x
import tensorflow as tf

def build_model():
    # build model
    model = tf.keras.models.Sequential()

    model.add(tf.keras.layers.Dense(16, activation='relu', input_shape=(10000,)))
    model.add(tf.keras.layers.Dense(16, activation='relu'))
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [0]:
# Creating the Convolutional Neural Network
from keras import layers
from keras import models

cnn = models.Sequential()
cnn.add(layers.Dropout(rate=0.3))
cnn.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn.add(layers.MaxPooling2D((1, 1)))

cnn.add(layers.Dropout(rate=0.3))
cnn.add(layers.Dense(128, activation='relu'))
cnn.add(layers.Dense(64, activation='relu'))
cnn.add(layers.Dense(10, activation='softmax'))

Defining a Convolutional Neural Network (CNN) is different in the sense that there are multiple parts (layers) that should be included to help create a proper CNN.

### Components of a CNN

The components needed for a CNN as I mentioned earlier are the Convolution layer and the Max Pooling layer. This allows the model to work as needed. The CNN looks at the data on a "feature level" and uses these layers to filter for more particular features as the layers get more complex. 

Within this model, the dropout layers were used also. Dropout helps in preventing overfitting by restricting certain neurons from activating and contributing to the output. 

# Section 4: Compiling a Model

Once the model is made, the hyperparameters and arguments need to be chosen. Once that is complete, the model can be compiled. The arguments that must be chosen are the following:
Once the model is made, the hyperparameters and arguments need to be chosen. Once that is complete, the model can be compiled. The arguments that must be chosen are the following:

* Learning rate: A tuning parameter that determines the step size of each iteration while trying to minimize the loss function. In order for the learning rate to be optimal, it must be finetuned.
* Loss function: The function that is going to be minimized during the training. It determines how far the model prediction is from the true value. Some examples of loss functions are mean squared error, squared hinge, and binary cross-entropy
* Optimizer: Uses the loss function to determine how to adjust the model parameters to get a closer prediction. Some examples of optimizers are Stochastic Gradient Descent and Adagrad 



In [0]:
# Model Compilation
model.compile(
    loss='binary_crossentropy', 
    optimizer=optimizers.RMSprop(lr=2e-5), 
    metrics=['acc'])


The above code sets the hyperparameters; these are always chosen before the model is compiled. Since the loss function is ```loss='binary_crossentropy'``` this means the type of problem is Binary Classification, and the last layer activation is a sigmoid curve. The learning rate is a fairly high one as well at $2e^{-5}$.

# Section 5: Training a Model

here, you can talk about overfitting/underfitting 

Once the model has been created and compiled, the training process may begin. During this training process, all 3 of the hyperparameters come together: the loss function, optimizer, and learning rate. The reason for this is to slowly adjust the parameters of the model in order to get the most optimal result.

The goal when training a model is to generalize most of the data in a way that is optimal. It is made extremely through Keras because we can just use the method model.fit_generator(). However, this comes with potential downfalls: Underfitting and Overfitting. 

Each network usually trains in a similar manner; it starts with some arbitrary weights and biases at each neuron. These are later adjusted as the model is trained. The models are typically trained using examples and labels (supervised learning). However, models can be trained using other methods like unsupervised learning and reinforcement learning, like I mentioned earlier. 

Normally there are two types of data that a model uses: Training and Validation. The training data is simply used to help the model generalize the data and the weights and biases are then adjusted. After that, the model is tested using the validation set. This data isn't included before and is completely new to the model.

1.   Underfitting - This model is too simple and cannot properly generalize the data or model the training data correctly, so it performs poorly on the training set. This can be seen when the accuracy is remaining stagnant or decreases. In order to try and reduce underfitting, more layers can be added to create a more complex model.

2.   Overfitting - The model is too complex because it models the training data too well and essentially memorizes it. This is can be recognized as the training accuracy continues to increase and the validation accuracy remains stagnant and ceases to improve. To combat overfitting, increase the features and lower the number of layers used. Removing features that aren't as important can remove irrelevant information.

In [0]:
# train
history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=30,
    validation_data=validation_generator,
    validation_steps=50
)

# Section 6: Finetuning a Pretrained Model

Before starting to answer a question where a model is needed, you can look to see if there is already a model that exists and tailor it to your own needs. This is where fine tuning comes in. Fine tuning shares some similarities to transfer learning like the fact that you are using another model that has been previously trained. However, once you add some more layers to the trained model, the pre-trained model is frozen so that the pretrained information is not lost. Then, only the final layers are trained. Once this is done the pre-trained model is unfrozen and the weights and biases can be modified to create a more specific/accurate result. 

Fine tuning and transfer learning are very useful because they don't require the programmer to create an entirely new model and train it from scratch. It allows them to use, tried and true models that have previously been trained which can result in greater accuracy.

In [0]:
from keras.applications import Xception
from keras import layers
from keras import models
from keras import optimizers

conv_base = Xception(
    weights='imagenet', 
    include_top=False, 
    input_shape=(150, 150, 3))

# Shows all the available layers
conv_base.summary()

conv_base.trainable = False

model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
  if layer.name == 'block13_sepconv2':
    set_trainable = True
  if set_trainable:
    layer.trainable = True
  else:
    layer.trainable = False

Here, the conv_base that contains the Xception trained model is frozen. After that, more layers are appended including the classification layer, and is then trained. 

Once the model is trained (not shown), the conv_base layer can then be unfrozen and tuned to properly fit and generalize the new dataset.