[Link to this document's Jupyter Notebook](./0415-PROJECT_Part2.ipynb)

# Project Part 2: Benchmark and Optimization

In this milestone you will provide a report which includes the results of your benchmark and optimization.  Your report will include a benchmark speeds on a single core and then a description of what you did to speed up the code and a graph of how much improvements you made over the benchmark.  Your final report should include the following:

- Project Title
- Motivating Image
- Abstract
- Methodology
- Timing Results
- Concluding Discussion and Future Work
- References


To help you out your instructor has provided the following template


---- START TEMPLATE ----

*Note:* This topic is different from my original proposed topic of using reinforcement learning to teach a machiene to play the first level of Super Mario Bros. This unfortunetly was not multi GPU parallelizable, and since the default code already has GPU support there would have been no parallelization done by me. However, this does not mean I did not get the code to work. Please see `./MariOh/README.md` for more information and instructions on how to make the HPCC play Mario and watch it while it does!

# Part 2 - PyTorch Classifier with Convolutional Neural Network

By "Brandon McIntyre"

 <img src="https://developers.google.com/machine-learning/practica/image-classification/images/cnn_architecture.svg" alt="Just a simple camera icon intended to be replaced with your image" width="100%">

Image from: https://developers.google.com/machine-learning/practica/image-classification/images/cnn_architecture.svg

---
# Abstract

**Domain**  

Convolution Neural Networks (CNN) are apart of a greater class of learning called "Deep Learning". Deep learning is a type of learning, or computational model more specificially, that invloves creating what is known as an artifical neural network (ANN). An ANN is pretty much a bunch of perceptrons stringed together by weights. A perceptron is meant to abstractly mimic a neuron in the brain (which has action potiental and only fires when it passes a threshold) and only fires if the weights and values sent to it combine to pass a threshold. In our case this creates the unique oppurtunity where the perceptrons can work as a system to guess at what the input image was and make guess at what class that image belongs to. This takes the approach of trail and error and learning, versus just searching all possible search spaces. This gives it the ability to guess quickly and slowly correct itself as it tries to guess the right image and will slowly "learn". There is obviously much more to NN, but it is a facinating bridge between computer science and cognitive science. What makes a CNN important to a classifier, is a CNN has an added special ability to recognize patterns, which makes it quite useful for pattern recognition in photos. This area has important implications in computer vision. It can allow automatic detection of objects in photographs making it useful for real life object detection and searching efforts when there is no human to make the judgement call.

**Motivation**

My interest in this space is really what drove me to where I am now here at MSU. This idea that we can create machines/programs that can learn the enviroment is really quite fascinating. The fact we can create things that are artifically intelligent and could almost seem conscious is really one of the most bizzare things out there. That fascination has led me to learning Data Science and becoming interested in Cognitive Science and all the ways we can use computation to accomplish the feats that at one time only seemed a human could do. Quite honestly, this field of study has fundamentally changed the way I look at life, and studying CNN and PyTorch just seems like a natural extension of my fascinations now.  

**Computation in Convolution Neural Networks**

In order for a Convolutional Neural Network to work and learn, there is a ton of computation and calculations that need to be done. An ANN is really just a bunch of linear algebra that can spit out a result. The base of the NN is just a bunch of perceptrons, that are really just functions that take in a bunch of signals and add them together. These functions can be many things like a sigmoid function, or RELU, etc. The function acts acts as a way the perceptron can pass or not pass its signal on to the next perceptron. It is this activation and signal passing that takes an input and displays an output. This is just merely the guess of the NN. The learning comes from what is known as the "forward" and "backward" algorithm. Again just fancy linear algebra, that adjusts the weights of the NN to change the strength of signals between perceptrons. So with all calculations comes computation. This computation, however, is parallelizable as often most of the calculations can be teased apart and easily mapped to a GPU. Since a GPU is just a bunch of very tiny cores, it is perfect to perform a bunch of very simple calculations and summing. In the case of the of our CNN, the image will be teased apart into pixels, some transformations will happen, and then each pixel will be mapped to an ANN that will calcuate with every single perceptron the classification of the image.

---- Need to remove below and replace with summary of results ----

**Software and Hardware**

The packagaes/software that will be used for this Convolution Network Classifier will be `Pytorch`. This will act as the primary workhorse for the neural network. [`PyTorch`](https://pytorch.org/) is a popular software/package for 
tensor computation and construction of Neural Networks. It is a pacakge for python, 
as well as C++. The software is also[open source](https://github.com/pytorch/pytorch) and has a rich community 
that has a plethora of tutorials on how to use many of the features. One of 
`PyTorch`'s strengths is that it can utilize GPUs to perform calculations. 
This allows for significant speed up in training and computation. One thing 
that is interesting is the code is "not a Python binding into a monolithic 
C++ framework. It is built to be deeply integrated into Python." `PyTorch` 
also has many libraries such as [`torchaudio`](https://pytorch.org/audio/stable/index.html) 
for audio, [`torchtext`](https://pytorch.org/text/stable/index.html) for text, 
[`torchvision`](https://pytorch.org/vision/stable/index.html) for computer vision, 
[`TorchElastic`](https://pytorch.org/elastic/0.2.1/index.html) for running on 
changing environments, [`TorchServe`](https://pytorch.org/serve/) for serving 
`PyTorch` models. This package is great for projects that require large amounts
of computation through neural networks, such as neural networks that take in images. 

All this code will be ran using the HPCC at MSU. Specifically, I will be using developer node `dev-intel16-k80` for devlopement. I also will be using the `Tesla K80` GPU in submission scripts that will be used for benchmarking. 

**Benchmarking, Optimizing, and Defining Success**

The statistic that will be benchmarked is the time it takes to run 20 epochs of the classifer. The timing study will take a look at the time it takes for 1 GPU vs 4 GPU to run 20 epochs at varying batch sizes (will get more into that later). The way this code will be optimized is by allowing the code to run on more than 1 GPU. It should be noted that originally the code came in a serial version. I then transformed it into the GPU verison that is in `pytorch_classifier/`. However, the serial version will not be fully tested as it would take an extremely long time to test the code. To that point, from only 1 run at 20 epochs (meaning this is not an average of 10 runs like the rest of the timing study) it took 


---
# Methodology

The CNN classifier is created using a basic CNN structure and the CNN is trained and tested on the [CIFAR10 dataset](https://www.tensorflow.org/datasets/catalog/cifar10). The code for this comes directly from PyTorch's website tutorials [here](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py). The code provided is serial utilizing only a CPU. My task was to parallelize first with GPU support and then parallelize with mutliple GPU support. Fortuntely with pytorch this is a pretty painless process.

In order to run the code, it is advised to visit [`./pytorch_classifier/README.md`](./pytorch_classifier/README.md). In that README file you will find everything that is needed to setup an enviroment on the HPCC that can run this python code. In sort, the README file will tell you how to install Anaconda 3 with python 3.8 on the HPCC, walk you through how to set up the appropiate enviroment using `.yml` file, and make sure you can activate the `pytorch_classifier` enviroment. This will be imperative to getting any of the following code working.

I made many modifications to make this code easy to work with my submission script, but I will go over the main modifications I made to the original code to Parallalelize it with single GPU and then mutli GPU support.

## Single GPU Parallelize

At the beginning of the code (Part 1 of Tutorial) I added the following line
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```
This allows PyTorch to check if a GPU is avaliable, and if not it will set this device varaible to CPU.

Then after defining the network as a class called `Net` (Part 2 of Tutorial), the code instantiates `Net` and saves this with a Variable `net`.
```python
net = Net()
```
Here is the first of the modification we will make to send the code to the GPU. Luckily this is really easy with PyTorch, all we have to do is add `.to(device)` to our code to accomplish the sending to the GPU.
```python
net = Net().to(device)
```

The next change we have to make is inside of the main epcoh loop (Part 4 of the Tutorial). This will be to our data, as we will need to both the inputs and labels to the GPU for training.
```python
# Original serial verison
inputs, labels = data

# GPU support version
inputs, labels = data[0].to(device), data[1].to(device)
```

*Note: In the Tutorial in Part 5. It has code the essentially reloads the model from the saved model. This was noted as not necessary, and I did not include that in the code I ran because it is not needed for this model (and make our job harder)

When testing our CNN we will need to apply similar changes to send data to the GPU. The first part is when we are only testing a few images (Beginning of Part 5 of tutorial). We will need to grab the few images we are testing and send that to the GPU.
```python
# Original serial version
dataiter = iter(testloader)
images, labels = dataiter.next()

# GPU support version
dataiter = iter(testloader)
images, labels = dataiter.next()
images, labels = images.to(device), labels.to(device)
```

The second part, a naturally extension of the first part, is testing with all images (Middle of Part 5). This is located in the first for-loop that loops through all of the data. We simply just need to send the data to the GPU again with the following code (that we also used above).
```python
# Original serial version
images, labels = data

# GPU support version
inputs, labels = data[0].to(device), data[1].to(device)
```

Finally, using the same exact process as the second step, but now in the second for-loop looping through all the data that is now checking for accuracy per classification  group (End of Part 5 in tutorial)
```python
# Original serial version
images, labels = data

# GPU support version
inputs, labels = data[0].to(device), data[1].to(device)
```

Following the above steps will create code that is parallelized with 1 GPU. Now to multiple GPUs (which is even easier)

## Multi GPU Parallelize

With PyTorch this really could not be any easier. The following will parallelize the code with multi GPU support.

There is, in fact, only one change that needs to be made to make this work on multiple GPUs. This occurs when we first instantiate our `Net` class. We simply will just want to pass in our `Net` instantiation into `nn.DataParallel()`. This can be see by the following
```python
# Single GPU support version
net = Net().to(device)

# Multiple GPU support version
net = Net()
if gpu_avail > 1:
    net = nn.DataParallel(net)
net = net.to(device)
```
What this code does is break apart the instantiation to allow us to pass in `net` into `nn.DataParallel()` only if more than one is GPU is avaliable. Then just like before we send the network to the "device". Fortunetly enough, `nn.DataParallel()` is smart enough that it will now handle everying dealing with more than 1 GPU. Even though `device` just refers to cuda, the network will send the right data to the right GPUs.

Now we have created code that can actually run in serial if no GPU's are avalible, run with 1 GPU if only one is available, or run with mutliple GPU if multiple GPUs are availiable.

## Alterations to code for timing study

The timing study I conducted was to see how the time it took to run the model changed across computation types (CPU only, single GPU, multiple GPU) and batch sizes. The accuracy was also recorded and analyzed. The hope was to see if multiple GPU parallelization provides speed up over CPU and single GPU models, and how much changing batch affected time and accuracy.

I will not go too deeply into my code alterations for the timing study as they could be implemented many ways, but I will give a general breakdown of what I did.

* Created new python file called `data_load.py` that contains the first couple of lines that download and load the data. I did this because I included this at the beginning of the submissions scripts. So in case if the data was not downloaded, that time of downloading was not included in the timed portion of the test.
* Created a way to pass in argument while running script from command line that can change the batch size of the model. I ran my model with 5 different batch sizes `50`,`100`,`150`,`200`,`250` to compare time it took to run model
* Altered the dimensions of the CNN in `Net` at the suggestion of the tutorial (found at the end of tutorial). I changed the fist Conv2d to `nn.Conv2d(3, 64, 5)` and the second Conv2d to `self.conv2 = nn.Conv2d(64, 16, 5)`
> **For reference:** Exercise: Try increasing the width of your network (argument 2 of the first nn.Conv2d, and argument 1 of the second nn.Conv2d – they need to be the same number), see what kind of speedup you get
* Changed the number of epochs to `20` with a variable. (If wanted could also make that an argument to pass in).
* Finally, I created three different codes for ease of alteration. With each code I changed the output files to be saved in a `output` folder. I also changed the label of all of the outputs to also include the batch size and compute type (`CPU`,`single`,`multi`)

## Running the provided code

You can run a test of the provided pre-modified code plus submission scripts with a few simple make commands. Each code will run for 20 epochs with a batch size of 250 (should take about 4 min for each). To run the multiple GPU code, make sure you are on a dev node with GPU access that is not `dev-intel14-k20`.

**NOTE: Make sure you have the appropiate enviroment activated and installed. See [`./pytorch_classifier/README.md`](./pytorch_classifier/README.md) on how to set up your Anaconda `pytorch_classifier` enviroment. It is imperative this enviroment is used prior to running these cells**

CPU only code

In [None]:
!cd ./timing_study && make -i clean
!cd ./timing_study && make cpu

Single GPU code

In [None]:
!cd ./timing_study && make -i clean
!cd ./timing_study && make single

Multiple GPU code

In [None]:
!cd ./timing_study && make -i clean
!cd ./timing_study && make multi

If on the HPCC you can submit the submission scripts with the following code

In [None]:
!cd ./timing_study && make -i clean
!cd ./timing_study && sbatch cpu_classifier.sb
!cd ./timing_study && sbatch single_classifier.sb
!cd ./timing_study && sbatch multi_classifier.sb

## Conducting Timing Study

To conduct the timing study I utilized the `cpu_classifier.py`,`single_classifier.py`, and `multi_classifier.py` code provided in `./timing_study`. Specifically, I ran the code using the HPCC job scheudler with the `cpu_classifier.sb`,`single_classifier.sb`, and `multi_classifier.sb` submission scripts. With the submission scripts I specifically used the `Tesla k80` GPUs. This is becuase I wanted to make sure the `Tesla k20` GPUs were not used, because these are incompatible with the PyTorch Cuda code.

The study varied 2 things; the model (CPU only, Single GPU, and Multiple GPU) and batch size. The idea behind this is because of the overhead of using mutltiple GPUs was noticed to make the multiple GPU code run slower than the single GPU code at lower batch sizes. However, as batch size increased the multiple GPU was faster. So in the submission scripts I did just that. I made the batch size changeable for each code and timed how long it took for that code to run 10 times at that given batch size. 

In order to make the timing fair between studies make sure that the data is downloaded before conducting the timing studies, this way there is no run of code that is thrown off by the one-time data download.

For my specific tests, I used batch sizes (50, 100, 150, 200, 250), I used computational types CPU only, Single `K80` GPU, and 4 `K80` GPUs, I also ran each code for 20 epochs, and finally I ran each code 10 times at its given batch size to find the average time it took. I also collected the accuracy of each run and averaged that to obatined an average accuracy of the batch size. 

---
# Timing Results

&#9989;  Show the results of a baseline experiment on a single core and after optimization.  Also include a graph of the results. 


&#9989;  Provide the results of a benchmark or scaling study for your project.  Make sure you include a description of the hardware that was used and graph the results.  Make sure you include detailed descriptions about the hardware that was used.  Graphs alone are not sufficient, explain the graphs. Did they meet expectations?  Was there any anomalies?

---
# Concluding Discussion and Future Work

&#9989;  Give another short description of the project and your final results.  Use this to talk about what you learned in this project.  Include what you found interesting and what would be a next step.  

---
# References

&#9989;  Include links to websites and resources used in this project.  

Convolutional Neural Networks
https://www.youtube.com/watch?v=YRhxdVk_sIs

Parallelizing CNN
https://core.ac.uk/download/pdf/229563237.pdf

---- END TEMPLATE ----

-----
### Congratulations, you are done!

Now, you just need to create a second directory in your git repository and include your report as an md or ipynb file in the directory along with any additional figures and files needed to reproduce the results.  You instructor should already have your git repository and be able to pull in your changes. 

Written by Dr. Dirk Colbry, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

----