# A Personal Introduction: 
# Convolutional Neural Networks (CNN)
###### John Susnik - October 7, 2021

##### Tags: Convolutional Neural Networks (CNN), Image Recognition, Layering, Beginner

### Brief Introduction:

I am writing this blog to try and explain something I recently learned - and that I honestly may not fully understand, but this is my best attempt at explaining (in “layman's terms”) what a neural network is (more specifically: a convolutional neural network). I will try to keep the article light and casual.

### Convolutional Neural Networks - What are they?

I just recently learned what neural networks are (3 days ago as of writing this), and some of their general uses. 

As a very brief introduction, Convolutional Neural Networks, or CNNs,  are most commonly used for analyzing visual imagery. They are structured as a pattern of convolutional layers and pooling layers, usually followed by a flatten layer, and one (or sometimes more) dense layers.

This part can be confusing.

At this point I would like to take a step back and describe (in general terms) what these layers are doing. You can think of this process as a way for a computer to “visually” identify what pictures are. I put “visually” in quotations because one aspect of neural networks is that they are widely perceived as a “black box”, where the “purpose” of each layer is hard to conceptualize. 

Imagine you were handed an amazing painting but you were only allowed to look at this painting through a magnifying glass at a very close distance. Your vision would be limited to a very small area you would have to physically "scan" the entire painting to see everything. You may have a hard time understanding what the "big picture" is, and may even get frustrated at the thought of doing this. This is essentially how a computer, or in this case a CNN, looks at an image. Where a human would use a magnifying glass to analyze a painting, a CNN would read in the data for each pixel of an image file. 

Now I mentioned above that a person may get frustrated at the thought of inspecting a potentially large painting with a magnifying glass, and not being able to take a step back and look at the entire picture at once. This is where the CNN has an advantage. The CNN reads in each pixel and memorizes the information while simulatenously "describing" the image. 

### What do you mean exactly by "describing" the image?

In the same way that a human would take notes while inspecting a painting, such as:
- the color
- the texture
- the lines
- the brushwork
- any other artistic feature
- maybe you even want to record the time of day for each pixel as you inspect it
- maybe you want to record the temperature of the room as well
- maybe you make notes about what your favorite areas of the painting are

A CNN is doing the same thing but a CNN speaks a different language. A CNN disects the image into multiple layers of numbers which are seemingly meaningless to us humans to understand - similar to how hard it would be for a human to explain to a computer how to "feel happy". When humans describe a painting as "abstract", a CNN can do the same thing but it is restricted to using numbers (the language of computers). So how does a CNN do this?

Well a CNN might have an attribute column that describes this property (the property being "how abstract is this image") by using a value that ranges between -1 and 1, where -1 is abstract and 1 is realistic. If a human sees a cartoon of Scooby Doo, we would probably agree that this is not as abstract as a painting by Picasso, but also not as realistic as a picture of a family and their dog. It would make sense to us that a CNN might rate the "abstract" attribute for a Scooby Doo picture example as something like 0.25, it's not as abstract as a Picaso, but it's definitely not a realistic picture (remembering that a number close to 1 would be realistic, and a number close to -1 would be abstract) - **don't focus too much on the specific value (as its subjective), but more on the idea that we picked a number that is somewhere between -1 and 1 for this example.**

### Okay, so how do CNNs actually work then?

In the section above I explained (vaguely) how a CNN reads in images and how it describes the images. The tricky part to conceptualize is that while humans can make sense of things like "color", "texture", and "realism", a CNN doesn't know what any of these things are - it only knows numbers. When we send an image to a CNN we tell the CNN what type of layering process we want it to use, which is the architecture for how the CNN will describe the image - or learn about the image.

When we provide an image to a CNN, we also provide a "label" that's associated with the image. For example if we want our CNN to distinguish the difference between cartoons and reality, the label might be "1" for a cartoon and "0" for a realistic picture. The CNN would look at the entire dataset of images that we provide it, and try to look for key differences between the 2 picture types. Similar to how a human could identify if a picture is a cartoon or a photograph of their brother, a CNN would look at the pixels of the image, the color-scale and line types, and create some numeric threshold to decide between the 2.

As the CNN is "learning" about these images, it's trying to predict what the label will be of each picture, and then compares its prediction against the actual label, and will adjust its method for deciding. Using my example from above where I talked about "abstract" as a scale from -1 to 1, a CNN might say that any picture with an "abstract" value of 0 or lower is a cartoon, and anything that is greater than 0 is a real person or place. The tricky part here is that while I'm using this "abstract" feature as an example, in actuality we don't fully understand how a CNN classifies images. We only specify the "architecture" of the process (how many layers, how large the layers are, etc.).

We tell a CNN to describe an image using 300 numbers, it will do its best to create its own categories to decide between the images itself. 

### You lost me again.

Another example, imagine you have images of simple shapes (triangles, circles, squares) and each shape is one of three colors (red, green, or blue). You give this data to a CNN and tell it that there are 3 different colors, please separate the images by color (in this example, the color is the "label" or the "target" of our CNN model). The CNN would learn very quickly that a "Red Square" is classified as "Red" because all of the pixels (the RGB colors) are Red. Now you might think "well that's a very easy task, of course a CNN could do that!" but you might be surpised at how much time it may take for the CNN to distinguish a  "Red Square" from a "Blue Circle", because the CNN might try to learn about some other attribute of the images - such as the shape. What if the first 3 images that the CNN saw were a "Red Square", a "Red Circle" and a "Red Triangle"? The CNN might try to classify the 3 categories as the shape and not the color.

### The  really tricky part.

This is the part that can get REALLY confusing. I will try to keep things simple and not dive too much into the math or actual process, but more outline the challenges in designing a CNN.

Continuing on my color and shape example from above, it might be inherently clear to us that if I give a CNN a group of images that consist of 3 shapes and 3 colors that I may want to tell it to classify the images in 6 different categories, one for each color and shape. What if I asked the CNN to classify between cartoons and reality? Well I would still want a label of either "cartoon" or "reality" for each picture, but how many attributes would a CNN need to differentiate between the 2 categories? This is the part that becomes difficult for humans to understand, and is the core challenge with CNNs.

#### Look at this picture below, don't focus too much on the names of each layer, but understand that **Feature Maps or f.maps** are similar in concept to the "attributes" or columns I mention above.

![Typical CNN](data/Typical_cnn.png) 

### Final questions and thoughts.

How many layers do we use? How large do I make each layer? Think of the layer "size" as the number of attributes or columns we constrict the CNN for describing the image. For something like simple shapes and colors, we may not need that many layers as the images are seemingly simple, but as the images get more detailed, where do we draw the line? 

Using the shape and color example again, imagine we gave the CNN 100 columns or attributes to describe these images, and then told it to categorize the images. It's possible that the CNN would perform worse, because it may overcomplicate the process and start looking at things we didn't forsee. Maybe the CNN starts classifying the intensity of the color, the thickness of the lines around each shape, the angle at each interesction for each vertex on the triangles - the possibilities are endless. When the classification problem becomes more complicated, like deciding if an image is a cartoon or a photograph of a real place, how many layers do we use then?

Another example of how CNNs (generally) work: Imagine you are introducing someone to pizza for the first time and you show them hundreds of pictures of different types of pizza, and tell them to classify them into 2 categories - but you don't give any more information. You show them pictures of square pizzas, deep-dish pizzas, pizzas with lots of toppings, some with no toppings at all, some that have BBQ sauce, some with tomato sauce. Where would you begin? Now imagine as you try to sort each pizza into the correct category, the person tells you "correct!" or "wrong!" without any additional information or explanation. You would have to repeat the process many times (sometimes hundreds, or thousands) before you might feel comfortable with the classification process. This is essentially how CNNs work.

### Summary

I hope this article helped explain the general idea behind Convolutional Neural Networks. I am no expert on CNNs but after a few days of learning about their general framework I am very interested in how these layering architectures are constructed, and how to find the "sweet spot" for classifying images.