# Chapter 3 | Neural Nets Foundations
> If You are Feeling Intimidated Like Me, Lets Work Through This Together

This chapter focuses on understanding the absolute bare bones fundamentals of how deep learning works. In particular the individual calculations that are being done at each and every 'artificial neuron' of any deep learning network that we build. I'm certainly not from a math background and am generally intimidated but intrigued by learning math like this. I think I have shaky foundational math understanding but I'm hoping working through this course in great detail will bolster these core concepts and I can build on them. Despite my enjoyment of math in school, I did only the general curriculum in my final two years because the math teachers in my school were able to induce a coma purely via the audio of their voice. I have no doubt they were well intentioned people but many were phoning in the days and I wasn't willing to spend half of my school day with them. I'm paying the price for it now having to re-learn these concepts but maybe its for the best, if they'd explained that $y=mx+b$, derivatives, and quadratics could detect cancers and make self driving cars I probably would have been a more passionate student. I feel very much a product of the "[Mathematician's Lament](https://www.maa.org/external_archive/devlin/LockhartsLament.pdf)" that is referenced in chapter 1 of the book. Its a lovely read and I took many lessons away from the write-up, not only in how I want to teach things going forward but also a strong emotional response to how something as incredible as math is ruined and tarnished because of how its taught. Think of where we could be, but nonetheless this blog is about chapter 3 of the fastai course, not math education.

Nonetheless here I am, and in the spirit of the "Sidebar: Tenacity and Deep Learning" from the book, I'm hoping that writing this blog and working through the content is a explicit evidence of success for me being both tenacious, and re-learning my math roots.

## Main Topics

The main concepts I want to have a 'mechanistic' & intuitive feeling for after this chapter are:

 - ReLu
 - Matrix Multiplication
 - Tensors
 - Gradient Descent
 
Hopefully after reading this blog you also feel comfortable with these important tools and feel as comfortable as I intend to be implementing and discussing these core concepts.

As mentioned in the lecture, this chapter has different content in the book from the lecture and I'd like to work through both, I'm firstly going to follow along the lecture with Jeremy and re-write & create the functions and tools he builds, barring the excel work which I'd like to re-write in python here, I will then work through the book content.

## Lecture Content

### Timm Module

Jeremy first talks through improving his pet classifier from the previous lesson, in particular having a look at different architectures and using the 'timm' library for vision model architectures. Lets have a look at the timm module and whats available

In [1]:
import timm

len(timm.list_models()), timm.list_models()[:20]

(964,
 ['adv_inception_v3',
  'bat_resnext26ts',
  'beit_base_patch16_224',
  'beit_base_patch16_224_in22k',
  'beit_base_patch16_384',
  'beit_large_patch16_224',
  'beit_large_patch16_224_in22k',
  'beit_large_patch16_384',
  'beit_large_patch16_512',
  'beitv2_base_patch16_224',
  'beitv2_base_patch16_224_in22k',
  'beitv2_large_patch16_224',
  'beitv2_large_patch16_224_in22k',
  'botnet26t_256',
  'botnet50ts_256',
  'cait_m36_384',
  'cait_m48_448',
  'cait_s24_224',
  'cait_s24_384',
  'cait_s36_384'])

There are a lot of models, almost ~1000 which is kind of nuts, looks like its certainly beefed up by different sizes of what I think is the same architecture structure, lets get a model down and have a look at the architecture.

In [2]:
resnet18 = timm.models.resnet18()
resnet18

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (drop_block): Identity()
      (act1): ReLU(inplace=True)
      (aa): Identity()
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (act2): ReLU(inplace=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, m

Ok so I've brought down a 'tiny' model (in the scheme of todays models which have millions of parameters) called resnet16 which I used on my shark classifier in chapter 2 and the resnet architecture is what Jeremy references in the lecture.

It looks like there are many 'Sequential' layers with 'BasicBlocks' inside them which then have a bunch of individual 'submodules' if we copy the language from the model.get_submodule() API which we're about to use. Lets now have a look at a particular submodule. The get_submodule() method allows us to step down the 'tree' and 'branches' of the layers with a dot notation. We shall step all the way down to a leaf, take particular note of the branch names contained within the smooth brackets '()'. First we go via the "Layer1" layer, into the first BasicBlock which has the notation of '(O)', I'm guessing because the layer is an array of BasicBlocks, the first index being 0, then I'm going to pick the BatchNorm2d submodule which has the notation of '(bn1)' within the brackets.

In [3]:
layer = resnet18.get_submodule("layer1.0.bn1")
list(layer.parameters())

[Parameter containing:
 tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        requires_grad=True)]

OK so we've got a couple of tensors, one set to all ones and another set to all zeroes, lets have a look at the doc to see if there's any hints.

In [4]:
from torch.nn import BatchNorm2d

BatchNorm2d?

[1;31mInit signature:[0m
[0mBatchNorm2d[0m[1;33m([0m[1;33m
[0m    [0mnum_features[0m[1;33m:[0m [0mint[0m[1;33m,[0m[1;33m
[0m    [0meps[0m[1;33m:[0m [0mfloat[0m [1;33m=[0m [1;36m1e-05[0m[1;33m,[0m[1;33m
[0m    [0mmomentum[0m[1;33m:[0m [0mfloat[0m [1;33m=[0m [1;36m0.1[0m[1;33m,[0m[1;33m
[0m    [0maffine[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mtrack_running_stats[0m[1;33m:[0m [0mbool[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mdevice[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;32mNone[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs
with additional channel dimension) as described in the paper
`Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift <

Ok deadset I'm not sure what a lot, if not all of this means, but there's a nice link to the paper that proposed this submodule. Maybe I'll revisit this later or what I'm guessing is that we will discuss batch normalisation as part of the course. Nonetheless, credit due to the pytorch team for awesome docs and references. I'm certainly feeling comfortable picking apart a model and submodules to then research or understand the peices. And to Jeremy's point, it looks like each module is just tensors which I'm assuming get matrix multiplied.

Lets have a look at another one

In [5]:
layer = resnet18.get_submodule("layer1.0.conv2")
list(layer.parameters())

[Parameter containing:
 tensor([[[[-1.2502e-01,  2.1611e-02,  6.2673e-02],
           [ 4.5127e-02,  3.0546e-02, -4.1605e-04],
           [ 2.3156e-02, -3.6311e-02,  3.3031e-03]],
 
          [[ 1.4144e-02, -6.9305e-02, -4.6877e-02],
           [ 3.5611e-02,  7.7496e-02, -3.8019e-02],
           [ 2.2473e-02,  1.5468e-02, -8.8914e-02]],
 
          [[ 1.1887e-01,  3.5537e-02, -3.4866e-02],
           [-4.2336e-02, -9.2128e-02, -3.5970e-02],
           [ 2.1227e-02,  6.4185e-02,  1.1702e-02]],
 
          ...,
 
          [[ 4.3194e-02,  4.8263e-02, -4.6128e-02],
           [-2.4847e-02,  1.1796e-02, -4.1562e-02],
           [-1.4793e-03, -7.6328e-02, -2.8772e-03]],
 
          [[-3.0051e-02, -1.6366e-02, -1.0406e-02],
           [ 4.4385e-02, -6.6810e-02, -1.4850e-02],
           [-4.8456e-02,  1.0097e-02, -2.8386e-02]],
 
          [[ 7.7507e-03, -4.8761e-02,  1.6333e-02],
           [-2.8063e-02, -7.2671e-02,  4.1981e-02],
           [-5.1363e-03, -8.3430e-03,  1.6157e-02]]],
 
 
   

Ok lets stop that, this thing is big. But nonetheless its interesting to see the different shapes and values, the batch norm module had zeroes and ones, this one seems to have all sorts of values and the shape is very different.

As Jeremy mentions, apparently these numbers can figure out if a dog is a basset hound or not, or in our previous example, a great white shark or a hammerhead. However this isn't clear at this time. Again as Jeremy mentions, machine learning is the act of fitting a function to data, lets investigate this further.

### How Do We Fit a Function to Data

Lets first build a general quadratic equation and plot it. I don't actually have an intuitive feeling for what makes this a 'quadratic' but again in the spirit of 80% do and 20% study, I'm going to soldier on to see the 'ball game' played out and circle back later to solidfy my theory as part of the 20% reading principle outlined in Radek's Metalearning book which I love. Note to Radek, I'm trusting you that this is a good plan, its working so far but as a product of school doing the opposite, I feel very conflicted moving on without actually 'knowing'.


In [9]:


def f(x): return 3*x**2 + 2*x + 1

plot_function(f, title="$3x^2 + 2x + 1$")

NameError: name 'plot_function' is not defined

This f(x) function is nice to plot that particular function but it'd be nice to be able to play with the parameters, so lets define a quad() function where we can pass in what we like.

Also functionally these two definitions of a function are the same that I've written below, its just a nice python syntax to be able to write it on one line but its not very common in the general python universe.

In [None]:
def quad(a,b,c,x): return a*x**2 + b*x + c

def quad(a,b,c,x):
    return a*x**2 + b*x + c

In [None]:
quad(3,2,1,1.5)

Lets introduce as Jeremy does partial functions, he describes it as something along the lines of 'fixing' part of a function. I've thought of it as making a modified function from another function but his description is simpler.

#### Partial Functions

In [None]:
from functools import partial

def mk_quad(a,b,c): return partial(quad,a,b,c)
f = mk_quad(3,2,1)
f(1.5)

In [None]:
doc(partial)

The [python docs themselves are quite useful](https://docs.python.org/3/library/functools.html#functools.partial) at describing partials. For example, "Return a new partial object which when called will behave like *func* called with the positional and keyword arguments."

"The partial() is used for *partial function application* which "freezes" some portion of a function's arguements and keywords, resulting in a new object with a simplified signature."

Looks like Jeremy is more accurate, closer to the original python docs & its probably a better analogy of partial objects. My understanding is improved and I'll stop saying function from another function and start espousing something similar to the docs & Jeremy.

In [None]:
plot_function(f)

#### Adding Noise to Our Perfect Functions

Lets now make some 'real' looking data,  we can add some noise to this function to more closely represent what data we're more likely to spot out in the hypothetical real world where the generator function of our data is perfect like this but we also live in the same world with innacurate measuring devices.

#### Nassim Taleb is an Awesome Writer

Side note, the book "Fooled by Randomness" and "Black Swan" by Nassim Taleb are genuinely inspiring works that made me think and behave about and in the world differently. In particular Nassim introduces the concept of these 'invisible' generators that create the randomness in our world, the main problem is that we only ever observe a sample from these generators. Despite 'long' time frames relative to our lives, ie having data over 20 years, that simply might be an insufficient sample from the 'generator' to make any worthwhile inference of what the actual likelihood of your observations actually are. Irrelevant to the python we're writing right now but when imaginging this hypothetical world where we observe this perfect function but only see a noisy version, I thought I'd share some of my favourite books.

In [None]:
from numpy.random import normal, seed, uniform

np.random.seed(42)

def noise(x, scale): return normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x, mult)) + noise(x,add)

Lets investigate each of the variables that Jeremy instantiates in the next few lines. I want to understand what each method is doing

In [None]:
doc(normal)

Ok so the normal function will draw random samples from a normal distribution, we have a scale and size variable which set the standard deviation and number of outputs we'd like

In [None]:
# We can see 10 samples which are taken from a normal distribution with a standard deviation of 0.3

normal(scale=.3,size=10)

In [None]:
torch.linspace?

torch.linspace looks like a really nice way to build a tensor that I think is 'linearly' spaced out based on the start,stop, and steps variables you provide. So below we start from -2, go all the way to 2, and add 20 steps

In [None]:
test = torch.linspace(-2,2,steps=20)
test, test.shape

Jeremy also runs a '[:,None]' indexation on this linspace which seems like a cool trick to do something but I'm not quite sure what. It look like he wants all of the columns, hence the ';' semi-colon which gives you all but I'm not sure what the None command does.

In [None]:
test[:,None], test.shape

Ok so it looks like it transposes the tensor from being a single row with many columns to being one column with many rows. I think my language of 'rows' and 'columns' is incorrect, this is simply a data table / dataframe way of thinking and tensors are fundamentally different so I need to figure out better language but I'm hoping we're at a simple enough state where this makes sense.

In [None]:
add_noise(f(test),.3,1.5),add_noise(f(test)[:,None],.3,1.5)

Ok so it looks like the same kind of data but transposed as we saw before

In [None]:
x = torch.linspace(-2, 2, steps=20)[:,None]
y = add_noise(f(x), 0.3, 1.5)
plt.scatter(x,y)

Lets try it without the transposing

In [None]:
x = torch.linspace(-2,2,steps=20)
y = add_noise(f(x),0.3,1.5)
plt.scatter(x,y)

Ok looks the same, Jeremy not sure why we did this transposing of a 1d tensor but its certainly a neat trick. Lets move on and start playing with some parameters and Ipython interactivity

In [None]:
from ipywidgets import interact

@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a,b,c):
    plot_function(mk_quad(a,b,c))
    plt.scatter(x,y)
    

Now if you're reading on quarto, I recognise that you won't be able to play with the plot I've written above so I've re-written a plot function command with Altair so that you can play around with it on the blog.

In [None]:
a, b, c = 1.5, 1.5, 1

x = torch.linspace(-2,2,steps=20)
y = add_noise(f(x),0.3,1.5)
data = pd.DataFrame({"x":x.numpy(), "y":y.numpy()})
scatter = alt.Chart(data).mark_point().encode(
    x='x:Q',
    y='y:Q'
)

f = mk_quad(a,b,c)

selector_a =  alt.selection_single(name="selector_a", 
                                fields=['a'],  
                                bind=alt.binding_range(min=0, max=3, step=0.1, name='A'),
                                init={'a': a,}) 
selector_b = alt.selection_single(name="selector_b",
                                fields=['b'], 
                                bind=alt.binding_range(min=0, max=3, step=0.1, name='B'),
                                init={'b': b})
selector_c = alt.selection_single(name="selector_c",
                                fields=['c'], 
                                bind=alt.binding_range(min=0, max=3, step=0.1, name='C'),
                                init={'c': c})


line = alt.Chart(pd.DataFrame({"x":x.numpy(),"y":1,"c":c})).mark_line(color="red").encode(
    x='x',
    y='y',
).transform_calculate(
    y="((selector_a.a*pow(datum.x,2) + selector_b.b*datum.x)) + datum.c").properties(title='a*x^2 + b*x + c').add_selection(selector_b).add_selection(selector_a)

(line + scatter)

## Book Content