# GenAI with Open-Source Models: Complete Tutorial
## Building RAG Systems with Local Models & PDF Processing

**üìö What you'll learn:**
- How to use **completely open-source** models (no API keys needed!)
- Process **real PDF documents** with practical use cases
- Build **in-memory vector stores** for fast prototyping
- Create **educational explanations** for each step

** Use Case:** Build a **Research Paper Assistant** that helps students understand academic papers

## Step 1: Environment Setup (No API Keys!)

**What we're doing:** Installing only open-source packages that work completely offline

In [1]:
# Install open-source packages only
!pip install -q sentence-transformers chromadb pypdf langchain langchain-community faiss-cpu transformers torch numpy pandas

# Verify installations
import subprocess
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
print("Installed packages:")
for line in result.stdout.split('\n'):
    if any(pkg in line for pkg in ['sentence-transformers', 'chromadb', 'langchain', 'faiss']):
        print(f"  {line.strip()}")

[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m67.3/67.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m19.5/19.5 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m313.2/313.2 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [2]:
# Import all required libraries
import os
import json
import time
import numpy as np
from typing import List, Dict, Any
from datetime import datetime

# PDF processing
from pypdf import PdfReader

# Vector stores and embeddings (all open-source)
import chromadb
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Document handling
from langchain.schema import Document

# Progress tracking
from tqdm.notebook import tqdm

print("All libraries imported successfully!")

All libraries imported successfully!


## Step 2: Understanding Our Use Case

**Scenario:** You're a student who needs to understand research papers quickly. We'll build a system that:
1. Reads PDF research papers
2. Creates a searchable knowledge base
3. Answers questions about the paper content
4. Explains complex concepts in simple terms

In [5]:
# Create sample research paper content (simulating a real PDF)
sample_research_content = """
Introduction to Convolutional Neural Networks
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
wujx2001@gmail.com
May 1, 2017
Contents
1 Introduction 2
2 Preliminaries 3
2.1 Tensor and vectorization . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Vector calculus and the chain rule . . . . . . . . . . . . . . . . . 4
3 CNN in a nutshell 5
3.1 The architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 The forward run . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . 6
3.4 Error back propagation . . . . . . . . . . . . . . . . . . . . . . . 8
4 Layer input, output and notations 9
5 The ReLU layer 10
6 The convolution layer 11
6.1 What is convolution? . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Why to convolve? . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.3 Convolution as matrix product . . . . . . . . . . . . . . . . . . . 15
6.4 The Kronecker product . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 Backward propagation: update the parameters . . . . . . . . . . 17
6.6 Even higher dimensional indicator matrices . . . . . . . . . . . . 19
6.7 Backward propagation: prepare supervision signal for the previous layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.8 Fully connected layer as a convolution layer . . . . . . . . . . . . 22
7 The pooling layer 23
1
8 A case study: the VGG-16 net 25
8.1 VGG-Verydeep-16 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.2 Receptive field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9 Remarks 28
Exercises 28
1 Introduction
This is a note that describes how a Convolutional Neural Network (CNN) operates from a mathematical perspective. This note is self-contained, and the
focus is to make it comprehensible to beginners in the CNN field.
The Convolutional Neural Network (CNN) has shown excellent performance
in many computer vision and machine learning problems. Many solid papers
have been published on this topic, and quite some high quality open source CNN
software packages have been made available.
There are also well-written CNN tutorials or CNN software manuals. However, I believe that an introductory CNN material specifically prepared for beginners is still needed. Research papers are usually very terse and lack details.
It might be difficult for beginners to read such papers. A tutorial targeting
experienced researchers may not cover all the necessary details to understand
how a CNN runs.
This note tries to present a document that
‚Ä¢ is self-contained. It is expected that all required mathematical background
knowledge are introduced in this note itself (or in other notes for this
course);
‚Ä¢ has details for all the derivations. This note tries to explain all the necessary math in details. We try not to ignore an important step in a
derivation. Thus, it should be possible for a beginner to follow (although
an expert may feel this note tautological.)
‚Ä¢ ignores implementation details. The purpose is for a reader to understand how a CNN runs at the mathematical level. We will ignore those
implementation details. In CNN, making correct choices for various implementation details is one of the keys to its high accuracy (that is, ‚Äúthe
devil is in the details‚Äù). However, we intentionally left this part out,
in order for the reader to focus on the mathematics. After understanding the mathematical principles and details, it is more advantageous to
learn these implementation and design details with hands-on experience
by playing with CNN programming.
CNN is useful in a lot of applications, especially in image related tasks. Applications of CNN include image classification, image semantic segmentation,
2
object detection in images, etc. We will focus on image classification (or categorization) in this note. In image categorization, every image has a major object
which occupies a large portion of the image. An image is classified into one of
the classes based on the identity of its main object, e.g., dog, airplane, bird, etc.
2 Preliminaries
We start by a discussion of some background knowledge that are necessary in
order to understand how a CNN runs. One can ignore this section if he/she is
familiar with these basics.
2.1 Tensor and vectorization
Everybody is familiar with vectors and matrices. We use a symbol shown in
boldface to represent a vector, e.g., x ‚àà R
D is a column vector with D elements.
We use a capital letter to denote a matrix, e.g., X ‚àà R
H√óW is a matrix with
H rows and W columns. The vector x can also be viewed as a matrix with 1
column and D rows.
These concepts can be generalized to higher-order matrices, i.e., tensors. For
example, x ‚àà R
H√óW√óD is an order 3 (or third order) tensor. It contains HW D
elements, and each of them can be indexed by an index triplet (i, j, d), with
0 ‚â§ i < H, 0 ‚â§ j < W, and 0 ‚â§ d < D. Another way to view an order 3 tensor
is to treat it as containing D channels of matrices. Every channel is a matrix
with size H √ó W. The first channel contains all the numbers in the tensor that
are indexed by (i, j, 0). When D = 1, an order 3 tensor reduces to a matrix.
We have interacted with tensors day-to-day. A scalar value is a zeroth-order
(order 0) tensor; a vector is an order 1 tensor; and a matrix is a second order
tensor. A color image is in fact an order 3 tensor. An image with H rows and
W columns is a tensor with size H √ó W √ó 3: if a color image is stored in the
RGB format, it has 3 channels (for R, G and B, respectively), and each channel
is a H √óW matrix (second order tensor) that contains the R (or G, or B) values
of all pixels.
It is beneficial to represent images (or other types of raw data) as a tensor.
In early computer vision and pattern recognition, a color image (which is an
order 3 tensor) is often converted to the gray-scale version (which is a matrix)
because we know how to handle matrices much better than tensors. The color
information is lost during this conversion. But color is very important in various
image (or video) based learning and recognition problems, and we do want to
process color information in a principled way, e.g., as in CNN.
Tensors are essential in CNN. The input, intermediate representation, and
parameters in a CNN are all tensors. Tensors with order higher than 3 are
also widely used in a CNN. For example, we will soon see that the convolution
kernels in a convolution layer of a CNN form an order 4 tensor.
Given a tensor, we can arrange all the numbers inside it into a long vector, following a pre-specified order. For example, in Matlab, the (:) operator
3
converts a matrix into a column vector in the column-first order. An example
is:
A =

1 2
3 4 
, A(:) = (1, 3, 2, 4)T =
Ô£Æ
Ô£Ø
Ô£Ø
Ô£∞
1
3
2
4
Ô£π
Ô£∫
Ô£∫
Ô£ª
. (1)
In mathematics, we use the notation ‚Äúvec‚Äù to represent this vectorization
operator. That is, vec(A) = (1, 3, 2, 4)T
in the example in Equation 1. In order
to vectorize an order 3 tensor, we could vectorize its first channel (which is a
matrix and we already know how to vectorize it), then the second channel, . . . ,
till all channels are vectorized. The vectorization of the order 3 tensor is then
the concatenation of the vectorization of all the channels in this order.
The vectorization of an order 3 tensor is a recursive process, which utilizes
the vectorization of order 2 tensors. This recursive process can be applied to
vectorize an order 4 (or even higher order) tensor in the same manner.
2.2 Vector calculus and the chain rule
The CNN learning process depends on vector calculus and the chain rule. Suppose z is a scalar (i.e., z ‚àà R) and y ‚àà R
H is a vector. If z is a function of y,
then the partial derivative of z with respect to y is a vector, defined as

‚àÇz
‚àÇy

i
=
‚àÇz
‚àÇyi
. (2)
In other words, ‚àÇz
‚àÇy
is a vector having the same size as y, and its i-th element
is ‚àÇz
‚àÇyi
. Also note that ‚àÇz
‚àÇyT =

‚àÇz
‚àÇy
T
.
Furthermore, suppose x ‚àà RW is another vector, and y is a function of x.
Then, the partial derivative of y with respect to x is defined as

‚àÇy
‚àÇxT

ij
=
‚àÇyi
‚àÇxj
. (3)
This partial derivative is a H √ó W matrix, whose entry at the intersection of
the i-th row and j-th column is ‚àÇyi
‚àÇxj
.
It is easy to see that z is a function of x in a chain-like argument: a function
maps x to y, and another function maps y to z. The chain rule can be used to
compute ‚àÇz
‚àÇxT , as
‚àÇz
‚àÇxT
=
‚àÇz
‚àÇyT
‚àÇy
‚àÇxT
. (4)
A sanity check for Equation 4 is to check the matrix / vector dimensions.
Note that ‚àÇz
‚àÇyT is a row vector with H elements, or a 1√óH matrix. (Be reminded
that ‚àÇz
‚àÇy
is a column vector). Since ‚àÇy
‚àÇxT is an H √óW matrix, the vector / matrix
multiplication between them is valid, and the result should be a row vector with
W elements, which matches the dimensionality of ‚àÇz
‚àÇxT .
4
For specific rules to calculate partial derivatives of vectors and matrices,
please refer to the Matrix Cookbook.
3 CNN in a nutshell
In this section, we will see how a CNN trains and predicts in the abstract level,
with the details left out for later sections.
3.1 The architecture
A CNN usually takes an order 3 tensor as its input, e.g., an image with H
rows, W columns, and 3 channels (R, G, B color channels). Higher order tensor
inputs, however, can be handled by CNN in a similar fashion. The input then
sequentially goes through a series of processing. One processing step is usually
called a layer, which could be a convolution layer, a pooling layer, a normalization layer, a fully connected layer, a loss layer, etc. We will introduce the
details of these layers later in this note.1
For now, let us give an abstract description of the CNN structure first.
x
1 ‚àí‚Üí w1 ‚àí‚Üí x
2 ‚àí‚Üí ¬∑ ¬∑ ¬∑ ‚àí‚Üí x
L‚àí1 ‚àí‚Üí wL‚àí1 ‚àí‚Üí x
L ‚àí‚Üí wL ‚àí‚Üí z (5)
The above Equation 5 illustrates how a CNN runs layer by layer in a forward
pass. The input is x
1
, usually an image (order 3 tensor). It goes through the
processing in the first layer, which is the first box. We denote the parameters
involved in the first layer‚Äôs processing collectively as a tensor w1
. The output of
the first layer is x
2
, which also acts as the input to the second layer processing.
This processing proceeds till all layers in the CNN has been finished, which
outputs x
L. One additional layer, however, is added for backward error propagation, a method that learns good parameter values in the CNN. Let‚Äôs suppose
the problem at hand is an image classification problem with C classes. A commonly used strategy is to output x
L as a C dimensional vector, whose i-th
entry encodes the prediction (posterior probability of x
1
comes from the i-th
class). To make x
L a probability mass function, we can set the processing in the
(L ‚àí 1)-th layer as a softmax transformation of x
L‚àí1
(cf. the distance metric
and data transformation note). In other applications, the output x
L may have
other forms and interpretations.
The last layer is a loss layer. Let us suppose t is the corresponding target
(ground-truth) value for the input x
1
, then a cost or loss function can be used
to measure the discrepancy between the CNN prediction x
L and the target t.
For example, a simple loss function could be
z =
1
2
kt ‚àí x
L
k
2
, (6)
1We will give detailed introductions to three types of layers: convolution, pooling, and
ReLU, which are the key parts of almost all CNN models. Proper normalization, e.g., batch
normalization or cross-layer normalization is important in the optimization process for learning
good parameters in a CNN. I may add these contents in the next update.
5
although more complex loss functions are usually used. This squared `2 loss can
be used in a regression problem. In a classification problem, the cross entropy
loss is often used. The ground-truth in a classification problem is a categorical
variable t. We first convert the categorical variable t to a C dimensional vector
t (cf. the distance metric and data transformation note). Now both t and x
L
are probability mass functions, and the cross entropy loss measures the distance
between them. Hence, we can minimize the cross entropy (cf. the information
theory note.) Equation 5 explicitly models the loss function as a loss layer,
whose processing is modeled as a box with parameters wL.
Note that some layers may not have any parameters, that is, wi may be
empty for some i. The softmax layer is one such example.
3.2 The forward run
Suppose all the parameters of a CNN model w1
, . . . , wL‚àí1 have been learned,
then we are ready to use this model for prediction. Prediction only involves running the CNN model forward, i.e., in the direction of the arrows in Equation 5.
Let‚Äôs take the image classification problem as an example. Starting from
the input x
1
, we make it pass the processing of the first layer (the box with
parameters w1
), and get x
2
. In turn, x
2
is passed into the second layer, etc.
Finally, we achieve x
L ‚àà R
C , which estimates the posterior probabilities of x
1
belonging to the C categories. We can output the CNN prediction as
arg max
i
x
L
i
. (7)
Note that the loss layer is not needed in prediction. It is only useful when
we try to learn CNN parameters using a set of training examples. Now, the
problem is: how do we learn the model parameters?
3.3 Stochastic gradient descent (SGD)
As in many other learning systems, the parameters of a CNN model are optimized to minimize the loss z, i.e., we want the prediction of a CNN model to
match the ground-truth labels.
Let‚Äôs suppose one training example x
1
is given for training such parameters.
The training process involves running the CNN network in both directions. We
first run the network in the forward pass to get x
L to achieve a prediction using
the current CNN parameters. Instead of outputting a prediction, we need to
compare the prediction with the target t corresponding to x
1
, that is, continue
running the forward pass till the last loss layer. Finally, we achieve a loss z.
The loss z is then a supervision signal, guiding how the parameters of the
model should be modified (updated). And the SGD way of modifying the parameters is
wi ‚Üê‚àí wi ‚àí Œ∑
‚àÇz
‚àÇwi
. (8)
6
Figure 1: Illustration of the gradient descent method.
A cautious note about the notation. In most CNN materials, a superscript
indicates the ‚Äútime‚Äù (e.g., training epochs). But in this note, we use the superscript to denote the layer index. Please do not get confused. We do not
use an additional index variable to represent time. In Equation 8, the ‚Üê‚àí sign
implicitly indicates that the parameters wi
(of the i-layer) are updated from
time t to t + 1. If a time index t is explicitly used, this equation will look like

wi
t+1
=

wi
t
‚àí Œ∑
‚àÇz
‚àÇ (wi)
t
. (9)
In Equation 8, the partial derivative ‚àÇz
‚àÇwi measures the rate of increase of z
with respect to the changes in different dimensions of wi
. This partial derivative vector is called the gradient in mathematical optimization. Hence, in a
small local region around the current value of wi
, to move wi
in the direction
determined by the gradient will increase the objective value z. In order to minimize the loss function, we should update wi along the opposite direction of the
gradient. This updating rule is called the gradient descent. Gradient descent is
illustrated in Figure 1, in which the gradient is denoted by g.
If we move too far in the negative gradient direction, however, the loss
function may increase. Hence, in every update we only change the parameters
by a small proportion of the negative gradient, controlled by Œ∑ (the learning
rate). Œ∑ > 0 is usually set to a small number (e.g., Œ∑ = 0.001). One update
based on x
1 will make the loss smaller for this particular training example if the
learning rate is not too large. However, it is very possible that it will make the
loss of some other training examples become larger. Hence, we need to update
the parameters using all training examples. When all training examples have
been used to update the parameters, we say one epoch has been processed. One
epoch will in general reduce the average loss on the training set until the learning
system overfits the training data. Hence, we can repeat the gradient descent
updating epochs and terminate at some point to obtain the CNN parameters
(e.g., we can terminate when the average loss on a validation set increases).

Gradient descent may seem simple in its math form (Equation 8), but it is
a very tricky operation in practice. For example, if we update the parameters
using only gradient calculated from only one training example, we will observe
an unstable loss function: the average loss of all training examples will bounce
up and down at very high frequency. This is because the gradient is estimated
using only one training example instead of the entire training set. Updating
the parameters using the gradient estimated from a (usually) small subset of
training examples is called the stochastic gradient descent. Contrary to single
example based SGD, we can compute the gradient using all training examples
and then update the parameters. However, this batch processing requires a lot
of computations because the parameters are updated only once in an epoch, and
is hence impractical, especially when the number of training examples is large.
A compromise is to use a mini-batch of training examples, to compute the
gradient using this mini-batch, and to update the parameters correspondingly.
For example, we can set 32 or 64 examples as a mini-batch. Stochastic gradient
descent (SGD) (using the mini-batch strategy) is the mainstream method to
learn a CNN‚Äôs parameters. We also want to note that when mini-batch is used,
the input of the CNN becomes an order 4 tensor, e.g., H √ó W √ó 3 √ó 32 if the
mini-batch size is 32.
A new problem now becomes apparent: how to compute the gradient, which
seems a very complex task?
3.4 Error back propagation
The last layer‚Äôs partial derivatives are easy to compute. Because x
L is connected
to z directly under the control of parameters wL, it is easy to compute ‚àÇz
‚àÇwL .
This step is only needed when wL is not empty. In the same spirit, it is also
easy to compute ‚àÇz
‚àÇxL . For example, if the squared `2 loss is used, we have an
empty ‚àÇz
‚àÇwL , and ‚àÇz
‚àÇxL = x
L ‚àí t.
In fact, for every layer, we compute two sets of gradients: the partial derivatives of z with respect to the layer parameters wi
, and that layer‚Äôs input x
i
.
‚Ä¢ The term ‚àÇz
‚àÇwi
, as seen in Equation 8, can be used to update the current
(i-th) layer‚Äôs parameters;
‚Ä¢ The term ‚àÇz
‚àÇxi can be used to update parameters backwards, e.g., to the
(i ‚àí 1)-th layer. An intuitive explanation is: x
i
is the output of the
(i ‚àí 1)-th layer and ‚àÇz
‚àÇxi
is how x
i
should be changed to reduce the loss
function. Hence, we could view ‚àÇz
‚àÇxi as the part of the ‚Äúerror‚Äù supervision
information propagated from z backward till the current layer, in a layer
by layer fashion. Thus, we can continue the back propagation process,
and use ‚àÇz
‚àÇxi to propagate the errors backward to the (i ‚àí 1)-th layer.
This layer-by-layer backward updating procedure makes learning a CNN much
easier.
Let‚Äôs take the i-th layer as an example. When we are updating the i-th layer,
the back propagation process for the (i + 1)-th layer must have been finished.
8
That is, we already computed the terms ‚àÇz
‚àÇwi+1 and ‚àÇz
‚àÇxi+1 . Both are stored in
memory and ready for use.
Now our task is to compute ‚àÇz
‚àÇwi and ‚àÇz
‚àÇxi
. Using the chain rule, we have
‚àÇz
‚àÇ(vec(wi)
T )
=
‚àÇz
‚àÇ(vec(xi+1)
T )
‚àÇ vec(x
i+1)
‚àÇ(vec(wi)
T )
, (10)
‚àÇz
‚àÇ(vec(xi)
T )
=
‚àÇz
‚àÇ(vec(xi+1)
T )
‚àÇ vec(x
i+1)
‚àÇ(vec(xi)
T )
. (11)
Since ‚àÇz
‚àÇxi+1 is already computed and stored in memory, it requires just a
matrix reshaping operation (vec) and an additional transpose operation to get
‚àÇz
‚àÇ(vec(xi+1)T )
, which is the first term in the right hand side (RHS) of both equations. So long as we can compute ‚àÇ vec(x
i+1)
‚àÇ(vec(wi)T )
and ‚àÇ vec(x
i+1)
‚àÇ(vec(xi)T )
, we can easily get
what we want (the left hand side of both equations).
‚àÇ vec(x
i+1)
‚àÇ(vec(wi)T )
and ‚àÇ vec(x
i+1)
‚àÇ(vec(xi)T )
are much easier to compute than directly computing ‚àÇz
‚àÇ(vec(wi)T )
and ‚àÇz
‚àÇ(vec(xi)T )
, because x
i
is directly related to x
i+1, through
a function with parameters wi
. The details of these partial derivatives will be
discussed in the following sections.
4 Layer input, output and notations
Now that the CNN architecture is clear, we will discuss in detail the different
types of layers, starting from the ReLU layer, which is the simplest layer among
those we discuss in this note. But before we start, we need to further refine our
notations.
Suppose we are considering the l-th layer, whose inputs form an order 3
tensor x
l with x
l ‚àà R
Hl√óWl√óDl
. Thus, we need a triplet index set (i
l
, jl
, dl
) to
locate any specific element in x
l
. The triplet (i
l
, jl
, dl
) refers to one element in
x
l
, which is in the d
l
-th channel, and at spatial location (i
l
, jl
) (at the i
l
-th row,
and j
l
-th column). In actual CNN learning, the mini-batch strategy is usually
used. In that case, x
l becomes an order 4 tensor in R
Hl√óWl√óDl√óN where N
is the mini-batch size. For simplicity we assume that N = 1 in this note. The
results in this section, however, are easy to adopt to mini-batch versions.
In order to simplify the notations which will appear later, we follow the
zero-based indexing convention, which specifies that 0 ‚â§ i
l < Hl
, 0 ‚â§ j
l < Wl
,
and 0 ‚â§ d
l < Dl
.
In the l-th layer, a function will transform the input x
l
to an output y,
which is also the input to the next layer. Thus, we notice that y and x
l+1 in
fact refers to the same object, and it is very helpful to keep this point in mind.
We assume the output has size Hl+1√óWl+1√óDl+1, and an element in the output
is indexed by a triplet (i
l+1, jl+1, dl+1), 0 ‚â§ i
l+1 < Hl+1, 0 ‚â§ j
l+1 < Wl+1
,
0 ‚â§ d
l+1 < Dl+1
.
9
5 The ReLU layer
A ReLU layer does not change the size of the input, that is, x
l and y share the
same size. In fact, the Rectified Linear Unit (hence the name ReLU) can be
regarded as a truncation performed individually for every element in the input:
yi,j,d = max{0, xl
i,j,d} , (12)
with 0 ‚â§ i < Hl = Hl+1, 0 ‚â§ j < Wl = Wl+1, and 0 ‚â§ d < Dl = Dl+1
.
There is no parameter inside a ReLU layer, hence no need for parameter
learning in this layer.
Based on Equation 12, it is obvious that
dyi,j,d
dx
l
i,j,d
=
q
x
l
i,j,d > 0
y
, (13)
where J¬∑K is the indicator function, being 1 if its argument is true, and 0 otherwise.
Hence, we have

‚àÇz
‚àÇxl

i,j,d
=
Ô£±
Ô£¥Ô£≤
Ô£¥Ô£≥

‚àÇz
‚àÇy

i,j,d
if x
l
i,j,d > 0
0 otherwise
. (14)
Note that y is an alias for x
l+1
.
Strictly speaking, the function max(0, x) is not differentiable at x = 0, hence
Equation 13 is a little bit problematic in theory. In practice, it is not an issue
and ReLU is safe to use.
The purpose of ReLU is to increase the nonlinearity of the CNN. Since the
semantic information in an image (e.g., a person and a Husky dog sitting next
to each other on a bench in a garden) is obviously a highly nonlinear mapping
of pixel values in the input, we want the mapping from CNN input to its output
also be highly nonlinear. The ReLU function, although simple, is a nonlinear
function, as illustrated in Figure 2.
If we treat x
l
i,j,d as one of the HlWlDl
features extracted by CNN layers 1
to l ‚àí 1, which can be positive, zero or negative. For example, x
l
i,j,d may be
positive if a region inside the input image has certain patterns (like a dog‚Äôs head
or a cat‚Äôs head or some other patterns similar to that); and x
l
i,j,d is negative or
zero when that region does not exhibit these patterns. The ReLU layer will set
all negative values to 0, which means that y
l
i,j,d will be activated only for images
possessing these patterns at that particular region. Intuitively, this property is
useful for recognizing complex patterns and objects. For example, it is only
a weak evidence to support ‚Äúthe input image contains a cat‚Äù if a feature is
activated and that feature‚Äôs pattern looks like cat‚Äôs head. However, if we find
many activated features after the ReLU layer whose target patterns correspond
to cat‚Äôs head, torso, fur, legs, etc., we have higher confidence (at layer l + 1) to
say that a cat probably exists in the input image.
10
Figure 2: The ReLU function.
Other nonlinear transformations have been used in the neural network research to produce nonlinearity, for example, the logistic sigmoid function y =
œÉ(x) = 1
1+exp(‚àíx)
. However, logistic sigmoid works significantly worse than
ReLU in CNN learning. Note that 0 < y < 1 if a sigmoid function is used, and
dy
dx = y(1 ‚àí y), we have dy
dx ‚â§
1
4
. Hence, in the error back propagation process,
the gradient ‚àÇz
‚àÇx =
‚àÇz
‚àÇy
dy
dx will have much smaller magnitude than ‚àÇz
‚àÇy (at most
1
4
). In other words, a sigmoid layer will cause the magnitude of the gradient
to significantly reduce, and after several sigmoid layers, the gradient will vanish
(i.e., all its components will be close to 0). A vanishing gradient makes gradient
based learning (e.g., SGD) very difficult.
On the other hand, the ReLU layer sets the gradient of some features in the
l-th layer to 0, but these features are not activated (i.e., we are not interested
in them). For those activated features, the gradient is back propagated without
any change, which is beneficial for SGD learning. The introduction of ReLU to
replace sigmoid is an important change in CNN, which significantly reduces the
difficulty in learning CNN parameters and improves its accuracy. There are also
more complex variants of ReLU, for example, parametric ReLU and exponential
linear unit.
6 The convolution layer
Next, we turn to the convolution layer, which is the most involved one among
those we discuss in this note.
6.1 What is convolution?
Let us start by convolving a matrix with one single convolution kernel. Suppose
the input image is 3 √ó 4 and the convolution kernel size is 2 √ó 2, as illustrated
in Figure 3.
11
 
 
(a) A 2 √ó 2 kernel
   
   
  	 
  
  
(b) The convolution input and output
Figure 3: Illustration of the convolution operation.
If we overlap the convolution kernel on top of the input image, we can
compute the product between the numbers at the same location in the kernel
and the input, and we get a single number by summing these products together.
For example, if we overlap the kernel with the top left region in the input, the
convolution result at that spatial location is: 1 √ó 1 + 1 √ó 4 + 1 √ó 2 + 1 √ó 5 = 12.
We then move the kernel down by one pixel and get the next convolution result
as 1√ó4+ 1√ó7+ 1√ó5+ 1√ó8 = 24. We keep move the kernel down till it reaches
the bottom border of the input matrix (image). Then, we return the kernel to
the top, and move the kernel to its right by one element (pixel). We repeat the
convolution for every possible pixel location until we have moved the kernel to
the bottom right corner of the input image, as shown in Figure 3.
For order 3 tensors, the convolution operation is defined similarly. Suppose
the input in the l-th layer is an order 3 tensor with size Hl √ó Wl √ó Dl
. A
convolution kernel is also an order 3 tensor with size H √ó W √ó Dl
. When we
overlap the kernel on top of the input tensor at the spatial location (0, 0, 0),
we compute the products of corresponding elements in all the Dl
channels and
sum the HW Dl products to get the convolution result at this spatial location.
Then, we move the kernel from top to bottom and from left to right to complete
the convolution.
In a convolution layer, multiple convolution kernels are usually used. Assuming D kernels are used and each kernel is of spatial span H √ó W, we denote
all the kernels as f. f is an order 4 tensor in R
H√óW√óDl√óD. Similarly, we use
index variables 0 ‚â§ i < H, 0 ‚â§ j < W, 0 ‚â§ d
l < Dl and 0 ‚â§ d < D to pinpoint
a specific element in the kernels. Also note that the set of kernels f refers to
the same object as the notation wl
in Equation 5. We change the notation a
bit to make the derivation a little bit simpler. It is also clear that even if the
mini-batch strategy is used, the kernels remain unchanged.
As shown in Figure 3, the spatial extent of the output is smaller than that
of the input so long as the convolution kernel is larger than 1 √ó 1. Sometimes
we need the input and output images to have the same height and width, and a
simple padding trick can be used. If the input is Hl√óWl√óDl and the kernel size
is H√óW √óDl√óD, the convolution result has size (Hl‚àíH+1)√ó(Wl‚àíW +1)√óD.
For every channel of the input, if we pad (i.e., insert) b
H‚àí1
2
c rows above the first
12
row and b
H
2
c rows below the last row, and pad b
W‚àí1
2
c columns to the left of
the first column and b
W
2
c columns to the right of the last column of the input,
the convolution output will be Hl √óWl √óD in size, i.e., having the same spatial
extent as the input. b¬∑c is the floor functions. Elements of the padded rows and
columns are usually set to 0, but other values are also possible.
Stride is another important concept in convolution. In Figure 3, we convolve
the kernel with the input at every possible spatial location, which corresponds
to the stride s = 1. However, if s > 1, every movement of the kernel skip
s ‚àí 1 pixel locations (i.e., the convolution is performed once every s pixels both
horizontally and vertically).
In this section, we consider the simple case when the stride is 1 and no
padding is used. Hence, we have y (or x
l+1) in R
Hl+1√óWl+1√óDl+1 , with Hl+1 =
Hl ‚àí H + 1, Wl+1 = Wl ‚àí W + 1, and Dl+1 = D.
In precise mathematics, the convolution procedure can be expressed as an
equation:
yi
l+1,jl+1,d =
X
H
i=0
X
W
j=0
X
Dl
dl=0
fi,j,dl
,d √ó x
l
i
l+1+i,jl+1+j,dl . (15)
Equation 15 is repeated for all 0 ‚â§ d ‚â§ D = Dl+1, and for any spatial location
(i
l+1, jl+1) satisfying 0 ‚â§ i
l+1 < Hl ‚àí H + 1 = Hl+1
, 0 ‚â§ j
l+1 < Wl ‚àí W + 1 =
Wl+1. In this equation, x
l
i
l+1+i,jl+1+j,dl refers to the element of x
l
indexed by
the triplet (i
l+1 + i, jl+1 + j, dl
).
A bias term bd is usually added to yi
l+1,jl+1,d. We omit this term in this
note for clearer presentation.
6.2 Why to convolve?
Figure 4 shows a color input image (4a) and its convolution results using two
different kernels (4b and 4c). A 3 √ó 3 convolution matrix K =
h
1 2 1
0 0 0
‚àí1 ‚àí2 ‚àí1
i
is
used. The convolution kernel should be of size 3 √ó 3 √ó 3, in which we set every
channel to K. When there is a horizontal edge at location (x, y) (i.e., when the
pixels at spatial location (x + 1, y) and (x ‚àí 1, y) differ by a large amount), we
expect the convolution result to have high magnitude. As shown in Figure 4b,
the convolution results indeed highlight the horizontal edges. When we set every
channel of the convolution kernel to KT
(the transpose of K), the convolution
result amplifies vertical edges, as shown in Figure 4c. The matrix (or filter) K
and KT are called the Sobel operators.2
If we add a bias term to the convolution operation, we can make the convolution result positive at horizontal (vertical) edges in a certain direction (e.g.,
a horizontal edge with the pixels above it brighter than the pixels below it),
and negative at other locations. If the next layer is a ReLU layer, the output
of the next layer in fact defines many ‚Äúedge detection features‚Äù, which activate
2The Sobel operator is named after Irwin Sobel, an American researcher in digital image
processing.
13
(a) Lenna (b) Horizontal edge (c) Vertical edge
Figure 4: The Lenna image and the effect of different convolution kernels.
only at horizontal or vertical edges in certain directions. If we replace the Sobel kernel by other kernels (e.g., those learned by SGD), we can learn features
that activate for edges with different angles. When we move further down in the
deep network, subsequent layers can learn to activate only for specific (but more
complex) patterns, e.g., groups of edges that form a particular shape. These
more complex patterns will be further assembled by deeper layers to activate for
semantically meaningful object parts or even a particular type of object, e.g.,
dog, cat, tree, beach, etc.
One more benefit of the convolution layer is that all spatial locations share
the same convolution kernel, which greatly reduces the number of parameters
needed for a convolution layer. For example, if multiple dogs appear in an input
image, the same ‚Äúdog-head-like pattern‚Äù feature will be activated at multiple
locations, corresponding to heads of different dogs.
In a deep neural network setup, convolution also encourages parameter sharing. For example, suppose ‚Äúdog-head-like pattern‚Äù and ‚Äúcat-head-like pattern‚Äù
are two features learned by a deep convolutional network. The CNN does not
need to devote two sets of disjoint parameters (e.g., convolution kernels in multiple layers) for them. The CNN‚Äôs bottom layers can learn ‚Äúeye-like pattern‚Äù
and ‚Äúanimal-fur-texture pattern‚Äù, which are shared by both these more abstract
features. In short, the combination of convolution kernels and deep and hierarchical structures are very effective in learning good representations (features)
from images for visual recognition tasks.
We want to add a note here. Although we have used phrases such as ‚Äúdoghead-like pattern‚Äù, the representation or feature learned by a CNN may not
correspond exactly to semantic concepts such as ‚Äúdog‚Äôs head‚Äù. A CNN feature
may activate frequently for dogs‚Äô heads and often be deactivated for other types
of patterns. However, there are also possible false activations at other locations,
and possible deactivations at dogs‚Äô heads.
In fact, a key concept in CNN (or more generally deep learning) is distributed
representation. For example, suppose our task is to recognize N different types
of objects and a CNN extracts M features from any input image. It is most
14
likely that any one of the M features is useful for recognizing all N object
categories; and to recognize one object type requires the joint effort of all M
features.
6.3 Convolution as matrix product
Equation 15 seems pretty complex. There is a way to expand x
l and simplify
the convolution as a matrix product.
Let‚Äôs consider a special case with Dl = D = 1, H = W = 2, and Hl = 3,
Wl = 4. That is, we consider convolving a small single channel 3 √ó 4 matrix (or
image) with one 2 √ó 2 filter. Using the example in Figure 3, we have
Ô£Æ
Ô£∞
1 2 3 1
4 5 6 1
7 8 9 1
Ô£π
Ô£ª ‚àó

1 1
1 1 
=

12 16 11
24 28 17 
, (16)
where the first matrix is denoted as A, and ‚àó is the convolution operator.
Now let‚Äôs run a Matlab command B=im2col(A,[2 2]), we arrive at a B
matrix that is an expanded version of A:
B =
Ô£Æ
Ô£Ø
Ô£Ø
Ô£∞
1 4 2 5 3 6
4 7 5 8 6 9
2 5 3 6 1 1
5 8 6 9 1 1
Ô£π
Ô£∫
Ô£∫
Ô£ª
.
It is obvious that the first column of B corresponds to the first 2 √ó 2 region
in A, in a column-first order, corresponding to (i
l+1, jl+1) = (0, 0). Similarly,
the second to last column in B correspond to regions in A with (i
l+1, jl+1) being
(1, 0), (0, 1), (1, 1), (0, 2) and (1, 2), respectively. That is, the Matlab im2col
function explicitly expands the required elements for performing each individual
convolution into a column in the matrix B. The transpose of B, BT
, is called
the im2row expansion of A.
Now, if we vectorize the convolution kernel itself into a vector (in the same
column-first order) (1, 1, 1, 1)T
, we find that3
B
T
Ô£Æ
Ô£Ø
Ô£Ø
Ô£∞
1
1
1
1
Ô£π
Ô£∫
Ô£∫
Ô£ª
=
Ô£Æ
Ô£Ø
Ô£Ø
Ô£Ø
Ô£Ø
Ô£Ø
Ô£Ø
Ô£∞
12
24
16
28
11
17
Ô£π
Ô£∫
Ô£∫
Ô£∫
Ô£∫
Ô£∫
Ô£∫
Ô£ª
. (17)
3The notation and presentation of this note is heavily affected by the MatConvNet software
package‚Äôs manual (http://arxiv.org/abs/1412.4564, which is Matlab based). The transpose
of an im2col expansion is equivalent to an im2row expansion, in which the numbers involved
in one convolution is one row in the im2row expanded matrix. The derivation in this section
uses im2row, complying with the implementation in MatConvNet. Caffe, a widely used CNN
software package (http://caffe.berkeleyvision.org/, which is C++ based) uses im2col.
These formulations are mathematically equivalent to each other.
15
If we reshape this resulting vector in Equation 17 properly, we get the exact
convolution result matrix in Equation 16. That is, the convolution operator is
a linear one. We can multiply the expanded input matrix and the vectorized
filter to get a result vector, and by reshaping this vector properly we get the
correct convolution results.
We can generalize this idea to more complex situations and formalize them.
If Dl > 1 (that is, the input x
l has more than one channels), the expansion
operator could first expand the first channel of x
l
, then the second, . . . , till all
Dl
channels are expanded. The expanded channels will be stacked together,
that is, one row in the im2row expansion will have H √óW √óDl
elements, rather
than H √ó W.
More formally, suppose x
l
is a third order tensor in R
Hl√óWl√óDl
, with one
element in x
l being indexed by a triplet (i
l
, jl
, dl
). We also consider a set of
convolution kernels f, whose spatial extent are all H √óW. Then, the expansion
operator (im2row) converts x
l
into a matrix œÜ(x
l
). We use two indexes (p, q)
to index an element in this matrix. The expansion operator copies the element
at (i
l
, jl
, dl
) in x
l
to the (p, q)-th entry in œÜ(x
l
).
From the description of the expansion process, it is clear that given a fixed
(p, q), we can calculate its corresponding (i
l
, jl
, dl
) triplet, because obviously
p = i
l+1 + (Hl ‚àí H + 1) √ó j
l+1
, (18)
q = i + H √ó j + H √ó W √ó d
l
, (19)
i
l = i
l+1 + i , (20)
j
l = j
l+1 + j . (21)
In Equation 19, dividing q by HW and take the integer part of the quotient,
we can determine which channel (d
l
) does it belong to. Similarly, we can get the
offsets inside the convolution kernel as (i, j), in which 0 ‚â§ i < H and 0 ‚â§ j < W.
q completely determines one specific location inside the convolution kernel by
the triplet (i, j, dl
).
Note that the convolution result is x
l+1, whose spatial extent is Hl+1 =
Hl ‚àí H + 1 and Wl+1 = Wl ‚àí W + 1. Thus, in Equation 18, the remainder
and quotient of dividing p by Hl+1 = Hl ‚àí H + 1 will give us the offset in the
convolved result (i
l+1, jl+1), or, the top-left spatial location of the region in x
l
(which is to be convolved with the kernel).
Based on the definition of convolution, it is clear that we can use Equations 20 and 21 to find the offset in the input x
l as i
l = i
l+1+i and j
l = j
l+1+j.
That is, the mapping from (p, q) to (i
l
, jl
, dl
) is one-to-one. However, we want
to emphasize that the reverse mapping from (i
l
, jl
, dl
) to (p, q) is one-to-many, a
fact that is useful in deriving the back propagation rules in a convolution layer.
Now we use the standard vec operator to convert the set of convolution
kernels f (order 4 tensor) into a matrix. Let‚Äôs start from one kernel, which
can be vectorized into a vector in R
HWDl
. Thus, all convolution kernels can
be reshaped into a matrix with HW Dl
rows and D columns (remember that
Dl+1 = D.) Let‚Äôs call this matrix F.
16
Finally, with all these notations, we have a beautiful equation to calculate
convolution results (cf. Equation 17, in which œÜ(x
l
) is BT
):
vec(y) = vec(x
l+1) = vec
œÜ(x
l
)F

. (22)
Note that vec(y) ‚àà R
Hl+1Wl+1D, œÜ(x
l
) ‚àà R
(Hl+1Wl+1)√ó(HWDl
)
, and F ‚àà
R
(HWDl
)√óD. The matrix multiplication œÜ(x
l
)F results in a matrix of size
(Hl+1Wl+1) √ó D. The vectorization of this resultant matrix generates a vector
in R
Hl+1Wl+1D, which matches the dimensionality of vec(y).
6.4 The Kronecker product
A short detour to the Kronecker product is needed to compute the derivatives.
Given two matrices A ‚àà R
m√ón and B ‚àà R
p√óq
, the Kronecker product A‚äóB
is a mp √ó nq matrix, defined as a block matrix
A ‚äó B =
Ô£Æ
Ô£Ø
Ô£∞
a11B ¬∑ ¬∑ ¬∑ a1nB
.
.
.
.
.
.
.
.
.
am1B ¬∑ ¬∑ ¬∑ amnB
Ô£π
Ô£∫
Ô£ª . (23)
The Kronecker product has the following properties that will be useful for
us:
(A ‚äó B)
T = A
T ‚äó B
T
, (24)
vec(AXB) = (B
T ‚äó A) vec(X), (25)
for matrices A, X, and B with proper dimensions (e.g., when the matrix multiplication AXB is defined.) Note that Equation 25 can be utilized from both
directions.
With the help of ‚äó, we can write down
vec(y) = vec
œÜ(x
l
)F I
=

I ‚äó œÜ(x
l
)

vec(F), (26)
vec(y) = vec
IœÜ(x
l
)F

= (F
T ‚äó I) vec(œÜ(x
l
)), (27)
where I is an identity matrix of proper size. In Equation 26, the size of I is
determined by the number of columns in F, hence I ‚àà R
D√óD in Equation 26.
Similarly, in Equation 27, I ‚àà R
(Hl+1Wl+1)√ó(Hl+1Wl+1)
.
The derivation for gradient computation rules in a convolution layer involves
many variables and notations. We summarize the variables used in this derivation in Table 1. Note that some of these notations have not been introduced
yet.
6.5 Backward propagation: update the parameters
As previously mentioned, we need to compute two derivatives: ‚àÇz
‚àÇ vec(xl)
and
‚àÇz
‚àÇ vec(F )
, where the first term ‚àÇz
‚àÇ vec(xl) will be used for backward propagation
Table 1: Variables, their sizes and meanings. Note that ‚Äúalias‚Äù means a variable
has a different name or can be reshaped into another form.
Alias Size & Meaning
X x
l HlWl √ó D
l
, the input tensor
F f, wl HW Dl √ó D, D kernels, each H √ó W and D
l
channels
Y y, x
l+1 Hl+1Wl+1 √ó D
l+1, the output, D
l+1 = D
œÜ(x
l
) Hl+1Wl+1 √ó HW Dl
, the im2row expansion of x
l
M Hl+1Wl+1HW Dl √ó HlWlD
l
, the indicator matrix for œÜ(x
l
)
‚àÇz
‚àÇY
‚àÇz
‚àÇ vec(y) Hl+1Wl+1 √ó D
l+1, gradient for y
‚àÇz
‚àÇF
‚àÇz
‚àÇ vec(f) HW Dl √ó D, gradient to update the convolution kernels
‚àÇz
‚àÇX
‚àÇz
‚àÇ vec(xl) HlWl √ó D
l
, gradient for x
l
, useful for back propagation
to the previous ((l ‚àí 1)-th) layer, and the second term will determine how the
parameters of the current (l-th) layer will be updated. A friendly reminder
is to remember that f, F and wi
refer to the same thing (modulo reshaping
of the vector or matrix or tensor). Similarly, we can reshape y into a matrix
Y ‚àà R
(Hl+1Wl+1)√óD, then y, Y and x
l+1 refer to the same object (again modulo
reshaping).
From the chain rule (Equation 10), it is easy to compute ‚àÇz
‚àÇ vec(F )
as
‚àÇz
‚àÇ(vec(F))T
=
‚àÇz
‚àÇ(vec(Y )
T )
‚àÇ vec(y)
‚àÇ(vec(F)
T )
. (28)
The first term in the RHS is already computed in the (l+ 1)-th layer as (equivalently) ‚àÇz
‚àÇ(vec(xl+1))T . The second term, based on Equation 26, is pretty straightforward:
‚àÇ vec(y)
‚àÇ(vec(F)
T )
=
‚àÇ
I ‚äó œÜ(x
l
)

vec(F)

‚àÇ(vec(F)
T )
= I ‚äó œÜ(x
l
). (29)
Note that we have used the fact ‚àÇXa
T
‚àÇa = X or ‚àÇXa
‚àÇaT = X so long as the matrix
multiplications are well defined. This equation leads to
‚àÇz
‚àÇ(vec(F))T
=
‚àÇz
‚àÇ(vec(y)
T )
(I ‚äó œÜ(x
l
)). (30)
Making a transpose, we get
‚àÇz
‚àÇ vec(F)
=

I ‚äó œÜ(x
l
)
T ‚àÇz
‚àÇ vec(y)
(31)
=

I ‚äó œÜ(x
l
)
T

vec 
‚àÇz
‚àÇY 
(32)
= vec 
œÜ(x
l
)
T ‚àÇz
‚àÇY I

(33)
= vec 
œÜ(x
l
)
T ‚àÇz
‚àÇY 
. (34)
Note that both Equation 25 (from RHS to LHS) and Equation 24 are used in
the above derivation.
Thus, we conclude that
‚àÇz
‚àÇF = œÜ(x
l
)
T ‚àÇz
‚àÇY , (35)
which is a simple rule to update the parameters in the l-th layer: the gradient
with respect to the convolution parameters is the product between œÜ(x
l
)
T
(the
im2col expansion) and ‚àÇz
‚àÇY (the supervision signal transferred from the (l+1)-th
layer).
6.6 Even higher dimensional indicator matrices
The function œÜ(¬∑) has been very useful in our analysis. It is pretty high dimensional, e.g., œÜ(x
l
) has Hl+1Wl+1HW Dl
elements. From the above, we know
that an element in œÜ(x
l
) is indexed by a pair p and q.
A quick recap about œÜ(x
l
): 1) from q we can determine d
l
, which channel
of the convolution kernel is used; and can also determine i and j, the spatial
offsets inside the kernel; 2) from p we can determine i
l+1 and j
l+1, the spatial
offsets inside the convolved result x
l+1; and, 3) the spatial offsets in the input
x
l
can be determined as i
l = i
l+1 + i and j
l = j
l+1 + j.
That is, the mapping m : (p, q) 7‚Üí (i
l
, jl
, dl
) is one-to-one, and thus is
a valid function. The inverse mapping, however, is one-to-many (thus not a
valid function). If we use m‚àí1
to represent the inverse mapping, we know that
m‚àí1
(i
l
, jl
, dl
) is a set S, where each (p, q) ‚àà S satisfies that m(p, q) = (i
l
, jl
, dl
).
Now we take a look at œÜ(x
l
) from a different perspective. In order to fully
specify œÜ(x
l
), what information is required? It is obvious that the following
three types of information are needed (and only those). For every element of
œÜ(x
l
), we need to know
(A) Which region does it belong to, i.e., what is the value of p (0 ‚â§ p <
Hl+1Wl+1)?
(B) Which element is it inside the region (or equivalently inside the convolution
kernel), i.e., what is the value of q (0 ‚â§ q < HW Dl
)?
The above two types of information determines a location (p, q) inside œÜ(x
l
).
The only missing information is
(C) What is the value in that position, i.e.,
œÜ(x
l
)

pq
?
Since every element in œÜ(x
l
) is a verbatim copy of one element from x
l
, we
can turn [C] into a different but equivalent one:
(C.1) For
œÜ(x
l
)

pq
, where is this value copied from? Or, what is its original
location inside x
l
, i.e., an index u that satisfies 0 ‚â§ u < HlWlDl
?
(C.2) The entire x
l
.
19
It is easy to see that the collective information in [A, B, C.1] (for the entire range of p, q and u), and [C.2] (x
l
) contains exactly the same amount of
information as œÜ(x
l
).
Since 0 ‚â§ p < Hl+1Wl+1, 0 ‚â§ q < HW Dl
, and 0 ‚â§ u < HlWlDl
, we can
use a a matrix M ‚àà R
(Hl+1Wl+1HWDl
)√ó(HlWlDl
)
to encode the information in
[A, B, C.1]. One row index of this matrix corresponds to one location inside
œÜ(x
l
) (i.e., a (p, q) pair). One row of M has HlWlDl
elements, and each element
can be indexed by (i
l
, jl
, dl
). Thus, each element in this matrix is indexed by a
5-tuple: (p, q, il
, jl
, dl
).
Then, we can use the ‚Äúindicator‚Äù method to encode the function m(p, q) =
(i
l
, jl
, dl
) into M. That is, for any possible element in M, its row index x
determines a (p, q) pair, and its column index y determines a (i
l
, jl
, dl
) triplet,
and M is defined as
M(x, y) = (
1 if m(p, q) = (i
l
, jl
, dl
)
0 otherwise
. (36)
The M matrix has the following properties:
‚Ä¢ It is very high dimensional;
‚Ä¢ But it is also very sparse: there is only 1 non-zero entry in the HlWlDl
elements in one row, because m is a function;
‚Ä¢ M, which uses information [A, B, C.1], only encodes the one-to-one correspondence between any element in œÜ(x
l
) and any element in x
l
, it does
not encode any specific value in x
l
;
‚Ä¢ Most importantly, putting together the one-to-one correspondence information in M and the value information in x
l
, obviously we have
vec(œÜ(x
l
)) = M vec(x
l
). (37)
6.7 Backward propagation: prepare supervision signal for
the previous layer
In the l-th layer, we still need to compute ‚àÇz
‚àÇ vec(xl)
. For this purpose, we want to
reshape x
l
into a matrix X ‚àà R
(HlWl
)√óDl
, and use these two equivalent forms
(modulo reshaping) interchangeably.
The chain rule states that ‚àÇz
‚àÇ(vec(xl)T ) =
‚àÇz
‚àÇ(vec(y)T )
‚àÇ vec(y)
‚àÇ(vec(xl)T )
(cf. Equation 11). We will start by studying the second term in the RHS (utilizing
Equations 27 and 37):
‚àÇ vec(y)
‚àÇ(vec(xl)
T )
=
‚àÇ(F
T ‚äó I) vec(œÜ(x
l
))
‚àÇ(vec(xl)
T )
= (F
T ‚äó I)M . (38)
Thus,
‚àÇz
‚àÇ(vec(xl)
T )
=
‚àÇz
‚àÇ(vec(y)
T )
(F
T ‚äó I)M . (39)
20
Since (using Equation 25 from right to left)
‚àÇz
‚àÇ(vec(y)
T )
(F
T ‚äó I) = 
(F ‚äó I)
‚àÇz
‚àÇ vec(y)
T
(40)
=

(F ‚äó I) vec 
‚àÇz
‚àÇY T
(41)
= vec 
I
‚àÇz
‚àÇY F
T
T
(42)
= vec 
‚àÇz
‚àÇY F
T
T
, (43)
we have
‚àÇz
‚àÇ(vec(xl)
T )
= vec 
‚àÇz
‚àÇY F
T
T
M , (44)
or equivalently
‚àÇz
‚àÇ(vec(xl)) = MT
vec 
‚àÇz
‚àÇY F
T

. (45)
Let‚Äôs have a closer look at the RHS. ‚àÇz
‚àÇY F
T ‚àà R
(Hl+1Wl+1)√ó(HWDl
)
, and
vec
‚àÇz
‚àÇY F
T

is a vector in R
Hl+1Wl+1HWDl
. On the other hand, MT
is an
indicator matrix in R
(HlWlDl
)√ó(Hl+1Wl+1HWDl
)
.
In order to pinpoint one element in vec(x
l
) or one row in MT
, we need an
index triplet (i
l
, jl
, dl
), with 0 ‚â§ i
l < Hl
, 0 ‚â§ j
l < Wl
, and 0 ‚â§ d
l < Dl
.
Similarly, to locate a column in MT or an element in ‚àÇz
‚àÇY F
T
, we need an index
pair (p, q), with 0 ‚â§ p < Hl+1Wl+1 and 0 ‚â§ q < HW Dl
.
Thus, the (i
l
, jl
, dl
)-th entry of ‚àÇz
‚àÇ(vec(xl)) equals the multiplication of two
vectors: the row in MT
(or the column in M) that is indexed by (i
l
, jl
, dl
), and
vec
‚àÇz
‚àÇY F
T

.
Furthermore, since MT
is an indicator matrix, in the row vector indexed
by (i
l
, jl
, dl
), only those entries whose index (p, q) satisfies m(p, q) = (i
l
, jl
, dl
)
have a value 1, all other entries are 0. Thus, the (i
l
, jl
, dl
)-th entry of ‚àÇz
‚àÇ(vec(xl))
equals the sum of these corresponding entries in vec
‚àÇz
‚àÇY F
T

.
Transferring the above description into precise mathematical form, we get
the following succinct equation:

‚àÇz
‚àÇX 
(i
l
,jl
,dl)
=
X
(p,q)‚ààm‚àí1(i
l
,jl
,dl)

‚àÇz
‚àÇY F
T

(p,q)
. (46)
In other words, to compute ‚àÇz
‚àÇX , we do not need to explicitly use the extremely high dimensional matrix M. Instead, Equation 46 and Equations 18
to 21 can be used to efficiently find ‚àÇz
‚àÇX .
We use the simple convolution example in Figure 3 to illustrate the inverse
mapping m‚àí1
, which is shown in Figure 5.

   
   
   
Figure 5: Illustration of how to compute ‚àÇz
‚àÇX .
In the right half of Figure 5, the 6 √ó 4 matrix is ‚àÇz
‚àÇY F
T
. In order to compute
the partial derivative of z with respect to one element in the input X, we need
to find which elements in ‚àÇz
‚àÇY F
T
is involved and add them. In the left half of
Figure 5, we show that the input element 5 (shown in larger font) is involved
in 4 convolution operations, shown by the red, green, blue and black boxes,
respectively. These 4 convolution operations correspond to p = 1, 2, 3, 4. For
example, when p = 2 (the green box), 5 is the third element in the convolution,
hence q = 3 when p = 2 and we put a green circle in the (2, 3)-th element of
the ‚àÇz
‚àÇY F
T matrix. After all 4 circles are put in the ‚àÇz
‚àÇY F
T matrix, the partial
derivative is the sum of elements in these four locations of ‚àÇz
‚àÇY F
T
.
The set m‚àí1
(i
l
, jl
, dl
) contains at most HW Dl
elements. Hence, Equation 46
requires at most HW Dl
summations to compute one element of ‚àÇz
‚àÇX .
4
6.8 Fully connected layer as a convolution layer
As aforementioned, one benefit of the convolution layer is that convolution is a
local operation. The spatial extent of a kernel is often small (e.g., 3 √ó 3). One
element in x
l+1 is usually computed using only a small number of elements in
its input x
l
.
A fully connected layer refers to a layer if the computation of any element in
the output x
l+1 (or y) requires all elements in the input x
l
. A fully connected
layer is sometimes useful at the end of a deep CNN model. For example, if after
many convolution, ReLU and pooling (which will be discussed soon) layers, the
output of the current layer contain distributed representations for the input
image, we want to use all these features in the current layer to build features
with stronger capabilities in the next one. A fully connected layer is useful for
this purpose.
Suppose the input of a layer x
l has size Hl √óWl √ó Dl
. If we use convolution
kernels whose size is Hl √ó Wl √ó Dl
, then D such kernels form an order 4 tensor
4
In Caffe, this computation is implemented by a function called col2im. In MatConvNet,
this operation is operated in a row2im manner, although the name row2im is not explicitly
used.
22
in Hl √óWl √ó Dl √ó D. The output is y ‚àà R
D. It is obvious that to compute any
element in y, we need to use all elements in the input x
l
. Hence, this layer is
a fully connected layer, but can be implemented as a convolution layer. Hence,
we do not need to derive learning rules for a fully connected layer separately.
7 The pooling layer
We will use the same notation inherited from the convolution layer. Let x
l ‚àà
R
Hl√óWl√óDl
be the input to the l-th layer, which is now a pooling layer. The
pooling operation requires no parameter (i.e., wi
is null, hence parameter learning is not needed for this layer). The spatial extent of the pooling (H √ó W) is
specified in the design of the CNN structure. Assume that H divides Hl and W
divides Wl and the stride equals the pooling spatial extent,5
the output of pooling (y or equivalently x
l+1) will be an order 3 tensor of size Hl+1√óWl+1√óDl+1
,
with
Hl+1 =
Hl
H
, Wl+1 =
Wl
W
, Dl+1 = Dl
. (47)
A pooling layer operates upon x
l
channel by channel independently. Within
each channel, the matrix with Hl √ó Wl
elements are divided into Hl+1 √ó Wl+1
nonoverlapping subregions, each subregion being H √ó W in size. The pooling
operator then maps a subregion into a single number.
Two types of pooling operators are widely used: max pooling and average
pooling. In max pooling, the pooling operator maps a subregion to its maximum
value, while the average pooling maps a subregion to its average value. In precise
mathematics,
max : yi
l+1,jl+1,d = max
0‚â§i<H,0‚â§j<W
x
l
i
l+1√óH+i,jl+1√óW+j,d , (48)
average : yi
l+1,jl+1,d =
1
HW
X
0‚â§i<H,0‚â§j<W
x
l
i
l+1√óH+i,jl+1√óW+j,d , (49)
where 0 ‚â§ i
l+1 < Hl+1, 0 ‚â§ j
l+1 < Wl+1, and 0 ‚â§ d < Dl+1 = Dl
.
Pooling is a local operator, and its forward computation is pretty straightforward. Now we focus on the back propagation. Only max pooling is discussed
and we can resort to the indicator matrix again.6 All we need to encode in this
indicator matrix is: for every element in y, where does it come from in x
l
?
We need a triplet (i
l
, jl
, dl
) to pinpoint one element in the input x
l
, and
another triplet (i
l+1, jl+1, dl+1) to locate one element in y. The pooling output
yi
l+1,jl+1,dl+1 comes from x
l
i
l
,jl
,dl
, if and only if the following conditions are met:
‚Ä¢ They are in the same channel;
‚Ä¢ The (i
l
, jl
)-th spatial entry belongs to the (i
l+1, jl+1)-th subregion;
5That is, the strides in the vertical and horizontal direction are H and W, respectively.
The most widely used pooling setup is H = W = 2 with a stride 2.
6Average pooling can be dealt with using a similar idea.
23
‚Ä¢ The (i
l
, jl
)-th spatial entry is the largest one in that subregion.
Translating these conditions into equations, we get
d
l+1 = d
l
, (50)

i
l
H

= i
l+1
,

j
l
W

= j
l+1
, (51)
x
l
i
l
,jl
,dl ‚â• yi+i
l+1√óH,j+j
l+1√óW,dl , ‚àÄ 0 ‚â§ i < H, 0 ‚â§ j < W , (52)
where b¬∑c is the floor function. If the stride is not H (W) in the vertical (horizontal) direction, Equation 51 must be changed accordingly.
Given a (i
l+1, jl+1, dl+1) triplet, there is only one (i
l
, jl
, dl
) triplet that satisfies all these conditions. Thus, we define an indicator matrix
S(x
l
) ‚àà R
(Hl+1Wl+1Dl+1)√ó(HlWlDl
)
. (53)
One triplet of indexes (i
l+1, jl+1, dl+1) specifies a row in S, while (i
l
, jl
, dl
)
specifies a column. These two triplets together pinpoint one element in S(x
l
).
We set that element to 1 if Equations 50 to 52 are simultaneously satisfied, and
0 otherwise. One row of S(x
l
) corresponds to one element in y, and one column
corresponds to one element in x
l
.
With the help of this indicator matrix, we have
vec(y) = S(x
l
) vec(x
l
). (54)
Then, it is obvious that
‚àÇ vec(y)
‚àÇ(vec(xl)
T )
= S(x
l
),
‚àÇz
‚àÇ(vec(xl)
T )
=
‚àÇz
‚àÇ(vec(y)
T )
S(x
l
), (55)
and consequently
‚àÇz
‚àÇ vec(xl)
= S(x
l
)
T ‚àÇz
‚àÇ vec(y)
. (56)
S(x
l
) is very sparse. It has exactly one nonzero entry in every row. Thus, we
do not need to use the entire matrix in the computation. Instead, we just need
to record the locations of those nonzero entries‚Äîthere are only Hl+1Wl+1Dl+1
such entries in S(x
l
).
A simple example can explain the meaning of these equations. Let us consider a 2 √ó 2 max pooling with stride 2. For a given channel d
l
, the first spatial
subregion contains four elements in the input, with (i, j) = (0, 0), (1, 0), (0, 1)
and (1, 1), and let us suppose the element at spatial location (0, 1) is the largest
among them. In the forward pass, the value indexed by (0, 1, dl
) in the input
(i.e., x
l
0,1,dl ) will be assigned to the element in the (0, 0, dl
)-th element in the
output (i.e., y0,0,dl ).
One column in S(x
l
) contains at most one nonzero element if the strides are
H and W, respectively. In the above example, the column of S(x
l
) indexed by
24
(0, 0, dl
), (1, 0, dl
) and (1, 1, dl
) are all zero vectors. The column corresponding
to (0, 1, dl
) contains only one nonzero entry, whose row index is determined by
(0, 0, dl
). Hence, in the back propagation, we have

‚àÇz
‚àÇ vec(xl)

(0,1,dl)
=

‚àÇz
‚àÇ vec(y)

(0,0,dl)
,
and

‚àÇz
‚àÇ vec(xl)

(0,0,dl)
=

‚àÇz
‚àÇ vec(xl)

(1,0,dl)
=

‚àÇz
‚àÇ vec(xl)

(1,1,dl)
= 0 .
However, if the pooling strides are smaller than H and W in the vertical
and horizontal directions, respectively, one element in the input tensor may be
the largest element in several pooling subregions. Hence, there can have more
than one nonzero entries in one column of S(x
l
). Let us consider the example
input in Figure 5. If a 2 √ó 2 max pooling is applied to it and the stride is 1 in
both directions, the element 9 is the largest in two pooling regions: [ 5 6
8 9 ] and
[
6 1
9 1 ]. Hence, in the column of S(x
l
) corresponding to the element 9 (indexed by
(2, 2, dl
) in the input tensor), there are two nonzero entries whose row indexes
correspond to (i
l+1, jl+1, dl+1) = (1, 1, dl
) and (1, 2, dl
). Thus, in this example,
we have

‚àÇz
‚àÇ vec(xl)

(2,2,dl)
=

‚àÇz
‚àÇ vec(y)

(1,1,dl)
+

‚àÇz
‚àÇ vec(y)

(1,2,dl)
.
8 A case study: the VGG-16 net
We have introduced the convolution, pooling, ReLU and fully connected layers
till now, and have briefly mentioned the softmax layer. With these layers, we
can build many powerful deep CNN models.
8.1 VGG-Verydeep-16
The VGG-Verydeep-16 CNN model is a pretrained CNN model released by the
Oxford VGG group.7 We use it as an example to study the detailed structure
of CNN networks. The VGG-16 model architecture is listed in Table 2.
There are six types of layers in this model.
Convolution A convolution layer is abbreviated as ‚ÄúConv‚Äù. Its description
includes three parts: number of channels; kernel spatial extent (kernel
size); padding (‚Äòp‚Äô) and stride (‚Äòst‚Äô) size.
ReLU No description is needed for a ReLU layer.
7http://www.robots.ox.ac.uk/~vgg/research/very_deep/
25
Table 2: The VGG-Verydeep-16 architecture and receptive field
type description r. size type description r. size
1 Conv 64;3x3;p=1,st=1 212 20 Conv 512;3x3;p=1,st=1 20
2 ReLU 210 21 ReLU 18
3 Conv 64;3x3;p=1,st=1 210 22 Conv 512;3x3;p=1,st=1 18
4 ReLU 208 23 ReLU 16
5 Pool 2x2;st=2 208 24 Pool 2x2;st=2 16
6 Conv 128;3x3;p=1,st=1 104 25 Conv 512;3x3;p=1,st=1 8
7 ReLU 102 26 ReLU 6
8 Conv 128;3x3;p=1,st=1 102 27 Conv 512;3x3;p=1,st=1 6
9 ReLU 100 28 ReLU 4
10 Pool 2x2;st=2 100 29 Conv 512;3x3;p=1,st=1 4
11 Conv 256;3x3;p=1,st=1 50 30 ReLU 2
12 ReLU 48 31 Pool 2
13 Conv 256;3x3;p=1,st=1 48 32 FC (7x7x512)x4096 1
14 ReLU 46 33 ReLU
15 Conv 256;3x3;p=1,st=1 46 34 Drop 0.5
16 ReLU 44 35 FC 4096x4096
17 Pool 2x2;st=2 44 36 ReLU
18 Conv 512;3x3;p=1,st=1 22 37 Drop 0.5
19 ReLU 20 38 FC 4096x1000
39 œÉ (softmax layer)
Pool A pooling layer is abbreviated as ‚ÄúPool‚Äù. Only max pooling is used in
VGG-16. The pooling kernel size is always 2 √ó 2 and the stride is always
2 in VGG-16.
Fully connected A fully connected layer is abbreviated as ‚ÄúFC‚Äù. Fully connected layers are implemented using convolution in VGG-16. Its size is
shown in the format n1 √ó n2, where n1 is the size of the input tensor, and
n2 is the size of the output tensor. Although n1 can be a triplet (such as
7 √ó 7 √ó 512, n2 is always an integer.
Dropout A dropout layer is abbreviated as ‚ÄúDrop‚Äù. Dropout is a technique to
improve the generalization of deep learning methods. It sets the weights
connected to a certain percentage of nodes in the network to 0 (and VGG16 set the percentage to 0.5 in the two dropout layers).
Softmax It is abbreviated as ‚ÄúœÉ‚Äù.
We want to add a few notes about this example deep CNN architecture.
‚Ä¢ A convolution layer is always followed by a ReLU layer in VGG-16. The
ReLU layers increase the nonlinearity of the CNN model.
‚Ä¢ The convolution layers between two pooling layers have the same number
of channels, kernel size and stride. In fact, stacking two 3 √ó 3 convolution
layers is equivalent to one 5√ó5 convolution layer; and stacking three 3√ó3
convolution kernels replaces a 7 √ó 7 convolution layer. Stacking a few (2
or 3) smaller convolution kernels, however, computes faster than a large
26
convolution kernel. In addition, the number of parameters is also reduced,
e.g., 2 √ó 3 √ó 3 = 18 < 25 = 5 √ó 5. The ReLU layers inserted in between
small convolution layers are also helpful.
‚Ä¢ The input to VGG-16 is an image with size 224 √ó 224 √ó 3. Because the
padding is one in the convolution kernels (meaning one row or column is
added outside of the four edges of the input), convolution will not change
the spatial extent. The pooling layers will reduce the input size by a factor
of 2. Hence, the output after the last (5th) pooling layer has spatial extent
7 √ó 7 (and 512 channels). We may interpret this tensor as 7 √ó 7 √ó 512 =
25088 ‚Äúfeatures‚Äù. The first fully connected layer converts them into 4096
features. The number of features remains at 4096 after the second fully
connected layer.
‚Ä¢ The VGG-16 is trained for the ImageNet classification challenge, which is
an object recognition problem with 1000 classes. The last fully connected
layer (4096 √ó 1000) output a length 1000 vector for every input image,
and the softmax layer converts this length 1000 vector into the estimated
posterior probability for the 1000 classes.
8.2 Receptive field
Another important concept in CNN is the receptive field size (abbreviated as
‚Äúr. size‚Äù in Table 2). Let us look at one element in the input to the first fully
connected layer (32|FC). Because it is the output of a max pooling, we need
values in a 2 √ó 2 spatial extent in the input to the max pool layer to compute
this element (and we only need elements in this spatial extent). This 2 √ó 2
spatial extent is called the receptive field for this element. In Table 2, we listed
the spatial extent for any element in the output of the last pooling layer. Note
that because the receptive field is square, we only use one number (e.g., 48 for
48 √ó 48). The receptive field size listed for one layer is the spatial extent in the
input to that layer.
A 3 √ó 3 convolution layer will increase the receptive field by 2 and a pooling
layer will double the spatial extent. As shown in Table 2, receptive field size in
the input to the first layer is 212√ó212. In other words, in order to compute any
single element in the 7 √ó 7 √ó 512 output of the last pooling layer, a 212 √ó 212
image patch is required (including the padded pixels in all convolution layers).
It is obvious that the receptive field size increases when the network becomes
deeper, especially when a pooling layer is added to the deep net. Unlike traditional computer vision and image processing features which depend only on
a small receptive field (e.g., 16 √ó 16), deep CNN computes its representation
(or features) using large receptive fields. The larger receptive field characteristic is an important reason why CNN has achieved higher accuracy than classic
methods in image recognition.
27
9 Remarks
We hope this introductory note on CNN is clear, self-contained, and easy to
understand to our readers.
Once a reader is confident in his/her understanding of CNN at the mathematical level, in the next step it is very helpful to get some hands on CNN
experience. For example, one can validate what has been talked about in this
note using the MatConvNet software package if you prefer the Matlab environment.8 For C++ lovers, Caffe is a widely used tool.9 The Theano package
is a python package for deep learning.10 Many more resources for deep learning (not only CNN) are available, e.g., Torch,11, TensorFlow,12 and more will
emerge soon.
Exercises
1. Dropout is a very useful technique in training neural networks, which is
proposed by Srivastava et al. in a paper titled ‚ÄúDropout: A Simple Way
to Prevent Neural Networks from Overfitting‚Äù in JMLR.13 Carefully read
this paper and answer the following questions (please organize your answer
to every question in one brief sentence).
(a) How does dropout operate during training?
(b) How does dropout operate during testing?
(c) What is the benefit of dropout?
(d) Why dropout can achieve this benefit?
2. The VGG16 CNN model (also called VGG-Verydeep-16) was publicized
by Karen Simonyan and Andrew Zisserman in a paper titled ‚ÄúVery Deep
Convolutional Networks for Large-Scale Image Recognition‚Äù in the arXiv
preprint server.14 And, the GoogLeNet model was publicized by Szegedy
et al. in a paper titled ‚ÄúGoing Deeper with Convolutions‚Äù in the arXiv
preprint server.15 These two papers were publicized around the same time
and share some similar ideas. Carefully read both papers and answer the
following questions (please organize your answer to every question in one
brief sentence).
(a) Why do they use small convolution kernels (mainly 3 √ó 3) rather than
8http://www.vlfeat.org/matconvnet/
9http://caffe.berkeleyvision.org/
10http://deeplearning.net/software/theano/
11http://torch.ch/
12https://www.tensorflow.org/
13Available at http://jmlr.org/papers/v15/srivastava14a.html
14Available at https://arxiv.org/abs/1409.1556, later published in ICLR 2015 as a conference track paper.
15Available at https://arxiv.org/abs/1409.4842, later published in CVPR 2015.
28
larger ones?
(b) Why both networks are quite deep (i.e., with many layers, around 20)?
(c) Which difficulty is caused by the large depth? How are they solved in
these two networks?
3. Batch Normalization (BN) is another very useful technique in training
deep neural networks, which is proposed by Sergey Ioffe and Christian
Szegedy, in a paper titled ‚ÄúBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift‚Äù in ICML 2015.16
Carefully read this paper and answer the following questions (please organize your answer to every question in one brief sentence).
(a) What is internal covariate shift?
(b) How does BN deal with this?
(c) How does BN operate in a convolution layer?
(d) What is the benefit of using BN?
4. ResNet is a very deep neural network learning technique proposed by He
et al. in a paper titled ‚ÄúDeep Residual Learning for Image Recognition‚Äù in
CVPR 2016.17 Carefully read this paper and answer the following questions (please organize your answer to every question in one brief sentence).
(a) Although VGG16 and GoogLeNet have encountered difficulties in
training networks around 20‚Äì30 layers, what enables ResNet to train networks as deep as 1000 layers?
(b) VGG16 is a feed-forward network, where each layer has only one input
and only one output. While GoogLeNet and ResNet are DAGs (directed
acyclic graph), where one layer can have multiple inputs and multiple
outputs, so long as the data flow in the network structure does not form
a cycle. What is the benefit of DAG vs. feed-forward?
(c) VGG16 has two fully connected layers (fc6 and fc7), while ResNet and
GoogLeNet do not have fully connected layers (except the last layer for
classification). What is used to replace FC in them? What is the benefit?
5. AlexNet refers to the deep convolutional neural network trained on the
ILSVRC challenge data, which is a groundbreaking work of deep CNN
for computer vision tasks. The technical details of AlexNet is reported
in the paper ‚ÄúImageNet Classification with Deep Convolutional Neural
Networks‚Äù, by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton
in NIPS 25.18 It proposed the ReLU activation function and creatively
used GPUs to accelerate the computations. Carefully read this paper
16Available at http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
17Available at https://arxiv.org/pdf/1512.03385.pdf
18This paper is available at http://papers.nips.cc/paper/4824-imagenet-classification-withdeep-convolutional-neural-networks
29
and answer the following questions (please organize your answer to every
question in one brief sentence).
(a) Describe your understanding of how ReLU helps its success? And,
how do the GPUs help out?
(b) Using the average of predictions from several networks help reduce the
error rates. Why?
(c) Where is the dropout technique applied? How does it help? And what
is the cost of using dropout?
(d) How many parameters are there in AlexNet? Why the dataset size
(1.2 million) is important for the success of AlexNet?
6. We will try different CNN structures on the MNIST dataset. We denote
the ‚Äúbaseline‚Äù network in the MNIST example in MatConvNet as BASE
in this question.19 In this question, a convolution layer is denoted as
‚Äúx √ó y √ó nIn √ó nOut‚Äù, whose kernel size is x √ó y, with nIn input and nOut
output channels, with stride equal 1 and pad equal 0. The pooling layers
are 2 √ó 2 max pooling with stride equal 2. The BASE network has four
blocks. The first consists of a 5√ó5√ó1√ó20 convolution and a max pooling;
the second block is composed of a 5 √ó 5 √ó 20 √ó 50 convolution and a max
pooling; the third block is a 4 √ó 4 √ó 50 √ó 500 convolution (FC) plus a
ReLU layer; and the final block is the classification layer (1 √ó 1 √ó 500 √ó 10
convolution).
(a) The MNIST dataset is available at yann.lecun.com/exdb/mnist. Read
the instructions in that page, and write a program to transform the data
to formats that suit your favorite deep learning software.
(b) Learning deep learning models often involve random numbers. Before
the training starts, set the random number generator‚Äôs seed to 0. Then,
use the BASE network structure and the first 10000 training examples
to learn its parameters. What is test set error rate (on the 10000 test
examples) after 20 training epochs?
(c) From now on, if not otherwise specified, we assume the first 10000
training examples and 20 epochs are used. Now we define the BN network
structure, which adds a batch normalization layer after every convolution
layer in the first three blocks. What is its error rate? What will you say
about BN vs. BASE?
(d) If you add a dropout layer after the classification layer in the 4th block.
What is the new error rate of BASE and BN? What you will comment on
dropout?
(e) Now we define the SK network structure, which refers to small kernel
size. SK is based on BN. The first block (5 √ó 5 convolution plus pooling)
now is changed to two 3√ó3 convolutions, and BN + ReLU is applied after
19MatConvNet version 1.0-beta20. Please refer to MatConvNet for all the details of BASE,
such as parameter initialization and learning rate.
30
every convolution. For example, block 1 is now 3 √ó 3 √ó 1 √ó 20 convolution
+ BN + ReLU + 3√ó3√ó20√ó20 convolution + BN + ReLU + pool. What
is SK‚Äôs error rate? How will you comment on that (e.g., how and why the
error rate changes?)
(f) Now we define the SK-s networks structure. The notation ‚Äòs‚Äô refers to
a multiplier that changes the number of channels in convolution layers.
For example, SK is the same as SK-1. And, SK-2 means the number of
channels in all convolution layers (except the one in block 4) are multiplied
by 2. Train networks for SK-2, SK-1.5, SK-1, SK-0.5 and SK-0.2. Report
their error rates and comment on them.
(g) Now we experiment with different training set sizes using the SK-0.2
network structure. Using the first 500, 1000, 2000, 5000, 10000, 20000, and
60000 (all) training examples, what error rates do you achieve? Comment
on your observations.
(h) Using the SK-0.2 network structure, study how different training sets
affect its performance. Train 6 networks, and use the (10000√ó(i‚àí1)+ 1)-
th to (i √ó 10000)-th training examples in training the i-th network. Are
CNNs stable in terms of different training sets?
(i) Now we study how randomness affects CNN learning. Instead of set
the random number generator‚Äôs seed to 0, use 1, 12, 123, 1234, 12345 and
123456 as the seed to train 6 different SK-0.2 networks. What are their
error rates? Comment on your observations.
(j) Finally, in SK-0.2, change all ReLU layers to sigmoid layers. How do
you comment on the comparison on error rates of using ReLU and sigmoid
activation functions?
31
"""

# Save as a sample PDF file
with open('sample_research_paper.txt', 'w') as f:
    f.write(sample_research_content)

print("Sample research paper created!")
print("File: sample_research_paper.txt")
print(f"Size: {len(sample_research_content)} characters")

Sample research paper created!
File: sample_research_paper.txt
Size: 71671 characters


## Step 3: PDF Processing Explained

**What happens here:**
- We read the PDF file
- Split it into manageable chunks (like paragraphs)
- Add metadata so we know where each piece came from

In [8]:
class PDFProcessor:
    """Handles PDF reading and text processing"""

    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,  # Each chunk ~500 characters
            chunk_overlap=50,  # Overlap to maintain context
            separators=["\n\n", "\n", ". ", "! ", "? "]  # Smart splitting
        )

    def load_pdf_content(self, file_path: str) -> str:
        """Load content from PDF or text file"""
        try:
            if file_path.endswith('.pdf'):
                reader = PdfReader(file_path)
                text = ""
                for page in reader.pages:
                    text += page.extract_text() + "\n"
            else:
                with open(file_path, 'r') as f:
                    text = f.read()

            print(f"‚úÖ Loaded {len(text)} characters from {file_path}")
            return text
        except Exception as e:
            print(f"Error loading file: {e}")
            return ""

    def create_documents(self, text: str, source_name: str) -> List[Document]:
        """Convert text into LangChain documents"""
        # Split text into chunks
        chunks = self.text_splitter.split_text(text)

        # Create documents with metadata
        documents = []
        for i, chunk in enumerate(chunks):
            doc = Document(
                page_content=chunk,
                metadata={
                    "source": source_name,
                    "chunk_id": i,
                    "chunk_size": len(chunk),
                    "timestamp": datetime.now().isoformat()
                }
            )
            documents.append(doc)

        print(f"Created {len(documents)} document chunks")
        return documents

# Initialize processor
processor = PDFProcessor()

text = processor.load_pdf_content("/content/CNN.pdf")
# Process our sample research paper
documents = processor.create_documents(text, "research_paper_survey")

# Show first few chunks
print("\nüìã First 3 chunks:")
for i, doc in enumerate(documents[:3]):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Size: {doc.metadata['chunk_size']} characters")

‚úÖ Loaded 70780 characters from /content/CNN.pdf
Created 155 document chunks

üìã First 3 chunks:

--- Chunk 1 ---
Content: Introduction to Convolutional Neural Networks
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
wujx2001@gmail.com
May 1, 2017
Contents
1 Introduction 2
2...
Size: 467 characters

--- Chunk 2 ---
Content: 3.2 The forward run . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . 6
3.4 Error back propagation . . . . . . . . . . . ....
Size: 465 characters

--- Chunk 3 ---
Content: 6.3 Convolution as matrix product . . . . . . . . . . . . . . . . . . . 15
6.4 The Kronecker product . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 Backward propagation: update the parameters . ...
Size: 439 characters


## Step 4: Open-Source Embeddings Explained

**What are embeddings?**
- Think of them as "smart fingerprints" for text
- Similar texts have similar fingerprints
- We use the **all-MiniLM-L6-v2** model (completely free and offline)

In [9]:
class EmbeddingManager:
    """Handles text embeddings using open-source models"""

    def __init__(self):
        # This is a small, fast, open-source model
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.embedding_dim = 384  # Size of embedding vectors

        print(f"Loaded embedding model: all-MiniLM-L6-v2")
        print(f"Embedding dimension: {self.embedding_dim}")

    def create_embeddings(self, texts: List[str]) -> np.ndarray:
        """Convert texts to embedding vectors"""
        print("üîÑ Creating embeddings...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Created {len(embeddings)} embeddings")
        return embeddings

# Initialize embedding manager
embed_manager = EmbeddingManager()

# Test with a simple example
test_texts = ["Machine learning is amazing", "Deep learning uses neural networks", "AI transforms industries"]
test_query = "neural networks"

embeddings = embed_manager.create_embeddings(test_texts)
print(f"Created {len(embeddings)} embeddings for test texts")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded embedding model: all-MiniLM-L6-v2
Embedding dimension: 384
üîÑ Creating embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Created 3 embeddings
Created 3 embeddings for test texts


  return forward_call(*args, **kwargs)


In [10]:
embeddings

array([[-0.04862163, -0.08617098,  0.08960795, ...,  0.07205721,
        -0.00438847, -0.02266295],
       [-0.10114671, -0.0130303 ,  0.07322458, ...,  0.06143846,
         0.03814853, -0.03969791],
       [-0.07190882, -0.0214227 , -0.00817364, ..., -0.04696938,
         0.04265788, -0.03134049]], dtype=float32)

## Step 5: In-Memory Vector Store (ChromaDB)

**What is ChromaDB?**
- A lightweight, in-memory database for vectors
- Perfect for prototyping and learning
- No setup required - works immediately!

In [11]:
class VectorStoreManager:
    """Manages the in-memory vector database"""

    def __init__(self, embedding_manager: EmbeddingManager):
        self.embedding_manager = embedding_manager

        # Create in-memory Chroma client
        self.client = chromadb.Client()

        # Create or get collection
        self.collection = self.client.create_collection(
            name="research_papers",
            metadata={"description": "Academic paper chunks"}
        )

        print("Created in-memory vector store")

    def add_documents(self, documents: List[Document]):
        """Add documents to the vector store"""
        print("üîÑ Adding documents to vector store...")

        # Prepare data
        texts = [doc.page_content for doc in documents]
        metadatas = [doc.metadata for doc in documents]
        ids = [f"doc_{i}" for i in range(len(documents))]

        # Create embeddings
        embeddings = self.embedding_manager.create_embeddings(texts)

        # Add to collection
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=texts,
            metadatas=metadatas,
            ids=ids
        )

        print(f"Added {len(documents)} documents to vector store")

    def search(self, query: str, n_results: int = 3) -> List[Dict[str, Any]]:
        """Search for similar documents"""
        # Create query embedding
        query_embedding = self.embedding_manager.create_embeddings([query])[0]

        # Search
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results
        )

        # Format results
        formatted_results = []
        for i in range(len(results['documents'][0])):
            formatted_results.append({
                'content': results['documents'][0][i],
                'metadata': results['metadatas'][0][i],
                'distance': results['distances'][0][i]
            })

        return formatted_results

# Initialize vector store
vector_store = VectorStoreManager(embed_manager)
vector_store.add_documents(documents)

# Test search
search_results = vector_store.search("What is deep learning?", n_results=2)
print("\nSearch Results:")
for i, result in enumerate(search_results, 1):
    print(f"\n{i}. Content: {result['content'][:200]}...")
    print(f"   Distance: {result['distance']:.3f}")

Created in-memory vector store
üîÑ Adding documents to vector store...
üîÑ Creating embeddings...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


Created 155 embeddings
Added 155 documents to vector store
üîÑ Creating embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Created 1 embeddings

Search Results:

1. Content: deep network, subsequent layers can learn to activate only for speciÔ¨Åc (but more
complex) patterns, e.g., groups of edges that form a particular shape. These
more complex patterns will be further asse...
   Distance: 0.940

2. Content: understand to our readers.
Once a reader is conÔ¨Ådent in his/her understanding of CNN at the math-
ematical level, in the next step it is very helpful to get some hands on CNN
experience. For example, ...
   Distance: 1.000


## Step 6: Open-Source Language Model

**What we're using:**
- **Flan-T5** - Google's open-source model
- **Completely free** - runs on your machine
- **Good for educational purposes** - explains concepts clearly

In [None]:
from transformers import pipeline

class OpenSourceLLM:
    """Open-source language model for Q&A"""

    def __init__(self):
        # Using Flan-T5-small for educational purposes
        model_name = "google/flan-t5-small"

        print("Loading open-source language model...")
        self.qa_pipeline = pipeline(
            "text2text-generation",
            model=model_name,
            tokenizer=model_name,
            max_length=512,
            temperature=0.7
        )

        print(f"Loaded model: {model_name}")

    def generate_answer(self, question: str, context: str) -> str:
        """Generate answer using context"""
        # Create prompt
        prompt = f"""Answer the question based on the context provided.

        Context: {context}

        Question: {question}

        Answer:"""

        # Generate response
        response = self.qa_pipeline(prompt, max_length=200, do_sample=True)
        return response[0]['generated_text']

# Initialize LLM
llm = OpenSourceLLM()

# Test the model
test_context = "Deep learning is a subset of machine learning that uses neural networks with multiple layers."
test_question = "What is deep learning?"

test_answer = llm.generate_answer(test_question, test_context)
print(f"\nü§ñ Test Answer: {test_answer}")

## Step 7: Complete RAG Pipeline

**Putting it all together:**
- **R**etrieval: Find relevant chunks from PDF
- **A**ugmentation: Add context to the question
- **G**eneration: Create answer using open-source model

In [None]:
class ResearchAssistant:
    """Complete RAG system for research papers"""

    def __init__(self, vector_store: VectorStoreManager, llm: OpenSourceLLM):
        self.vector_store = vector_store
        self.llm = llm
        self.conversation_history = []

    def ask_question(self, question: str, n_contexts: int = 3) -> Dict[str, Any]:
        """Ask a question about the research paper"""
        print(f"Processing: {question}")

        start_time = time.time()

        # Step 1: Find relevant contexts
        relevant_docs = self.vector_store.search(question, n_results=n_contexts)

        # Step 2: Combine contexts
        combined_context = "\n\n".join([doc['content'] for doc in relevant_docs])

        # Step 3: Generate answer
        answer = self.llm.generate_answer(question, combined_context)

        processing_time = time.time() - start_time

        # Store conversation
        result = {
            "question": question,
            "answer": answer,
            "contexts_used": len(relevant_docs),
            "processing_time": round(processing_time, 2),
            "sources": [doc['metadata'] for doc in relevant_docs]
        }

        self.conversation_history.append(result)

        return result

# Initialize the research assistant
assistant = ResearchAssistant(vector_store, llm)

print("Research Assistant is ready!")
print("You can now ask questions about the research paper.")

## Step 8: Interactive Learning Session

**Let's test our system with educational questions!**

In [None]:
# Educational questions about the research paper
educational_questions = [
    "What is the difference between traditional NLP and deep learning NLP?",
    "Can you explain what a transformer is in simple terms?",
    "What are the main applications of deep learning in NLP?",
    "How do neural networks help with language understanding?",
    "What comes before deep learning in NLP history?"
]

print("üéì Educational Questions & Answers:")
print("=" * 50)

for question in educational_questions:
    print(f"\nQuestion: {question}")
    result = assistant.ask_question(question)
    print(f"Answer: {result['answer']}")
    print(f"Processing time: {result['processing_time']}s")
    print(f"Sources used: {result['contexts_used']} chunks")

## Final Summary & Next Steps

In [None]:
print("üéâ Congratulations! You've built a complete GenAI system!")
print("\nWhat you learned:")
print("How to process PDF documents into searchable chunks")
print("Using open-source embedding models (no API keys!)")
print("Building in-memory vector databases with ChromaDB")
print("Creating Q&A systems with open-source language models")
print("Adding educational features for better learning")

print("\n Next steps to explore:")
print("1. Try with your own PDF research papers")
print("2. Experiment with different embedding models")
print("3. Add conversation memory for follow-up questions")
print("4. Create a web interface using Streamlit")
print("5. Try larger open-source models like Llama-2")

# Save conversation history for review
with open('learning_session.json', 'w') as f:
    json.dump(assistant.conversation_history, f, indent=2)

print("\nConversation history saved to 'learning_session.json'")