In [63]:
import numpy as np
#from scipy.signal import convolve
from scipy.ndimage import convolve

#from imageio import imread
import matplotlib.pyplot as plt
from IPython import display
import tensorflow as tf

# **Convolutional Neural Networks (CNNs)**

A Convolutional Neural Network (CNN) is essentially a feedforward Multi-Layer Perceptron (MLP) that is designed to recognize local patterns and sparsity in input data. Like the MLP, each neuron is connected to others through learnable weights. These weights are adjusted during training to optimize the network's performance for a specific task.

The main difference between MLPs and CNNs is that the latter is developed for processing multidimensional data, such as images or videos. Also, CNNs have a more diverse set of specialized layers, including convolutional layers, pooling layers, and upsampling layers, which are optimized for processing spatial (image) and temporal data (video).

## **Convolutions**

Convolution is a mathematical operation that involves sliding a small matrix, known as a kernel or filter, across a larger matrix representing the input data, such as an image. During this process, the element-wise product $\odot$ is computed between the kernel and each local region (sub-matrix) it covers on the input data matrix. The result of this operation is a new matrix, called a feature map, which encodes information about the presence, absence, or strength of specific features in the input data.

Let's examine the following convolutional operations to illustrate this concept.


### **One Dimension**

Let's consider $\mathbf{x}$ as an input vector with $n$ elements and $\mathbf{w}$ as a weight vector, also known as a  **filter**, with $k \leq n$.

$$
\mathbf{x} =
\left( \begin{array}{c}
x_{1}\\
x_{2}\\
\vdots\\
x_{n}
\end{array} \right), ~~~~
\mathbf{w} =
\left( \begin{array}{c}
w_{1}\\
w_{2}\\
\vdots\\
w_{k}
\end{array} \right)
$$

Here $k$  is known as the  **window size** and indicates the size of the filter applied to the input vector $\mathbf{x}$. It defines the region of the local neighborhood within the input vector $\mathbf{x}$ used for computing output values. To proceed, we define a subvector of $\mathbf{x}$ with the same size as the filter vector. Let $\mathbf{x}_k(i)$ denote the window of $\mathbf{x}$  of size $k$ starting at position $i$:

$$\mathbf{x}_k(i) = \left( \begin{array}{c}
x_i \\
x_{i+1} \\
\vdots\\
x_{i+k-1} 
\end{array} \right).$$

For $k \leq n$, it must be that $i+k-1 \leq n$, implying $1 \leq i \leq n-k+1$. As a validity test, if we start at $i =  n-k+1$, then the end position is $i+k-1 = n$. If we calculate the total number of elements by the difference in position provides the window size $k$, confirmed by $n - i = n - (n-k+1) = k$. For example, with $n = 5$ and $k = 3$:

$$
\mathbf{x} =
\left( \begin{array}{c}
x_{1}\\
x_{2}\\
x_{3}\\
x_{4}\\
x_{5}
\end{array} \right), ~~~~
\mathbf{w} =
\left( \begin{array}{c}
w_{1}\\
w_{2}\\
w_{3}
\end{array} \right)
$$

the window of $\mathbf{x}$ from $i = 2$ to $i+k-1 = 4$ is:

$$\mathbf{x}_3(2) = \left( \begin{array}{c}
x_2 \\
x_{3}\\
x_{4}
\end{array} \right)$$

**Example**

Let's first consider a particular example with input vector $\mathbf{x}$ of size $n = 5$ and a weight vector with window size $k = 3$. The vectors are illustrated in the following figure:

<center><img src = "figures/1d-conv.png" width="800" height="400"/></center>


The convolution steps for the sliding windows of $\mathbf{x}$ with the filter $\mathbf{w}$ are:


$$
\sum \mathbf{x}_3(1) \odot \mathbf{w} = \sum (1, 3, -1)^T \odot (1, 0, 2)^T = \sum  (1 \cdot 1, 3 \cdot 0, -1 \cdot 2) = 1 + 0 - 2 = -1,
$$

$$
\sum \mathbf{x}_3(2) \odot \mathbf{w} = \sum (3, -1, 2)^T \odot (1, 0, 2)^T = \sum  (3 \cdot 1, -1 \cdot 0, 2 \cdot 2) = 3 + 0 + 4 = 7,
$$

$$
\sum \mathbf{x}_3(3) \odot \mathbf{w} = \sum (-1, 2, 3)^T \odot (1, 0, 2)^T = \sum  (-1 \cdot 1, 2 \cdot 0, 3 \cdot 2) = -1 + 0 + 6 = 5.
$$

The element-wise product $\odot$  , also known as the Hadamard product, multiplies corresponding elements in two vectors. Unlike the typical inner product, which multiplies an element by a column, this operation multiplies an element by its corresponding element in another vector. This steps provide the convolution between the two vectors resulting in a vector of size n-k+1 = 3. Thus, the convolution $\mathbf{x} * \mathbf{w}$ is:

$$\mathbf{x} * \mathbf{w} =
\left( \begin{array}{c}
-1\\
7\\
5
\end{array} \right)$$

In [64]:
# Code for the example

X = np.array([1, 3, -1, 2, 3])
# flip the filter W to use the convolve function
# as expected in machine learning and deep learning context
W = np.flip(np.array([1, 0, 2]))

# perform 1D convolution
output = np.convolve(X, W, mode='valid')

print("Input vector X:", X.shape)
print("filter W:", W.shape)
print("output X*W:", output)

Input vector X: (5,)
filter W: (3,)
output X*W: [-1  7  5]


In [42]:
W

array([2, 0, 1])

This demonstrates that convolution is an element-wise product between a subvector and a weight vector of the same size, providing a scalar value when summed, which forms the result of the convolution operation.

To simplify the notation, let's adopt the convention that for a vector $\mathbf{a} \in \mathbb{R}^k$, define the summation operator as one that adds all elements of the vector. That is, 

$$\text{Sum}(\mathbf{a}) = \sum_{i=1}^{k} a_{i}$$ 

Then from the example, we would have

$$
\sum \mathbf{x}_3(1) \odot \mathbf{w} = \text{Sum}\bigg( \mathbf{x}_3(1) \odot \mathbf{w} \bigg)= 1 + 0 - 2 = -1.
$$

Then, we can define a general one dimensional convolution operation between $\mathbf{x}$ and $\mathbf{w}$, denoted by the asterisk symbol $\ast$, as 

$$\mathbf{x} \ast \mathbf{w} = \left( \begin{array}{c}
\text{Sum}(\mathbf{x}_k(1) \odot \mathbf{w})\\
\vdots\\
\text{Sum}(\mathbf{x}_k(i) \odot \mathbf{w})\\
\vdots\\
\text{Sum}(\mathbf{x}_k(n-k+1) \odot \mathbf{w})
\end{array} \right).$$

The convolution of $\mathbf{x} \in \mathbf{R}^{n}$ and $\mathbf{W} \in \mathbf{R}^{k}$ results in a vector of size $n-k+1$. The i-th element from this output vector can be decomposed as

$$\text{Sum}(\mathbf{x_k}(i) \odot \mathbf{w}) = x_{i}w_1 + x_{i+1}w_2 + \cdots + x_{(i+k-1)}w_k =  \sum_{j=1}^{k} x_{(i+j-1)}w_j.$$

This shows that the sum is over all elements of the subvector $\mathbf{x}_k(i)$, so the last element of this sum must coincide with the last elements of $\mathbf{x}_k(i)$ and $\mathbf{w}$. This results in the convolution of $\mathbf{x}$ with $\mathbf{w}$ over the window defined by $k$.

### **Two Dimension**

We can extend the convolution operation to an matrix input instead of a vector. Let $\mathbf{X}$ be an input matrix with $n \times n$ elements and $\mathbf{W}$ be the weight matrix, also known as a  **filter**, with $k \leq n$.

$$
\mathbf{X} = \begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,n}
\end{bmatrix},~~
\mathbf{W}=\begin{bmatrix}
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
w_{k,1} & w_{k,2} & \cdots & w_{k,k}
\end{bmatrix}
$$

Here, similar to the one dimensional case,  $k$ is the window size and indicates the size of the filter applied to the input matrix $\mathbf{X}$. From the one dimensional case we can extend the notion of a sub vector to a sub matrix. Let $\mathbf{X}_k(i,j)$ denote the $k \times k$ submatrix of $\mathbf{X}$ starting at row $i$ and column $j$ as

$$\mathbf{X}_k(i,j) = \begin{bmatrix}
x_{i,j} & x_{i,~(j+1)} & \cdots & x_{i,~(j+k-1)} \\
x_{(i+1),~j} & x_{(i+2),~j} & \cdots & x_{(i+1), (j+k-1)} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1), ~j} & x_{(i+1),(j+1)} & \cdots & x_{(i+k-1),(j+k-1)}
\end{bmatrix}$$

where for two indices, give that this is a square matrix, the range is simple $1 \leq (i,j) \leq n-k+1$.

AS for the one dimensional case, to simplify the notation, we adopt the convention that for a matrix $\mathbf{A} \in \mathbb{R}^{k \times k}$ define the summation operator as one that adds all elements of the matrix.

$$\text{Sum}(\mathbf{A}) = \sum_{i=1}^{k}\sum_{j=1}^{k} a_{i,j}$$

Then, we can define  a general two dimensional convolution operation between matrices $\mathbf{X}$ and $\mathbf{W}$, as


$$\mathbf{X} \ast \mathbf{W} = \begin{bmatrix}
\text{Sum}(x_k(1,1) \odot \mathbf{W}) &\text{Sum}(x_k(1,2) \odot \mathbf{W}) & \cdots & \text{Sum}(x_k(1,n-k+1) \odot \mathbf{W}) \\
\text{Sum}(x_k(2,1) \odot \mathbf{W}) & \text{Sum}(x_k(2,2) \odot \mathbf{W}) & \cdots &\text{Sum}(x_k(2,n-k+1) \odot \mathbf{W})\\
\vdots & \vdots & \ddots & \vdots \\
\text{Sum}(x_k(n-k+1,1) \odot \mathbf{W}) & \text{Sum}(x_k(n-k+1,2) \odot \mathbf{W}) & \cdots & \text{Sum}(x_k(n-k+1,n-k+1) \odot \mathbf{W})
\end{bmatrix}$$

where

$$\text{Sum}(\mathbf{X}_k(i,j) \odot \mathbf{W})=\sum_{a=1}^{k}\sum_{b=1}^{k} x_{(i+a-1),(j+b-1)} w_{a,b}$$

The convolution of $\mathbf{X} \in \mathbf{R}^{n \times n}$ and $\mathbf{W} \in \mathbf{R}^{k \times k}$ results in a $(n-k+1) \times (n-k+1)$ matrix.

**Example**

Let's consider a particular example with input matrix $\mathbf{X}$ with dimension  $3 \times 3$ (n = 3) and a weight matrix with dimension  $2 \times 2$ (k = 2). The matrices are illustrated in the following figure:

<center><img src = "figures/2D-conv.png" width="800" height="400"/></center>

The convolution steps for the sliding windows of $\mathbf{X}$ with the filter $\mathbf{W}$ illustrated in the figure are mathematically translated to:


$$\text{Sum}(\mathbf{X}_k(1,1) \odot \mathbf{W})=\text{Sum}\bigg(
    \begin{bmatrix} 
    1 & 2 \\
    3 & 1 
    \end{bmatrix} 
    \odot 
    \begin{bmatrix} 
    1 & 0 \\
    0 & 1
    \end{bmatrix} \bigg) =  2$$

$$\text{Sum}(\mathbf{X}_2(1,2) \odot \mathbf{W}) = \text{Sum}\bigg(
    \begin{bmatrix} 
    2 & 2 \\
    1 & 4
    \end{bmatrix} 
    \odot
    \begin{bmatrix}
    1 & 0 \\
    0 & 1
    \end{bmatrix} \bigg)= 6$$


$$\text{Sum}(\mathbf{X}_2(2,1) \odot \mathbf{W}) = \text{Sum}\bigg( 
    \begin{bmatrix} 
    3 & 1 \\ 
    2 & 1 \end{bmatrix} 
    \odot \begin{bmatrix} 
    1 & 0 \\ 
    0 & 1 
    \end{bmatrix} \bigg)= 4$$


$$\text{Sum}(\mathbf{X}_2(2,2) \odot \mathbf{W}) = \text{Sum}\bigg( 
    \begin{bmatrix} 
    1 & 4 \\
    3 & 3 \end{bmatrix}
    \odot
    \begin{bmatrix}
    1 & 0 \\ 
    0 & 1
    \end{bmatrix} \bigg) = 4$$

The convolution $\mathbf{X}*\mathbf{W}$ has size $2 \times 2$, since  $n - k + 1 = 3 - 2 + 1 = 2$, and is given by

$$\mathbf{X}*\mathbf{W} = 
    \begin{bmatrix} 
        \text{Sum}(\mathbf{X}_2(1,1) \odot \mathbf{W}) & \text{Sum}(\mathbf{X}_2(1,2) \odot \mathbf{W}) \\
        \\
        \text{Sum}(\mathbf{X}_2(2,1) \odot \mathbf{W}) & \text{Sum}(\mathbf{X}_2(2,2) \odot \mathbf{W}) 
    \end{bmatrix} = 
    \begin{bmatrix} 
        2 & 6 \\
        4 & 4 
    \end{bmatrix}$$

## **Three dimensional Convolution on CNNs**



We now extend the convolution operation to a three-dimensional matrix, also called a rank-3 tensor. In the context of convolutional neural networks, the first dimension comprises the depth (or the number of 2D slices stacked along the depth axis), the second the rows (height), and the third the columns (width). A rank-3 tensor in tthe context of a CNN would be a single channel of a 3D volume. Here's how we can represent a single-channel 3D tensor mathematically:


$$
\mathbf{X}=
\begin{bmatrix}
\begin{bmatrix}
x_{1,1,1} & x_{1,2,1} & \cdots & x_{1,n,1} \\
x_{2,1,1} & x_{2,2,1} & \cdots & x_{2,n,1} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1,1} & x_{n,2,1} & \cdots & x_{n,n,1}
\end{bmatrix}\\
\\
\begin{bmatrix}
x_{1,1,2} & x_{1,2,2} & \cdots & x_{1,n,2} \\
x_{2,1,2} & x_{2,2,2} & \cdots & x_{2,n,2} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1,2} & x_{n,2,2} & \cdots & x_{n,n,2}
\end{bmatrix}\\
\\
\vdots\\
\\
\begin{bmatrix}
x_{1,1,m} & x_{1,2,m} & \cdots & x_{1,n,m} \\
x_{2,1,m} & x_{2,2,m} & \cdots & x_{2,n,m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1,m} & x_{n,2,m} & \cdots & x_{n,n,m}
\end{bmatrix}
\end{bmatrix},~~


\mathbf{W}= \begin{bmatrix}
\begin{bmatrix}
w_{1,1,1} & w_{1,2,1} & \cdots & w_{1,k,1} \\
w_{2,1,1} & w_{2,2,1} & \cdots & w_{2,k,1} \\
\vdots & \vdots & \ddots & \vdots \\
w_{k,1,1} & w_{k,2,1} & \cdots & w_{k,k,1}
\end{bmatrix}\\
\\
\begin{bmatrix}
w_{1,1,2} & w_{1,2,2} & \cdots & w_{1,k,2} \\
w_{2,1,2} & w_{2,2,2} & \cdots & w_{2,k,2} \\
\vdots & \vdots & \ddots & \vdots \\
w_{k,1,2} & w_{k,2,2} & \cdots & w_{k,k,2}
\end{bmatrix}
\\
\vdots\\
\\
\begin{bmatrix}
w_{1,1,r} & w_{1,2,r} & \cdots & w_{1,k,r} \\
w_{2,1,r} & w_{2,2,r} & \cdots & w_{2,k,r} \\
\vdots & \vdots & \ddots & \vdots \\
w_{k,1,r} & w_{k,2,r} & \cdots & w_{k,k,r}

\end{bmatrix}
\end{bmatrix}
$$

Similar to convolutions in other dimensions, the window size must satisfy $k \leq n$, and the total number of slice matrices along the depth of the filter tensor should not exceed the depth of the input tensor, meaning $ r \leq m $. Although this mathematical representation of a tensor may seem complex at first, it closely resembles how Python libraries like NumPy represent a tensor.

Let's extend the idea of defining a sub-tensor from the input tensor $\mathbf{X}$. Let $\mathbf{X}_k(i,j, q)$ denote a $ k \times k \times r $ sub-tensor of $\mathbf{X}$ that starts at row $i$, column $j$, and depth $q$ of $\mathbf{X}$. It is defined as:


$$
\mathbf{X}_k(i,j,q)=
\begin{bmatrix}
\begin{bmatrix}
x_{i,j,q} & x_{i,(j+1),q} & \cdots & x_{i,(j+k-1),q} \\
x_{(i+1),j,q} & x_{(i+1),(j+1),q} & \cdots & x_{(i+1),(j+k-1),q} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,q} & x_{(i+k-1),(j+1),q} & \cdots & x_{(i+k-1),(j+k-1),q}
\end{bmatrix}\\
\\
\\
\begin{bmatrix}
x_{i,j,(q+1)} & x_{i,(j+1),(q+1)} & \cdots & x_{i,(j+k-1),(q+1)} \\
x_{(i+1),j,(q+1)} & x_{(i+1),(j+1),(q+1)} & \cdots & x_{(i+1),(j+k-1),(q+1)} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,(q+1)} & x_{(i+k-1),(j+1),(q+1)} & \cdots & x_{(i+k-1),(j+k-1),(q+1)}
\end{bmatrix}\\
\\
\vdots\\
\\
\begin{bmatrix}
x_{i,j,(q+r-1)} & x_{i,(j+1),(q+r-1)} & \cdots & x_{i,(j+k-1),(q+r-1)} \\
x_{(i+1),j,(q+r-1)} & x_{(i+1),(j+1),(q+r-1)} & \cdots & x_{2,n,(q+r-1)} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,(q+r-1)} & x_{(i+k-1),(j+1),(q+r-1)} & \cdots & x_{(i+k-1),(j+k-1),(q+r-1)}
\end{bmatrix}
\end{bmatrix}
$$

where we have for rows and columns the same range $1 \leq (i,j) \leq n-k+1$ as in two dimension and the addition of the channels range $1 \leq q \leq m-r+1$.

Typically in CNNs, we use a 3D filter $\mathbf{W} \in \mathbb{R}^{k \times k \times m}$, with the number of channels $r=m$, the same as the number of channels of the input tensor $\mathbf{X} \in \mathbb{R}^{n \times n \times m}$. This is a simplification, where we would have $q = 1$ and the sub tensor as:

$$
\mathbf{X}_k(i,j,q)=
\begin{bmatrix}
\begin{bmatrix}
x_{i,j,1} & x_{i,(j+1),1} & \cdots & x_{i,(j+k-1),1} \\
x_{(i+1),j,1} & x_{(i+1),(j+1),1} & \cdots & x_{(i+1),(j+k-1),1} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,1} & x_{(i+k-1),(j+1),1} & \cdots & x_{(i+k-1),(j+k-1),1}
\end{bmatrix}\\
\\
\\
\begin{bmatrix}
x_{i,j,2} & x_{i,(j+1),2} & \cdots & x_{i,(j+k-1),2} \\
x_{(i+1),j,2} & x_{(i+1),(j+1),2} & \cdots & x_{(i+1),(j+k-1),2} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,2} & x_{(i+k-1),(j+1),2} & \cdots & x_{(i+k-1),(j+k-1),2}
\end{bmatrix}\\
\\
\vdots\\
\\
\begin{bmatrix}
x_{i,j,m} & x_{i,(j+1),m} & \cdots & x_{i,(j+k-1),m} \\
x_{(i+1),j,m} & x_{(i+1),(j+1),m} & \cdots & x_{2,n,m} \\
\vdots & \vdots & \ddots & \vdots \\
x_{(i+k-1),j,m} & x_{(i+k-1),(j+1),m} & \cdots & x_{(i+k-1),(j+k-1),m}
\end{bmatrix}
\end{bmatrix}
$$


This way we fixed the value of the channels to be equal between filter and input. 

As for others dimensions, given a tensor $\mathbf{A} \in \mathbb{R}^{k \times k \times m}$, we define the summation operator as one that adds all elements of the tensor. That is, 

$$\text{Sum}(\mathbf{A}) = \sum_{a=1}^{k}\sum_{b=1}^{k}\sum_{q=1}^{m}a_{ijq}$$

Before generalizing the convolution operation for three dimension, let's consider a example to illustrate the logic behind all this dense mathematical notation.

**Example**

Consider a input tensor $\mathbf{X}$ with dimension  $3 \times 3 \times 3$ (n = 3 and m = 3) and a filter with dimension  $2 \times 2 \times 3$ ( windows size with k = 2 and r = m = 3). The tensors are illustrated in the following figure:

<center><img src = "figures/3dd-conv.png" ></center>


The convolution steps for the sliding windows of $\mathbf{x}$ with the filter $\mathbf{w}$ illustrated in the figure are:

$$
\text{Sum}(\mathbf{X}_2(1,1,1) \odot \mathbf{W}) = \text{Sum}\bigg( 
\begin{bmatrix} 
    \begin{bmatrix} 
        1 & -1 \\ 
        2 & 1 
    \end{bmatrix} \\
    \\
    \begin{bmatrix} 
        2 & 1 \\
        3 & -1
    \end{bmatrix} \\
    \\
    \begin{bmatrix} 
        1 & -2 \\ 
        2 & 1 
    \end{bmatrix} 
\end{bmatrix} 
\odot 
\begin{bmatrix} 
    \begin{bmatrix} 
        1 & 1 \\
        2 & 0 
    \end{bmatrix} \\
    \\
    \begin{bmatrix} 
        1 & 0 \\ 
        0 & 1
    \end{bmatrix} \\
    \\
    \begin{bmatrix}
        0 & 1 \\
        1 & 0 
    \end{bmatrix} 
\end{bmatrix} \bigg) =  \text{Sum}\bigg(   
    \begin{bmatrix} 
    \begin{bmatrix} 
        1  & -1  \\ 
        4 & 0 
    \end{bmatrix} \\
    \\
    \begin{bmatrix} 
        2 & 0 \\
        0 & -1
    \end{bmatrix} \\
    \\
    \begin{bmatrix} 
        0 & -2 \\ 
        2 & 0 
    \end{bmatrix} 
\end{bmatrix} \bigg) = 1 - 1 + 4 +2 -1 -2+ 2 = 5 
$$


$$
\text{Sum}(\mathbf{X}_2(1,2,1) \odot \mathbf{W}) = \text{Sum}\bigg(
\begin{bmatrix}
    \begin{bmatrix}
        -1 & 3 \\
        1 & -4
    \end{bmatrix} \\
    \\
    \begin{bmatrix}
        1 & 3 \\
        -1 & 1
    \end{bmatrix} \\
    \\
    \begin{bmatrix}
        -2 & 4 \\
        1 & -2
    \end{bmatrix}
\end{bmatrix}
\odot
\begin{bmatrix}
    \begin{bmatrix}
        1 & 1 \\
        2 & 0
    \end{bmatrix} \\
    \\
    \begin{bmatrix}
        1 & 0 \\
        0 & 1
    \end{bmatrix} \\
    \\
    \begin{bmatrix}
        0 & 1 \\
        1 & 0
    \end{bmatrix}
\end{bmatrix} \bigg) = \text{Sum}\bigg(
\begin{bmatrix}
\begin{bmatrix}
    -1 & 3 \\
    2 & 0
\end{bmatrix} \\
\\
\begin{bmatrix}
    1 & 0 \\
    0 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    0 & 4 \\
    1 & 0
\end{bmatrix}
\end{bmatrix} \bigg) = -1 + 3 + 2 + 1 + 1 + 4 + 1 = 11 
$$


$$
\text{Sum}(\mathbf{X}_2(2,1,1) \odot \mathbf{W}) = \text{Sum}\bigg(
\begin{bmatrix}
\begin{bmatrix}
    2 & 1 \\
    3 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    3 & -1 \\
    1 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    2 & 1 \\
    1 & 3
\end{bmatrix}
\end{bmatrix}
\odot
\begin{bmatrix}
\begin{bmatrix}
    1 & 1 \\
    2 & 0
\end{bmatrix} \\
\\
\begin{bmatrix}
    1 & 0 \\
    0 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    0 & 1 \\
    1 & 0
\end{bmatrix}
\end{bmatrix} \bigg) = \text{Sum}\bigg(
\begin{bmatrix}
\begin{bmatrix}
    2 & 1 \\
    6 & 0
\end{bmatrix} \\
\\
\begin{bmatrix}
    3 & 0 \\
    0 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    0 & 1 \\
    1 & 0
\end{bmatrix}
\end{bmatrix} \bigg) = 2 + 1 + 6 + 3 + 1 + 1 + 1 = 15
$$


$$
\text{Sum}(\mathbf{X}_2(2,2,1) \odot \mathbf{W}) = \text{Sum}\bigg(
\begin{bmatrix}
\begin{bmatrix}
    1 & 4 \\
    1 & 2
\end{bmatrix} \\
\\
\begin{bmatrix}
    -1 & 1 \\
    1 & -2
\end{bmatrix} \\
\\
\begin{bmatrix}
    1 & -2 \\
    3 & -1
    \end{bmatrix}
\end{bmatrix}
\odot
\begin{bmatrix}
\begin{bmatrix}
    1 & 1 \\
    2 & 0
\end{bmatrix} \\
\\
\begin{bmatrix}
    1 & 0 \\
    0 & 1
\end{bmatrix} \\
\\
\begin{bmatrix}
    0 & 1 \\
    1 & 0
\end{bmatrix}
\end{bmatrix} \bigg) = \text{Sum}\bigg(
\begin{bmatrix}
\begin{bmatrix}
    1 & 4 \\
    2 & 0
\end{bmatrix} \\
\\
\begin{bmatrix}
    -1 & 0 \\
    0 & -2
\end{bmatrix} \\
\\
\begin{bmatrix}
    0 & -2 \\
    3 & 0
\end{bmatrix}
\end{bmatrix} \bigg) = 1 + 4 +  2 + -1 - 2 - 2 + 3  = 5 
$$

The convolution $\mathbf{X}*\mathbf{W}$ has size $2 \times 2$, since  $n - k + 1 = 3 - 2 + 1 = 2$, and $r =m = 3$; it is is given as

 $$\mathbf{X}*\mathbf{W} = 
 \begin{bmatrix}
    \text{Sum}(\mathbf{X}_2(1,1) \odot \mathbf{W}) & \text{Sum}(\mathbf{X}_2(1,2) \odot \mathbf{W}) \\
    \\
    \text{Sum}(\mathbf{X}_2(2,1) \odot \mathbf{W}) & \text{Sum}(\mathbf{X}_2(2,2) \odot \mathbf{W})
\end{bmatrix}
=
\begin{bmatrix}
    5 & 11 \\
    15 & 5
\end{bmatrix}
$$

Note that because we fixed the third dimension as $r = m$, by restricting the input tensor $\mathbf{X} \in \mathbb{R}^{n \times n \times m}$ and the filter tensor $\mathbf{W} \in \mathbb{R}^{k \times k \times m}$ to have the same number of channels, we get a $(n-k+1) \times (n-k+1) \times 1$ matrix instead of a tensor, where the last dimension can be dropped, since there is no freedom to slide the filter in the third dimension. Because of this, the notation for the sub tensor can be rewrite as $\text{Sum}(\mathbf{X}_k(i,j,q) \odot \mathbf{W}) = \text{Sum}(\mathbf{X}_k(i,j) \odot \mathbf{W})$. 

In [95]:
X = np.array([
    [[1, -1, 3],
     [2, 1, 4],
     [3, 1, 2]],

    [[2, 1, 3],
     [3, -1, 1],
     [1, 1, -2]],

    [[1, -2, 4],
     [2, 1, -2],
     [1, 3, -1]]
], dtype=np.float32)

# TensorFlow expects the input to have a shape of [batch, (depth, height, width), channels]
# Add a batch dimension and a channel dimension to X
X = X.reshape(1, *X.shape, 1) 
# create a simple 3D kernel
W = np.array([
    [[1, 1],
     [2, 0]],

    [[1, 0],
     [0, 1]],

    [[0, 1],
     [1, 0]]
], dtype=np.float32)

# TensorFlow expects the filter to have a shape of [(depth, height, width), in_channels = 1, out_channels = 1]
# Since our input has a single channel (in_channels = 1) and we want a single output channel ( out_channels = 1),
# Add those dimensions to W
W = W.reshape(*W.shape, 1, 1) 

# 3D convolution
output = tf.nn.conv3d(input=X, filters=W, strides=[1, 1, 1, 1, 1], padding="VALID")

# squeeze to remove the redundant dimensions of batch and channel
output_2d = output.numpy().squeeze()    

print("Shape of input tensor X:\n", X.shape)
print("Shape of filter tensor W:\n", W.shape)
print("Convolved output shape with channel and batch dimension:\n", output.shape)
print("Convolved output:\n", output_2d)

Shape of input tensor X:
 (1, 3, 3, 3, 1)
Shape of filter tensor W:
 (3, 2, 2, 1, 1)
Convolved output shape with channel and batch dimension:
 (1, 1, 2, 2, 1)
Convolved output:
 [[ 5. 11.]
 [15.  5.]]




Typically in CNNs, we use a 3D filter $\mathbf{W} \in \mathbb{R}^{k \times k \times m}$, with the number of channels $r=m$, the same as the number of channels of the input tensor $\mathbf{X} \in \mathbb{R}^{n \times n \times m}$.

Each channel of the input tensor $\mathbf{X}$ represents a different aspect (or feature) of the input tensor, and **the filter is used to extract those features from the input tensor $\mathbf{X}$ by convolving it with the filter tensor $\mathbf{W}$**. By using a filter $\mathbf{W}$ with the same number of channels as the input signal $\mathbf{X}$, we can ensure that the filter is applied to every channel of the input data. This allows the network to learn features from each channel of the input signal separately, which can improve the overall performance of the network.



Each channel of the input signal $\mathbf{X}$ represents a different aspect or feature of the input data, and **the filter is used to extract features from the input data $\mathbf{X}$ by convolving it with the filter tensor $\mathbf{W}$**. By using a filter $\mathbf{W}$ with the same number of channels as the input signal $\mathbf{X}$, we can ensure that the filter is applied to every channel of the input data. This allows the network to learn features from each channel of the input signal separately, which can improve the overall performance of the network.

If  $r \leq m - r + 1$, then the resulting convolution would have dimension $(n-k+1) \times (n-k+1) \times (m-k+1)$ instead of $(n-k+1) \times (n-k+1) \times 1$. We will see that this simplification is important to create a channel for each convolution between a tensor $\mathbf{X}$ with more then one filter