## 2.3. Linear Algebra

*Studying and coding along with the printed book __„Dive into Deep Learning“__ by Aston Zhang, Zachary C. Lipton, Mu Li & Alexander J. Smola. The accompanying website for the chapter Preliminaries > Linear Algebra can be found at [d2l.ai](https://d2l.ai/chapter_preliminaries/linear-algebra.html).*

__In order to build sophisticated models with tensors we will need some knowledge of linear algebra. *There's no way around it :)*__

In [70]:
import torch

### 2.3.1. Scalars

- The values in mathematical operations are called __scalars__
- Known values (like 5 or 9 in an equation) are __constant scalars__. Unknow variables (like c or f in an equation) represent __unknown scalars__
- Scalars are denoted by lower case letters like x, y or z
- The space of all (continuous) real-valued scalars is <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow data-mjx-texclass="ORD">
    <mi mathvariant="double-struck">R</mi>
  </mrow>
</math>
- <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>x</mi>
  <mo>&#x2208;</mo>
  <mrow data-mjx-texclass="ORD">
    <mi mathvariant="double-struck">R</mi>
  </mrow>
</math> is a formal way to say that x is a real-valued scalar
- The symbol <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mo>&#x2208;</mo>
</math> (pronounced “in”) denotes membership in a set
- <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>x</mi>
  <mo>,</mo>
  <mi>y</mi>
  <mo>&#x2208;</mo>
  <mo fence="false" stretchy="false">{</mo>
  <mn>0</mn>
  <mo>,</mo>
  <mn>1</mn>
  <mo fence="false" stretchy="false">}</mo>
</math> for example, indicates that x and y are variables that can only take values 0 or 1.

In [71]:
# scalars are implemented as tensors that contain only one element
x = torch.tensor(3.0)
y = torch.tensor(2.0)

In [72]:
# performing the addition, multiplication, division, and exponentiation operations
x + y

tensor(5.)

In [73]:
x - y

tensor(1.)

In [74]:
x * x

tensor(9.)

In [75]:
x / y

tensor(1.5000)

In [76]:
x ** y

tensor(9.)

### 2.3.2. Vectors

- A vector as like a fixed-length array of scalars
- These scalars are the elements of the vector (synonyms: entries or components)
- As an example, studying the risk of heart attack: Each patient might get a vector assigned with the elements "most recent vital signs", "cholesterol levels" or "minutes of exercise per day"
- In the book vectors are denoted by bold lowercase letters like **x**, **y** or **z**
- Vectors are implemented as 1<sup>st</sup>-order tensors which can have arbitrary length
- Python vector indices start at 0 (zero-based indexing)
- In linear algebra subscripts begin at 1 (one-based indexing)
- By default vectors are visualized by stacking their elements __vertically__
- In general there are column vectors and row vectors whose elements are stacked horizontally

In [77]:
x = torch.arange(3)
x

tensor([0, 1, 2])

The elements of a vector can be denoted by using a subscript.

<img src="../assets/images/0231_vector.png" style="width:150px;vertical-align:middle" />

x<sub>2</sub> (a scalar), denotes the second element of vector **x**. (But we would access it in Python with x[1].)

The vector contains n elements (n is the dimesionality of the vector): x ⋲ ℝ<sup>n</sup>

In [78]:
# acessing a tensors element via indexing
x[2]

tensor(2)

In [79]:
# a tensor’s length is accessible via Python’s built-in len function
len(x)

3

In [80]:
# accessing the length via the shape attribute
# it returns a tuple that indicates a tensor’s length along each axis
# tensors with just one axis have shapes with just one element
x.shape

torch.Size([3])

__Clarifying the use of the word “dimension”:__

- “dimension” is often used to mean both, the number of axes and the length along a particular axis
- in this book (or tutorial), <span style="color:red">
  - ***“order” is used to refer to the number of axes***
  - ***dimensionality exclusively is used to refer to the number of components***</span>

### 2.3.3. Matrices

- Scalars are 0<sup>th</sup>-order tensors
- Vectors are 1<sup>st</sup>-order tensors
- Matrices are 2<sup>nd</sup>-order tensors

- Matrices are denoted by bold capital letters (e.g., **X**, **Y** or **Z**)
- In code the are reprsesentd by tensors with two axes
- Matrices are often used for representing datasets with rows corresponding to individual records and columns corresponding to attributes

- A matrix **A** containing m * n real-valued scalars is expressed as **A** ⋲ ℝ<sup>m * n</sup>
- The scalars are arranged as m rows and n columns
- A matrix is *square* when m = n

Visual representation of a matrix as a table:

<img src="../assets/images/0232_matrix.png" style="width:300px;vertical-align:middle" />

Referring to an individal element: a <sub>ij</sub> is the value at **A**'s i<sup>th</sup> and j<sup>th</sup> column.

In code a matrix **A** ⋲ ℝ<sup>m * n</sup> is represented by a 2<sup>nd</sup> order tensor with shape (m, n).

In [81]:
A = torch.arange(6).reshape(3,2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

#### __Transpose___

In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix **A*+ by producing another matrix, often denoted by **AT** (among other notations) (Source: https://en.wikipedia.org/wiki/Transpose).

The transpose of a *m * n* matrix is a *n * m* matrix.

A matrix **A**'s transpose is signifyd by **A<sup>T</sup>**. If **B** = **A<sup>T</sup>** then b<sub>ij</sub> = a<sub>ji</sub>.

<img src="../assets/images/0233_matrix_transpose.png" style="width:400px;vertical-align:middle" />


In [82]:
# accessing any matrix’s transpose
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

Symmetric matrices are the subset of square matrices that are equal to their own transposes: **A** = **A<sup>T</sup>**. 

In [83]:
# example of a symmetric matrix
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A

tensor([[1, 2, 3],
        [2, 0, 4],
        [3, 4, 5]])

In [84]:
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

### 2.3.4. Tensors

- Tensors allow it to describe extensions to n<sub>th</sub>-order arrays
- Software objects of the tensor class can have an arbitrary numbers of axes
- The word tensor for both the mathematical object and its realization in code (and therefore might be a bit confusing for the novice learner)
- Tensors will become important when working with images
- Each image arrives as a 3<sub>rd</sub>-order tensor with axes corresponding to the height, width, and channel

In [85]:
torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

### 2.3.5. Basic Properties of Tensor Arithmetic

Elementwise operations with scalars, vectors, matrices and higher-order tensors produce outputs that have the same shape as their operands.

In [86]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3) # 2x3 matrix
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [87]:
B = A.clone() # cloning A and receiving another 2x3 matrix
B

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [88]:
# addition of two matrices
A + B

tensor([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]])

The elementwise product of two matrices is called their __Hadamard product__ (⊙ symbol).

In [89]:
A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

When adding a scalar to a tensor, the scalar is added to each element of the tensor. The resulting tensor has the same shape as the original tensor.

In [90]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X

tensor([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],

        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]])

When multiplying a scalar and a tensor, each element of the tensor is multiplied by the scalar. The resulting tensor has the same shape as the original tensor.

In [91]:
b = 66
Y = torch.arange(66).reshape(2, 3, 11)
b * Y

tensor([[[   0,   66,  132,  198,  264,  330,  396,  462,  528,  594,  660],
         [ 726,  792,  858,  924,  990, 1056, 1122, 1188, 1254, 1320, 1386],
         [1452, 1518, 1584, 1650, 1716, 1782, 1848, 1914, 1980, 2046, 2112]],

        [[2178, 2244, 2310, 2376, 2442, 2508, 2574, 2640, 2706, 2772, 2838],
         [2904, 2970, 3036, 3102, 3168, 3234, 3300, 3366, 3432, 3498, 3564],
         [3630, 3696, 3762, 3828, 3894, 3960, 4026, 4092, 4158, 4224, 4290]]])

### 2.3.6. Reduction

Expressing the sum of the elements in a vector **x** of length *n*:

<img src="../assets/images/sum_vec_x_n.png" style="width:100px;vertical-align:middle" />

In [92]:
x = torch.arange(10, dtype=torch.float32)
x

tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [93]:
# sum of the elements in a vector x
x.sum()

tensor(45.)

Expressing sums over the elements of tensors of arbitrary shape by calculating the sums over all its axes. 
The sum of a *m * n* matrix **A**:

<img src="../assets/images/sum_over_sum.png" style="width:200px;vertical-align:middle" />

In [94]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3) # example from above
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [95]:
A.shape

torch.Size([2, 3])

In [96]:
A.sum()

tensor(15.)

In [97]:
print(A[0])
print(A[1])
A[0].sum() + A[1].sum()

tensor([0., 1., 2.])
tensor([3., 4., 5.])


tensor(15.)

Invoking the sum function ***reduces a tensor along all of its axes and produces a scalar***. The `sum` function takes the axes along which the tensor should be reduced as an argument (axis=0 in sum means we sum over all elements along the rows).

If we specify axis=0 in sum, the input matrix ***reduces along axis 0*** to generate the output vector. Therefore axis 0 is missing from the shape of the output vector.

In [98]:
A.shape

torch.Size([2, 3])

In [99]:
A.sum(axis=0)

tensor([3., 5., 7.])

In [100]:
A.sum(axis=0).shape

torch.Size([3])

In [101]:
# passing axis=1 as a parameter will reduce the column dimension (axis 1)
# this reduction will be done by summing up elements of all the columns
A.sum(axis=1)

tensor([ 3., 12.])

In [102]:
A.sum(axis=1).shape

torch.Size([2])

In [103]:
# reducing a matrix along both rows and columns via summation
# equivalent to summing up all the elements of the matrix
A.sum(axis=[0,1]) == A.sum()

tensor(True)

__Calculating the mean of a tensor:__

The mean (aka the average) is calculated by dividing the sum by the total number of elements.

In [104]:
mean = A.sum() / A.numel()
mean

tensor(2.5000)

In [105]:
# dedicated library function that works analogously to sum
A.mean() == mean

tensor(True)

In [106]:
# calculating the mean by reducing a tensor along specific axes
mean_axis_zero = A.sum(axis=0) / A.shape[0]
mean_axis_zero

tensor([1.5000, 2.5000, 3.5000])

In [107]:
A.mean(axis=0) == mean_axis_zero

tensor([True, True, True])

### 2.3.7. Non-Reduction Sum

Keeping the number of axes unchanged when invoking the function for calculating the sum or mean, for example when we want to use the broadcast mechanism.

In [108]:
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [109]:
A.shape

torch.Size([2, 3])

In [110]:
sum_A = A.sum(axis=1) # sum over axis 1 regular style, doesn't keep shape
sum_A

tensor([ 3., 12.])

In [111]:
sum_A = A.sum(axis=1, keepdims=True) # stays in shape with keepdims=True 
sum_A

tensor([[ 3.],
        [12.]])

In [112]:
sum_A.shape

torch.Size([2, 1])

In [113]:
# sum_A keeps its two axes after summing each row
# now lets divide A by sum_A with broadcasting to create a matrix where each row sums up to 1
A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

In [114]:
# calculating the cumulative sum of elements of A axis=0 (row by row)
# by calling the cumsum function
# the cumsum function doesn't reduce the input tensor along any axis
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

In [115]:
# once again A for comparision's sake
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [116]:
B = torch.arange(12).reshape(3,4)
B

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [117]:
B.cumsum(axis=0) # last row gets replaced with cumlulative sum of all rows

tensor([[ 0,  1,  2,  3],
        [ 4,  6,  8, 10],
        [12, 15, 18, 21]])

In [118]:
C = torch.arange(12).reshape(3,4)
C

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

In [119]:
C.cumsum(axis=1) # last column gets replaced with cumlulative sum of all columns

tensor([[ 0,  1,  3,  6],
        [ 4,  9, 15, 22],
        [ 8, 17, 27, 38]])

### 2.3.8. Dot Products

- The dot product is one of the most fundamental operations in linear algebra
- A dot product is a single number that reflects the commonalities between two objects (vectors, matrices, tensors, signals, images)
- Given two products **x, y** ⋲ ℝ<sup>d</sup>, their dot product *x <sup>T</sup> y* ) is a sum over the products of the elements at the same position:

<img src="../assets/images/0238_dot_product.png" style="width:200px;vertical-align:middle" />

- The dot product is also know as inner product, (**x**, **y**)


In [120]:
x = torch.arange(3, dtype=torch.float32)
x

tensor([0., 1., 2.])

In [121]:
y = torch.ones(3, dtype = torch.float32)
y

tensor([1., 1., 1.])

In [122]:
torch.dot(x, y)

tensor(3.)

In [123]:
z = x * y
z

tensor([0., 1., 2.])

In [124]:
z.sum() # is this how it's calculated?
# first the product, then the sum over the result?

tensor(3.)

In [125]:
# calculating the dot product of two vectors by performing an elementwise multiplication followed by a sum
torch.sum(x * y)

tensor(3.)

__Now for some very abstract explanation of what can be done with the dot product:__

- Let's say we have a vector **x** ⋲ ℝ<sup>n</sup> and a set of weights denoted by **w** ⋲ ℝ<sup>n</sup>
- Now the weighted sum of the values in **x** according to the weights **w** could be expressed as the dot product **x**<sup>T</sup>**w**
- When the weights are nonnegative and sum to 1, then the dot product expresses a weighted average
- If we normalizing two vectors to have unit length, the dot products express the cosine of the angle between them

### 2.3.9. Matrix–Vector Products

Calculating the product between a *m x n* matrix **A*+ and a *n*-dimensional vector *x*. 

Beginning with a visualizing a matrix where each **a**<sub>*i*</sub><sup>T</sup> ⋲ ℝ<sup>n</sup> is a row vector 
representing the *i<sup>th</sup>* row of the matrix **A**:

<img src="../assets/images/0239_matrix_vec_prod_1.png" style="width:200px;vertical-align:middle" />

The matrix–vector product __Ax__is a column vector of length *m* whose *i<sup>th</sup>* element is the dot product **a**<sub>*i*</sub><sup>T</sup>:

<img src="../assets/images/0239_matrix_vec_prod_2.png" style="width:300px;vertical-align:middle" />

The multiplication with a matrix __A__ ⋲ ℝ<sup>*m*n*</sup> as a transformation that projects vectors from ℝ<sup>*n*</sup> to ℝ<sup>*m*</sup>.

In [126]:
# expressing a matrix–vector product in code
# matrix A from above
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [127]:
A.shape

torch.Size([2, 3])

In [128]:
# vector x from above
x

tensor([0., 1., 2.])

In [129]:
x.shape

torch.Size([3])

In [130]:
# the column dimension of A (its length along axis 1) must be the same as the dimension of x (its length)
A[1].shape == x.shape # is A[1].shape the length along axis 1 ?

True

In [131]:
# using the mv function to express a matrix–vector product in code
torch.mv(A, x)

tensor([ 5., 14.])

In [132]:
# executing both matrix–vector product with operator @
A@x

tensor([ 5., 14.])

### 2.3.10. Matrix–Matrix Multiplication

We have two matrices, __A__ ⋲ ℝ<sup>*n*k*</sup> and __B__ ⋲ ℝ<sup>*k*m*</sup>:

<img src="../assets/images/0239_matrix_matrix_prod_1.png" style="width:80%;vertical-align:middle" />

- __a__<sub>*i*</sub><sup>T</sup> ⋲ ℝ<sup>*k*</sup> denotes the row vector representing the *i<sup>th</sup>* row of the matrix __A__
- __b__<sub>*j*</sub> ⋲ ℝ<sup>*k*</sup> denotes the column vector representing the *j<sup>th</sup>* column of the matrix __B__

<img src="../assets/images/0239_matrix_matrix_prod_2.png" style="width:80%;vertical-align:middle" />

- To form the matrix product __C__ ⋲ ℝ<sup>*n*m*</sup> we compute each element *c<sub>ij</sub>* as the as the dot product between the *i<sup>th</sup>* row of __A__ and the *j<sup>th</sup>* column of __B__:

<img src="../assets/images/0239_matrix_matrix_prod_3.png" style="width:80%;vertical-align:middle" />

(The screenshots with mathematical notation are taken from [d2l.ai](https://d2l.ai/chapter_preliminaries/linear-algebra.html) because I really don't know how to write them in markdown.)

The matrix–matrix multiplication __AB__ can be seen as performing *m* matrix–vector products or *m*n* dot products and stitching the results together to form an *n*m* matrix.

In [133]:
# code example for performing matrix multiplication on A and B
# we already have A, a matrix with two rows and three columns
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [134]:
# lets create another matrix B
# B is a matrix with three rows and four columns
B = torch.ones(3, 4)
B

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

In [135]:
# using the mm function to express a matrix–matrix product in code
# after multiplication, we obtain a matrix with two rows and four columns
torch.mm(A, B)

tensor([[ 3.,  3.,  3.,  3.],
        [12., 12., 12., 12.]])

In [136]:
# same with the @ operator
A@B

tensor([[ 3.,  3.,  3.,  3.],
        [12., 12., 12., 12.]])

- The term matrix–matrix multiplication is often simplified to matrix multiplication
- Matrix multiplication should not be confused with the Hadamard product

### 2.3.11. Norms

- The norm of a vector tells us how big the vector is
- The $\ell_2$ norm measures the (Euclidean) length of a vector
- Size in this context concerns the magnitude of a vector’s components, not its dimensionality

A norm is a function $\| \cdot \|$ that maps a vector
to a scalar and satisfies the following three properties:

1. Given any vector $\mathbf{x}$, if we scale (all elements of) the vector 
   by a scalar $\alpha \in \mathbb{R}$, its norm scales accordingly:
   $$\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.$$
2. For any vectors $\mathbf{x}$ and $\mathbf{y}$:
   norms satisfy the triangle inequality:
   $$\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.$$
3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
   $$\|\mathbf{x}\| > 0 \textrm{ for all } \mathbf{x} \neq 0.$$

Many functions are valid norms and different norms 
encode different notions of size. 
The Euclidean norm that we all learned in elementary school geometry
when calculating the hypotenuse of a right triangle
is the square root of the sum of squares of a vector's elements.
Formally, this is called [**the $\ell_2$ *norm***] and expressed as

(**$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.$$**)

(Source incl. mathematical notation: [D2L.ai: Interactive Deep Learning Book with Multi-Framework Code, Math, and Discussions](https://github.com/d2l-ai/d2l-en/blob/master/chapter_preliminaries/linear-algebra.md))

In Python the method `norm` calculates 𝓁<sub>2</sub> the norm.

In [137]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)