In [1]:
# Basic imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import ticker

# Display options
from IPython.display import display
pd.options.display.max_columns = None

np.set_printoptions(threshold=30)

# Plots style
from cycler import cycler

matplotlib.rcParams['lines.linewidth'] = 3
matplotlib.rcParams['lines.markersize'] = 10

matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['xtick.color'] = '#A9A9A9'
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['ytick.color'] = '#A9A9A9'

matplotlib.rcParams['grid.color'] = '#ffffff'

matplotlib.rcParams['axes.facecolor'] = '#ffffff'

matplotlib.rcParams['axes.spines.left'] = False
matplotlib.rcParams['axes.spines.right'] = False
matplotlib.rcParams['axes.spines.top'] = False
matplotlib.rcParams['axes.spines.bottom'] = False

matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['#2EBCE7', '#84EE29', '#FF8177'])

$$
\def\var{{\text{Var}}} % Variance
\def\corr{{\text{Corr}}} % Correlation
\def\cov{{\text{Cov}}} % Covariance
\def\expval{{}}
\newcommand\norm[1]{\lVert#1\rVert} % norm
\def\setR{{\rm I\!R}} % Sets
\def\rx{{\textrm{X}}} % Scalar random variables
\def\ry{{\textrm{Y}}}
\def\rz{{\textrm{Z}}}
\def\rvx{{\textbf{X}}} % Vector random variables
\def\rvy{{\textbf{Y}}}
\def\rvz{{\textbf{Z}}}
\def\vtheta{{\boldsymbol{\theta}}} % Vectors
\def\va{{\boldsymbol{a}}}
\def\vb{{\boldsymbol{b}}}
\def\vi{{\boldsymbol{i}}}
\def\vj{{\boldsymbol{j}}}
\def\vp{{\boldsymbol{p}}}
\def\vq{{\boldsymbol{q}}}
\def\vu{{\boldsymbol{u}}}
\def\vv{{\boldsymbol{v}}}
\def\vw{{\boldsymbol{w}}}
\def\vx{{\boldsymbol{x}}}
\def\vy{{\boldsymbol{y}}}
\def\vz{{\boldsymbol{z}}}
\def\evu{{u}} % Elements of vectors
\def\evv{{v}}
\def\evw{{w}}
\def\evx{{x}}
\def\evy{{y}}
\def\evz{{z}}
\def\mA{{\boldsymbol{A}}} % Matrices
\def\mB{{\boldsymbol{B}}}
\def\mC{{\boldsymbol{C}}}
\def\mD{{\boldsymbol{D}}}
\def\mI{{\boldsymbol{I}}}
\def\mQ{{\boldsymbol{Q}}}
\def\mS{{\boldsymbol{S}}}
\def\mT{{\boldsymbol{T}}}
\def\mU{{\boldsymbol{U}}}
\def\mV{{\boldsymbol{V}}}
\def\mW{{\boldsymbol{W}}}
\def\mX{{\boldsymbol{X}}}
\def\mLambda{{\boldsymbol{\Lambda}}}
\def\mSigma{{\boldsymbol{\Sigma}}}
\def\emA{{A}} % Elements of matrices
\def\emB{{B}}
\def\emX{{X}}
\def\tT{{T}} % Transformations
$$



Appendix B. Mathematical Notation
=================================

An important step to increase your math skills is to understand what the
symbols denote. It is important, to make the most of this book, that you
understand the meaning of each math symbol you’ll encounter. For this
reason, you can find here all the mathematical notations used in this
book.

Note that mathematical notation can be slightly different from one book
to another. A large part of the notation conventions I use here comes
from the Deep Learning book (goodfellow, Ian, Yoshua Bengio, and Aaron
Courville. Deep learning. MIT press, 2016. You can find the detailed
notation page corresponding to the book here:
https://github.com/goodfeli/dlbook_notation)

Greek Letters
-------------

$\alpha$: Alpha. Regularization parameter.

$\Delta$: Capital delta. Distance between two points used to calculate
the derivative for instance.

$\epsilon$: Epsilon.

$\lambda$: Lambda. Number of events in a Poisson distribution. It can
also refer to the eigenvalues.

$\mu$: Mu. The mean of a distribution.

$\nabla$: Nabla. Applied to a function, it refers to its gradient.

$\prod$: Capital Pi. The product notation, or Pi notation corresponds to
a repeated product of the expression following it. It is similar to the
Sigma notation, but for a product instead of a sum.

$\sigma$: Sigma. Standard deviation. It can also refer to the singular
values.

$\Sigma$: Capital sigma. It is used to write sums (Sigma notation): it
refers to a repeated sum of the expression following it. It can also
refer to the matrix containing all the singular values.

$\theta$: Theta. Models parameters.

Calculus
--------

$\lim_{a \to 0}$: Limit when $a$ approaches zero.

$\frac{d f(x)}{dx}$: Derivative of $f(x)$ with respect to $x$.

$f'(x)$: Derivative of $f(x)$.

$\partial$: Partial derivative.

$\int f(x) \: dx$: Integral of $f(x)$ with respect to $x$.

Dataset
-------

$x^{(i)}$: The $i$th observation in a dataset.

$\hat{y}$: In the context of cost functions, “y hat” refers to the value
of $y$ estimated by the model.

$L$: A loss function.

$J(\theta)$: A cost function.

Probability
-----------

$\bar{x}$: x bar. Arithmetic mean of the variable $x$.

$\var(\vx)$: Variance of the vector $\vx$.

$\cov(\vx, \vy)$: Covariance of the vectors $\vx$ and $\vy$.

$\corr(\vx, \vy)$: Correlation between the vectors $\vx$ and $\vy$.

$\expval_{\rx\sim P}[\rx\rbrack$: Expected value of the random variable
$\rx$ that has a distribution $P$.

$S$: Sample space

------------------------------------------------------------------------

$\rx \sim \mathcal{P}$: The random variable $\rx$ has a distribution
$\mathcal{P}$.

$\mathcal{N}$: Normal distribution.

$\mathcal{N}(\rx=x ; \mu,\,\sigma^{2})$: Normal distribution as a
function of $x$ and parameterized by the distribution parameters $\mu$
and $\sigma^2$. (you can also find the use of a pipe symbol ($|$)
similar to conditional probabilities in some resources, as in Bishop,
Christopher M. Pattern recognition and machine learning. springer,
2006., Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong.
Mathematics for machine learning. Cambridge University Press, 2020., or
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT
press, 2012.) .

$\text{Bern}$: Bernoulli distribution.

$\text{Bin}$: Binomial distribution.

$\text{Poi}$: Poisson distribution.

$\text{Exp}$: Exponential distribution.

------------------------------------------------------------------------

$\rx$: Random variable corresponding to a random experiment.

$x$: An outcome (also called state, or realization) of the random
variable $\rx$ (more precisely, it is the outcome of the random
experiment corresponding to the random variable).

$A$: An event (a set of outcomes).

------------------------------------------------------------------------

In the following definitions, the uppercase $P$ are for discrete random
variable and the lowercase $p$ for continuous random variables.

$P(\rx)$: Probability Mass Function of the discrete random variable
$\rx$. This function takes a possible outcome of $\rx$ and returns a
probability.

$P(\rx=x)$ (or simply $P(x)$): Probability that the discrete random
variable $\rx$ takes the value $x$. This is a number between 0 and 1.

$P(A)$: Probability that the event $A$ occurs. For instance, if the
event corresponds to the the outcome $x$, you have $P(A)=P(\rx=x)$.

$P(\rx=x, \ry=y)$ (or simply $P(x, y)$): Joint probability. Probability
that the discrete random variables $\rx$ and $\ry$ respectively take the
values $x$ and $y$.

$P(\rx, \ry)$: Joint probability distribution function of the discrete
random variables $\rx$ and $\ry$. This is a function that takes a
possible outcome of $\rx$ and $\ry$ and returns a probability.

$P(\rx=x | \ry=y)$ (or simply $P(x | y)$): Conditional probability.
Probability that the discrete random variable $\rx$ takes the value $x$
given that the discrete random variable $\ry$ takes the value $y$.

------------------------------------------------------------------------

$\mathcal{L}_x(\theta)$: Likelihood of observing some data $x$ drawn
from a distribution with parameter $\theta$.

------------------------------------------------------------------------

$\binom{N}{m}$: $N$ choose $m$ is the binomial coefficient.

$N!$: Factorial N.

------------------------------------------------------------------------

Information Theory
------------------

$I(x)$: Shannon information of the event $\rx=x$. The input is a single
outcome.

$H(P)$: Shannon entropy of the probability distribution $P$.

$H(P, Q)$: Cross-entropy between the probability distributions $P$ and
$Q$.

$D_{KL}$: Kullback-Leibler divergence (or KL divergence).

Sets
----

$\setR$: The set of the real number.

Linear Algebra
--------------

$\vx$: A vector.

$\evx_i$: Component at index $i$ of the vector $\vx$ (first index is 1).

$\vi$ and $\vj$: The basis vectors corresponding to the $x$ and $y$ axis
in the Cartesian plane.

$\mA$: A matrix.

$\emA_{i, j}$: Component at row $i$ and column $j$ of the matrix $\mA$.

$\mA_{i, :}$: Row $i$ of the matrix $\mA$.

$\mA_{:, i}$: Column $i$ of the matrix $\mA$.

$\tT$: A linear transformation.

$\mA^{-1}$: The inverse of the matrix $\mA$.

$\mA^+$: The Moore-Penrose inverse (or pseudo-inverse) of the matrix
$\mA$.