# Hello World

* Solving MNIST as the “Hello World” of deep learning.

***

# Basic Terms

* **Class**: It refers to a _category_ in a classification.

* **Sample**: These are the data points.

* **Label**: The class associated with a specific _sample_ is called a _label_.

***

# Tensors

* **Scalars** are _rank-0_, **Vectors** are _rank-1_ and **Matrices** are _rank-2_ tensors.

* In deep learning, most of all the time are worked with tensors which have 0 to 4 ranks. Although, it may up to 5 if video data is being processed.

* A tensor is defined with three key attributes: _rank_, _shape_ and _data type_.

  * **Rank**: This is also called the tensor's **ndim** in Python libraries such as NumPy or TensorFlow.

  * **Shape**: This is a tuple of integers that describes how many dimensions the tensor has along each axis.

  * **Data Type**: This is the type of the data contained in the tensor; for instance, a tensor's type could be _float16_, _float32_ _float64_, _uint8_, and so on. In TensorFlow, also there are _string_ tensors. It is usually called _dtype_ in Python libraries.

***

# Data Batches

* In general, the first axis in all data tensors is accepted in deep learning as the _samples axis_ (sometimes called the _samples dimension_)

* In addition, deep learning models don’t process an entire dataset at once; rather, they break the data into small batches.

***

# Data Tensors

* **Vector Data**: (samples, features)
* **Timeseries or Sequence**: (samples, timesteps, features)
* **Images**: (samples, height, width, channels)
* **Video**: (samples, frames, height, width, channels)

***

# Tensor Operations

* Dense layers are the layers which are connected between them.
* A relu operation: relu(x) is max(x, 0); “**relu**” stands for “**rectified linear unit**”

  ## Element Wise Operations

  * They are operations that are applied independently to each entry in the tensors being considered.
  * This means these operations are highly amenable to massively parallel implementations.
  * In practice, when dealing with NumPy arrays, these operations are available as well-optimized built-in NumPy functions, which themselves delegate the heavy lifting to a Basic Linear Algebra Subprograms (BLAS) implementation. BLAS are low-level, highly parallel, efficient tensor-manipulation routines that are typically implemented in Fortran or C.
  * Likewise, when running TensorFlow code on a GPU, element-wise operations are executed via fully vectorized CUDA implementations that can best utilize the highly parallel GPU chip architecture.

  ## Broadcasting

  * What happens with addition when the shapes of the two
  tensors being added differ? When possible, and if there’s no ambiguity, the smaller tensor will be broadcast to match the shape of the larger tensor. Broadcasting consists of two steps:

    1- Axes are added to the smaller tensor to match the ndim of the larger tensor.

    2- The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor.

  ## Tensor Product

  * It is also called as **dot product**.

  ```python
  x = np.random.random((32,))
  y = np.random.random((32,))
  z = np.dot(x, y)
  ```

  ## Tensor Reshaping

  * Reshaping a tensor means rearranging its rows and columns to match a target shape.

  ***

  # Gradient Bases Optimization

  ```python
  output = relu(dot(input, W) + b)
  ```

  * In this expression, W and b are tensors that are attributes of the layer. They’re called the **weights** or **trainable parameters** of the layer.

  * These weights contain the information learned by the model from exposure to training data.

  * Initially, these weight matrices are filled with small random values.

  * What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called **training**, is the learning that machine learning is all about. This happens within what’s called a **training loop**, which works as follows.

  * Given a differentiable function, it’s theoretically possible to find its minimum analytically: it’s known that a function’s minimum is a point where the derivative is 0, so all you have to do is find all the points where the derivative goes to 0 and check for which of these points the function has the lowest value. Applied to a neural network, that means finding analytically the combination of weight values that yields the smallest possible loss function.

  * Applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called **backpropagation**.