# 2024 LAB 7a Introduction to TensorFlow and Keras (Chollet)

# Data representations for neural networks 

In general, all current machine learning systems use tensors
as their basic data structure. Tensors are fundamental to the field—so fundamental
that TensorFlow was named after them. 

**So what’s a tensor?**

 At its core, a tensor is a container for data—usually numerical data. So, it’s a container for numbers. You may be already familiar with matrices, which are rank-2 tensors: tensors are a generalization of matrices to an arbitrary number of dimensions
(note that in the context of tensors, a dimension is often called an axis).

**Scalars (rank-0 tensors)**

A tensor that contains only one number is called a scalar (or scalar tensor, or rank-0
tensor, or 0D tensor). In NumPy, a float32 or float64 number is a scalar tensor (or
scalar array). You can display the number of axes of a NumPy tensor via the ndim attribute; a scalar tensor has 0 axes (ndim == 0). The number of axes of a tensor is also
called its rank. 

Here’s a NumPy scalar:

In [None]:
import numpy as np
x = np.array(12)
x

In [None]:
x.ndim 

**Vectors (rank-1 tensors)**

An array of numbers is called a vector, or rank-1 tensor, or 1D tensor. A rank-1 tensor is
said to have exactly one axis. Following is a NumPy vector:

In [None]:
x = np.array([12, 8, 6, 14, 7])
x

In [None]:
x.ndim

This vector has five entries and so is called a 5-dimensional vector. Don’t confuse a 5D
vector with a 5D tensor! A 5D vector has only one axis and has five dimensions along
its axis, whereas a 5D tensor has five axes (and may have any number of dimensions
along each axis). Dimensionality can denote either the number of entries along a spe-
cific axis (as in the case of our 5D vector) or the number of axes in a tensor (such as a
5D tensor), which can be confusing at times. In the latter case, it’s technically more
correct to talk about a tensor of rank 5 (the rank of a tensor being the number of axes),
but the ambiguous notation 5D tensor is common regardless. 

**Matrices (rank-2 tensors)**

An array of vectors is a matrix, or rank-2 tensor, or 2D tensor. A matrix has two axes
(often referred to as rows and columns). You can visually interpret a matrix as a rectan-
gular grid of numbers. This is a NumPy matrix:

In [None]:
x = np.array([[5, 78, 2, 34, 0],
                  [6, 79, 3, 35, 1],
                  [7, 80, 4, 36, 2]])
x

In [None]:
x.ndim

The entries from the first axis are called the rows, and the entries from the second axis
are called the columns. In the previous example, [5, 78, 2, 34, 0] is the first row of x,
and [5, 6, 7] is the first column. 

**Rank-3 and higher-rank tensors**

If you pack such matrices in a new array, you obtain a rank-3 tensor (or 3D tensor),
which you can visually interpret as a cube of numbers. Following is a NumPy rank-3
tensor:

In [None]:
x = np.array([[[5, 78, 2, 34, 0],
                   [6, 79, 3, 35, 1],
                   [7, 80, 4, 36, 2]],
                  [[5, 78, 2, 34, 0],
                   [6, 79, 3, 35, 1],
                   [7, 80, 4, 36, 2]],
                  [[5, 78, 2, 34, 0],
                   [6, 79, 3, 35, 1],
                   [7, 80, 4, 36, 2]]])
x

In [None]:
x.ndim

# Key attributes

A tensor is defined by three key attributes:

  -  **Number of axes (rank)** — For instance, a rank-3 tensor has three axes, and a matrix
has two axes. This is also called the tensor’s ndim in Python libraries such as
NumPy or TensorFlow.
  -  **Shape** — This is a tuple of integers that describes how many dimensions the tensor has along each axis. For instance, the previous matrix example has shape
(3, 5), and the rank-3 tensor example has shape (3, 3, 5). A vector has a shape
with a single element, such as (5,), whereas a scalar has an empty shape, ().
  - **Data type** (usually called dtype in Python libraries) — This is the type of the data
contained in the tensor; for instance, a tensor’s type could be float16, float32,
float64, uint8, and so on. In TensorFlow, you are also likely to come across
string tensors.


To make this more concrete, let’s look back at the data we processed in the MNIST
example. First, we load the MNIST dataset:

In [None]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Next, we display the number of axes of the tensor train_images, the ndim attribute,  its shape and its data type

In [None]:
train_images.ndim 

In [None]:
train_images.shape

In [None]:
train_images.dtype

So what we have here is a rank-3 tensor of 8-bit integers. More precisely, it’s an array of
60,000 matrices of 28 × 28 integers. Each such matrix is a grayscale image, with coefficients between 0 and 255.

Let’s display the fourth digit in this rank-3 tensor, using the Matplotlib library (a
well-known Python data visualization library, which comes preinstalled in Colab)

In [None]:
import matplotlib.pyplot as plt
digit = train_images[420]
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()

In [None]:
# Naturally, the corresponding label is the integer 9:
train_labels[420] 

# Manipulating tensors in NumPy

In the previous example, we selected a specific digit alongside the first axis using the
syntax train_images[i]. Selecting specific elements in a tensor is called tensor slicing.
Let’s look at the tensor-slicing operations you can do on NumPy arrays.

The following example selects digits #10 to #100 (#100 isn’t included) and puts
them in an array of shape (90, 28, 28):

In [None]:
my_slice = train_images[10:100]
my_slice.shape

In general, you may select slices between any two indices along each tensor axis. For
instance, in order to select 14 × 14 pixels in the bottom-right corner of all images, you
would do this:

In [None]:
my_slice = train_images[:, 14:28, 14:28] 
my_slice.shape

**The notion of data batches**

In general, the first axis (axis 0, because indexing starts at 0) in all data tensors you’ll
come across in deep learning will be the samples axis (sometimes called the samples
dimension). In the MNIST example, “samples” are images of digits.

In addition, deep learning models don’t process an entire dataset at once; rather,
they break the data into small batches. Concretely, here’s one batch of our MNIST
digits, with a batch size of 128:

In [None]:
batch = train_images[:128]
batch.shape

And here’s the next batch:

In [None]:
batch = train_images[128:256]
batch.shape

And the nth batch:

In [None]:
n = 3 
batch = train_images[128 * n:128 * (n + 1)]
batch.shape

When considering such a batch tensor, the first axis (axis 0) is called the batch axis or
batch dimension. This is a term you’ll frequently encounter when using Keras and other
deep learning libraries.

# Real-world examples of data tensors

Let’s make data tensors more concrete with a few examples similar to what you’ll
encounter later. The data you’ll manipulate will almost always fall into one of the following categories:
  - **Vector data** — Rank-2 tensors of shape (samples, features), where each sample
is a vector of numerical attributes (“features”)
  - **Timeseries data** or sequence data—Rank-3 tensors of shape (samples, timesteps,
features), where each sample is a sequence (of length timesteps) of feature
vectors
  - **Images** — Rank-4 tensors of shape (samples, height, width, channels), where
each sample is a 2D grid of pixels, and each pixel is represented by a vector of
values (“channels”)
  - **Video** — Rank-5 tensors of shape (samples, frames, height, width, channels),
where each sample is a sequence (of length frames) of images 

**1. Vector data**

This is one of the most common cases. In such a dataset, each single data point can be
encoded as a vector, and thus a batch of data will be encoded as a rank-2 tensor (that
is, an array of vectors), where the first axis is the samples axis and the second axis is the
features axis.

Let’s take a look at two examples:
  - An actuarial dataset of people, where we consider each person’s age, gender,
and income. Each person can be characterized as a vector of 3 values, and thus
an entire dataset of 100,000 people can be stored in a rank-2 tensor of shape
(100000, 3).
  - A dataset of text documents, where we represent each document by the counts
of how many times each word appears in it (out of a dictionary of 20,000 common words). Each document can be encoded as a vector of 20,000 values (one
count per word in the dictionary), and thus an entire dataset of 500 documents
can be stored in a tensor of shape (500, 20000). 

**2. Timeseries data or sequence data**

Whenever time matters in your data (or the notion of sequence order), it makes sense
to store it in a rank-3 tensor with an explicit time axis. Each sample can be encoded as
a sequence of vectors (a rank-2 tensor), and thus a batch of data will be encoded as a
rank-3 tensor

![chollet23.png](attachment:chollet23.png)

The time axis is always the second axis (axis of index 1) by convention. Let’s look at a
few examples:

  - A dataset of stock prices. Every minute, we store the current price of the stock,
the highest price in the past minute, and the lowest price in the past minute.
Thus, every minute is encoded as a 3D vector, an entire day of trading is
encoded as a matrix of shape (390, 3) (there are 390 minutes in a trading day),
and 250 days’ worth of data can be stored in a rank-3 tensor of shape (250,
390, 3). Here, each sample would be one day’s worth of data.
  - A dataset of tweets, where we encode each tweet as a sequence of 280 characters
out of an alphabet of 128 unique characters. In this setting, each character can
be encoded as a binary vector of size 128 (an all-zeros vector except for a 1 entry
at the index corresponding to the character). Then each tweet can be encoded
as a rank-2 tensor of shape (280, 128), and a dataset of 1 million tweets can be
stored in a tensor of shape (1000000, 280, 128). 

**3. Image data**

Images typically have three dimensions: height, width, and color depth. Although
grayscale images (like our MNIST digits) have only a single color channel and could
thus be stored in rank-2 tensors, by convention image tensors are always rank-3, with a
one-dimensional color channel for grayscale images. A batch of 128 grayscale images
of size 256 × 256 could thus be stored in a tensor of shape (128, 256, 256, 1), and a
batch of 128 color images could be stored in a tensor of shape (128, 256, 256, 3).

![chollet24.png](attachment:chollet24.png)

There are two conventions for shapes of image tensors: the channels-last convention
(which is standard in TensorFlow) and the channels-first convention (which is increasingly falling out of favor).

The channels-last convention places the color-depth axis at the end: (samples,
height, width, color_depth). Meanwhile, the channels-first convention places the
color depth axis right after the batch axis: (samples, color_depth, height, width).
With the channels-first convention, the previous examples would become (128, 1,
256, 256) and (128, 3, 256, 256). The Keras API provides support for both formats. 

**4. Video data**

Video data is one of the few types of real-world data for which you’ll need rank-5 ten-
sors. A video can be understood as a sequence of frames, each frame being a color
image. Because each frame can be stored in a rank-3 tensor (height, width, color_depth), a sequence of frames can be stored in a rank-4 tensor (frames, height,
width, color_depth), and thus a batch of different videos can be stored in a rank-5
tensor of shape (samples, frames, height, width, color_depth).
 
 
 For instance, a 60-second, 144 × 256 YouTube video clip sampled at 4 frames per
second would have 240 frames. A batch of four such video clips would be stored in a
tensor of shape (4, 240, 144, 256, 3). That’s a total of 106,168,320 values! If the dtype of the tensor was float32, each value would be stored in 32 bits, so the tensor
would represent 405 MB. Heavy! Videos you encounter in real life are much lighter,
because they aren’t stored in float32, and they’re typically compressed by a large fac-
tor (such as in the MPEG format).

# Introduction to TensorFlow and Keras

Quick presentation of Keras (https://keras.io) and TensorFlow (https://tensorflow.org). TensorFlow is a Python-based, free, open source machine learning platform, developed primarily by Google. Keras is a deep learning API (Application Programming Interface) for Python, built on top of TensorFlow, that provides a convenient way to define and train any kind of deep learning model.

![chollet31.png](attachment:chollet31.png)

Keras is used at Google, Netflix, Uber, CERN, NASA, Yelp,
Instacart, Square, and hundreds of startups working on a wide range of problems
across every industry. Your YouTube recommendations originate from Keras models.

# First steps with TensorFlow

As we will see, training a neural network revolves around the following concepts:

  - First, low-level tensor manipulation — the infrastructure that underlies all modern machine learning. This translates to TensorFlow APIs:
    - Tensors, including special tensors that store the network’s state (variables)
    - Tensor operations such as addition, relu, matmul
    - Backpropagation, a way to compute the gradient of mathematical expressions
(handled in TensorFlow via the GradientTape object).

  - Second, high-level deep learning concepts. This translates to Keras APIs:
    - Layers, which are combined into a model
    - A loss function, which defines the feedback signal used for learning
    - An optimizer, which determines how learning proceeds
    - Metrics to evaluate model performance, such as accuracy
    - A training loop that performs mini-batch stochastic gradient descent
    
    ![chollet19.png](attachment:chollet19.png)

# Constant tensors and variables

To do anything in TensorFlow, we’re going to need some tensors. Tensors need to be
created with some initial value. For instance, you could create all-ones or all-zeros tensors, or tensors of values drawn from a random distribution. Let us see examples

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
x = tf.ones(shape=(2, 1)) # Equivalent to np.ones(shape=(2, 1))
print(x)

In [None]:
x = tf.zeros(shape=(2, 1))  # Equivalent to np.zeros(shape=(2, 1))
print(x)

**Random tensors**

In [None]:
x = tf.random.normal(shape=(3, 1), mean=0., stddev=1.)
print(x)

In [None]:
x = tf.random.uniform(shape=(3, 1), minval=0., maxval=1.)
print(x)

To train a model, we’ll need to update its state, which is a set of tensors. If tensors
are constant (aren’t assignable), how do we do it? That’s where variables come in. tf.Variable is the
class meant to manage modifiable state in TensorFlow. To create a variable, you need to provide some initial value, such as a random tensor.

**Creating a TensorFlow variable**

In [None]:
v = tf.Variable(initial_value=tf.random.normal(shape=(3, 1)))
print(v)

The state of a variable can be modified via its assign method, as follows.

**Assigning a value to a TensorFlow variable**

In [None]:
v.assign(tf.ones((3, 1)))

It also works for a subset of the coefficients.

**Assigning a value to a subset of a TensorFlow variable**

In [None]:
v[0, 0].assign(3.)

Similarly, assign_add() and assign_sub() are efficient equivalents of += and -=, as
shown next. 

**Using `assign_add`**

In [None]:
v.assign_add(tf.ones((3, 1)))

# Tensor operations: Doing math in TensorFlow

**A few basic math operations**

In [None]:
a = tf.ones((2, 2))
b = tf.square(a) #Take the square.
c = tf.sqrt(a)  #Take the square root.
d = b + c
e = tf.matmul(a, b) #Take the product of two tensors 
e *= d  #Multiply two tensors (element-wise).
print(a)
print(b)
print(c)
print(d)
print(e)

# An end-to-end example: A linear classifier in pure TensorFlow

You know about tensors, variables, and tensor operations, and you know how to compute gradients. That’s enough to build any machine learning model based on gradient descent.  
Let us implement a linear classifier from scratch in TensorFlow.

First, let’s come up with some nicely linearly separable synthetic data to work with:

two classes of points in a 2D plane. We’ll generate each class of points by drawing their coordinates from a random distribution with a specific covariance matrix and a specific mean. Intuitively, the covariance matrix describes the shape of the point cloud,
and the mean describes its position in the plane (see figure below). We’ll reuse the same
covariance matrix for both point clouds, but we’ll use two different mean values—the point clouds will have the same shape, but different positions.

**Generating two classes of random points in a 2D plane**

In [None]:
import tensorflow as tf
import numpy as np

num_samples_per_class = 1000
negative_samples = np.random.multivariate_normal(
    mean=[0, 3],
    cov=[[1, 0.5],[0.5, 1]],
    size=num_samples_per_class)
positive_samples = np.random.multivariate_normal(
    mean=[3, 0],
    cov=[[1, 0.5],[0.5, 1]],
    size=num_samples_per_class)

Generate the first class of points: 1000 random 2D points. cov=[[1, 
0.5],[0.5, 1]] corresponds to an oval-like point cloud oriented 
from bottom left to top right.
Generate the other class of 
points with a different mean and 
the same covariance matrix.

In the preceding code, negative_samples and positive_samples are both arrays
with shape (1000, 2). Let’s stack them into a single array with shape (2000, 2).

**Stacking the two classes into an array with shape (2000, 2)**

In [None]:
inputs = np.vstack((negative_samples, positive_samples)).astype(np.float32)

**Generating the corresponding targets (0 and 1)**

In [None]:
targets = np.vstack((np.zeros((num_samples_per_class, 1), dtype="float32"),
                     np.ones((num_samples_per_class, 1), dtype="float32")))

**Plotting the two point classes**

In [None]:
import matplotlib.pyplot as plt
plt.scatter(inputs[:, 0], inputs[:, 1], c=targets[:, 0])
plt.show()

**Creating the linear classifier variables**

Now let’s create a linear classifier that can learn to separate these two blobs. A linear
classifier is an affine transformation (prediction = W • input + b) trained to minimize
the square of the difference between predictions and the targets.
 As you’ll see, it’s actually a much simpler example than the end-to-end example of
a toy two-layer neural network. However, this time you
should be able to understand everything about the code, line by line.
 Let’s create our variables, W and b, initialized with random values and with zeros,
respectively.

In [None]:
input_dim = 2     # The inputs will be 2D points
output_dim = 1    # The output predictions will be a single score per sample (close to 0 if the sample is predicted to be in class 0, 
                  # and close to 1 if the sample is predicted to be in class 1).
W = tf.Variable(initial_value=tf.random.uniform(shape=(input_dim, output_dim)))
b = tf.Variable(initial_value=tf.zeros(shape=(output_dim,)))

Here’s our forward pass function.

**The forward pass function**

In [None]:
def model(inputs):
    return tf.matmul(inputs, W) + b

Because our linear classifier operates on 2D inputs, W is really just two scalar coefficients, w1 and w2: W = [[w1], [w2]]. Meanwhile, b is a single scalar coefficient. As such,
for a given input point [x, y], its prediction value is prediction = [[w1], [w2]] • [x,
y] + b = w1 * x + w2 * y + b.

**The mean squared error loss function**

In [None]:
def square_loss(targets, predictions):
    per_sample_losses = tf.square(targets - predictions)  
    return tf.reduce_mean(per_sample_losses)

  - per_sample_losses will be a tensor with the same shape as
targets and predictions, containing per-sample loss scores.
  - We need to average these per-sample loss scores into a
single scalar loss value: this is what reduce_mean does.

Next is the training step, which receives some training data and updates the weights W
and b so as to minimize the loss on the data.

**The training step function**

In [None]:
learning_rate = 0.1

def training_step(inputs, targets):
    with tf.GradientTape() as tape:
        predictions = model(inputs)
        loss = square_loss(predictions, targets)
    grad_loss_wrt_W, grad_loss_wrt_b = tape.gradient(loss, [W, b])
    W.assign_sub(grad_loss_wrt_W * learning_rate)
    b.assign_sub(grad_loss_wrt_b * learning_rate)
    return loss

For simplicity, we’ll do batch training instead of mini-batch training: we’ll run each training
step (gradient computation and weight update) for all the data, rather than iterate over
the data in small batches. On one hand, this means that each training step will take
much longer to run, since we’ll compute the forward pass and the gradients for 2,000
samples at once. On the other hand, each gradient update will be much more effective
at reducing the loss on the training data, since it will encompass information from all
training samples instead of, say, only 128 random samples. As a result, we will need many
fewer steps of training, and we should use a larger learning rate than we would typically
use for mini-batch training (we’ll use learning_rate = 0.1, defined before).

**The batch training loop**

In [None]:
for step in range(40):
    loss = training_step(inputs, targets)
    print(f"Loss at step {step}: {loss:.4f}")

After 40 steps, the training loss seems to have stabilized around 0.025. Let’s plot how
our linear model classifies the training data points. Because our targets are zeros and
ones, a given input point will be classified as “0” if its prediction value is below 0.5, and
as “1” if it is above 0.5 (see figure below):

In [None]:
predictions = model(inputs)
plt.scatter(inputs[:, 0], inputs[:, 1], c=predictions[:, 0] > 0.5)
plt.show()

Recall that the prediction value for a given point [x, y] is simply prediction ==
[[w1], [w2]] • [x, y] + b == w1 * x + w2 * y + b. Thus, class 0 is defined as w1 * x + w2 * y + b < 0.5, and class 1 is defined as w1 * x + w2 * y + b > 0.5. You’ll notice that what
you’re looking at is really the equation of a line in the 2D plane: w1 * x + w2 * y + b = 0.5.
Above the line is class 1, and below the line is class 0. You may be used to seeing line
equations in the format y = a * x + b; in the same format, our line becomes y = - w1 / w2 * x + (0.5 - b) / w2.
 Let’s plot this line (shown in figure below):

In [None]:
x = np.linspace(-1, 4, 100)
y = - W[0] /  W[1] * x + (0.5 - b) / W[1]
plt.plot(x, y, "-r")
plt.scatter(inputs[:, 0], inputs[:, 1], c=predictions[:, 0] > 0.5)

The firs line generate 100 regularly spaced numbers between –1 and 4, which
we will use to plot our line. Plot our line ("-r" means “plot it as a red line”). The last line plots our model’s predictions on the same plot.