# Mnist classification with NNs
A first example of a simple Neural Network, applied to a well known dataset.from tensorflow.keras.layers import Input, Dense
tensorflow.keras.layers
The neural network layer modules provided by TensorFlow's Keras API:

Input: Used to define the input layer of a neural network, specifying the shape of the input data.
Dense: A fully connected (dense) layer, which is one of the most commonly used layers in neural networks where all neurons are interconnected.

In [None]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model
from tensorflow.keras import utils
import numpy as np

Let us load the mnist dataset

tensorflow.keras.datasets: Common datasets provided by Keras.
mnist: A handwritten digit recognition dataset consisting of grayscale images (28×28 pixels) classified into 10 categories (digits 0-9).
Dataset Structure:
x_train: Image data for the training set, with shape (60000, 28, 28), representing 60,000 grayscale images of size 28×28.

y_train: Label data for the training set, with shape (60000,), representing 60,000 labels (each an integer from 0 to 9, indicating the digit class).

x_test: Image data for the test set, with shape (10000, 28, 28), representing 10,000 grayscale images of size 28×28.

y_test: Label data for the test set, with shape (10000,), representing 10,000 labels (each an integer from 0 to 9).

x_train and x_test are 3D NumPy arrays with the format: (number of samples, height, width).

y_train and y_test are 1D NumPy arrays with the format: (number of samples,), where each value is an integer between 0 and 9, representing the digit class.

In [None]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
print(x_train.shape)
print("pixel range is [{},{}]".format(np.min(x_train),np.max(x_train)))

(60000, 28, 28)
pixel range is [0,255]


[link text](https://)We normalize the input in the range [0,1]

# Preprocessing MNIST Dataset for Deep Learning  

## Normalize Pixel Values to [0, 1]  

The pixel values in the MNIST dataset range from **0 to 255**, as it consists of grayscale images.  
Using `astype('float32')` converts the data type, preventing precision loss due to integer operations.  

### Why is this necessary?  

1. **Deep learning models are sensitive to numerical ranges:**  
   - Keeping pixel values in the range **0-255** may cause **large gradient variations**, affecting training stability.  
   - Normalizing to **[0, 1]** makes it **easier for the model to learn**.  

2. **Avoiding overflow issues:**  
   - Deep learning involves extensive **matrix computations**, and without normalization, issues like **gradient explosion** or **vanishing gradients** may occur.  
   - For example, computing `exp(x)` (as in the **softmax layer**) with excessively large input values can cause **numerical overflow**.  

---

## Convert 28×28 2D Image Data into a 1D Vector (784 Dimensions)  

### Why is this necessary?  

1. **Dense (fully connected) layers accept only 1D inputs:**  
   - The **Dense layer** requires a **one-dimensional input vector**, but `x_train` originally has the shape **(60000, 28, 28) (3D)**.  
   - Using `reshape(60000, 28*28)`, it transforms into **(60000, 784)**, converting each **28×28** ima


In [None]:
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

x_train = np.reshape(x_train,(60000,28*28))
x_test = np.reshape(x_test,(10000,28*28))

# Output Layer Adaptation for Neural Networks (Softmax)  

Using **Categorical Crossentropy Loss Function**  
- Prevents integer labels from affecting learning performance  
- Improves computational efficiency  

## Why Use One-Hot Encoding?  

### **Issue 1: Neural Networks May Misinterpret Integer Label Relationships**  
If `y_train` is represented as integers (e.g., **0-9**), the model might **misinterpret the numerical relationships** between categories.  

### **Issue 2: Softmax Output Layer Requires One-Hot Encoding**  
Typically, the final layer of a classification model is a **softmax layer**, which outputs a **probability distribution** over the different categories.  

### **Issue 3: Categorical Crossentropy Loss Requires One-Hot Encoding**  
In classification tasks, we commonly use the **categorical crossentropy loss function (`categorical_crossentropy`)** to measure the difference between the predicted and true distributions.  

---

## **When Can We Skip One-Hot Encoding?**  

If your output layer uses **`sparse_categorical_crossentropy`**, **One-Hot encoding is not needed**:  
- This loss function works with **integer labels** (`y_train` remains in the range **0-9**).  
- It is suitable for **large-category classification tasks** (e.g., **1000-class ImageNet**).  
- However, in common classification tasks like **


In [None]:
print(y_train[0])
y_train_cat = utils.to_categorical(y_train)
print(y_train_cat[0])
y_test_cat = utils.to_categorical(y_test)

5
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


# Our First Network: Implementing Logistic Regression  

## **Model Structure**  

- `Input(shape=(784,))`: Defines the input tensor (Tensor).  
- `Dense(64, activation='relu')(input_layer)`:  
  - Creates weight matrix **W** and bias **b**.  
  - Applies the **ReLU activation function**.  
- `Model(inputs, outputs)`: Combines these layers to form a complete computational graph.  

---

## **Understanding the Input Layer**  

- **`Input()`**: Defines the input layer, specifying the shape of the input data.  
- **`shape=(28*28,)`**: Represents a **784-dimensional vector**.  

### **Why is the shape `(28*28,)`?**  
- The **MNIST dataset** consists of **28×28 grayscale images**.  
- In `x_train = np.reshape(x_train, (60000, 28*28))`, each **28×28** image is **flattened** into a **784-dimensional vector**.  
- The `Input` layer must match this shape to process the input correctly.  

---

## **Output Layer: Dense(10, activation='softmax')**  

Defines a **fully connected layer (Dense layer)** with **softmax activation**.  

### **Parameter Explanation:**  
- **`10`**: Number of neurons, meaning the output is a **10-dimensional vector**.  
- **`activation='softmax'`**:  
  - Computes the **probability distribution** over **10 categories**.  
  - Ensures that the sum of probabilities equals **1**.  
  - Suitable for **multi-class classification tasks** like MNIST (**digits 0-9**).  

---

## **Computation Process**  

### **Step 1: Dense(10) Computation**  
Calculates the **weight matrix (W)** and **bias (b)**:  

\[
z = XW + b
\]

- **X**: Input data (**shape: `(batch_size, 784)`**).  
- **W**: Weight matrix (**shape: `(784, 10)`**).  
- **b**: Bias vector (**shape: `(10,)`**).  
- The result **z** has shape **`(batch_size, 10)`**.  

### **Step 2: Softmax(z)**
Converts **z** into a **probability distribution**:

\[
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{10} e^{z_j}}
\]

Ensuring that the final output is a **vector of probabilities** for each class.  


In [None]:
xin = Input(shape=(28*28,))
res = Dense(10,activation='softmax')(xin)

mynet = Model(inputs=xin,outputs=res)

In [None]:
mynet.summary()

Now we need to compile the network.
In order to do it, we need to pass two mandatory arguments:


*   the **optimizer**, in charge of governing the details of the backpropagation algorithm
*   the **loss function**

Several predefined optimizers exist, and you should just choose your favourite one. A common choice is Adam, implementing an adaptive lerning rate, with momentum

# Configuring Model Training with `mynet.compile()`  

## **What Does `compile()` Do?**  
The `mynet.compile()` function is used to configure how the model will be trained. It defines:  

- **Optimizer (`optimizer`)**: Determines how weight parameters are updated to minimize loss.  
- **Loss Function (`loss`)**: Measures the error between predictions and true values.  
- **Evaluation Metrics (`metrics`)**: Performance indicators displayed during training (e.g., accuracy).  

---

## **Why Use Adam (Adaptive Moment Estimation)?**  

Adam is one of the most commonly used **deep learning optimizers**, combining:  

1. **Momentum**:  
   - Helps smooth the optimization process.  
   - Prevents oscillations in weight updates.  

2. **RMSProp (Adaptive Learning Rate)**:  
   - Automatically adjusts the learning rate for each parameter.  
   - Prevents updates from being too large or too small.  

### **Advantages of Adam:**  
✅ **Adaptive learning rate**: Works well for most tasks.  
✅ **Faster training**: Converges faster than SGD (Stochastic Gradient Descent).  
✅ **Stable performance**: Suitable for various deep learning tasks (e.g., **image classification, NLP**).  

---

## **Alternative Optimizers**  

- **`optimizer='sgd'` (Stochastic Gradient Descent)**:  
  - Simple but slow convergence.  
- **`optimizer='rmsprop'`**:  
  - Suitable for **recurrent neural networks (RNNs)**.  
- **`optimizer='adamax'`**:  
  - A variant of Adam, ideal for **very sparse data**.  


In [None]:
mynet.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

Finally, we fit the model over the trianing set.

Fitting, just requires two arguments: training data e ground truth, that is x and y. Additionally we can specify epochs, batch_size, and many additional arguments.

In particular, passing validation data allow the training procedure to measure loss and metrics on the validation set at the end of each epoch.

In [None]:
mynet.fit(x_train,y_train_cat, shuffle=True, epochs=10, batch_size=32,validation_data=(x_test,y_test_cat))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.8097 - loss: 0.7262 - val_accuracy: 0.9144 - val_loss: 0.3078
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 2ms/step - accuracy: 0.9144 - loss: 0.3066 - val_accuracy: 0.9231 - val_loss: 0.2798
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.9181 - loss: 0.2858 - val_accuracy: 0.9240 - val_loss: 0.2713
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.9237 - loss: 0.2753 - val_accuracy: 0.9246 - val_loss: 0.2707
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.9263 - loss: 0.2618 - val_accuracy: 0.9263 - val_loss: 0.2688
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9295 - loss: 0.2552 - val_accuracy: 0.9273 - val_loss: 0.2634
Epoch 7/10
[1m

<keras.src.callbacks.history.History at 0x7c5b226d1a90>

In [None]:
xin = Input(shape=(784,))
x = Dense(100,activation='relu')(xin)
res = Dense(10,activation='softmax')(x)

mynet2 = Model(inputs=xin,outputs=res)

In [None]:
mynet2.summary()

In [None]:
mynet2.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
mynet2.fit(x_train,y_train_cat, shuffle=True, epochs=10, batch_size=32,validation_data=(x_test,y_test_cat))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.8791 - loss: 0.4353 - val_accuracy: 0.9619 - val_loss: 0.1323
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.9633 - loss: 0.1238 - val_accuracy: 0.9719 - val_loss: 0.0939
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.9770 - loss: 0.0793 - val_accuracy: 0.9680 - val_loss: 0.0995
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.9813 - loss: 0.0615 - val_accuracy: 0.9763 - val_loss: 0.0769
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.9869 - loss: 0.0437 - val_accuracy: 0.9774 - val_loss: 0.0754
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.9894 - loss: 0.0338 - val_accuracy: 0.9765 - val_loss: 0.0823
Epoch 7/10
[1m

<keras.src.callbacks.history.History at 0x7c5b22653990>

An amazing improvement. WOW!

# Exercises

1.   Add additional Dense layers and check the performance of the network
2.   Replace 'relu' with different activation functions
3. Adapt the network to work with the so called sparse_categorical_crossentropy
4. the fit function return a history of training, with temporal sequences for all different metrics. Make a plot.

