# Traffic Sign Recognition – Data Preprocessing

In this notebook, we prepare the dataset for training a neural network.
This includes splitting the data and encoding the labels.
The goal is to prepare the data for training a Convolutional Neural Network (CNN).

We cover the following steps:
1. Loading preprocessed data
2. Train–test split
3. Image resizing
4. Normalize pixel values
5. Convert labels to categorical (one-hot)

In [19]:
import numpy as np
import cv2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

### Step 1: Load dataset (simulated)

In a real project, we would load the traffic sign images and labels here.  
Since this is a training/demo notebook, we'll simulate a small dataset to show the preprocessing workflow.

Each image is represented as a 3D array (height, width, channels), and labels are integers representing classes.


In [25]:
# Simulate 50 RGB images of varying sizes and random labels (0-9 classes)
num_samples = 50
images = [np.random.randint(0, 256, size=(np.random.randint(28, 40), np.random.randint(28, 40), 3), dtype=np.uint8)
          for _ in range(num_samples)]
labels = np.random.randint(0, 10, size=(num_samples,))

print("Simulated dataset. Number of images:", len(images))
print("Number of labels:", len(labels))
print("Example image shape:", images[0].shape)
print("Example label:", labels[0])



Simulated dataset. Number of images: 50
Number of labels: 50
Example image shape: (37, 28, 3)
Example label: 2


## Step 2: Train–test split

Before training a CNN, we split the dataset:

- **Training set**: Used to teach the model.
- **Test set**: Used to evaluate the model on unseen data.

**Important:** Some traffic sign classes have very few images (1 image).  
`train_test_split` with `stratify` requires **at least 2 images per class**.  
We will remove these rare classes for this demo.

- Our filtered dataset has **29 classes** but only **88 images**.
- To perform a stratified split, the test set must have **at least 1 sample per class**.
- We'll increase `test_size` to 0.35 (~31 images) to satisfy this condition.


In [7]:
num_classes = len(set(labels_filtered))
test_size = max(0.2, num_classes / len(labels_filtered))  # ensure >= 1 per class

print(f"Using test_size = {test_size:.2f} to satisfy stratification constraints")

X_train, X_test, y_train, y_test = train_test_split(
    images_filtered,
    labels_filtered,
    test_size=test_size,
    random_state=42,
    stratify=labels_filtered
)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))


Using test_size = 0.33 to satisfy stratification constraints
Training set size: 59
Test set size: 29


#### Notes

1. `test_size` is automatically increased if the number of classes is high relative to dataset size.
2. Stratified split ensures each class appears in both train and test sets.
3. This prevents `ValueError` due to too few samples in some classes.


### Step 3: Image resizing

CNNs require all input images to have the same size.  
Traffic sign images come in different shapes, so we need to resize them to a fixed size (e.g., 32x32 pixels) for consistency.


In [10]:
# Resize all images to 32x32
X_train_resized = np.array([cv2.resize(img, (32, 32)) for img in X_train])
X_test_resized = np.array([cv2.resize(img, (32, 32)) for img in X_test])

print("Training images resized shape:", X_train_resized.shape)
print("Test images resized shape:", X_test_resized.shape)


Training images resized shape: (59, 32, 32, 3)
Test images resized shape: (29, 32, 32, 3)


### Step 4: Normalize pixel values

Neural networks train more efficiently when input values are scaled.  
We convert pixel values from the range [0, 255] to [0, 1].


In [11]:
X_train_norm = X_train_resized / 255.0
X_test_norm = X_test_resized / 255.0

print("Pixel values normalized. Min:", X_train_norm.min(), "Max:", X_train_norm.max())


Pixel values normalized. Min: 5.0322659173041016e-08 Max: 0.003921505220453935


### Step 5: Convert labels to categorical (one-hot)

CNNs output probabilities for each class, so labels need to be one-hot encoded.  
Example: class 3 → [0, 0, 0, 1, 0, ...]


In [22]:
# reshape labels to 2D
y_train_reshaped = np.array(y_train).reshape(-1, 1)
y_test_reshaped = np.array(y_test).reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)

# Fit on training labels and transform both train and test
y_train_cat = encoder.fit_transform(y_train_reshaped)
y_test_cat = encoder.transform(y_test_reshaped)


### Step 6: Save preprocessed data

We save the normalized images and one-hot labels to `.npy` files.  
This allows us to quickly load preprocessed data in future notebooks without repeating all preprocessing steps.
In a real project, after resizing and normalizing, we would save the preprocessed images and labels 
to disk for faster loading in later steps.  

Since this notebook is for demonstration, we'll simulate the save process.

In [26]:
# Normally, you would do:
# np.save("assets/images_preprocessed.npy", X_train_norm)
# np.save("assets/labels_preprocessed.npy", y_train_cat)

# Here, just print a message to show what would happen
print("Preprocessed images would be saved to 'assets/images_preprocessed.npy'")
print("Preprocessed labels would be saved to 'assets/labels_preprocessed.npy'")

Preprocessed images would be saved to 'assets/images_preprocessed.npy'
Preprocessed labels would be saved to 'assets/labels_preprocessed.npy'


Saving preprocessed data avoids repeating time-consuming steps like resizing and normalization.

In a real dataset, you would replace the print statements with np.save to write actual .npy files.

This keeps the notebook lightweight for training/demo purposes.