<a href="https://colab.research.google.com/github/hamza-bangash/DeepLearningPractice/blob/main/CNN/imagePreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Image Preprocessing fro CNN**

## **First Downlaod Data from Kaggle**

In [2]:
# 1) install kaggle api
!pip install kaggle --quiet

In [3]:
# 2) upload kaggle.json
from google.colab import files
files.upload()  # choose kaggle.json from your computer

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"hamzabangash1","key":"bf2ca98470c38c5e1d6d788919559588"}'}

In [4]:
# 3) move kaggle.json to the correct folder
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

In [5]:
# 4) set correct file permissions
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
# 5) download dataset
!kaggle datasets download samuelcortinhas/apples-or-tomatoes-image-classification


Dataset URL: https://www.kaggle.com/datasets/samuelcortinhas/apples-or-tomatoes-image-classification
License(s): CC0-1.0
Downloading apples-or-tomatoes-image-classification.zip to /content
  0% 0.00/2.33M [00:00<?, ?B/s]
100% 2.33M/2.33M [00:00<00:00, 871MB/s]


In [7]:
# 6) unzip dataset into 'data' folder
!unzip -q apples-or-tomatoes-image-classification.zip -d data/

In [8]:
# 7) inspect the data or Structure of folder
!find data -maxdepth 2 -type d | head -n 20

data
data/test
data/test/tomatoes
data/test/apples
data/train
data/train/tomatoes
data/train/apples


## **Checking for Corrupt files**

In [9]:
import os
from PIL import Image

root = "data"  # your dataset folder
bad = 0

for dirpath, _, files in os.walk(root):
    for f in files:
        p = os.path.join(dirpath, f)
        try:
            Image.open(p).verify()  # check if image is valid
        except Exception:
            bad += 1
            try:
                os.remove(p)  # delete corrupt image
            except:
                pass

print("removed bad files:", bad)


removed bad files: 0


## **Load Dataset into TensorFlow**

In [10]:
import tensorflow as tf

# parameters
img_size = (224, 224)  # resize all images to 224x224
batch_size = 32

# 1) training dataset
train_ds = tf.keras.utils.image_dataset_from_directory(
    "data/train",         # path to training images
    image_size=img_size,  # resize images
    batch_size=batch_size,
    shuffle=True           # shuffle data for better training
)

# 2) testing dataset
test_ds = tf.keras.utils.image_dataset_from_directory(
    "data/test",          # path to test images
    image_size=img_size,
    batch_size=batch_size,
    shuffle=False          # no need to shuffle test data
)

# 3) see class names
class_names = train_ds.class_names
print("Classes:", class_names)


Found 294 files belonging to 2 classes.
Found 97 files belonging to 2 classes.
Classes: ['apples', 'tomatoes']


## **Checking Balance of Data**

In [11]:
import numpy as np

# get all labels from the training dataset
all_labels = np.concatenate([y.numpy() for x, y in train_ds], axis=0)

# count images per class
unique, counts = np.unique(all_labels, return_counts=True)
class_counts = dict(zip(class_names, counts))

print("Class distribution in training set:", class_counts)


Class distribution in training set: {'apples': np.int64(164), 'tomatoes': np.int64(130)}


#### **If Data Inbalance**
  You have a few options:

**1) Data Augmentation for the minority class**

Apply extra transformations (flip, rotate, zoom) more frequently to tomatoes to increase effective training samples

**2) Class Weights in Training**

class_weight = {

    0: 1.0,               # apples

    1: 164 / 130          # tomatoes -> 1.26
}

model.fit(train_ds, validation_data=test_ds, epochs=10, class_weight=class_weight)

The model pays more attention to the minority class during training.

**3)Oversampling**

Repeat minority class images to balance the dataset.

Usually done before creating image_dataset_from_directory, less common in TF 2.x pipelines because augmentation + class weighting works well.

#### 1) Data Augmentation for the minority class

##### Step 1: Define data augmentation

In [12]:
import tensorflow as tf

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1),
])


##### Step 2: Separate minority class images

In [13]:
# get all images and labels from training dataset
images = []
labels = []

for x, y in train_ds:  # train_ds from step 3
    images.append(x)
    labels.append(y)

images = tf.concat(images, axis=0)
labels = tf.concat(labels, axis=0)

# minority class index (tomatoes)
minority_index = 1

minority_images = tf.boolean_mask(images, labels==minority_index)
minority_labels = tf.boolean_mask(labels, labels==minority_index)


##### Step 3: Apply augmentation only to minority images

In [14]:
augmented_images = data_augmentation(minority_images)
augmented_labels = minority_labels  # labels stay the same

##### Step 4: Combine with original training dataset

In [15]:
# combine original images + augmented minority images
all_images = tf.concat([images, augmented_images], axis=0)
all_labels = tf.concat([labels, augmented_labels], axis=0)

# create new tf.data.Dataset
train_ds_balanced = tf.data.Dataset.from_tensor_slices((all_images, all_labels))
train_ds_balanced = train_ds_balanced.shuffle(300).batch(32)

## **Data Augmentation**

In [16]:
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1),
])

# apply augmentation only on training dataset
train_ds_balanced = train_ds_balanced.map(lambda x, y: (data_augmentation(x), y))


## **Normalize Pixel Values**

In [17]:
# normalize images
normalizer = tf.keras.layers.Rescaling(1./255)

# apply normalization to datasets
train_ds_balanced = train_ds_balanced.map(lambda x, y: (normalizer(x), y))
test_ds  = test_ds.map(lambda x, y: (normalizer(x), y))

## **Improve Performance with Caching & Prefetch**

In [18]:
train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)
test_ds  = test_ds.cache().prefetch(tf.data.AUTOTUNE)

# **Models**

In [19]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

In [20]:
model = Sequential()

# Add an explicit Input layer
model.add(tf.keras.Input(shape=(224, 224, 3)))

# first convolutional layer
model.add(Conv2D(32, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

# second convolutional layer
model.add(Conv2D(64, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

# flatten and fully connected layers
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))  # binary classification

In [21]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)


In [22]:
history = model.fit(
    train_ds_balanced,
    steps_per_epoch=100,
    epochs=10,
    validation_data=test_ds,
    validation_steps=10
)


Epoch 1/10
[1m 14/100[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m42s[0m 496ms/step - accuracy: 0.4807 - loss: 3.7703



[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 83ms/step - accuracy: 0.4953 - loss: 2.5174 - val_accuracy: 0.4433 - val_loss: 0.7014
Epoch 2/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 48ms/step - accuracy: 0.6204 - loss: 0.6473 - val_accuracy: 0.5258 - val_loss: 0.6828
Epoch 3/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 43ms/step - accuracy: 0.6798 - loss: 0.6176 - val_accuracy: 0.5361 - val_loss: 0.7220
Epoch 4/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 44ms/step - accuracy: 0.6675 - loss: 0.6020 - val_accuracy: 0.6392 - val_loss: 0.7121
Epoch 5/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 50ms/step - accuracy: 0.6903 - loss: 0.6057 - val_accuracy: 0.6598 - val_loss: 0.6785
Epoch 6/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 50ms/step - accuracy: 0.6895 - loss: 0.5