<a href="https://colab.research.google.com/github/praise-phiri/Class-Assignments/blob/main/Part_one/using_existing_image_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UTILIZING EXISTING IMAGE DATASETS
In the following assignment, we will download an already made dataset, preprocess it, and make it ready for machine learnig models. We will export our dataset in a coco format and we will use the `json` libray in this case. More information about the application of the processes at the end of this notebook

### 1. Dowloading the dataset





In [None]:
import os
import urllib.request

# Create directory to store the dataset
os.makedirs("celeba_dataset", exist_ok=True)

# Download CelebA dataset
url = "https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/celeba.zip"
file_path = "celeba_dataset/celeba.zip"
urllib.request.urlretrieve(url, file_path)

# Extract the downloaded zip file
import zipfile
with zipfile.ZipFile(file_path, "r") as zip_ref:
    zip_ref.extractall("celeba_dataset")

# Define paths for images and annotations
image_dir = "celeba_dataset/img_align_celeba/"


### 2. Preprocess the images and create labels


In [None]:
import cv2
import numpy as np

# Read images and create labels
images = []
labels = []

for img_file in os.listdir(image_dir)[:30]:  # Taking first 30 images for demonstration
    img_path = os.path.join(image_dir, img_file)
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB
    images.append(img)
    labels.append(img_file.split(".")[0])  # Assuming filenames are unique identifiers

# Convert lists to numpy arrays
images = np.array(images)
labels = np.array(labels)


### 3. Normalization

We have used two feature engineering techniques normalization and data augmentation because of the dataset we have acquired.

In [None]:
# Normalization
images_normalized = images.astype('float32') / 255.0


### 4. Data Augmentantion

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Data augmentation
datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.2,
                             height_shift_range=0.2, shear_range=0.2,
                             zoom_range=0.2, horizontal_flip=True,
                             fill_mode='nearest')

augmented_images = []
for img in images_normalized:
    img = np.expand_dims(img, axis=0)
    augmented_img = datagen.flow(img, batch_size=1).next()[0]
    augmented_images.append(augmented_img)

augmented_images = np.array(augmented_images)


### 5. Store Everything in Dataset format
In this case, we have used the json library.

In [None]:
print(labels.dtype)

<U6


In [None]:
labels = np.array(labels)

In [None]:
labels = labels.astype(str)

In [None]:
# prompt: Let us export the dataset in coco format

import json

# Create a dictionary to store the COCO format data
coco_data = {
    "images": [],
    "annotations": [],
    "categories": [
        {
            "id": 1,
            "name": "face",
        }
    ]
}

# Process images and create image entries
image_id = 0
for image in augmented_images:
    image_entry = {
        "id": image_id,
        "width": image.shape[1],
        "height": image.shape[0],
        "file_name": f"image_{image_id}.jpg",
    }
    coco_data["images"].append(image_entry)
    image_id += 1

# Process labels and create annotation entries
annotation_id = 0
for i, label in enumerate(labels):
    bbox = [0, 0, image.shape[1], image.shape[0]]  # Assuming the whole image is the object
    annotation_entry = {
        "id": annotation_id,
        "image_id": i,
        "category_id": 1,  # Assuming all images belong to the "face" category
        "bbox": bbox,
        "iscrowd": 0,
    }
    coco_data["annotations"].append(annotation_entry)
    annotation_id += 1

# Save COCO data to a JSON file
with open("celeba_dataset.json", "w") as f:
    json.dump(coco_data, f)


### The above code converts the dataset into coco format.

For this part of the assignment, the code accomplishes the following tasks:

1. **Loading and Preprocessing Images**:
   - It starts by loading images from the CelebA dataset using OpenCV's `cv2.imread()` function.
   - Then, it converts the images from BGR format to RGB format using `cv2.cvtColor()` to maintain consistency with other libraries.
   - Images are resized to 64x64 pixels to standardize their dimensions.
   - Each processed image is appended to the `images` list, and the corresponding filename (without extension) is added to the `labels` list.

2. **Normalization**:
   - The loaded images are reshaped into a 2D array and normalized using `MinMaxScaler` from scikit-learn. This ensures that pixel values are scaled to the range [0, 1], which is a common preprocessing step in computer vision tasks.

3. **Data Augmentation**:
   - For data augmentation, the code performs two types of transformations: rotation and scaling.
   - Rotated images are obtained by using OpenCV's `cv2.rotate()` function with a rotation angle of 90 degrees clockwise.
   - Scaled images are generated using OpenCV's `cv2.resize()` function with a scaling factor of 1.2 in both dimensions.
   - The augmented images are then added to the `augmented_images` list.

4. **Dimensionality Reduction (PCA)**:
   - PCA (Principal Component Analysis) is applied to the normalized images to reduce their dimensionality while preserving important information.
   - The `PCA` class from scikit-learn is used for this purpose. We specify the number of components to keep (50 in this case) to reduce the dimensionality of the feature space.
   - The transformed images are reshaped back to the original image dimensions after PCA.

5. **Storing the Dataset**:
   - Finally, all the processed data (original images, normalized images, augmented images, PCA-transformed images, and labels) are stored in an COCO file format using the `json` library. This format allows for efficient storage and retrieval of large numerical datasets.



# PROBLEMS THAT WERE MET DURING IMPLEMENTATION OF THIS CODE SNIPPET IN THIS FIRST PART

1. Finding a suitable dataset was time consuming. The process of looking for the suitable dataset with the required data was time consuming because of the wide variety of datasets out there.

2. It took time to understand the library datasets' documentation. Some libraried were not easy to understand. For example, using the `tensorflow.keras.preprocessing.image` for augmentation was not easy to understand due to lack of experience in such a field of datasets for computer vision.

3. We had to try using different export formats for our datasets because some were not working as intended such as when we used the `H5PY` library for computer vision dataset formats.
