# Data preparation


First of all, to build a great classifying model, we should prepare our data so it fits correctly into it. We want to reescale our images to a correct size, in our case, to the minimum size of all images, which will find out in this script over here. But, at what size should we change them?

## 1. Data inspection

In this part, we'll code a script that finds the minimum size among all images in both train and test folders. We'll be storing all sizes in dictionaries grouped by classes so at the end we can print them and choose the less sized one. 

In [11]:
import os,shutil
import cv2
import skimage.feature
import numpy as np
import matplotlib.pyplot as plt
from Utils import calculate_min_size, resize_images_in_folder

In [12]:
import os
import cv2

# Specify the paths to the main folders containing subfolders for train and test
train_folder_path = 'data/train'
test_folder_path = 'data/test'

# Initialize dictionaries to store the minimum width and height for each class
min_width_by_class = {}
min_height_by_class = {}

# Process the train folder
for class_folder in os.listdir(train_folder_path):
    class_folder_path = os.path.join(train_folder_path, class_folder)
    min_width, min_height = calculate_min_size(class_folder_path)
    
    min_width_by_class[class_folder] = min_width
    min_height_by_class[class_folder] = min_height

# Process the test folder
for class_folder in os.listdir(test_folder_path):
    class_folder_path = os.path.join(test_folder_path, class_folder)
    min_width, min_height = calculate_min_size(class_folder_path)
    
    # Check if the class already exists in the dictionary
    if class_folder in min_width_by_class:
        # Update with the minimum values
        min_width_by_class[class_folder] = min(min_width_by_class[class_folder], min_width)
        min_height_by_class[class_folder] = min(min_height_by_class[class_folder], min_height)
    else:
        # Create new entries for the class
        min_width_by_class[class_folder] = min_width
        min_height_by_class[class_folder] = min_height

# Print the minimum width and height for each class from both folders
for class_name in min_width_by_class:
    min_width = min_width_by_class[class_name]
    min_height = min_height_by_class[class_name]
    print(f"Class: '{class_name}', Minimum Width: {min_width}, Minimum Height: {min_height}")


Class: 'bedroom', Minimum Width: 200, Minimum Height: 200
Class: 'Coast', Minimum Width: 200, Minimum Height: 200
Class: 'Forest', Minimum Width: 200, Minimum Height: 200
Class: 'Highway', Minimum Width: 200, Minimum Height: 200
Class: 'industrial', Minimum Width: 200, Minimum Height: 200
Class: 'Insidecity', Minimum Width: 200, Minimum Height: 200
Class: 'kitchen', Minimum Width: 200, Minimum Height: 200
Class: 'livingroom', Minimum Width: 200, Minimum Height: 200
Class: 'Mountain', Minimum Width: 200, Minimum Height: 200
Class: 'Office', Minimum Width: 200, Minimum Height: 200
Class: 'OpenCountry', Minimum Width: 200, Minimum Height: 200
Class: 'store', Minimum Width: 200, Minimum Height: 200
Class: 'Street', Minimum Width: 200, Minimum Height: 200
Class: 'Suburb', Minimum Width: 200, Minimum Height: 200
Class: 'TallBuilding', Minimum Width: 200, Minimum Height: 200
Class: 'coast', Minimum Width: 200, Minimum Height: 200
Class: 'forest', Minimum Width: 200, Minimum Height: 200
Class:

Conclusion: all images should be resized to 200x200 because bedroom has the minimum dimension size, its height.

## 2 Reescaling images 

Now, having discovered what size we should refactor our images, we are going to start doing it. This is going to be accomplished with the function resize_images_in_folder, from Utils.py 

In [13]:
# Specify the paths to the main folders containing subfolders for train and test
folder_paths = ['data/train', 'data/test']

# Specify the target size
target_width = 200
target_height = 200

# Resize images in both train and test folders
for folder_path in folder_paths:
    for class_folder in os.listdir(folder_path):
        class_folder_path = os.path.join(folder_path, class_folder)
        resize_images_in_folder(class_folder_path,target_width,target_height)

Now, we have resized all images to 200x200. Great! Let's find out if there are any more issues related to our data.

## 3. Data cleaning

We have seen a strange thing in test folder. There it is a 'livingRoom (Case Conflict)' folder, where all images look identical to the 'livingRoom' original folder. We should probably check that. We'll do a double loop for each of conflict cases to check if there is in the other location. If this happens, we are going to compare if matrixes are equal by subtracting them and finding numbers that are non zero. That would mean that the specified pixel is different than the other one, making images different.

In [14]:
test_living_room = 'data/test/livingRoom'
test_living_room2= 'data/test/livingRoom (Case Conflict)'

i=0
for filename in os.listdir(test_living_room2):
    #check if it is in the other folder
    for filename2 in os.listdir(test_living_room):
        if(filename==filename2):
            print('Found:'+str(i))
            image1=cv2.imread(test_living_room+str('/')+str(filename))
            image2=cv2.imread(test_living_room2+str('/')+str(filename2))
            if image1.shape == image2.shape:
                # Compare pixel-wise equality
                difference = cv2.subtract(image1, image2)
                b, g, r = cv2.split(difference)

                if cv2.countNonZero(b) == 0 and cv2.countNonZero(g) == 0 and cv2.countNonZero(r) == 0:
                    print("The images are identical.")
                    i=i+1
                else:
                    print("The images are not identical.")
            else:
                print("The images have different shapes.")



FileNotFoundError: [WinError 3] The system cannot find the path specified: 'data/test/livingRoom (Case Conflict)'

As we can see, all output says that images are identical. As a fact, we don't want duplicated data, neither in train nor test. So, we will delete the whole folder. 

In [None]:
shutil.rmtree(test_living_room2)

Finally, we can say that our data is clean and ready to pass it into a model.

# 4. Feature Visualization

In this stage, we'll extract some features and visualize them in order to find out or guess which ones should we pick. We'll work with just one photo.

In [None]:
image = cv2.imread('data/train/bedroom/image_0001.jpg')


In [None]:
import cv2
from skimage.feature import canny, corner_fast, corner_harris, hessian_matrix, hog, local_binary_pattern
import matplotlib.pyplot as plt


# Convert the image to grayscale if it's not already
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply Canny edge detection
canny_edges = canny(gray_image)

# Apply FAST corner detection
fast_corners = corner_fast(gray_image)

# Apply Harris corner detection
harris_corners = corner_harris(gray_image)

# Apply Hessian matrix for feature detection
hessian_det = hessian_matrix(gray_image)

# Apply HOG (Histogram of Oriented Gradients)
hog_features, hog_image = hog(gray_image, visualize=True)

# Apply LBP (Local Binary Pattern)
radius = 3
n_points = 8 * radius
lbp_features = local_binary_pattern(gray_image, n_points, radius, method='uniform')

# Create subplots to display the transformed images
plt.figure(figsize=(18, 18))

# Original Image
plt.subplot(3, 4, 1)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.title('Original Image')

# Canny Edge Detection
plt.subplot(3, 4, 2)
plt.imshow(canny_edges, cmap='gray')
plt.title('Canny Edge Detection')

# FAST Corner Detection
plt.subplot(3, 4, 3)
plt.imshow(fast_corners, cmap='gray')
plt.title('FAST Corner Detection')

# Harris Corner Detection
plt.subplot(3, 4, 4)
plt.imshow(harris_corners, cmap='gray')
plt.title('Harris Corner Detection')

# Hessian Matrix
plt.subplot(3, 4, 5)
plt.imshow(hessian_det[0], cmap='gray')
plt.title('Hessian Matrix')

# HOG Features
plt.subplot(3, 4, 6)
plt.imshow(hog_image, cmap='gray')
plt.title('HOG Features')

# LBP Features
plt.subplot(3, 4, 7)
plt.imshow(lbp_features, cmap='gray')
plt.title('LBP Features')

# Display the subplots
plt.tight_layout()
plt.show()
