## Dataset Summary

### This dataset contains over 14,000 images that need to be classified into 6 distinct categories. Here’s a quick breakdown:
- Total Images: 14,000+
- Classes: 6 categories, each representing a unique label for classification.
- Image Format: Likely standardized (e.g., JPEG, PNG).
- Size and Quality: Varies, but typically consistent for training uniformity.
- Typical Categories in Image Classification

### While not specified, common image classification categories could include:
- Natural Scenes: Different environments (e.g., mountains, forests).
- Urban and Rural Scenes: Different landscapes (e.g., streets, buildings).
- Objects: Specific items within scenes.

#### Use Case:

The model developed from this dataset would likely employ CNNs (Convolutional Neural Networks) due to their effectiveness in image feature extraction and spatial hierarchy, aiming for a high accuracy similar to 98% as mentioned in your Intel classification example.

## Import Libraries

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import glob as gb
import cv2
import tensorflow as tf
import keras
import os
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from tensorflow.keras.preprocessing.image import ImageDataGenerator

2024-12-13 17:15:51.100817: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Import Data And Preprocessing

In [2]:
train_path = 'archive/seg_train/'
test_path = 'archive/seg_test/'
pred_path = 'archive/seg_pred/'

In [4]:
def open_folders(path, file, name = 'Traning Data'):
    for folder in os.listdir(path + file):
        files = gb.glob(pathname = path + file + '/' + folder + '/*.jpg')
        print(f'For {name} : Found {len(files)} images in folder {folder}')

print('-' * 40 + ' Traning Data ' + '-' * 46)
open_folders(train_path, 'seg_train')
print('\n' + '-' * 40 + ' Test Data ' + '-' * 50)
open_folders(test_path, 'seg_test', name = 'Test Data')
print('\n' +'-' * 40 + ' Prediction Data ' + '-' * 44)
files = gb.glob(pathname = pred_path + 'seg_pred' + '/*.jpg')
print(f'For Prediction Data : Found {len(files)} images in folder Prediction')

---------------------------------------- Traning Data ----------------------------------------------
For Traning Data : Found 2271 images in folder forest
For Traning Data : Found 2191 images in folder buildings
For Traning Data : Found 2404 images in folder glacier
For Traning Data : Found 2382 images in folder street
For Traning Data : Found 2512 images in folder mountain
For Traning Data : Found 2274 images in folder sea

---------------------------------------- Test Data --------------------------------------------------
For Test Data : Found 474 images in folder forest
For Test Data : Found 437 images in folder buildings
For Test Data : Found 553 images in folder glacier
For Test Data : Found 501 images in folder street
For Test Data : Found 525 images in folder mountain
For Test Data : Found 510 images in folder sea

---------------------------------------- Prediction Data --------------------------------------------
For Prediction Data : Found 7301 images in folder Prediction


### Shape of the images

Most of the images are sized +/- `150x150x3`, and they need to be uniform in size for the model, which only accepts input in one specific dimension. To avoid losing significant information, we will resize them to `100x100x3`.

### Visualization for each folder/class

In [5]:
code = {'buildings': 0, 'forest': 1, 'glacier': 2, 'mountain': 3, 'sea': 4, 'street': 5}

# Get the labels for the images
def getcode(n):
    for x, y in code.items():
        if n == y:
            return x

In [19]:
new_size = 100
def get_image_array(path, folder_name, new_size = new_size):
    X = []
    y = []
    if folder_name != 'seg_pred':
        for folder in os.listdir(path + folder_name):
            files = gb.glob(pathname= path + folder_name + '/' + folder + '/*.jpg')
            for file in files:
                image = cv2.imread(file)
                image_array = cv2.resize(image, (new_size, new_size))
                X.append(list(image_array))
                y.append(code[folder])
    else :
        files = gb.glob(pathname= path + folder_name + '/*.jpg')
        for file in files:
            image = cv2.imread(file)
            image_array = cv2.resize(image, (new_size, new_size))
            X.append(list(image_array))
    return X, y

In [20]:
X_train, y_train = get_image_array(train_path, 'seg_train')
X_test, y_test = get_image_array(test_path, 'seg_test')
X_pred, _ = get_image_array(pred_path, 'seg_pred')

print('-' * 40 + ' Traning Data ' + '-' * 46)
print(f'We Have {len(X_train)} Image In X_train')
print(f'We Have {len(y_train)} items In y_train ')

print('\n' +'-' * 40 + ' Test Data ' + '-' * 50)
print(f'We Have {len(X_test)} Image In X_test')
print(f'We Have {len(y_test)} items In y_test')

print('\n' +'-' * 40 + ' Prediction Data ' + '-' * 44)
print(f'We Have {len(X_pred)} Image In X_pred')

---------------------------------------- Traning Data ----------------------------------------------
We Have 14034 Image In X_train
We Have 14034 items In y_train 

---------------------------------------- Test Data --------------------------------------------------
We Have 3000 Image In X_test
We Have 3000 items In y_test

---------------------------------------- Prediction Data --------------------------------------------
We Have 7301 Image In X_pred


In [21]:
X_train, y_train = np.array(X_train) , np.array(y_train) 
X_test, y_test = np.array(X_test) , np.array(y_test) 
X_pred  = np.array(X_pred) 

print(f'X_train shape  is {X_train.shape}') 
print(f'X_test shape  is {X_test.shape}')
print(f'y_train shape  is {y_train.shape}')
print(f'y_test shape  is {y_test.shape}')
print(f'X_pred shape  is {X_pred.shape}')

X_train shape  is (14034, 100, 100, 3)
X_test shape  is (3000, 100, 100, 3)
y_train shape  is (14034,)
y_test shape  is (3000,)
X_pred shape  is (7301, 100, 100, 3)


In [22]:
# Data Augmentation
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

# Augment the training data
augmented_images = []
augmented_labels = []

for X_batch, y_batch in datagen.flow(X_train, y_train, batch_size=64, shuffle=False):
    augmented_images.append(X_batch)
    augmented_labels.append(y_batch)
    if len(augmented_images) * 64 >= len(X_train):
        break

X_train_augmented = np.vstack(augmented_images)
y_train_augmented = np.hstack(augmented_labels)

# Flatten the images for traditional ML models
def flatten_images(X):
    return X.reshape(X.shape[0], -1)

X_train_flattened = flatten_images(X_train_augmented)
X_test_flattened = flatten_images(X_test)

## Random Forest

In [24]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_flattened, y_train_augmented)
rf_pred = rf_model.predict(X_test_flattened)
rf_accuracy = accuracy_score(y_test, rf_pred)
print(f'Random Forest Accuracy: {rf_accuracy}')

Random Forest Accuracy: 0.591


## Support Vector Machine

In [None]:
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_flattened, y_train_augmented)
svm_pred = svm_model.predict(X_test_flattened)
svm_accuracy = accuracy_score(y_test, svm_pred)
print(f'SVM Accuracy: {svm_accuracy}')

## K Nearest Neighbors

In [None]:
# Train and evaluate k-NN
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_flattened, y_train_augmented)
knn_pred = knn_model.predict(X_test_flattened)
knn_accuracy = accuracy_score(y_test, knn_pred)
print(f'k-NN Accuracy: {knn_accuracy}')

In [None]:
# Generate classification report
report_rf = classification_report(y_test, rf_pred, target_names=code.keys())
report_svm = classification_report(y_test, svm_pred, target_names=code.keys())
report_knn = classification_report(y_test, knn_pred, target_names=code.keys())
print("Random Forest Classification Report:\n", report_rf)
print("SVM Classification Report:\n", report_svm)
print("k-NN Classification Report:\n", report_knn)

In [None]:
# Plot and save the confusion matrix for Random Forest
def plot_confusion_matrixPercentage(true_labels, pred_labels, class_names):
    cm = confusion_matrix(true_labels, pred_labels)
    
    # Normalize by the number of true labels in each row to get percentages
    cm_percentage = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100  # Convert to percentage
    cm_percentage = np.nan_to_num(cm_percentage)  # Replace NaN with 0 if division by zero occurs

    # Create custom annotations with percentage symbol
    annotations = np.array([[f'{int(value)}%' for value in row] for row in cm_percentage])

    plt.figure(figsize=(8, 6))  
    sns.heatmap(cm_percentage, annot=annotations, fmt="", cmap="Blues", 
                xticklabels=class_names, yticklabels=class_names, cbar=False)
    
    plt.title("Confusion Matrix (Percentage)", fontsize=16)
    plt.xlabel("Predicted Labels", fontsize=12)
    plt.ylabel("True Labels", fontsize=12)
    plt.tight_layout()
    plt.show()

class_names = ['buildings', 'forest', 'glacier', 'mountain', 'sea', 'street']   
plot_confusion_matrixPercentage(y_test, rf_pred, class_names)
plot_confusion_matrixPercentage(y_test, svm_pred, class_names)
plot_confusion_matrixPercentage(y_test, knn_pred, class_names)

# Plot and save sample predictions
plt.figure(figsize=(30, 40))
for n, i in enumerate(list(np.random.randint(0, len(X_test), 36))):
    plt.subplot(6, 6, n+1)
    plt.imshow(X_test[i])
    plt.axis('off')
    plt.title(f'Actual: {getcode(y_test[i])}\n Predict: {getcode(rf_pred[i])}', fontdict={'fontsize': 14, 'color': 'blue'})
plt.savefig('imagePrediction.png')

### Random Forest Accuracy: 0.556

#### Random Forest Classification Report:

| Class      | Precision | Recall | F1-Score | Support |
|------------|------------|--------|----------|---------|
| buildings  | 0.44       | 0.33   | 0.38     | 437     |
| forest     | 0.66       | 0.79   | 0.72     | 474     |
| glacier    | 0.54       | 0.58   | 0.56     | 553     |
| mountain   | 0.54       | 0.63   | 0.58     | 525     |
| sea        | 0.52       | 0.35   | 0.41     | 510     |
| street     | 0.58       | 0.64   | 0.61     | 501     |
| **accuracy** | **0.56** |        |          | 3000    |
| macro avg  | 0.55       | 0.55   | 0.54     | 3000    |
| weighted avg | 0.55     | 0.56   | 0.55     | 3000    |

### k-NN Accuracy: 0.400

#### k-NN Classification Report:

| Class      | Precision | Recall | F1-Score | Support |
|------------|------------|--------|----------|---------|
| buildings  | 0.33       | 0.03   | 0.05     | 437     |
| forest     | 0.58       | 0.66   | 0.62     | 474     |
| glacier    | 0.47       | 0.45   | 0.46     | 553     |
| mountain   | 0.34       | 0.74   | 0.46     | 525     |
| sea        | 0.26       | 0.33   | 0.29     | 510     |
| street     | 0.71       | 0.14   | 0.24     | 501     |
| **accuracy** | **0.40** |        |          | 3000    |
| macro avg  | 0.45       | 0.39   | 0.35     | 3000    |
| weighted avg | 0.45     | 0.40   | 0.36     | 3000    |