# Cats on That

The contents of this notebook constitute the USD MSAAI Image Processing final project Team 5's appendices submission on 12/12/2022.

Team 5 consists of the following members:


1.   Ian Timmons
2.   Ian Feekes
3.   Jester Ugalde
4.   Yevginiya Okuneva

## Initial Configuration

The following few cells perform initial configuration for the notebook, such as importing libraries, setting global variables, and mounting files and repositories as necessary

### Imports

In [27]:
import keras
import tensorflow as tf
from keras.datasets import mnist, cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation, MaxPooling2D, Conv2D, BatchNormalization
import matplotlib.pyplot as plt
from keras.utils import np_utils
from keras.layers import Dense
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras import backend as K
from PIL import Image, ImageChops, ImageFilter

# File operations
import os
import os.path
from pathlib import Path
import glob

# Image processing and displaying operations
import cv2

# Dataframe and series operations
import pandas as pd
import numpy as np

# Create callback to limit the number of epochs once learning ceases
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
from keras.utils.vis_utils import plot_model

### Global Variables

In [28]:
# Kaggle Token Path on Google Drive
kaggleTokenPath = '/content/drive/My Drive/Colab Notebooks/Image_Processing/kaggle.json'
datasetDownloadKaggle = True

# Path used for data exploration
GeorgianArchitectureImagesPath = '/content/train/arcDataset/Georgian architecture'
# Path used to contain data
dataDirPath = '/content/train/arcDataset'
dataDirPath = '/content/trainarcDataset'

### Mounting Drive

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Configuring & Installing Kaggle

To obtain the architecture dataset, it may be stored locally, however it can also be obtained from Kaggle. To run Kaggle download commands to import datasets, one must first install kaggle via CLI.

In [32]:
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Now that the CLI is installed for Kaggle, all that is needed is an access token. This can be downloaded from the Kaggle Account page and clicking "generate API key" then moving it to a selected part of google drive.

https://www.kaggle.com/general/74235

In [35]:
kaggle datasets download -d wwymak/architecture-dataset

SyntaxError: ignored

In [33]:
!mkdir ~/.kaggle 

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [34]:
!cp '/content/drive/My Drive/Colab Notebooks/Image_Processing/kaggle.json' ~/.kaggle/

cp: cannot stat '/content/drive/My Drive/Image_Processing/kaggle.json': No such file or directory


In [7]:
!chmod 600 ~/.kaggle/kaggle.json

chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


Download the actual dataset

In [8]:
!kaggle datasets download -d wwymak/architecture-dataset

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


Create training directories and extract the downloaded data into the directories

In [9]:
!mkdir '/content/train'

In [21]:
#!unzip train.zip -d train

unzip:  cannot find or open train.zip, train.zip.zip or train.zip.ZIP.


In [22]:
!unzip /root/.kaggle/architecture-dataset -d /content/train

unzip:  cannot find or open /root/.kaggle/architecture-dataset, /root/.kaggle/architecture-dataset.zip or /root/.kaggle/architecture-dataset.ZIP.


## EDA and Pre-Processing

Include a clear discussion that ensures all steps are clearly explained and addresses the following:
How did you make sure that you are ready to apply deep learning models?
What type of pre-processing is required on your data?
How can you define and refine various feature variables that you may potentially use for the modeling?
Have additional features added to demonstrate necessary image processing, image preparation, or image access for later AI computation?


In [11]:
# List of file/directory names which do not have data of interest
# These files describe architectural relationships and time frames, which are
#    not currently relevant to our application
badSubDirNames = ['ReadMe~', 'ReadMe', 'arcRelationship25.txt', 'arcNames25.txt',
                  'relationship.txt']

# Get all files in our data directory
subDirNames = os.listdir(dataDirPath)

for name in badSubDirNames:
  subDirNames.remove(name)
  assert(name not in subDirNames)

subDirNames

FileNotFoundError: ignored

In [25]:
# Declare Empty data frame object
MAIN_ARC = pd.DataFrame(dtype = str)

# Go through each directory and pick up all jpg files, appending the directory name as a column to the data frame
for directoryName in subDirNames:
  fullPath = dataDirPath + "/" + directoryName
  # Load files from the dataset
  fileList = list(Path(fullPath).glob(r"*.jpg"))
  # Convert list of file names into a series
  fileImageSeries = pd.Series(fileList, name="directoryName").astype(str)
  MAIN_ARC[directoryName] = fileImageSeries

# Data frame should have a column for each directory and should have many columns
assert(MAIN_ARC.shape[0] > 0 and MAIN_ARC.shape[1] == len(subDirNames))

# Show the first 5 columns of our new data frame object
MAIN_ARC.head(5)

NameError: ignored

In [None]:
numColumns = 0
for col in MAIN_ARC.columns:
  numColumns = numColumns + len(col)

print(numColumns)

Dropping bad entries

We can see from the following cell that the images are not all of equal dimensions and will need to be resized and normalized

In [None]:
# Declare lists to illustrate the different widths and heights in the images
differentWidths = []
differentHeights = []

figure,axis = plt.subplots(3,3,figsize=(14,14))

for i_ind,i_ops in enumerate(axis.flat):
    # Process current frame as opencv structure and convert the proper color format
    IMAGE_READING = cv2.cvtColor(cv2.imread(MAIN_ARC[MAIN_ARC.columns[2]][i_ind]),cv2.COLOR_BGR2RGB)
    # Append dimensions to the array list for data analysis - all images should be rgb
    if IMAGE_READING.shape[0] not in differentHeights:
      differentHeights.append(IMAGE_READING.shape[0])
    if IMAGE_READING.shape[1] not in differentWidths:
      differentWidths.append(IMAGE_READING.shape[1])
    i_ops.axis("off")
    i_ops.imshow(IMAGE_READING)
    
plt.tight_layout()
plt.show()    

# Show that the images have different dimensions and will need to be resized
print()
print("The number of different widths for the first 9 images is:", len(differentWidths))
print("The number of different heights for the first 9 images is:", len(differentHeights))

In [None]:
# Declare a variable to be the smallest image shape
smallestImageShape = cv2.imread(MAIN_ARC[MAIN_ARC.columns[1]][0]).shape
# Get the width and height
smallestX = smallestImageShape[0]
smallestY = smallestImageShape[1]
# Iterate through data frame and determine the largest possible resolution we can resize to
for col in MAIN_ARC.columns:
  for row in MAIN_ARC[col]:
    # Data frame organization will break for directories of differing numbers of files
    if row != 'nan' and type(row) == str:
      image = cv2.imread(row)
      if image.shape[0] < smallestX:
        smallestX = image.shape[0]
      if image.shape[1] < smallestY:
        smallestY = image.shape[1]

smallestImageShape = (smallestX, smallestY)
smallestImageShape

## Modeling Methods, Validation, and Performance Metrics

Perform modeling using the training dataset.
Evaluate the model(s) using the test dataset and validate as well.
Ensure all modeling methods are well-motivated, correctly implemented, and, to the extent appropriate, span the range of methods discussed in this course.
Cross-validation and/or held-out test sets are used in accordance with best practices to assess model performance.
Performance metrics are carefully tailored to the project objectives.


## Modeling Results and Findings

Discuss the results comparing different models and explain the differences and the challenges.
Ensure all project objectives are fully met, findings are clearly presented, and question(s) are technically addressed in the report.
Include tables/graphs comparing the different models, including their characteristics, performance, and accuracy.