Project Description: You are given a dataset which contains satellite images from Texas after Hurricane Harvey. There are damaged and non-damaged building images organized into respective folders. You can find the project 3 dataset on the course GitHub repository here.

Your goal is to build multiple neural networks based on different architectures to classify images as containing buildings that are either damaged or not damaged. You will evaluate each of the networks you develop and produce and select the “best” network to “deploy”. Note that this is a binary classification problem, where the goal it to classify whether the structure in the image has damage or does not have damage.

Part 1: (3 points) Data preprocessing and visualization

You will need to perform data analysis and pre-processing to prepare the images for training. At a minimum, you should:

Write code to load the data into Python data structures

Investigate the datasets to determine basic attributes of the images

Ensure data is split for training, validation and testing and perform any additional preprocessing (e.g., rescaling, normalization, etc.) so that it can be used for training/evaluation of the neural networks you will build in Part 2.

Part 2: (10 points) Model design, training and evaluation

You will explore different model architectures that we have seen in class, including:

A dense (i.e., fully connected) ANN

The Lenet-5 CNN architecture

Alternate-Lenet-5 CNN architecture, described in paper/except (Table 1, Page 12 of the research paper https://arxiv.org/pdf/1807.01688.pdf, but note that the dataset is not the same as that analyzed in the paper.)

You are free to experiment with different variants on all three architectures above. For example, for the fully connected ANN, feel free to experiment with different numbers of layers and perceptrons. Train and evaluate each model you build,and select the “best” performing model.

Note that the input and output dimensions are fixed, as the inputs (images) and the outputs (labels) have been given. These have important implications for your architecture. Make sure you understand the constraints these impose before beginning to design and implement your networks. Failure to implement these correctly will lead to incorrect architectures and significant penalty on the project grade.

Note: You can also try to run the VGG-16 architecture from class, however, you may run into long runtimes and/or memory limits on the VM. It is also possible, depending on the architecture that you choose, that you could also run into memory constraints with any of the other architectures. If you are hitting memory issues, you can try to decrease the batch_size parameter in the .fit() function, as described in the notes.

Part 3: (7 points) Model deployment

For the best model built in part 2, persist the trained model to disk so that it can be reconstituted easily. Develop a simple inference server to serve your trained model over HTTP. There should be at least two endpoints:

A model summary endpoint providing metadata about the model

An inference endpoint that can perform classification on an image. Note: this endpoint will need to accept a binary message payload containing the image to classify and return a JSON response containing the results of the inference.

Package your model inference server in a Docker container image and push the image to the Docker Hub. Provide instructions for starting and stopping your inference server using the docker-compose file. Provide examples of how to make requests to your inference server.


Part 4: (7 points) Write a 3 page report summarizing your work. Be sure to include something about the following:

Data preparation: what did you do? (1 pt)

Model design: which architectures did you explore and what decisions did you make for each? (2 pts)

Model evaluation: what model performed the best? How confident are you in your model? (1 pt)

Model deployment and inference: a brief description of how to deploy/serve your model and how to use the model for inference (this material should also be in the README with examples) (1 pt)

Submission guidelines: Part 1 and Part 2 should be submitted as one notebook file. Part 3 should include a Dockerfile, a docker image (prebuilt and pushed to Docker Hub) and a docker-compose.yml file for starting the container. It should also include a README with instructions for using the container image, docker-compose file and example requests. Part 4 should be submitted as a PDF file.

In-class Project Checkpoint Thursday, April 4th. We will devote the first portion of Thursday’s class to checking in on the project and answering questions.

# Data Preparation

In [19]:

import numpy as np
import os
import cv2

base_data_path = "./coe379L-sp24/datasets/unit03/Project3/data_all_modified/"
damage_file_set = os.listdir(base_data_path + "damage/")
no_damage_file_set = os.listdir(base_data_path + "no_damage/")
print("damage file entry: ", damage_file_set[0])
print("No damage file entry: ", no_damage_file_set[0])
print()
print("Number of damage files: ", len(damage_file_set))
print("Number of no_damage files: ", len(no_damage_file_set))
print("Total number of files: ", len(damage_file_set) + len(no_damage_file_set))

ModuleNotFoundError: No module named 'cv2'

First we seek preliminary information about our dataset such as the size of each file, the number of files in each class, and more.

In [20]:
# updating file paths for each image belonging to both classes
damage_file_set = [base_data_path + "damage/" + file for file in os.listdir(base_data_path + "damage/")]
no_damage_file_set = [base_data_path + "no_damage/" + file for file in os.listdir(base_data_path + "no_damage/")]
print("damage file entry: ", damage_file_set[0])
print("No damage file entry: ", no_damage_file_set[0])
print()

# calculating file size statistics
damage_file_sizes = [os.path.getsize(file) for file in damage_file_set]
no_damage_file_sizes = [os.path.getsize(file) for file in no_damage_file_set]
print(f"Average damage file size : {np.mean(damage_file_sizes)/1000:0.4f}kB")
print(f"Average no_damage file size: {np.mean(no_damage_file_sizes)/1000:0.4f}kB")
print(f"Range of damage file sizes : {np.min(damage_file_sizes)/1000:0.4f}kB - {np.max(damage_file_sizes)/1000:0.4f}kB")
print(f"Range of no_damage file sizes: {np.min(no_damage_file_sizes)/1000:0.4f}kB - {np.max(no_damage_file_sizes)/1000:0.4f}kB")

# calculating file dimension statistics
make_readable = lambda dims: "x".join(dims) # helper function to make dimensions more readable
damage_file_dims = [cv2.imread(file).shape for file in damage_file_set]
no_damage_file_dims = [cv2.imread(file).shape for file in no_damage_file_set]
print(f"Average damage file dimensions : {make_readable(np.mean(damage_file_dims, axis=0))}")
print(f"Average no_damage file dimensions: {make_readable(np.mean(no_damage_file_dims, axis=0))}")
print(f"Range of damage file dimensions : {make_readable(np.min(damage_file_dims, axis=0))} - {make_readable(np.max(damage_file_dims, axis=0))}")
print(f"Range of no_damage file dimensions: {make_readable(np.min(no_damage_file_dims, axis=0))} - {make_readable(np.max(no_damage_file_dims, axis=0))}")


damage file entry:  ./coe379L-sp24/datasets/unit03/Project3/data_all_modified/damage/-95.613297_29.758008.jpeg
No damage file entry:  ./coe379L-sp24/datasets/unit03/Project3/data_all_modified/no_damage/-96.93535899999999_28.727813.jpeg

Average damage file size : 2.5123kB
Average no_damage file size: 3.0134kB
Range of damage file sizes : 0.9430kB - 5.6470kB
Range of no_damage file sizes: 1.0200kB - 5.1340kB


NameError: name 'cv2' is not defined

#### Now we can check the dimensions of each image in the datasets