# DS200A Computer Vision Project
---
## Summary
This project consists of a large dataset with images belonging to 20 different categories.
The goal is to classify the category of a given image using a machine learning model. We are
given 1501 images to train our models, and 716 images to evaluate the model accuracies. We
began by performing exploratory data analyses to give us insights into the feature selection
process. Then, we trained 4 models: Logistic Regression, KNN, SVM and Random Forest. For
each model, we used cross validation to select the best parameters, and a feature selection
function to select the top 10 features. In the end, we identified Logistic Regression as the best
model because it has one of the lowest validation errors, a fast computation time and is
relatively easy to interpret.

## Key questions explored
- How well can I design a image classifier using conventional (i.e. non-neural net) ML models?

## Techniques used
- logistic regression
- KNN
- SVM
- Random forest
- bootstrapping
- cross validation
- feature engineering

## Key findings
- One of the most significant factors that affected model accuracy was the preprocessing of feature values. When normalizing the data, the model performed 10-50% worse compared to models trained on the original data. In contrast, models trained on scaled feature values performed significantly better than the original data.
- Our models achieved a validation accuracy of 40%-46%, with SVM performing the highest (46% accuracy) and KNN performing the lowest (40% accuracy). They also achieved a The 95% confidence interval for SVM was between 38-54%. We selected the Logistic Regression model to predict the test set because it had descent accuracy, is a simple model that’s easy to interpret, and it didn’t overfit the train data as much as some of the other models. We expect this model to generalize best to the test data.
- In this project, we limited our features to numerical values only. However, it is possible to use a pixel by pixel matrix as a feature, which could be significantly improve the final model. Despite the feature constraints presented in this project, we were still able to achieve almost a 50% accuracy on the validation set by focusing our efforts on transforming the data and optimizing model parameters to increase model accuracy. It would be interesting to see how much we could improve the model accuracy by if we combine our data cleaning and parameter optimization efforts in this project with more powerful features.

This image shows a snapshot of the results from different ML models.
![Results comparison](summary_results.png)

This image shows the breakdown of categories in the dataset.
![Distribution of categories in train data](category_dist.png)

## Notebooks
This project is split into 4 notebooks. When running for the first time, the notebooks should be opened and ran in sequential order. After all files have been downloaded, they can be opened and executed individually. 
- __NB1: Data cleaning and preparation__
- __NB2a: Preliminary EDA__
- __NB2b: Feature selection__
- __NB3: Modeling__
---

## Data Cleaning and Preparation (NB1) - Table of contents
> ### Part 1. [Access zipped files](#zip)
> ### Part 2. [Read data and clean](#set_parameters)
> ### Part 3. [Confirm images in proper format](#confirm)

---

In [2]:
import numpy as np
import pandas as pd
import skimage
import os
import zipfile 
from zipfile import ZipFile
import re 
from skimage import io
from pathlib import Path

<a id="zip"></a>
### Part 1. Define functions to assist in data preparation/cleaning
As part of our class project, we are given two zip files that contain images. Therefore, our first task is to unzip these files for access. We need to ensure that 1. file names have no spaces and 2. data is in correct RGB format. In particular, there are spaces in the file names for the test set, which are causing issues. Also, not all images are in RGB format.

#### 1.1 Remove spaces from specified file

In [3]:
def remove_directory_file_spaces_zip(dir_name='20_Validation'): 
    """
    Removes spaces in file names of specified directory in zip file.
    The zip file name should be specified from the root directory. 
    Extracts the zip file, trims the filenames and rezips the file.
    
    Keyword arguments: 
    dir_name (string) -- name of zip file in root folder 
    
    Returns: 
        None 
    """
    # Find and extract zip files
    val_files_zip = zipfile.ZipFile(f"{os.getcwd()}/{dir_name}.zip")
    val_files_zip.extractall(f"{os.getcwd()}/{dir_name}/")
    
    # Remove spaces from file names
    current_directory = os.getcwd()
    directory = f"{current_directory}/{dir_name}"
    [os.rename(os.path.join(directory, f), 
               os.path.join(directory, f).replace(' ', '').lower()) 
     for f in os.listdir(directory)]
    
    # Rezip files and close
    zf = ZipFile(f"{dir_name}.zip", "w")
    for dirname, subdirs, files in os.walk(directory): 
        for filename in files:
            zf.write(os.path.join(dirname, filename), filename)
    zf.close()
    
remove_directory_file_spaces_zip() 

#### 1.2 Convert greyscale images into RGB
Some files are in greyscale. For consistency in the structure of data, convert these into a representative RGB format.

In [4]:
def convert_gray(images):
    """
    Creates RGB representation of gray-level images. 
    
    Keyword arguments: 
    images (pd.Series) -- array containing image pixels 
    
    Returns: 
        ndarray of images all with RGB representation
        
    """
    return images.apply(lambda image: skimage.color.grey2rgb(image) if len(image.shape) < 3 else image) 

<a id="read"></a>

### Part 2. Read image data and create cleaned dataframe
Take a given folder and create a dataframe with the picture object, and the encoding as listed below.

0=Airplanes, 1=Bear, 2=Blimp, 3=Comet, 4=Crab, 5=Dog, 6=Dolphin, 7=Giraffe, 8=Goat, 9=Gorilla, 10=Kangaroo, 11=Killer-Whale, 12=Leopards, 13=Llama, 14= Penguin, 15= Porcupine, 16=Teddy-Bear, 17=Triceratops, 18=Unicorn, 19=Zebra


#### 2.1 Create functions to assist in data cleaning/preparation process

In [18]:
def read_organize_data(folder_name='20_categories_training', isTest=False, isCache=True):    
    """
    Returns a dataframe with picture objects and category encodings of all images in folder. 
    
    Keyword arguments:
    folder_name (string) -- name of image folder in root directory 
    isTest (bool) -- Flag. True loads images from test set, False loads train set.
    isCache (bool) -- Flag. True loads images from cache, False overwrites cached images.
        
    Return: 
        pd.DataFrame with image objects and category encodings 
    """
    # Read in cached file, if exists
    cache_path = '/test/cleaned_test.pkl' if isTest else '/test/cleaned_train.pkl'
    if Path(cache_path).is_file() and isCache: 
        print(f"Loading cache {'test' if isTest else 'train'} file")
        return pd.read_pickle(cache_path)    
    
    folder_name = '20_Validation' if isTest else folder_name
    img_dir_path = Path(f"{os.getcwd()}/{folder_name}.zip") 
    
    # Unzip folder
    category_zip = zipfile.ZipFile(f"{os.getcwd()}/{folder_name}.zip")
    category_zip.extractall(f"{os.getcwd()}/{folder_name}/")
    
    # Get info about image to use in dataframe
    image_names, category_names, images, encoding = get_image_data(folder_name, img_dir_path, isTest)
    df = create_cleaned_df(image_names, images, category_names, encoding)          
    
    # Cache dataframe
    if not Path(cache_path).is_file() or isCache is False:
        df.to_pickle(f"./{cache_path}")
    
    return df

In [6]:
def get_image_data(folder_name, img_dir_path, isTest):
    """
    Returns lists of image names, category names, and image pixels of all images in folder. 
    
    Keyword arguments:
    folder_name (string) -- path of image folder in root directory 
    img_dir_path (Path) -- path of image directory in root directory 
    isTest (bool) -- Flag. True loads images from test set, False loads train set.
        
    Return: 
        lists with image names, category names, and image objects 
    """
    image_names, category_names, images = [], [], []
    
    with ZipFile(img_dir_path, 'r') as images_zip: 
        for filename in images_zip.namelist(): 
            img_name = re.search("^.*.jpg", filename)
            
            # Only include images that are no corrupted
            if img_name is not None and "/._" not in img_name.group(): 
                if ' ' in img_name.group(): 
                    continue 
                image_names.append(img_name.group())
            # Get image data in pixels for test files
            if isTest: 
                img_path = Path(f"{folder_name}/{img_name.group()}")
                images.append(io.imread(img_path))
                continue 
            
            # Get image category name and image data in pixels for train data
            category_name = re.search("(.*)/", filename)
            if category_name and "/._" not in img_name.group(): 
                category_name = category_name.group().replace("/", "")
                category_names.append(category_name)
                category_path = Path(f"{folder_name}/{img_name.group()}")
                image = io.imread(category_path)
                images.append(image)
    
    # Translate category name into encoding for train data
    if not isTest:
        encoding = encode_categories(category_names)
        return image_names, category_names, images, encoding
    return image_names, category_names, images, _

In [7]:
def encode_categories(category_names):
    """
    Returns list category encodings of all images in folder. 
    
    Keyword arguments:
    category_names (list) -- category names of all images 
        
    Return: 
        list with category name encodings of all images
    """
    encoding_dict = {0:'airplanes', 1:'bear', 2:'blimp', 3:'comet', 4:'crab', \
                    5:'dog', 6:'dolphin', 7:'giraffe', 8:'goat', 9:'gorilla', \
                    10:'kangaroo', 11:'killer-whale', 12:'leopards', 13:'llama', \
                    14:'penguin', 15:'porcupine', 16:'teddy-bear', 17:'triceratops', \
                    18:'unicorn', 19:'zebra'}
    # Flip key-value pairs 
    encoding_dict = {v: k for k, v in encoding_dict.items()}
    return [encoding_dict[cat_name] for cat_name in category_names]

In [8]:
def create_cleaned_df(image_names, images, category_names=None, encoding=None):
    """
    Returns a dataframe with picture objects and category encodings of all images in folder. 
    
    Keyword arguments:
    image_names (list) -- name of all images
    images (list) -- image objects of all images
    category_names (list) -- category names of all images 
    encoding (list) -- category encodings of all images
    
    Return: 
        pd.DataFrame with image objects and category encodings 
    """
    df = pd.DataFrame() 
    df['name'] = image_names
    df['image'] = images
    df['image'] = convert_gray(df['image'])
    
    # For train data, include response variable
    if category_names and encoding:
        df['category'] = category_names
        df['encoding'] = encoding
    return df

#### 2.2 Clean data and measure time

In [19]:
import time 
t1 = time.time()
training_data = read_organize_data(isTest=False, isCache=True)
validation_data = read_organize_data(isTest=True, isCache=True)
t2 = time.time()
print(f"Seconds: {round(t2 - t1, 2)}")
print(f"Training images loaded: {len(training_data)}")
print(f"Test images loaded: {len(validation_data)}")

Seconds: 13.93
Training images loaded: 1501
Test images loaded: 716


<a id="confirm"></a>

### Part 3. Confirm images are in proper RGB format


In [22]:
# The following dataframe will have 0 entries if data is properly cleaned
starting_data = read_organize_data(isTest=False, isCache=True)

# Check for images in greyscale format
if starting_data[starting_data['image'].apply(lambda x: len(x.shape) < 3)].empty:
    print("Images in correct RGB format!")
else:
    print("Images need to be correctly formatted!")

Images in correct RGB format!
