# ENGR418 Project Stage 1 Group 31

By: Jared Paull (63586572), Liam Ross (75469692)


---
---
---
---

## Single Function Call

First, run all of the functions at the bottom of this notebook, then run this single cell to demonstrate the algorithm.

In [16]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
import os
from sklearn.metrics import confusion_matrix
import PIL
# first param is the relative training data directory
# second param is the relative testing data directory
test_function("../data/training", "../data/testing/")

Testing Model on Training Data:

Shape Predicted  Circle  Rectangle  Square
Shape Actual                              
Circle               18          0       0
Rectangle             0         18       0
Square                0          0      18

Percentage of model errors from the training data: 0.00%



Testing Model on Testing Data:

Shape Predicted  Circle  Rectangle  Square
Shape Actual                              
Circle               17          0       1
Rectangle             0         17       1
Square                1          0      17

Percentage of model errors from the testing data: 5.56%


---
---
---
---

**The following code details the logic behind the algorithm. Additional functions used are at the bottom of this notebook (I would normally place them in an external script file, sorry). The code below is well commented and presented in a logical order**

## Importing Libraries

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
import os
from sklearn.metrics import confusion_matrix
import PIL

## Scraping Image Data

Next, the training data must be loaded into a 2D numpy array. The first dimention of the array represents a sample, which in this context is a picture. Then each pixel contained of an image represents a feature of the dataset. This means that the total number of features is equal to the total number of pixels in the image.

To do this, first the training images must be loaded into an array. Each image must be grayscaled, normalized in size, then scaled down to reduce the memory requirements of the algorithm. Simultaniously, each sample must have a corresponding label so that the algorithm can later correctly train a model from the data paired with the correct labels.

In [28]:
# pass in the relative directory that contains the training data.
# directory is relative to where this Jupyter notebook is
# Refer to get_image_data function at the bottom for detailed comments on the function.
x, y = get_image_data("../data/training")

# print statement to get a feel for the data
# Each row of x are all 4096 pixels of an image
# Each value of y indicates the class, where the index of y correlates to the row of x (linking image data and label).
print(x,"\n", y);


[[131 131 132 ... 159 159 160]
 [208 207 208 ... 199 199 200]
 [206 206 206 ... 196 195 196]
 ...
 [143 143 142 ... 197 196 196]
 [180 181 181 ... 206 206 205]
 [196 196 195 ... 220 220 222]] 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


### Creating Logistic Regression Model

Now that all of the image data is collected, and they have a corresponding label. The data can be fit to a logistic regression model.

In [29]:
# Creating logistic regression model instance that implements a liblinear solver type
# liblinear solver implements a coordinate descent algorithm which works well with high dimension (4096 here)
log_regress = linear_model.LogisticRegression(solver = "liblinear")
# method to fit the logistic regression instance with the data collected in the previous cell
log_regress.fit(x,y);

## Testing the algorithm

Now that a model exists, image data and labels are scraped from the testing folder, in the exact same fashion as the data collection from the training data folder.

In [30]:
# The code here is the same as that used to get image data from the training folder.
# This section will not be commented on, since the previous section covers all aspects of it.
# xt,yt represent image (xt) training data, and label (yt) training data

xt, yt = get_image_data("../data/testing")

## Prediction and Confusion Matrix

The training data is fed into the model and an output is predicted (based on the model). Then the outputs from the model are compared with the correct values to see the model accuracy. First the training data is tested on the model.

In [31]:
# feed the training data into the model, pred is an array containing the output labels based on the model
pred =  log_regress.predict(x)

# These are two formatting questions to make the confusion matrix more appealing. Refer to confusion_format function at the bottom.
predicted = confusion_format(pred)
actual = confusion_format(y)

# prints a confusion matrix, rows are true values, and columns are the model's guessed values.
print(pd.crosstab(actual, predicted, rownames=["Shape Actual"], colnames=["Shape Predicted"]))

# then the percentage of errors is the number of errors divided by the total number of image samples times 100 for percentage.
# The error_percentage function is described below in comment detail.
print(f"\nPercentage of model errors from the testing data: {error_percentage(pred,y):.2f}%")

Shape Predicted  Circle  Rectangle  Square
Shape Actual                              
Circle               18          0       0
Rectangle             0         18       0
Square                0          0      18

Percentage of model errors from the testing data: 0.00%


Next, the testing data is tested on the model.

In [32]:
# feed the testing data into the model, pred is an array containing the output labels based on the model
pred =  log_regress.predict(xt)

# These are two formatting questions to make the confusion matrix more appealing. Refer to confusion_format function at the bottom.
predicted = confusion_format(pred)
actual = confusion_format(y)

# prints a confusion matrix, rows are true values, and columns are the model's guessed values.
print(pd.crosstab(actual, predicted, rownames=["Shape Actual"], colnames=["Shape Predicted"]))

# then the percentage of errors is the number of errors divided by the total number of image samples times 100 for percentage.
# The error_percentage function is described below in comment detail.
print(f"\nPercentage of model errors from the testing data: {error_percentage(pred,y):.2f}%")

Shape Predicted  Circle  Rectangle  Square
Shape Actual                              
Circle               17          0       1
Rectangle             0         17       1
Square                1          0      17

Percentage of model errors from the testing data: 5.56%


## Weight Visualization

Now that a model has been trained, and has been validated as having over 94% correctness the weights can be visualized. A plot can be formed of each weighting vector. The function below will call a slew of functions to stitch together three visualizations, from left to right the images are as follows: circle weights, rectangle weights, square weights.

In [33]:
# scaling factor for weights so visible on monochrome scale
# ~250-500 works well
weight_scaling_factor = 500
show_weights(log_regress, weight_scaling_factor)

---
---
---
---

# **Functions**

All of these functions **must** be ran before anything else. Each function has its purpose discussed, and are each well commented on.

The first function is used to get all image data from a relative directory, it is used to get all image data for training and testing. Thus images are read from both of their respective directories.

In [9]:
def get_image_data(rel_dir):

    # first, an empty list is created so that image pixel arrays can be later added to it

    x = []
    y = []


    # create a for loop that will iterate through all items in the relative directory that contains the image data
    for pic in os.listdir(f"{rel_dir}/"):
        # import image using Pillow library, then convert the image to grayscale imediatly
        image = PIL.Image.open(f"{rel_dir}/{pic}").convert("L")
        # crop the image, will crop vertically from height/4 to 3*height/4
        # and crop horizontally from width/4 to 3*height/4
        # this will crop the image to reduce memory to only relevant pixels
        image = PIL.Image.fromarray(np.array(image)[int(np.floor(image.height / 4)) : int(np.ceil(3 * image.height / 4)), int(np.floor(image.width / 4)) : int(np.ceil( 3 * image.width / 4))])
        # resizes the image to 64x64 pixels, ensures the number of feature vectors are constant regardless of raw image file.
        # resizing also reduces the total memory requirements of the algorithm.
        image = image.resize((64,64))
        # converts from image format to a 2D array representing a pixel grid
        data = np.asarray(image)
        # converts from a 2D pixel grid to a 1D array of length 64^2=4096, where the rows are appended horizontally.
        vec = np.hstack(data)
        # add the image data to the container of all images.
        x.append(vec)
    
        # examine the name of the picture file, can find correct label based on first letter of the file name.
        # c indicates the picture is a circle
        if( str.lower(pic[0]) == "c"):
            # classify circles as a 0
            y.append(0)
        # r indicates the picture is a rectangle
        elif (str.lower(pic[0]) == "r"):
            # classify rectangle as a 1
            y.append(1)
        # only other situation is the image is a square
        else:
            # classify square as a 2
            y.append(2)
    
    # convert from python list to numpy array, format is required for sklearn logistic regression solver.
    x = np.array(x)
    y = np.array(y)
    return x,y

The next function is used to format the confusion matrix, instead of having rows correlating to the decimal value. This function converts from decimal values to corresponding string depending on the class map that was definited initially.

In [10]:
# This function will convert from decimal label to strings.
# 0=>Circle, 1=>Rectangle, 2=>Square

def confusion_format(labels):
    test = []
    for i in labels:
        if i == 0:
            test.append("Circle")
        elif i == 1:
            test.append("Rectangle")
        else:
            test.append("Square")
    test = np.array(test)
    return test

The final function is used to find the percentage difference between two arrays. This is used to find the error of the model's classification compared to the proper classification.

In [11]:
def error_percentage(pred, y):
    
    #print(pred)
    #print(y)
    # the number of errors is the number of differences between the model's labels and the correct labels
    errors = 0
    for i in range(pred.size):
        # pred is the predicted array labels, while y is the actual
        if pred[i] != y[i]:
            errors = errors + 1
            
    # then the percentage of errors is the number of errors divided by the total number of image samples times 100 for percentage.
    return errors / pred.size * 100

Moving onto the visualization section. This function will call many functions to orchestrate the view. These functions will be discussed below.

In [12]:
def show_weights(log_regress, scale):
    c0 = log_regress.coef_[0]
    c1 = log_regress.coef_[1]
    c2 = log_regress.coef_[2]
    
    i0 = get_weight_image(c0, scale)
    i1 = get_weight_image(c1, scale)
    i2 = get_weight_image(c2, scale)
    
    i = concate_horizontal_images(i0,i1,i2)
    i.show()

The function below will convert from a weighting vector into an image. It will take the weighting vectors, then normalize by the max value weight, then scale it by a factor to cover the monochrome colour range. Afterwards it must convert from a single array of length 4096 to a 2D array where both dimensions have length 64. Finally, the array is converted back into an image and it is resized to be easily visible.

In [13]:
def get_weight_image(c, scale):
    # coefficients normalized and scaled by a factor
    c = np.abs(c/c.max()) * scale
    
    # initialize empty list to store rows
    a = []
    for i in range(64):
        temp = []
        for j in range(64):
            temp.append(c[64 * i + j])
        a.append(temp)
    a = np.array(a)
    
    # convert from array to image
    i = PIL.Image.fromarray(a)
    # return the resized image
    return i.resize((256,256))

The last function for visualization will stitch together each of the three weighting images into one single image.

In [14]:
def concate_horizontal_images(im0, im1, im2):
    i = PIL.Image.new(mode = "L",size=(768, 256))
    i.paste(im0, (0, 0, 256, 256))
    i.paste(im1, (256, 0))
    i.paste(im2, (512, 0))
    return i

In [15]:
# this project is simply to run everyhing at once, refer to function code for comments.
def test_function(train, test):
    x, y = get_image_data(train)
    log_regress = linear_model.LogisticRegression(solver = "liblinear")
    log_regress.fit(x,y)
    xt, yt = get_image_data(test)
    
    pred =  log_regress.predict(x)
    predicted = confusion_format(pred)
    actual = confusion_format(y)
    print("Testing Model on Training Data:\n")
    print(pd.crosstab(actual, predicted, rownames=["Shape Actual"], colnames=["Shape Predicted"]))
    print(f"\nPercentage of model errors from the training data: {error_percentage(pred,y):.2f}%\n\n\n")
    
    print("Testing Model on Testing Data:\n")
    pred =  log_regress.predict(xt)
    predicted = confusion_format(pred)
    actual = confusion_format(y)
    print(pd.crosstab(actual, predicted, rownames=["Shape Actual"], colnames=["Shape Predicted"]))
    print(f"\nPercentage of model errors from the testing data: {error_percentage(pred,y):.2f}%")
    weight_scaling_factor = 500
    show_weights(log_regress, weight_scaling_factor)