<h1> DS200A Computer Vision Assignment</h1>

<h2>  Part Two: Feature Selection </h2>	


<h4> In this section, we would like you to select between 15 and 20 features to focus your model on. This will require significant explatoratory research. The first one is already implemented for you, and the next two are pre-specified.  </h4>

In [1]:
#Imports. NOTE: cv2 may need to be installed depending on environment. 
import pandas as pd
import numpy as np
import skimage.io as io
import cv2 as cv

In [2]:
starting_data = pd.read_pickle("./starting_data.pkl")

In [3]:
#Normalizes an image by making the 1st channel all 3 channels if the
#image is in grayscale (happens about 1% of the training set)
def normalize(image):
    if len(image.shape) == 2:
        temp = np.zeros((image.shape[0], image.shape[1], 3))
        for i in range(3):
            temp[:,:,i] = image
        image = temp
    return image

# Returns the average of the red-channel pictures for the images
def red_avg(image):
    return np.mean(image[:, :, 0])

# Returns the average of the green-channel pictures for the images
def green_avg(image):
    return np.mean(image[:, :, 1])

# Returns the average of the blue-channel pictures for the images
def blue_avg(image):
    return np.mean(image[:, :, 2])

# Argmax calculates, at every pixel, which channel had the highest intensity
# and then works out across the image, what percentage of pixels were 
# argmax for each of the three channels. This helps tell us which channel
# is dominating in different images, which may help us quantify color features
# better. 
# Returns: An array of 3 elements that sum to 1, consisting of % arg-max
# for each of the three channels. 
def argmax(image):
    orig = np.zeros(3)
    a = np.argmax(image, axis = 2)
    vals, counts = np.unique(a, return_counts=True)
    for i in range(3):
        if i in vals:
            orig[i] = counts[np.where(vals == i)]
    percent_max = orig/np.sum(orig)
    if len(percent_max) != 3:
        print('Help! Percent max length is not 3')        
    return percent_max

# Returns a cropped image that is defined as the middle 50% of the image
# in both direction (i.e. 25-75th percentile). We suspect that the actual
# focus of the image should be on this section. 
# Returns a cropped image of 1/2 size in each direction (so 0.25 size total)
def crop(image):
    m, n, k = image.shape
    patch_min = int(m/4)
    patch_max = int(3*m/4)
    cropped = image[patch_min:patch_max, patch_min:patch_max, :]    
    return cropped

# Contrast measures, across each channel, the range of intensities and then
# takes the mean of these values to get a percentage of max possible contrast
# Returns a scalar value of mean contrast
def contrast(image):
    vals = np.zeros(3)
    for i in range(3):
        vals[i] = (np.max(image[:, :, i]) - np.min(image[:, :, i]))
    return np.mean(vals)

# Edges uses cv2's Canny edge detector to detect edges in a reasonable
# threshold. Since we are not allowed to keep the location of the edges
# (as we need scalar features) we simply note the number of edges as a
# percentage of all pixels. 
def edges(image, threshold_low = 100, threshold_high = 200):
    # from float to uint8 to keep cv2 happy!!
    edges_image = cv.Canny(image.astype(np.uint8), threshold_low, threshold_high)
    vals, counts = np.unique(edges_image, return_counts=True)
    
    if len(counts)<2:
        return 1.0
    else:
        return counts[1]/sum(counts)

In [4]:
# getImageFeatures: populates our dataframe with the various features 
# described above. Where the function returns a vector of 3 values (i.e. 
# per-channel metrics) this method also separates them into scalars. 
def getImageFeatures(df):
    df["image"] = df["image"].apply(normalize)

    df["red"] = df["image"].apply(red_avg)
    df["green"] = df["image"].apply(green_avg)
    df["blue"] = df["image"].apply(blue_avg)
    df["argmax"] = df["image"].apply(argmax)
    df[["argmax_r","argmax_g","argmax_b"]] = pd.DataFrame(
        df.argmax.values.tolist(), index= df.index)
    df["contrast"] = df["image"].apply(contrast)
    df["edge"] = df["image"].apply(edges)

    df["cropped"] = df["image"].apply(crop)
    df["red_crop"] = df["cropped"].apply(red_avg)
    df["green_crop"] = df["cropped"].apply(green_avg)
    df["blue_crop"] = df["cropped"].apply(blue_avg)
    df["argmax_crop"] = df["cropped"].apply(argmax)
    df[["argmax__crop_r","argmax_crop_g","argmax_crop_b"]] = pd.DataFrame(
        df.argmax_crop.values.tolist(), index= df.index)
    df["contrast_crop"] = df["cropped"].apply(contrast)
    df["edge_crop"] = df["cropped"].apply(edges)

    #Uncomment the following line when ready to make the final dataframe for Part 3!
    # remove filename from the training df later... kept temporarily for easy debugging
    df = df.drop(labels = ['image',  'cropped','argmax', 'argmax_crop'], axis = 1) #'filename',
    return df

#test = getImageFeatures(starting_data)
#test.head()

Define more features above, performing any EDA research below. We expect all external sources sited, and a couple significant different graphs indicating some form of EDA. 

<h4> Graphs </h4>

In [4]:
# store -r starting_data

<h4> Sources </h4>

In [5]:
# B/W images with single channel.. keep to show for report
files=[]
for i,image in enumerate(starting_data["image"]):
    if len(image.shape) < 3:
        files.append(starting_data.iloc[i, 2])
print(files)

['blimp_0022.jpg', 'comet_0006.jpg', 'comet_0011.jpg', 'comet_0013.jpg', 'comet_0021.jpg', 'comet_0036.jpg', 'comet_0038.jpg', 'comet_0041.jpg', 'comet_0049.jpg', 'comet_0052.jpg', 'comet_0053.jpg', 'comet_0057.jpg', 'comet_0058.jpg', 'crab_0045.jpg', 'dolphin_0025.jpg', 'gorilla_0128.jpg']


<h4> DataFrame Creation </h4>

In [7]:
def feature_frame(df):
    return getImageFeatures(df)

In [9]:
train_df = feature_frame(starting_data)
train_df.head()

Unnamed: 0,class,filename,aspect,red,green,blue,argmax_r,argmax_g,argmax_b,contrast,edge,red_crop,green_crop,blue_crop,argmax__crop_r,argmax_crop_g,argmax_crop_b,contrast_crop,edge_crop
0,0,airplanes_0001.jpg,2.426829,183.357049,176.758482,149.242033,0.572327,0.421021,0.006653,254.463664,0.113464,161.131365,160.474151,137.0064,0.536377,0.460449,0.003174,250.598267,0.193604
1,0,airplanes_0002.jpg,2.179348,210.781639,189.322828,164.848389,0.971802,0.028198,0.0,249.801636,0.056641,172.329463,127.0986,104.797017,0.984375,0.015625,0.0,239.890218,0.149658
2,0,airplanes_0003.jpg,2.381818,169.915943,147.578112,111.457837,0.705322,0.056702,0.237976,253.397135,0.126892,155.995922,106.718251,49.301164,0.809326,0.06543,0.125244,220.446019,0.228027
3,0,airplanes_0004.jpg,2.311765,152.404667,132.895747,78.073232,0.835449,0.161743,0.002808,253.534108,0.159851,146.875142,103.978308,37.956971,0.943848,0.052734,0.003418,247.106567,0.236084
4,0,airplanes_0005.jpg,2.244318,147.112763,150.219843,86.000939,0.404968,0.515442,0.07959,254.519206,0.169128,100.401541,102.504867,46.679292,0.262695,0.493408,0.243896,253.630859,0.244629


In [10]:
train_df.describe()

Unnamed: 0,class,aspect,red,green,blue,argmax_r,argmax_g,argmax_b,contrast,edge,red_crop,green_crop,blue_crop,argmax__crop_r,argmax_crop_g,argmax_crop_b,contrast_crop,edge_crop
count,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0,1501.0
mean,9.57495,1.271221,118.674594,117.96934,109.286776,0.499326,0.232369,0.268305,241.695293,0.177461,116.009166,110.528402,101.724997,0.549557,0.186151,0.264293,229.225584,0.206324
std,5.54189,0.42786,44.630329,40.436065,45.336776,0.312318,0.242696,0.301187,20.051401,0.081827,41.266951,36.541116,41.342812,0.329532,0.219462,0.310762,28.169096,0.08739
min,0.0,0.525397,1.106049,0.560478,3.824446,0.0,0.0,0.0,50.916667,0.002747,1.067322,0.806227,6.372816,0.0,0.0,0.0,42.5,0.002686
25%,5.0,0.918333,88.656294,92.409164,74.885923,0.230652,0.030151,0.018738,237.276774,0.118652,89.898811,87.577558,70.761315,0.261719,0.016602,0.01001,218.103882,0.146973
50%,9.0,1.333333,117.803005,117.420675,105.316196,0.498779,0.146118,0.138184,249.582682,0.171936,117.124451,108.332845,95.893485,0.585449,0.092529,0.120117,237.810791,0.210449
75%,14.0,1.5,146.935043,142.000479,139.460025,0.763428,0.361694,0.427551,254.113444,0.240173,142.853776,132.894375,127.482779,0.852539,0.287598,0.445557,249.866933,0.268311
max,19.0,3.469027,243.682848,242.976391,241.722951,1.0,0.986145,1.0,255.0,0.374207,237.698004,235.436205,241.186424,1.0,0.99292,1.0,255.0,1.0


In [4]:
#Save the above dataframe for Part 3!
train_df.to_pickle("train_df.pkl")