# Visualizing Sample Color Histograms
In notebook 6 of this project, I collected a bunch of sample images of different classes of objects (roofs, water, etc.) and plotted their color histograms. In this notebook, I'll try to calculate color histograms for each of my samples and then use dimensionality reduction to visualize the distribution.

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
import os
import glob
import cv2
from color_histograms import *
import IPython.display

%matplotlib inline

## Creating Pandas data structures
The first thing that we need to do is figure out how to store our data in Pandas data structures so that we can save them to spreadsheets for later use.

In [307]:
# Create data frame for image data
imageData = pd.DataFrame(columns=['class', 'imageID', 'filename',\
                                  'width', 'height', 'hist1', 'hist2',\
                                  'hist3', 'hist4', 'hist5', 'hist6',\
                                  'hist7', 'hist8', 'hist9', 'hist10',\
                                  'hist11', 'hist12', 'hist13', 'hist14',\
                                  'hist15', 'hist16', 'hist17', 'hist18',\
                                  'hist19', 'hist20', 'hist21', 'hist22',\
                                  'hist23', 'hist24', 'hist25', 'hist26',\
                                  'hist27', 'hist28', 'hist29', 'hist30',\
                                  'hist31', 'hist32', 'hist33', 'hist34',\
                                  'hist35', 'hist36', 'hist37', 'hist38',\
                                  'hist39', 'hist40', 'hist41', 'hist42',\
                                  'hist43', 'hist44', 'hist45', 'hist46',\
                                  'hist47', 'hist48'])

In [308]:
# Directories containing samples
imageDir = '../images/'
samplesDir = imageDir + 'samples/'
roofsDir = samplesDir + 'roofs/'
waterDir = samplesDir + 'water/'
vegDir = samplesDir + 'vegetation/'

Let's start with the **roof** class:

In [309]:
className = 'roof'

In [310]:
# Iterate through images in directory and add image data to DataFrame
count = 0
for imagePath in glob.iglob(roofsDir + '*.png'):
    count += 1
    #print 'Processing image {}'.format(count)
    # Get image basename with extension
    imageFn = os.path.basename(imagePath)
    # Get image ID
    imageID = imageFn[:-4]
    # Read in image and get dimensions
    image = cv2.imread(imagePath)
    h, w = image.shape[:2]
    # Calculate color histogram
    hist = calc_color_hist(image, bins=16)
    hist = np.transpose(hist)
    # Create temporary histogram DataFrame
    histdf = pd.DataFrame(hist)
    colNames = list(histdf.columns.values)
    tempNames = [str(i+1) for i in colNames]
    newNames = ['hist' + i for i in tempNames]
    histdf.columns = newNames
    # Add to DataFrame
    img_data = pd.DataFrame({'class': [className], 'imageID': [imageID],\
                             'filename': [imageFn], 'width': [w],\
                             'height': [h]})
    # Add histogram DataFrame
    img_data = pd.concat([img_data, histdf], axis=1)
    # Continues indexing
    imageData = imageData.append(img_data, ignore_index=True)
print 'Processed {} images in total'.format(count)

Processed 16 images in total


In [311]:
# Drop any rows with duplicated information
imageData.drop_duplicates(['class', 'imageID'], inplace=True)
# Reorganize columns
imageData = imageData[['class', 'imageID', 'filename',\
                       'width', 'height', 'hist1', 'hist2',\
                       'hist3', 'hist4', 'hist5', 'hist6',\
                       'hist7', 'hist8', 'hist9', 'hist10',\
                       'hist11', 'hist12', 'hist13', 'hist14',\
                       'hist15', 'hist16', 'hist17', 'hist18',\
                       'hist19', 'hist20', 'hist21', 'hist22',\
                       'hist23', 'hist24', 'hist25', 'hist26',\
                       'hist27', 'hist28', 'hist29', 'hist30',\
                       'hist31', 'hist32', 'hist33', 'hist34',\
                       'hist35', 'hist36', 'hist37', 'hist38',\
                       'hist39', 'hist40', 'hist41', 'hist42',\
                       'hist43', 'hist44', 'hist45', 'hist46',\
                       'hist47', 'hist48']]

In [312]:
# Writing image data to csv
dataDir = '../data/'
imageDataFn = dataDir + 'imageData.csv'
imageData.to_csv(imageDataFn)

Now that we've saved the roof samples data to a spreadsheet, we want to be able to load the data from the spreadsheet and add data for the vegetation samples and the water samples.

In [313]:
# Loading image data from csv
imageData = pd.read_csv(imageDataFn)
imageData.drop('Unnamed: 0', axis=1, inplace=True)

In [314]:
# Adding water samples
className = 'water'

# Iterate through images in directory and add image data to DataFrame
count = 0
for imagePath in glob.iglob(waterDir + '*.png'):
    count += 1
    #print 'Processing image {}'.format(count)
    # Get image basename with extension
    imageFn = os.path.basename(imagePath)
    # Get image ID
    imageID = imageFn[:-4]
    # Read in image and get dimensions
    image = cv2.imread(imagePath)
    h, w = image.shape[:2]
    # Calculate color histogram
    hist = calc_color_hist(image, bins=16)
    hist = np.transpose(hist)
    # Create temporary histogram DataFrame
    histdf = pd.DataFrame(hist)
    colNames = list(histdf.columns.values)
    tempNames = [str(i+1) for i in colNames]
    newNames = ['hist' + i for i in tempNames]
    histdf.columns = newNames
    # Add to DataFrame
    img_data = pd.DataFrame({'class': [className], 'imageID': [imageID],\
                             'filename': [imageFn], 'width': [w],\
                             'height': [h]})
    # Add histogram DataFrame
    img_data = pd.concat([img_data, histdf], axis=1)
    # Continues indexing
    imageData = imageData.append(img_data, ignore_index=True)
print 'Processed {} images in total'.format(count)

Processed 14 images in total


In [315]:
# New class of samples
className = 'vegetation'

# Iterate through images in directory and add image data to DataFrame
count = 0
for imagePath in glob.iglob(vegDir + '*.png'):
    count += 1
    #print 'Processing image {}'.format(count)
    # Get image basename with extension
    imageFn = os.path.basename(imagePath)
    # Get image ID
    imageID = imageFn[:-4]
    # Read in image and get dimensions
    image = cv2.imread(imagePath)
    h, w = image.shape[:2]
    # Calculate color histogram
    hist = calc_color_hist(image, bins=16)
    hist = np.transpose(hist)
    # Create temporary histogram DataFrame
    histdf = pd.DataFrame(hist)
    colNames = list(histdf.columns.values)
    tempNames = [str(i+1) for i in colNames]
    newNames = ['hist' + i for i in tempNames]
    histdf.columns = newNames
    # Add to DataFrame
    img_data = pd.DataFrame({'class': [className], 'imageID': [imageID],\
                             'filename': [imageFn], 'width': [w],\
                             'height': [h]})
    # Add histogram DataFrame
    img_data = pd.concat([img_data, histdf], axis=1)
    # Continues indexing
    imageData = imageData.append(img_data, ignore_index=True)
print 'Processed {} images in total'.format(count)

Processed 24 images in total


In [316]:
# Drop any rows with duplicated information
imageData.drop_duplicates(['class', 'imageID'], inplace=True)
# Reorganize columns
imageData = imageData[['class', 'imageID', 'filename',\
                       'width', 'height', 'hist1', 'hist2',\
                       'hist3', 'hist4', 'hist5', 'hist6',\
                       'hist7', 'hist8', 'hist9', 'hist10',\
                       'hist11', 'hist12', 'hist13', 'hist14',\
                       'hist15', 'hist16', 'hist17', 'hist18',\
                       'hist19', 'hist20', 'hist21', 'hist22',\
                       'hist23', 'hist24', 'hist25', 'hist26',\
                       'hist27', 'hist28', 'hist29', 'hist30',\
                       'hist31', 'hist32', 'hist33', 'hist34',\
                       'hist35', 'hist36', 'hist37', 'hist38',\
                       'hist39', 'hist40', 'hist41', 'hist42',\
                       'hist43', 'hist44', 'hist45', 'hist46',\
                       'hist47', 'hist48']]

In [317]:
# Writing image data to csv
dataDir = '../data/'
imageDataFn = dataDir + 'imageData.csv'
imageData.to_csv(imageDataFn)

The next step is to read back the data into Numpy arrays so that I can use `sklearn` tools to visualize the data in two dimensions.

One other thing that still needs to be figured out is dropping rows with duplicated information. If I want to process additional images, I don't want to duplicate data.