# Random Forest Classifier
In this notebook, I'm going to try getting rid of the template matching step of my building finder and instead just have a sliding window that checks to see if there are roofs in each window. Doing it this way (with constant candidate patch sizes), I can also use the built-in HOG features.

The first step will be to grab a bunch of image samples and pull out roof (positive) and nonroof (negative) training examples. Hopefully we can get a somewhat balanced training set.

The next step will be to get a set of representative images randomly sampled from the DHS locations and to try to classify each window within them. In this step, it will be useful to save the image patches classified as roofs and nonroofs in separate folders so that we can identify cases where misclassification occurred. We can then add these hard cases to the training set and hopefully improve our classification algorithm.

We can repeat this process until we (hopefully) get a good image patch roof/nonroof classifier.

### Get some training samples

In [1]:
import pandas as pd
import numpy as np
import cv2
from skimage.feature import hog
from sklearn import ensemble
import time
import urllib
import os
import glob
import shutil
import matplotlib.pyplot as plt
from utils import *
from features import *

%matplotlib inline

In [83]:
# Import and preview csv data
dhs_fn = '../data/IMR1990-2000_NL1992-2012_thresh100.csv'
dhs_data = pd.read_csv(dhs_fn)

# Samples to take from each location
samples = 1
# Number of locations to sample from
locations = 100

# Output image directory
out_dir = '../images/forest/samples/'
out_csv = '../data/dhs_image_metadata.csv'

# Create DataFrame for image metadata
image_data = pd.DataFrame(columns=['image', 'cellid', 'cell_lat',
                                  'cell_lon', 'lat', 'lon'])

# Sampling images from DHS locations
for index, row in dhs_data.iterrows():
    if index + 1 > locations:
        break
    
    cell_id = int(row['cellid'])
    cell_lat = row['lat']
    cell_lon = row['lon']
    # Sample images
    image_data = sample_dhs(image_data, cell_id, cell_lat, cell_lon,
                           samples, out_dir)

# Save image metadata to csv
image_data.set_index('image', inplace=True)
image_data.to_csv(out_csv)

print 'Done. Sampled {} images total.'.format(samples * locations)

Sampled 1 images from DHS cell 100498 in 0.314904928207 seconds.
Sampled 1 images from DHS cell 101916 in 0.340812921524 seconds.
Sampled 1 images from DHS cell 101940 in 0.308054208755 seconds.
Sampled 1 images from DHS cell 101946 in 0.317348003387 seconds.
Sampled 1 images from DHS cell 101976 in 0.331739902496 seconds.
Sampled 1 images from DHS cell 103383 in 0.33606004715 seconds.
Sampled 1 images from DHS cell 104103 in 0.327265977859 seconds.
Sampled 1 images from DHS cell 106271 in 0.463610887527 seconds.
Sampled 1 images from DHS cell 106977 in 0.297461032867 seconds.
Sampled 1 images from DHS cell 106991 in 0.362490177155 seconds.
Sampled 1 images from DHS cell 109148 in 0.330276966095 seconds.
Sampled 1 images from DHS cell 110577 in 0.322143793106 seconds.
Sampled 1 images from DHS cell 110578 in 0.329143047333 seconds.
Sampled 1 images from DHS cell 111296 in 0.366881132126 seconds.
Sampled 1 images from DHS cell 111297 in 0.325039863586 seconds.
Sampled 1 images from DHS 

In [5]:
# Rename images first
in_dir = '../images/forest/samples/'
t0 = time.time()
count = 0
for image_fn in glob.glob(in_dir + '*'):
    out_fn = in_dir + str(count) + '.png'
    os.rename(image_fn, out_fn)
    count += 1
t1 = time.time()
print 'Renamed {} images in {} seconds.'.format(count, (t1-t0))

Renamed 17 images in 0.0105409622192 seconds.


Ok, now that we have some images, we need to split them up into 81 (9x9) 80x80 pixel patches. Let's write a few functions to do that:

In [2]:
def save_patch(image, out_fn, x=0, y=0, width=80, height=80):
    """
    This function saves a specified image patch from a image.
    :param image: Input 3-channel image
    :param out_fn: Filename for output image patch
    :param x: Top left x-coordinate of image patch
    :param y: Top left y-coordinate of image patch
    :param width: Width in pixels of image patch
    :param height: Height in pixels of image patch
    """
    patch = image[x:x+width, y:y+width, :]
    cv2.imwrite(out_fn, patch)

In [36]:
def save_patches(image_fn, out_dir, width=80, height=80):
    """
    This function takes each input image and saves however many image
    patches of the specified size as can fit in the original image. 
    Image patches are offset by half the height and width specified.
    :param image_fn: Input image filename
    :param out_dir: Folder where image patches will be saved
    :param width: Width of each image patch
    :param height: Height of each image patch
    """
    image = cv2.imread(image_fn)
    # Get dimensions of input image
    max_x, max_y = image.shape[:2]
    # Pull image patches as long as they fit in the image
    count = 0
    top_left_x = 0
    while (top_left_x + (height - 1) < max_x):
        top_left_y = 0
        while (top_left_y + (width - 1) < max_y):
            out_fn = (out_dir + os.path.basename(image_fn)[:-4] + '_' +
                      str(count) + '.png')
            save_patch(image, out_fn, top_left_x, top_left_y, width,
                       height)
            count += 1
            top_left_y += (width/2)
        top_left_x += (height/2)

Great, now that we can save 81 patches from any image, let's go through our samples and save 81 patches for each of them to separate into positive and negative training examples:

In [39]:
# Save patches from all sample images
in_dir = '../images/forest/samples/'
out_dir = '../images/forest/patches/'
t0 = time.time()
count = 0
for image_fn in glob.glob(in_dir + '*'):
    save_patches(image_fn, out_dir)
    count += 1
t1 = time.time()
print 'Saved patches for {} images in {} seconds.'.format(count,(t1-t0))

Saved patches for 142 images.


In [71]:
# Rename images first
in_dir = '../images/forest/training/roof/'
out_dir = '../images/forest/training/new_roof/'
t0 = time.time()
count = 0
for image_fn in glob.glob(in_dir + '*'):
    out_fn = out_dir + str(count) + '.png'
    shutil.copyfile(image_fn, out_fn)
    count += 1
t1 = time.time()
print 'Copied {} images in {} seconds.'.format(count, (t1-t0))
# Rename images first
in_dir = '../images/forest/training/nonroof/'
out_dir = '../images/forest/training/new_nonroof/'
t0 = time.time()
count = 0
for image_fn in glob.glob(in_dir + '*'):
    out_fn = out_dir + str(count) + '.png'
    shutil.copyfile(image_fn, out_fn)
    count += 1
t1 = time.time()
print 'Copied {} images in {} seconds.'.format(count, (t1-t0))

Copied 722 images in 2.19266605377 seconds.
Copied 1058 images in 3.18941497803 seconds.


## Extracting features from training examples
Ok, now that we've organized those patches into roof and nonroof classes, let's first use the feature extractor functions that we wrote earlier and see how well it does:

In [3]:
sample_dir = '../images/forest/training/'
csv_out = '../data/forest_training_data.csv'
store_image_data(sample_dir, csv_out)

Processed 722 images for roof class.
Processed 1058 images for nonroof class.
Processed 1780 images total in 149.869764805 seconds.


Now let's load those features back into the workspace so that we can use them for classification:

In [4]:
csv_in = '../data/forest_training_data.csv'
(features, colors, hogs, mags, labels, label_encoder) = \
                import_image_data(csv_in, display=True)

['nonroof' 'roof']
Got class labels for 1780 training data points.
Got feature vectors for 1780 training data points.


Let's see how well a random forest classifier does on these training examples:

In [15]:
# Mix up the data
perm = np.random.permutation(labels.size)
features = features[perm]
labels = labels[perm]

In [16]:
t0 = time.time()
clf = ensemble.RandomForestClassifier(n_estimators=50, random_state=0,
                                      class_weight='auto')
# Training on training examples
num_train = 1000
clf.fit(features[:num_train], labels[:num_train])
accuracy = clf.score(features[num_train:], labels[num_train:])
print 'Overall classification accuracy: {}'.format(accuracy)
t1 = time.time()
print 'Took {} seconds.'.format(t1-t0)

Overall classification accuracy: 0.917948717949
Took 0.134510993958 seconds.


In [17]:
# Figuring out the true/false positive/negative rates
y_hat = clf.predict(features[num_train:])
y = labels[num_train:]
num_test = y_hat.shape[0]
positive = sum(y)
negative = num_test - positive
print '{} test examples: {} positive, {} negative'.format(num_test,
                                                     positive, negative)
positive_hat = sum(y_hat)
negative_hat = num_test - positive_hat
print 'Predicted: {} positive, {} negative.'.format(positive_hat,
                                                   negative_hat)
# Different types of mistakes:
# 0 = correct, -1 = false positive, 1 = false negative
mistakes = y - y_hat
false_neg = mistakes > 0
false_pos = mistakes < 0
false_neg_count = sum(false_neg)
false_pos_count = sum(false_pos)
true_pos = positive_hat - false_pos_count
true_neg = negative_hat - false_neg_count
print 'Prediction results:'
print '    {} true positive, {} false positive'.format(true_pos,
                                                      false_pos_count)
print '    {} true negative, {} false negative'.format(true_neg,
                                                      false_neg_count)
print 'Roof accuracy: {}'.format(float(true_pos) / positive)
print 'Nonroof accuracy: {}'.format(float(true_neg) / negative)

780 test examples: 312 positive, 468 negative
Predicted: 288 positive, 492 negative.
Prediction results:
    268 true positive, 20 false positive
    448 true negative, 44 false negative
Roof accuracy: 0.858974358974
Nonroof accuracy: 0.957264957265


This looks extremely promising! Let's get some new random DHS images and try to classify the image patches within them.

## Testing the classifier
It seems that this might actually result in a good roof detector! Let's get a new set of DHS images and see how it does classifying all the image patches. We can see where it makes mistakes and then add those "hard" examples to the training set--hopefully that will improve the classifier going forward.

Ok, we've use the code above to get a new set of DHS images. Let's break those up into 81 image patches each and classify each of them, saving them into separate roof and nonroof folders:

In [55]:
in_dir = '../images/forest/samples/'
roof_dir = '../images/forest/classify/roof/'
nonroof_dir = '../images/forest/classify/nonroof/'

In [56]:
# Train random forest classifier on all training examples
clf = ensemble.RandomForestClassifier(n_estimators=50, random_state=0,
                                      class_weight='auto')
clf.fit(features, labels)

RandomForestClassifier(bootstrap=True, class_weight='auto', criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [28]:
# Classify each image patch
width = 80
height = 80
image_count = 0
t0 = time.time()
for image_fn in glob.glob(in_dir + '*'):
    # Load in image
    image = cv2.imread(image_fn)
    image_count += 1
    print 'Classifying patches in image {}.'.format(image_count)
    # Get dimensions of input image
    max_x, max_y = image.shape[:2]
    # Pull image patches as long as they fit in the image
    count = 0
    top_left_x = 0
    while (top_left_x + (height - 1) < max_x):
        top_left_y = 0
        while (top_left_y + (width - 1) < max_y):
            # Get feature vector from image patch
            patch = image[top_left_x:top_left_x+width,
                          top_left_y:top_left_y+width, :]
            color = calc_color_hist(patch)
            color = color.flatten()
            (hog, hog_bins, magnitude_hist, magnitude_bins,
             max_magnitude) = compute_hog(patch)
            feature = np.concatenate((color, hog, magnitude_hist),
                                    axis=0)
            # Classify image
            predict = clf.predict(feature)[0] # 0 = nonroof, 1 = roof
            probs = clf.predict_proba(feature)[0]
            nonroof_prob = probs[0]
            roof_prob = probs[1]
            # Decide where to save image patch
            if predict == 1:
                out_fn = (roof_dir + os.path.basename(image_fn)[:-4] +
                          '_' + str(count) + '_' + str(roof_prob) +  '.png')
            else:
                out_fn = (nonroof_dir + os.path.basename(image_fn)[:-4] +
                          '_' + str(count) + '_' + str(nonroof_prob) + '.png')
            # Save image patch
            save_patch(image, out_fn, top_left_x, top_left_y,
                           width, height)
            count += 1
            top_left_y += (width/2)
        top_left_x += (height/2)
t1 = time.time()
print 'Classified image patches for {} images in {} seconds.'.format(
                                        image_count, (t1-t0))

Classifying patches in image 1.
Classifying patches in image 2.
Classifying patches in image 3.
Classifying patches in image 4.
Classifying patches in image 5.
Classifying patches in image 6.
Classifying patches in image 7.
Classifying patches in image 8.
Classifying patches in image 9.
Classifying patches in image 10.
Classifying patches in image 11.
Classifying patches in image 12.
Classifying patches in image 13.
Classifying patches in image 14.
Classifying patches in image 15.
Classifying patches in image 16.
Classifying patches in image 17.
Classifying patches in image 18.
Classifying patches in image 19.
Classifying patches in image 20.
Classifying patches in image 21.
Classifying patches in image 22.
Classifying patches in image 23.
Classifying patches in image 24.
Classifying patches in image 25.


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Let's do this a couple more times to get more hard training examples!

## Writing annotated classified images
Our next step will be to save annotated versions of our classified images so that we can visually inspect them to see how well our classifier is doing. Some things that we want to do here are:

- Draw bounding boxes around patches identified as roofs
- Use non-maximum suppression to keep only the best patch out of overlapping patches

### Draw all bounding boxes around patches classified as roofs

In [18]:
in_dir = '../images/forest/samples/'
roof_dir = '../images/forest/classify/roof/'
nonroof_dir = '../images/forest/classify/nonroof/'
annotated_dir = '../images/forest/classify/annotated/'

In [19]:
# Train random forest classifier on all training examples
clf = ensemble.RandomForestClassifier(n_estimators=50, random_state=0,
                                      class_weight='auto')
clf.fit(features, labels)

RandomForestClassifier(bootstrap=True, class_weight='auto', criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [20]:
# Classify each image patch
width = 80
height = 80
image_count = 0
font = cv2.FONT_HERSHEY_TRIPLEX
annotation_color = (0,255,0) # green
t0 = time.time()
for image_fn in glob.glob(in_dir + '*'):
    # Load in image and one to annotate
    image = cv2.imread(image_fn)
    image_out = cv2.imread(image_fn)
    image_count += 1
    print 'Classifying patches in image {}.'.format(image_count)
    # Get dimensions of input image
    max_x, max_y = image.shape[:2]
    # Pull image patches as long as they fit in the image
    count = 0
    top_left_x = 0
    while (top_left_x + (height - 1) < max_x):
        top_left_y = 0
        while (top_left_y + (width - 1) < max_y):
            # Get feature vector from image patch
            patch = image[top_left_x:top_left_x+height,
                          top_left_y:top_left_y+width, :]
            color = calc_color_hist(patch)
            color = color.flatten()
            (hog, hog_bins, magnitude_hist, magnitude_bins,
             max_magnitude) = compute_hog(patch)
            feature = np.concatenate((color, hog, magnitude_hist),
                                    axis=0)
            # Classify image patch
            predict = clf.predict(feature)[0] # 0 = nonroof, 1 = roof
            probs = clf.predict_proba(feature)[0]
            nonroof_prob = probs[0]
            roof_prob = probs[1]
            # Annotate image when patches classified as roofs
            if predict == 1:
                cv2.putText(image_out, str(roof_prob),
                            (top_left_x + height/2, top_left_y + width/2),
                            font, 0.5, annotation_color, thickness=1,
                            lineType=cv2.CV_AA)
                cv2.rectangle(image_out, (top_left_x, top_left_y),
                             (top_left_x+height-1, top_left_y+width-1),
                             annotation_color)
            # Decide where to save image patch
            if predict == 1:
                out_fn = (roof_dir + str(roof_prob) + '_' +
                          os.path.basename(image_fn)[:-4] +
                          '_' + str(count) +  '.png')
            else:
                out_fn = (nonroof_dir + str(nonroof_prob) + '_' +
                          os.path.basename(image_fn)[:-4] +
                          '_' + str(count) + '.png')
            # Save image patch
            save_patch(image, out_fn, top_left_x, top_left_y,
                           width, height)
            count += 1
            top_left_y += (width/2)
        top_left_x += (height/2)
    # Save annotated image
    image_out_fn = annotated_dir + os.path.basename(image_fn)
    cv2.imwrite(image_out_fn, image_out)
t1 = time.time()
print 'Classified image patches for {} images in {} seconds.'.format(
                                        image_count, (t1-t0))

Classifying patches in image 1.
Classifying patches in image 2.
Classifying patches in image 3.
Classifying patches in image 4.
Classifying patches in image 5.
Classifying patches in image 6.
Classifying patches in image 7.
Classifying patches in image 8.
Classifying patches in image 9.
Classifying patches in image 10.
Classifying patches in image 11.
Classifying patches in image 12.
Classifying patches in image 13.
Classifying patches in image 14.
Classifying patches in image 15.
Classifying patches in image 16.
Classifying patches in image 17.
Classified image patches for 17 images in 114.938821077 seconds.


### Adding non-maximum suppression
What we should do here is save all the locations and probabilities of the patches classified as roofs within each image. Then at the end, we can choose the maximum probability patches.