# New DHS images
In this notebook, I will download a new set of DHS images. We'll see how well our current two-step classifier works on this new test set. We can also use these new images to collect more training data for future use.

### Importing DHS data
First, let's import the data created earlier when we were examining the correlation between nightlights and infant mortality rates.

In [1]:
import pandas as pd

In [2]:
# Import and preview csv data
dhs_fn = '../data/IMR1990-2000_NL1992-2012_thresh100.csv'
dhs_data = pd.read_csv(dhs_fn)
dhs_data.head()

Unnamed: 0,cellid,birthsAfter,birthsBefore,dIMR,dNL,imrAfter,imrBefore,lat,lon,nlAfter,nlBefore
0,100498,891,1392,6.864624,0,37.037037,30.172413,-20.25,28.75,5,5
1,101916,173,350,-33.95541,0,28.901733,62.857143,-19.25,17.75,1,1
2,101940,187,292,18.148853,0,69.518715,51.369862,-19.25,29.75,1,1
3,101946,162,146,48.114319,1,123.456787,75.342468,-19.25,32.75,2,1
4,101976,402,295,-51.0667,3,9.950249,61.016949,-19.25,47.75,4,1


### Sample images from each DHS cell
Next, we want to sample images from each half-degree DHS cell using the Goole Static Maps API. Ideally, we want to be able to get an arbitrary number of images from each cell and store data about the image in a csv.

In [3]:
import urllib
import numpy as np
import time

In [8]:
def sample_image(out_fn, lat, lon, height=400, width=400, zoom=19):
    """
    This function uses the Google Static Maps API to download and save
    one satellite image.
    :param out_fn: Output filename for saved image
    :param lat: Latitude of image center
    :param lon: Longitude of image center
    :param height: Height of image in pixels
    :param width: Width of image in pixels
    :param zoom: Zoom level of image
    """
    # Google Static Maps API key
    api_key = 'AIzaSyAejgapvGncLMRiMlUoqZ2h6yRF-lwNYMM'
    
    # Save satellite image
    url_pattern = 'https://maps.googleapis.com/maps/api/staticmap?center=%0.6f,%0.6f&zoom=%s&size=%sx%s&maptype=satellite&key=%s'
    url = url_pattern % (lat, lon, zoom, height, width, api_key)
    urllib.urlretrieve(url, out_fn)

Let's test out the function and make sure that we can save one image at a time at the desired location:

In [22]:
# Get a picture of Rains 214
out_fn = '../images/practice/Rains214.png'
lat = 37.421179
lon = -122.157794
sample_image(out_fn, lat, lon)

Now we need a function to sample many images from each DHS location. Our goal is to be able to sample multiple images (at random) from each DHS location, and store data about those images (which DHS cell id they match with, DHS cell longitude and latitude, image longitude and latitude)

In [42]:
def sample_dhs(image_data, cell_id, cell_lat, cell_lon, samples,
               out_dir):
    """
    This function samples multiple images at random for a DHS location
    and saves them in the output directory.
    :param image_data: DataFrame containing image metadata
    :param cell_id: Cell ID of DHS location
    :param cell_lat: Latitude of DHS location
    :param cell_lon: Longitude of DHS location
    :param samples: Number of samples to get
    :param out_dir: Directory for sampled images
    :returns: DataFrame containing updated image metadata
    """
    t_start = time.time()
    
    # Sample images
    for i in range(samples):
        # Randomly sample within half-degree cell
        lat = cell_lat + np.random.uniform(-0.25, 0.25)
        lon = cell_lon + np.random.uniform(-0.25, 0.25)
        # Determine output filename
        fn = str(cell_id) + '_' + str(i) + '.png'
        out_fn = out_dir + fn
        # Save image
        sample_image(out_fn, lat, lon)
        # Update image metadata
        temp_data = pd.DataFrame({'image': [fn], 'cellid': [cell_id],
                                 'cell_lat': [cell_lat], 'cell_lon': 
                                 [cell_lon], 'lat': [lat], 'lon': 
                                 [lon]})
        image_data = pd.concat([image_data, temp_data])
    
    # Return updated image metadata
    t_end = time.time()
    print 'Sampled {} images from DHS cell {} in {} seconds.'.format(
                                                    samples, cell_id,
                                                    (t_end - t_start))
    return image_data

To test this out, let's get some samples from DHS locations specified in our DHS data csv:

In [48]:
# Import and preview csv data
dhs_fn = '../data/IMR1990-2000_NL1992-2012_thresh100.csv'
dhs_data = pd.read_csv(dhs_fn)

# Samples to take from each location
samples = 5
# Number of locations to sample from
locations = 100

# Output image directory
out_dir = '../images/practice/in/'
out_csv = '../data/dhs_image_metadata.csv'

# Create DataFrame for image metadata
image_data = pd.DataFrame(columns=['image', 'cellid', 'cell_lat',
                                  'cell_lon', 'lat', 'lon'])

# Sampling images from DHS locations
for index, row in dhs_data.iterrows():
    if index + 1 > locations:
        break
    
    cell_id = int(row['cellid'])
    cell_lat = row['lat']
    cell_lon = row['lon']
    # Sample images
    image_data = sample_dhs(image_data, cell_id, cell_lat, cell_lon,
                           samples, out_dir)

# Save image metadata to csv
image_data.set_index('image', inplace=True)
image_data.to_csv(out_csv)

print 'Done. Sample {} images total.'.format(samples * locations)

Sampled 5 images from DHS cell 100498 in 1.73706912994 seconds.
Sampled 5 images from DHS cell 101916 in 1.61488604546 seconds.
Sampled 5 images from DHS cell 101940 in 1.67446708679 seconds.
Sampled 5 images from DHS cell 101946 in 1.70053792 seconds.
Sampled 5 images from DHS cell 101976 in 1.48707222939 seconds.
Sampled 5 images from DHS cell 103383 in 1.59422588348 seconds.
Sampled 5 images from DHS cell 104103 in 1.53859400749 seconds.
Sampled 5 images from DHS cell 106271 in 1.61935305595 seconds.
Sampled 5 images from DHS cell 106977 in 1.72392296791 seconds.
Sampled 5 images from DHS cell 106991 in 1.68160200119 seconds.
Sampled 5 images from DHS cell 109148 in 1.70569705963 seconds.
Sampled 5 images from DHS cell 110577 in 2.10972499847 seconds.
Sampled 5 images from DHS cell 110578 in 1.56889605522 seconds.
Sampled 5 images from DHS cell 111296 in 1.81126403809 seconds.
Sampled 5 images from DHS cell 111297 in 1.69076895714 seconds.
Sampled 5 images from DHS cell 112736 in 1.

Let's try to batch classify these practice images using the two-step batch classification function:

In [4]:
import sklearn

In [5]:
sklearn.__path__

['/afs/cs.stanford.edu/u/nealjean/.local/lib/python2.7/site-packages/sklearn']

In [9]:
from classify import *
import os
import time
import cv2
import glob
import numpy as np
import pandas as pd

In [49]:
# Rename images first
in_dir = '../images/practice/in/'
t0 = time.time()
count = 0
for image_fn in glob.glob(in_dir + '*'):
    image = cv2.imread(image_fn)
    out_fn = in_dir + str(count) + '.png'
    cv2.imwrite(out_fn, image)
    count += 1
t1 = time.time()
print 'Renamed {} images in {} seconds.'.format(count, (t1-t0))

Renamed 500 images in 15.0758049488 seconds.


In [50]:
# Batch classify parameters
out_dir = '../images/practice/out/'
template_fn = '../images/templates/template1.png'
image_data_fn = '../data/all_image_data.csv'
hog_data_fn = '../data/hog_training_data.csv'

In [52]:
two_step_batch_classify(in_dir, out_dir, template_fn, image_data_fn,
                       hog_data_fn)

Classified 500 images in 211.244637012 seconds.


This two-step classification test worked better than the simple color histogram classifier from earlier, but definitely not as well as on the original DHS image set (as expected, since the training data was taken from that set).

Let's repeat this process to get a lot more training samples for the random forest classifier, then test again.

## New Idea: Use Random Forests for everything

In [50]:
csv_in = '../data/hog_training_data.csv'
(features, colors, hogs, mags, labels, label_encoder) = \
            import_image_data(csv_in, display=False)

In [51]:
features.shape

(384, 73)

In [52]:
from sklearn import ensemble

In [53]:
# Mix up the data
perm = np.random.permutation(labels.size)
features = features[perm]
labels = labels[perm]

In [54]:
forest = ensemble.RandomForestClassifier(random_state=0,
                                        class_weight='auto')
# Training on training examples
forest.fit(features[:300], labels[:300])
accuracy = forest.score(features[300:], labels[300:])
print 'Overall classification accuracy: {}'.format(accuracy)

Overall classification accuracy: 0.964285714286


Using the samples that I collected from the image patches that the color histogram found as roofs, it seems that using random forests to classify image patches might work pretty well.

The next thing that I will try will be to divide my 400x400 pixel satellite images into 25 80x80 patches, and then use random forests to try to classify them into building and non-building categories. Using this strategy, I can also use the built-in HOG functions, which will probably be faster.