# Image Classification

Satellite images often need to be classified (assigned to a fixed set of types) or to be used for detection of various features of interest.  Here we will look at the classification case, using labelled satellite images from various categories from the [UCMerced LandUse dataset](http://weegee.vision.ucmerced.edu/datasets/landuse.html). scikit-learn is useful for general numeric data types, but it doesn't have significant support for working with images. Luckily, there are various deep-learning and convolutional-network libraries that do support images well, including Keras (backed by TensorFlow) as we will use here. To run this notebook, you will first need to download the dataset and put it in ../data/.

<!-- Direct link: http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip -->

In [None]:
import os
import intake
import glob

import numpy as np
import geoviews as gv
import holoviews as hv
import pandas as pd

gv.extension('bokeh')

#### Get the classes and files

In [None]:
path = '../data/UCMerced_LandUse/Images/'
classes = np.array([f.split('/')[-1] for f in glob.glob(path+'*')])
files = {c: glob.glob(os.path.join(path, c, '*')) for c in classes}

In [None]:
classes

#### Split files into train and test sets

In [None]:
train_set = list(np.random.choice(np.arange(100), 80, False))
test_set = [i for i in range(100) if i not in train_set]

train_files = {c: [f for f in fs if int(f[-6:-4]) in train_set] for c, fs in files.items()}
test_files  = {c: [f for f in fs if int(f[-6:-4]) in test_set]  for c, fs in files.items()}

#### Define function to sample from train or test set

In [None]:
def get_sample(cls, set='training'):
    files = train_files if set == 'training' else test_files
    flist = list(files[cls])
    f = flist[np.random.randint(len(flist))]
    return gv.RGB.load_tiff(f).relabel(cls)

Samples are loaded as xarrays:

In [None]:
get_sample(classes[0]).data

But are actually visualizable RGB Images:

In [None]:
gv.Layout([get_sample(s) for s in np.random.choice(classes, 4)]).cols(2)

## Define the model

A simple convolutional network using Keras:

In [None]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(100, 100, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(21))
model.add(Activation('sigmoid'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

## Declare the data

We will define a generator that loads chunks of the data:

In [None]:
ntraining = 10000

def get_array(rgb):
    h, w = rgb.interface.shape(rgb, True)
    b = np.random.randint(h-100)
    l = np.random.randint(w-100)
    return np.dstack([np.flipud(rgb.dimension_values(d, flat=False)[b:b+100, l:l+100])/255 for d in rgb.vdims])

choices = np.random.choice(classes, ntraining)
class_list = list(classes)

def gen_samples(choices, set='training'):
    "Generates random arrays along with class labels"
    for c in choices:
        labels = np.zeros((21,))
        labels[class_list.index(c)] = 1
        yield get_array(get_sample(c, set))[np.newaxis, :], labels[np.newaxis, :]        

## Run the model

In [None]:
%%time
history = model.fit_generator(gen_samples(choices), steps_per_epoch=100, epochs=100, verbose=1)

## Evaluate the model

In [None]:
(hv.Curve(history.history['loss'], 'Iteration', 'Loss'    ).options(width=400) +
 hv.Curve(history.history['acc'],  'Iteration', 'Accuracy').options(width=400))

Now let us test the predictions on the test set, first visually:

In [None]:
def get_prediction(cls):
    sample = get_sample(cls, 'test')
    array = get_array(sample)[np.newaxis, ...]
    p = model.predict(array).argmax()
    p = classes[p]
    return sample.relabel('Predicted: %s - Actual: %s' % (p, cls))

opts = dict(fontsize={'title': '8pt'}, xaxis=None, yaxis=None, width=250, height=250)
hv.Layout([get_prediction(cls).options(**opts) for cls in classes[:20]]).cols(3)

And now numerically for 500 predictions:

In [None]:
ntesting = 500
choices = np.random.choice(classes, ntesting)
class_list = list(classes)

prediction = model.predict_generator(gen_samples(choices), steps=ntesting)
predictions = classes[prediction.argmax(axis=1)]

accuracy = (predictions==choices).sum()/ntesting

print(f'Accuracy on test set {accuracy}')

Next we can see how well the classifier performs on the different categories. We'll run 20 predictions on each category:

In [None]:
def predict(cls, iterations=20):
    accurate, predictions = [], []
    for i in range(iterations):
        sample = get_sample(cls, 'test')
        array = get_array(sample)[np.newaxis, ...]
        p = model.predict(array).argmax()
        p = classes[p]
        predictions.append(p)
        accurate.append(p == cls)
    return np.sum(accurate)/float(iterations), predictions

accuracies = [(c, *predict(c)) for c in classes]

We can now visualize this data as a bar chart:

In [None]:
df = pd.DataFrame(accuracies, columns=['landuse', 'accuracy', 'predictions'])

hv.Bars(df, 'landuse', 'accuracy').options(width=700, xrotation=45, color_index='landuse', 
                                           cmap='Category20', show_legend=False)

Another interesting way of viewing this data is to look at which categories the classifier got confused on. We will count how many times the classifier classified one category as another category and visualize the result as a Chord graph where each edge is colored by the predicted category. By clicking on a node we can reveal which other categories incorrectly identified an image as being of that category:

In [None]:
pdf = pd.DataFrame([(p, l) for (_, l, _, ps) in df.itertuples() for p in ps], columns=['Prediction', 'Actual'])
graph = pdf.groupby(['Prediction', 'Actual']).size().to_frame().reset_index()

hv.Chord(graph.rename(columns={0: 'Count'})).relabel('Misclassification Graph').options(
    node_color='index', cmap='Category20', edge_color_index='Actual', label_index='index',
    width=600, height=600)

Clicking on buildings, for instance, reveals a lot of confusion about overpasses, mediumresidential, and intersections, all of which do share visual features in common. Conversely, number of buildings were misidentified as parklots, which is also reasonable. As we saw in the bar chart above, forests on the other hand, have lots of edges leading back to itself, demonstrating the high accuracy observed for that category of images.