![ACM SIGCHI Summer School on Computational Interaction  
Inference, optimization and modeling for the engineering of interactive systems  
27th August - 1st September 2018  
University of Cambridge, UK  ](imgs/logo_full.png)





# Learning control manifolds: data-driven HCI
$$
\newcommand{\vec}[1]{{\bf #1} } 
\newcommand{\real}{\mathbb{R}}
\newcommand{\expect}[1]{\mathbb{E}[#1]}
\DeclareMathOperator*{\argmin}{arg\,min}
\vec{x}
\real$$

 

## Outline 

### Modelling sensor vector streams
* <a href="#unsupervised"> Discuss computational approaches to modelling sensors as input devices </a>
* <a href="#inherent">Discuss the ideas of inherent structures in sensor streams </a>
* <a href="#keyboard"> Look at the keyboard from a different perspective </a>
* <a href="#capturing"> Try out the Rewarding the Original" process for analysing sensor streams

### Unsupervised learning
* <a href="#notation"> Introduce standard notation for machine learning </a>
* <a href="#clustering"> Discuss basic clustering </a>

### Clustering and manifold learning
* <a href="#clustering_ex"> Try out clustering to cluster images in pixel space.
* <a href="#manifold"> Introduce manifold learning, along with PCA and self-organising maps.
* <a href="#manifold_hci"> Show how the keyboard can be unraveled by unsupervised learning.
* <a href="#beard_pointer"> Look at how ISOMAP can build a beard pointer without supervision.
* <a href="#practical"> Challenge: implement unsupervised learning from a camera feed.


<a id="unsupervised"> </a>
## Unsupervised learning for sensor data
In this part, we will explore how **unsupervised learning** (and semi-supervised learning) can pull out structure from **sensors**. We can use this "natural", latent structure to build interfaces without having to predefine our controls. This will gives a substrate which we can then attach to useful actions, in the knowledge that what we are designing is based on empirically derived control features.

### Why is this computational interaction?
Unsupervised learning learns a model of interaction directly from **data**. The way in which user action is interpreted is determined empirically through a rigorous algorithmic process. This is both a **computational process** for capturing user behaviour systematically and mapping it on actions, and actionable computational models that can be applied to specific problems.

We **do not** start by designing an algorithm to recognise inputs. Instead we *computationally* analyse inputs and derive an algorithm from this analysis. 

#### Properties
The unsupervised learning approach is:

| Property | Why  |
|----------|------|
|**data-driven**  |  Captures how we should interact by observing how we can interact   |
|**generalisable**  |  Makes weak assumptions about the nature of interaction. |
|**quantitative** | Provides numerical tools to design and evaluate the *design process*. |
|**objective**| Provides an analytical base before assumptions and idioms are introduced. |


### Motivation
<a id="motivation"></a>
<img src="imgs/mainfold_labeled.png">

For many conventional UI sensors, we already have good mappings from **sensor measurements** to **interface actions**. This is largely because the sensors were designed specifically to have electrical outputs which are very close to the intended actions; a traditional mechanical mouse literally emits electrical pulses at a rate proportional to translation.

But with optical sensors like a Kinect, or with a high-degree of freedom flexible sensor, or tricky sensors like electromyography (which measures the electrical signals present as muscles contract), these mappings become tricky. Supervised learning lets train a system to recognise patterns in these signals (e.g. to classify poses or gestures). 
**But what if you don't know what's even feasible or would make a good interface?**

## Inherent structures

If we take sensor measurements of a person doing "random stuff" (derived from [Rewarding the Original](http://www.dcs.gla.ac.uk/~jhw/motionexplorerdata/) CHI 2012 for ideas on how to make "random stuff" a formal process), we will will end up with a set of feature vectors that were both **performable** and **measurable** (because we know someone did them and a sensor measured them). 



One way to look at this data is to recover **inherent structure** in these measurements. We can ask some pertinent questions:
* are there **regularities** or **stable points** which represent things which might be good controls? 
    * Can we find these empirically? 
* Can we link stable points to useful actions we want to be able to do? 
* Can we infer user intentions from these stable points robustly?


<a href="https://www.youtube.com/embed/tNQJHWVB_QA"> <img src="imgs/motion_video_frame.png"> </a>

# Sampling inherent structures

Desiderata:
* find control opportunities with minimal assumptions about sensors
* capture a parsimonious space for interaction -- only that which can be done and can be sensed
* efficiently and reproducibly capture interactions
* map out characterstics of sensor vectors

<img src="imgs/map.png">

<a id="keyboard"> </a>
## Making the familiar unfamiliar: keyboard vectors
This code will display a window. The output will change as keys are pressed.

In [None]:
# standard imports
import sys
sys.path.append("src")

import numpy as np
import matplotlib.pyplot as plt
import sys, os, time
import pandas as pd
%matplotlib inline
import matplotlib as mpl
plt.rc('figure', figsize=(8.0, 4.0), dpi=180)
import scipy.stats
import pykalman
from scipy.stats import norm
import scipy.stats
import IPython

In [None]:
from key_display import key_tk
import keyboard
state = key_tk()
%gui tk
keyboard.restore_state(state)  

## Input as a stream of vectors
What is being visualised? The output is treating the keyboard as a 128 dimensional binary vector; **a point in $\real^{128}$** for each time $t$.

As keys go and up and down, they switch on and off the relevant elements of the vector. This vector has some process noise and a bit of temporal smoothing applied.  The order of the elements is random but fixed. 

This is an unfamiliar way of looking at a keyboard input, where we might expect to consume key information asynchronously from an event loop, and key events would come as fully formed data structures.

However, this is typical for sensors that might be encountered:
* There is **noise**, or uncertainty in measurement.
* There is a **very high dimension** of state measured, but a low dimension of control exerted.
* There are **continuous dynamics**; instantaneous changes of state are not possible.
* Data comes as a **regular array**; but without much more structure than that.
* Input comes **synchronously**, as a sampled stream.

The "ordinary" keyboard input is a highly massaged, processed version of raw input (not that the visualised vector version is an authentic representation of the raw input either).


<a id="capturing"><a>
## Rewarding the Original: Capturing the repertoire
We can run the code again, and use a simplified rewarding the original algorithm to capture the "interesting" vectors.

At the bottom, there is a count shown. This is a count of the number of *unique vectors* (for some sense of unique) seen so far. Try pressing a key a few times; the counter will increase then stop increasing.

A collection of vectors is being sampled as the process runs. This *repertoire* is augmented with a new input if the input is different enough from that seen before. This allows to collect all of the distinct vectors that this user/input device combination is capable of generating.

In [None]:
vec_list = {"data":[], "time":[]} # this will hold the collected vectors
state = key_tk(rwo_kwargs={"bag":vec_list, "threshold":0.42, "metric":'euclidean'}, alpha=0.8)
%gui tk  

In [None]:
keyboard.restore_state(state)  
try:
    os.mkdir("captured_data")
except OSError:
    pass

# save the state to a file
fname = "captured_data/rwo_{0}.npy".format(time.asctime().replace(" ", "_").replace(":", "_"))
print(fname)
np.save(fname, np.array(vec_list["data"]))
print(np.load(fname).shape)

#### Viewing the vectors
We can view this as a matrix, showing each captured vector as a row, in time order.

In [None]:
plt.imshow(np.array(vec_list["data"]).T)

### Measuring results

We can use these results to capture something about "how much" of the possible space is explored. For example, we could compare the diversity of results using a single finger running over the keyboard to a fist running over the keyboard.

Some simple measures are:
* the number of vectors generated in total
* the "invention rate"; number of vectors per second
* the "volume" of the space that is explored (computed as the log-determinant of the covariance matrix)


In [None]:
print("No. vectors: {0}".format(len(vec_list["data"])))
print("Volume: {0:.2e}".format(np.log(np.linalg.det(np.cov(vec_list["data"])))))
print("Median invention rate: {0:.2f} vecs/second".format(np.median(np.diff(vec_list["time"]))))


# Practical: Mash your keyboard
Try capturing two datasets. One precise (e.g. dragging a single finger) and one less precise (e.g. dragging a fist) over the surface.
See if you can exhaust the capture when `threshold=0.4`.

Compare the statistics for the results.

## Rewarding the original

In summary:

* We have a process for capturing sensor vectors that correspond to possible and measurable states of a sensing device.
* We can analyse the results comparing possibilities with different users, sensors and tasks, including measures such as total number of vectors, "volume", "overlap", "time of invention" and so on.
* The set we capture can be conditioned on specific variables (e.g. those vectors generated by a whole wrist, or just one finger, or while holding a coffee cup (don't try this!))
* We can look at many types of input as streams of vectors, and process these using standard machine learning tools.
* **However** this proces does not distinguish noise from intentional control. A noisy sensor looks more "innovative" than a clean one. These metrics are useful relative comparisons, but must be treated cautiously.


### From analysis to synthesis
This is a powerful *analytic* tool, which we could use to analyse potential input devices or the effect of user impairments on interaction. It would be nice to be able to use these ideas for interaction *synthesis* as well. 

But one problem is that we end up with a collection of *very* high dimensional vectors. This is hard to work with -- how we would use the information captured to design?


# Unsupervised learning

<a id="notation"> </a>
## Some mathematical notation

We will by considering datasets which consist of a series of measurements. We learn from a *training set* of data.
Each measurement is called a *sample* or *datapoint*, and each measurement type is called a *feature*. 

If we have $n$ samples and $d$ features, we form a matrix $X$ of size $n \times d$, which has $n$ rows of $d$ measurements. $d$ is the **dimension** of the measurements. $n$ is the **sample size**.  Each row of $X$ is called a *feature vector*. For example, we might have 200 images of digits, each of which is a sequence of $8\times8=64$ measurements of brightness, giving us a $200 \times 64$ dataset. The rows of image values are the *features*.

### Geometry of feature vectors
Each feature vector is a point in an $\real^d$ space. Typically the ordering of $n$ samples is not relevant; the information is represented **geometrically**.



In [None]:
import sklearn.manifold, sklearn.cluster, sklearn.datasets, sklearn.decomposition
import scipy.stats
import cv2

## Supervised learning

Supervised learning involves learning a relationship between attribute variables and target variables; in other words learning a function which maps input measurements to target values. This can be in the context of making discrete decisions (is this image a car or not?) or learning continuous relationships (how loud will this aircraft wing be if I make the shape like this?). Most, but not all, common machine learning problems are framed as supervised learning problems.

We're going to focus on **unsupervised** learning for the rest of this section.

## Unsupervised learning
Unsupervised learning learns "interesting things" about the structure of data without any explicit labeling of points. The key idea is that datasets may have a simple underlying or *latent* representation which can be determined simply by looking at the data itself.

Two common unsupervised learning tasks are *clustering* and *dimensional reduction*. Clustering can be thought of as the unsupervised analogue of classification -- finding discrete classes in data. Dimensional reduction can be thought of as the analogue of regression -- finding a small set of continuous variables which "explain" a higher dimensional set. 



<a id="clustering">
### Clustering

Clustering tries to find well-seperated (in some sense) **partitions** of a data set. It is essentially a search for natural boundaries in the data. 


<img src="imgs/cluster_img.png">

There are many, *many* clustering approaches. A simple one is *k-means*, which finds clusters via an iterative algortihm. The number of clusters must be chosen in advance. In general, it is hard to estimate the number of clusters, although there are algorithms for estimating this. k-means proceeds by choosing a set of $k$ random points as initial cluster seed points; classifiying each data point according to its nearest seed point; then moving the cluster point towards the mean position of all the data points that belong to it. 

We can use this to find *dense, disconnected* regions of a dataset. In a sensor stream example, this might be a sequence of sensor inputs that occur commonly because they represent a particular state. A simple switch, for example, could be measured as a sampled signal indicating resistance. Although there would be some (very little) noise, there would be two clear clusters corresponding to the on and off states.

The k-means algorithm does not guarantee to find the best possible clustering -- it falls into *local minima*. But it often works very well.

<img src="imgs/cluster_boundary.png">

In [None]:
digits = sklearn.datasets.load_digits()
digit_data = digits.data


selection = np.random.randint(0,200,(10,))

digit_seq = [digit_data[s].reshape(8,8) for s in selection]
plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
for i, d in enumerate(selection):    
    plt.text(4+8*i,10,"%s"%digits.target[d])
plt.axis("off")
plt.title("Some random digits from the downscaled MNIST set")
plt.figure()

In [None]:
# apply principal component analysis
pca = sklearn.decomposition.PCA(n_components=2).fit(digit_data)
digits_2d = pca.transform(digit_data)

# plot each digit with a different color (these are the true labels)
plt.scatter(digits_2d[:,0], digits_2d[:,1], c=digits.target, cmap='jet', s=6)
plt.title("A 2D plot of the digits, colored by true label")
# show a few random draws from the examples, and their labels
plt.figure()

In [None]:
## now cluster the data
kmeans = sklearn.cluster.KMeans(n_clusters=10)
kmeans_target = kmeans.fit_predict(digits.data)
plt.scatter(digits_2d[:,0], digits_2d[:,1], c=kmeans_target, cmap='jet', s=6)
plt.title("Points colored by cluster inferred")

# plot some items in the same cluster
# (which should be the same digit or similar!)
def plot_same_target(target):
    plt.figure()
    selection = np.where(kmeans_target==target)[0][0:20]
    digit_seq = [digit_data[s].reshape(8,8) for s in selection]
    plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
    for i, d in enumerate(selection):    
        plt.text(4+8*i,10,"%s"%digits.target[d])
    plt.axis("off")
    plt.title("Images from cluster %d" % target)
    
for i in range(10):    
    plot_same_target(i)    


In [None]:
## now cluster the data, but do it with too few and too many clusters

for clusters in [3,20]:
    plt.figure()
    kmeans = sklearn.cluster.KMeans(n_clusters=clusters)
    kmeans_target = kmeans.fit_predict(digits.data)
    plt.scatter(digits_2d[:,0], digits_2d[:,1], c=kmeans_target, cmap='jet')
    plt.title("%d clusters is not good" % clusters)
    # plot some items in the same cluster
    # (which should be the same digit or similar!)
    def plot_same_target(target):
        plt.figure()
        selection = np.random.permutation(np.where(kmeans_target==target))[0][0:20]
        digit_seq = [digit_data[s].reshape(8,8) for s in selection]
        plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
        for i, d in enumerate(selection):    
            plt.text(4+8*i,10,"%s"%digits.target[d])
        plt.axis("off")

    for i in range(clusters):
        plot_same_target(i)    



# Practical: Day and night
<a id="clustering_ex"></a>

Use a clustering algorithm (choose one from [sklearn](http://scikit-learn.org/stable/modules/clustering.html#clustering)) to cluster a set of images of street footage, some filmed at night, some during the day.

The images are available by loading `data/daynight.npz` using `np.load()`. This is a has 512 images of size 160x65, RGB color, 8-bit unsigned integer. You can access these as:

    images = np.load("data/daynight.npz")['data']

There is also the **true labels** for each image in ['target']. **Obviously, don't use these in the clustering process!**.


You should be able to cluster the images according to the time of day without using any labels. The raw pixel values can be used as features for clustering, but a more sensible approach is to summarise the image as a **color histogram**. 

This essentially splits the color space into coarse bins, and counts the occurence of each color type. You need to choose a value for $n$ (number of bins per channel) for the histogram; smaller numbers (like 3 or 4) are usually good.

Make a function that can show the images and the corresponding cluster labels, to test how well clustering has worked; you might also see if there are additional meaningful clusters in the imagery.

### Steps
1. Load the imagery
1. Check you can plot it (use `plt.imshow`)
1. Create a set of features using `color_histogram()`
1. Try clustering it and plotting the result
1. Experiment with clustering algorithms and `color_histogram()` settings and see how this affects clustering performance.


In [None]:
def color_histogram(img, n):
    """Return the color histogram of the 2D color image img, which should have dtype np.uint8
    n specfies the number of bins **per channel**. The histogram is computed in YUV space. """
    # compute 3 channel colour histogram using openCV
    # we convert to YCC space to make the histogram better spaced
    chroma_img = cv2.cvtColor(img, cv2.COLOR_BGR2YUV) 
    # compute histogram and reduce to a flat array
    return cv2.calcHist([chroma_img.astype(np.uint8)], channels=[0,1,2], mask=None, histSize=[n,n,n], ranges=[0,256,0,256,0,256]).ravel()
    

In [None]:
images = np.load("data/daynight.npz")['data']
plt.imshow(cv2.cvtColor(images[1,:,:,:], cv2.COLOR_BGR2RGB))
plt.grid("off")

In [None]:
## Solution goes here

<a id="manifold"></a>
# Dimensional reduction
A very common unsupervised learning task is *dimensional reduction*; taking a dataset with a dimension of $\real^h$ and reducing to a dimension of $\real^l$ which has fewer dimensions than $\real^h$ but retains as much of the useful information as possible, for some definition of "useful information". The most common application is for **visualisation**, because humans are best at interpreting 2D data and struggle with higher dimensions.

**Even 3D structure can be tricky for humans to get their heads around!**
<img src="imgs/topologic.jpg">

Dimensional reduction can be thought of as a form of lossy compression -- finding a "simpler" representation of the data which captures its essential properties. This of course depends upon what the "essential properties" that we want to keep are, but generally we want to reject *noise* and keep non-random structure. We find a **subspace** that captures the meaningful variation of a dataset.

One way of viewing this process is finding *latent variables*; variables we did not directly observe, but which are simple explanations of the ones we did observe. For example, if we measure a large number of weather measurements (rainfall, pressure, humidity, windspeed), these might be a very redundant representation of a few simple variables (e.g. is there a storm?). If features correlate or cluster in the measured data we can learn this structure *even without knowing training labels*.

##### Manifold learning
One way of looking at this problem is learning a *manifold* on which the data lies (or lies close to). A *manifold* is a geometrical structure which is locally like a low-dimensional Euclidean space. Imagine data points lying on the surface of a sheet of paper crumpled into a ball, or a 1D filament or string tangled up in a 3D space. 

Manifold approaches attempt to automatically find these smooth embedded structures by examining the local structure of datapoints (often by analysing the nearest neighbour graph of points). This is more flexible than linear dimensional reduction as it can in theory unravel very complex or tangled datasets. 

However, the algorithms are usually approximate, they do not give guarantees that they will find a given manifold, and can be computationally intensive to run. 

<img src="imgs/isomap.jpg">



### Principal component analysis
One very simple method of dimensional reduction is *principal component analysis*. This is a linear method; in other words it finds rigid rotations and scalings of the data to project it onto a lower dimension. That is, it finds a matrix $A$ such that $y=Ax$ gives a mapping from $d$ dimensional $x$ to $d^\prime$ dimensional $y$.

The PCA algorithm effectively looks for the rotation that makes the dataset look "fattest" (maximises the variance), chooses that as the first dimension, then removes that dimension, rotates again to make it look "fattest" and repeats. Linear algebra makes it efficient to do this process in a single step by extracting the *eigenvectors* of the *covariance matrix*. 

PCA always finds a matrix $A$ such that $y = Ax$, where the dimension of $y<x$. PCA is exact and repeatable and very efficient, but it can only find rigid transformations of the data. This is a limitation of any linear dimensional reduction technique.




In [None]:

digits = sklearn.datasets.load_digits()
digit_data = digits.data

# plot a single digit data element
def show_digit(d):
    fig = plt.figure(figsize=(3,3))
    ax1 = fig.add_subplot(2,1,1)
    ax1.imshow(d.reshape(8,8), cmap='gray', interpolation='nearest')
    
    ax2 = fig.add_subplot(2,1,2)
    ax2.bar(np.arange(len(d)), d)
    fig.subplots_adjust()
    
# show a couple of raw digits
for i in range(3):
    show_digit(digit_data[np.random.randint(0,1000)])


In [None]:
plt.figure(figsize=(15,15))
plt.title("PCA")
# apply principal component analysis
pca = sklearn.decomposition.PCA(n_components=2).fit(digit_data)
digits_2d = pca.transform(digit_data)

# plot each digit with a different color
plt.scatter(digits_2d[:,0], digits_2d[:,1], c=digits.target, cmap='rainbow')

## Explaining the projections
One useful property of PCA is that we compute exactly how "fat" each of these learned dimensions were. The ratio of *explained variance* tells us how much each of the original variation in the dataset is captured by each learned dimension. 

If most of the variance is in the first couple of components, we know that a 2D representation will capture much of the original dataset. If the ratios of variance are spread out over many dimensions, we will need many dimensions to represent the data well. 

In [None]:
# We can see how many dimensions we need to represent the data well using the eigenspectrum
# here we show the first 32 components
pca = sklearn.decomposition.PCA(n_components=32).fit(digit_data)
plt.bar(np.arange(32), pca.explained_variance_ratio_)
plt.xlabel("Component")
plt.ylabel("Proportion of variance explained")

### Limitations of linearity
One example where PCA does badly is the "swiss roll dataset" -- a plane rolled up into a spiral in 3D. This has a very simple structure; a simple plane with some distortion. But PCA can never unravel the spiral to find this simple explanation because it cannot be unravelled via a linear transformation.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
swiss_pos, swiss_val = sklearn.datasets.make_swiss_roll(800, noise=0.0)
fig = plt.figure(figsize=(4,4))
# make a 3D figure
ax = fig.add_subplot(111, projection="3d")
ax.scatter(swiss_pos[:,0], swiss_pos[:,1], swiss_pos[:,2], c=swiss_val, cmap='gist_heat', s=10)


In [None]:
# Apply PCA to learn this structure (which doesn't help much)
plt.figure()
pca = sklearn.decomposition.PCA(2).fit(swiss_pos)
pca_pos = pca.transform(swiss_pos)
plt.scatter(pca_pos[:,0], pca_pos[:,1], c=swiss_val, cmap='gist_heat')

###  Nonlinear manifold learning
Other approaches to dimensional reduction look at the problem in terms of learning a *manifold*. A *manifold* is a geometrical structure which is *locally like* a low-dimensional Euclidean space. Examples are the plane rolled up in the swiss roll, or a 1D "string" tangled up in a 3D space. 

Some manifold approaches attempt to automatically find these smooth embedded structures by examining the local structure of datapoints (often by analysing the nearest neighbour graph of points). This is more flexible than linear dimensional reduction as it can in theory unravel very complex or tangled datasets. 

However, the algorithms are usually approximate, they do not give guarantees that they will find a given manifold, and can be computationally intensive to run.

A popular manifold learning algorithm is *ISOMAP* which uses nearest neighbour graphs to identify locally connected parts of a dataset.



In [None]:
plt.figure()
np.random.seed(2018)
swiss_pos, swiss_val = sklearn.datasets.make_swiss_roll(800, noise=0.0)
isomap_pos = sklearn.manifold.Isomap(n_neighbors=10, n_components=2).fit_transform(swiss_pos)
plt.scatter(isomap_pos[:,0], isomap_pos[:,1], c=swiss_val, cmap='gist_heat')

In [None]:
plt.figure()

# note that isomap is sensitive to noise!
noisy_swiss_pos, swiss_val = sklearn.datasets.make_swiss_roll(800, noise=0.5)
isomap_pos = sklearn.manifold.Isomap(n_neighbors=10, n_components=2).fit_transform(noisy_swiss_pos)
plt.scatter(isomap_pos[:,0], isomap_pos[:,1], c=swiss_val, cmap='gist_heat')

-----------------
## Self organising maps
<a id="som"></a>

Self-organising maps are a nice half way house between clustering and manifold learning approaches. They create a dense "net" of clusters in the original (high-dimensional space), and force the cluster points to **also** lie in a low-dimensional space with local structure, for example, on a regular 2D grid. This maps a **discretized** low-dimensional space into the high-dimensional space.

The algorithm causes the clusters have local smoothness in both the high and the low dimensional space; it does this by forcing cluster points on the grid to move closer (in the high-d space) to their neighbours (in the low-d grid).

<img src="imgs/somtraining.png"> [Image from https://en.wikipedia.org/wiki/Self-organizing_map]

In other words: **clusters that are close together in the high-dimensional space should be close together in the low dimensional space**. This "unravels" high dimensional structure into a simple low-dimensional approximation.


In [None]:
## Self organising maps
digits = sklearn.datasets.load_digits()
digits.data -= 8.0

In [None]:
# !conda install -c conda-forge weave
import som
som = reload(som)
som_map = som.SOM(48,48,64)
som_map.learn(digits.data, epochs=50000)

In [None]:
# show SOM response for each *pixel* in the input image
for v in [20,30,40,50]:
    plt.figure()
    plt.imshow(som_map.codebook[:,:,v], cmap="magma", interpolation="nearest")
    plt.axis("off")

In [None]:
# Show the SOM response for one node, across *all* pixels
plt.imshow(som_map.codebook[20,20,:].reshape(8,8), cmap="gray", interpolation="nearest")
plt.grid("off")

In [None]:
def show_codebook_images():
    plt.figure(figsize=(32,32))
    for i in range(0,48,2):
        for j in range(0,48,2):
            img = som_map.codebook[i,j,:].reshape(8,8)        
            plt.imshow(img, cmap="gray", extent=[i,i+2,j,j+2])
    plt.xlim(0,48)
    plt.ylim(0,48)
    plt.axis("off")
show_codebook_images()

## The U-Matrix

One very nice aspect of the self-organsing map is that we can extract the **U-matrix** which captures how close together in the **high-dimensional space** points in the **low-dimensional** map are. This lets us see whether there are natural **partitions** in the layout; wrinkles in the layout that might be good clustering points.

In [None]:
import scipy.spatial.distance

def umatrix(codebook):
    ## take the average HD distance to all neighbours within
    ## certain radius in the 2D distance    
    x_code, y_code = np.meshgrid(np.arange(codebook.shape[0]), np.arange(codebook.shape[1]))
    hdmatrix = codebook.reshape(codebook.shape[0]*codebook.shape[1], codebook.shape[2])    
    hd_distance = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(hdmatrix))**2
    ld_distance = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(np.vstack([x_code.ravel(), y_code.ravel()]).T))
    return np.mean(hd_distance * (np.logical_and(ld_distance>0,ld_distance<1.5)),axis=1).reshape(codebook.shape[0], codebook.shape[1])
    
plt.figure(figsize=(14,14))    
um = umatrix(som_map.codebook)    
show_codebook_images()
plt.imshow(um, interpolation="nearest", cmap="inferno", alpha=0.75, extent=[0,48,48,0])

plt.grid("off")


# Applying to HCI
These are standard machine learning techniques. How do we apply this practically to interaction? How can we solve the analysis -> synthesis problem?

<a id="manifold_hci"></a>
## Laying out the keyboard vectors

The keyboard vectors we captured at the start of this section look pretty noisy and unstructured. We can visualise them as a matrix:


In [None]:
keyboard_data = np.load("data/rwo_Sun_Aug_26_20_47_40_2018.npy")
print(keyboard_data.shape)
plt.imshow(keyboard_data.T)

Since the ordering of elements is random, there is no spatial relation among keys visible. However, because there will be correlation between keys that were pressed close in time (because of the temporal smoothing, in this case), we would expect there to be some spatial information left. 

We can apply the self-organising map to this data:

In [None]:
np.random.seed(2018)
som_map = som.SOM(8,8,128)
som_map.learn(keyboard_data, epochs=25000)

And we can view the output live, as we move across the keyboard:

In [None]:
def live_som(k_vec):
    # transform a keyboard vector to an output vector to display
    z = som_map.score(np.zeros_like(k_vec), width=1.0) # remove constant part
    result = np.clip(som_map.score(k_vec, width=1.0) -z, 0, 100)
    return np.fliplr(result.T) * 3


from key_display import key_tk
import keyboard
state = key_tk(transform_fn = live_som, shape=som_map.codebook.shape[0:2], alpha=0.8)
%gui tk
keyboard.restore_state(state) 

Note that we did **not** train the system to map keys to physical locations. We simply captured the sensor stream, and identified the **control manifold** -- the space of sensor vectors that correspond to useful control signals. We were able to return to a 2D space and recover the spatial (or rather topological) structure of the vectors and use that as a useful control input. 



<a id="beard_pointer"></a>
## Learning a pointer


### ISOMAP: The face-direction example
<a id="isomap"></a>
A well known manifold learning algorithm is *ISOMAP* which uses nearest neighbour graphs to identify locally connected parts of a dataset. This examines local neighbor graphs to find an "unraveling" of the space to a 1D or 2D subspace, which can deal with very warped high-dimensional data, and doesn't get confused by examples like the swiss roll above (assuming parameters are set correctly!).

Let's use ISOMAP (a local neighbours embedding approach) to build a real, working vision based interface.

In [None]:
# load a video of my head in different orientations
face_frames = np.load("data/face_frames.npz")['arr_0']

In [None]:
# show the video in opencv -- it's just a raw sequence of values
# the video is 700 frames of 64x64 imagery
frame_ctr = 0
# play the video back 
while frame_ctr<face_frames.shape[1]:
    frame = face_frames[:,frame_ctr].reshape(64,64)
    cv2.imshow('Face video', cv2.resize(frame, (512,512), interpolation=cv2.INTER_NEAREST))
    frame_ctr += 1
    key = cv2.waitKey(5) & 0xff
    if key  == 27:
        break
        
# clean up
cv2.destroyAllWindows()        

In [None]:
# fit isomap to the face data (this takes a few minutes)

faces = face_frames.T
np.random.seed(2018)
isomap = sklearn.manifold.Isomap(n_neighbors=25)
isomap.fit(faces)
xy = isomap.transform(faces)
orig_xy = np.array(xy)

In [None]:
## the following code just plots images on the plot without overlap
overlaps = []

def is_overlap(ra,rb):
    P1X, P2X, P1Y, P2Y = ra
    P3X, P4X, P3Y, P4Y = rb
    
    return not ( P2X <= P3X or P1X >= P4X or P2Y <= P3Y or P1Y >= P4Y )

def overlap_test(r):
    if any([is_overlap(r,rb) for rb in overlaps]):
        return False
    overlaps.append(r)
    return True

def plot_some_faces(xy, faces, thin=1.0, sz=8):
    global overlaps
    overlaps = []
    q = sz/4
    for i in range(len(xy)):
        x, y = xy[i,0], xy[i,1]
        image = faces[i,:].copy()
        
        if np.random.random()<thin:
            for j in range(10):
                x, y = xy[i,0], xy[i,1]
                x += np.random.uniform(-q,q)
                y += np.random.uniform(-q, q)
                x *= q
                y *= q
                extent = [x, x+sz, y, y+sz]
                if overlap_test(extent):                    
                    img = image.reshape(64,64)
                    img[:,0] = 1
                    img[:,-1] = 1
                    img[0,:] = 1
                    img[-1,:] = 1                    
                    plt.imshow(img, vmin=0, vmax=1, cmap="gray",interpolation="lanczos",extent=extent, zorder=100)
                    break

In [None]:
## make a 2D plot of the faces
# tweak co-ordinates

xy[:,0] = -orig_xy[:,0] / 2.5
xy[:,1] = -orig_xy[:,1] 
plt.figure(figsize=(20,20))

# plot the faces

plot_some_faces(xy, faces, sz=8)

# the axes correctly
plt.xlim(np.min(xy[:,0])-10,np.max(xy[:,0])+10)
plt.ylim(np.min(xy[:,1])-10,np.max(xy[:,1])+10)
plt.gca().patch.set_facecolor('gray')
plt.xlim(-70,70)
plt.ylim(-70,70)
plt.grid("off")


In [None]:
frame_ctr = 0
# play the video back, but show the projected dimension on the screen

while frame_ctr<face_frames.shape[1]:
    frame = face_frames[:,frame_ctr].reshape(64,64)
    frame = (frame*256).astype(np.uint8)    
    frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2RGB)
    xy = isomap.transform([face_frames[:,frame_ctr]])
    cx, cy = 256, 256
    s = 6
    x,y = xy[0]
    y = -y
    resized_frame = cv2.resize(frame, (512,512), interpolation=cv2.INTER_NEAREST)
    cv2.circle(resized_frame, (int(cx-x*s), int(cy-y*s)), 10, (0,255,0), -1)
    cv2.line(resized_frame, (cx,cy), (int(cx-x*s), int(cy-y*s)), (0,255,0))
    cv2.imshow('Face video', resized_frame)
    
    frame_ctr += 1
    key = cv2.waitKey(1) & 0xff
    if key  == 27:
        break
        
cv2.destroyAllWindows()

<a id="practical"></a>
    
--------------------
## Mapping UI controls to unsupervised structures
<a id="mapping"></a>

The point of all of this is to find **control structures** in **sensor data**. That is, to find regularities in measured values that we could use to control a user interface.

To do this, we need to map unsupervised structure onto the interface itself. We could at this point move to a supervised approach, now that we have likely candidates to target. But a simpler approach is just to hand-map unsupervised structure to controls.

#### Clusters
For example, if we have clustered a set of data (e.g. measurements of the joint angles of the hand), and extracted a set of fundamental poses, we can then create a mapping table from cluster indices to actions.

|cluster | 1 | 2 | 3 | 4 |
|-----------------------------------------------|
|**action**  | confirm   | cancel    | increase  | decrease  |

<img src="imgs/handposes.jpg" width="400px">


#### Distance transform
Sometimes it is useful to have some continuous elements in an otherwise discrete interface (e.g. to support animation on state-transitions). A useful trick is to use a **distance transform**, which takes a datapoint in the original measured space $D_H$ and returns the distances to all cluster centres. (`sklearn`'s `transform` function for certain clustering algorithms does this transformation for you)

This could be used, for example, to find the top two candidates for a hand pose, and show a smooth transition between actions as the hand interpolates between them.

The most obvious use of this is to **disable** any action when the distance to all clusters is too great. This implements a quiescent state and is part of solving the **Midas touch** problem; you only spend a small amount of time on a UI actively interacting and don't want to trigger actions all the time!

## Manifolds

In the continuous case, with a dimensional reduction approach, then the mapping can often be a simple transformation of the inferred manifold. This usually requires that the manifold be **oriented** correctly; for example, in the head pointing example, I adjusted the signs of the resulting 2D manifold to match the direction my nose points in. More generally, it might be necessary to apply a scaling or rotation of the output with a linear transform:

$$ x_l = f(x_h)\\
x_c = Ax_l,
$$ where $x_l$ is the low-dimensional vector, $x_h$ is high dimensional sensor vector, $x_c$ is the vector (e.g. a cursor) we pass to the UI, and $A$ is a hand-tuned or learned transformation matrix.

As an example, $A = \begin{bmatrix}0 & 1 \\ -1 & 1\end{bmatrix}$ exchanges the $x$ and $y$ co-ordinate and flips the sign of $y$.

<img src="imgs/orienting.png">


In more complex examples, we need need to learn a more sophsticated nonlinear mapping. For example, we might apply supervised learning to map output vectors to spatial locations. This might seem like cheating -- why bother with the unsupervised part?

But the key insight is that we need vastly less training data to make this reliable. Moreover, we can factor the design process into:
* capturing a representative dataset (e.g. rewarding the original)
* estimating a good manifold  (e.g. using tSNE)
* pinning it to useful actions (e.g. using a deep neural network)

We can intervene at any part of these design processes and build on them.

-----
## Challenge
<a id="challenge"></a>
In this practical, you will capture images from your webcam, and build a UI **control** using unlabeled data. Without providing **any** class labels or values, you have to build an interaction that can do "something interesting" from the image data. 

You have complete freedom to choose what the configuration space you want to use is; you could take images of your face or hands; take images of drawn figures; image an object rotating or moving across a surface; or anything else you want.

As an illustrative example, the unsupervised approach could be used to image a soft drinks can at different rotations, and recover the rotation angle as an input (i.e. as a physical "dial").

<img src="imgs/can.jpg">

The criterion is the most **interesting** but **functional** interface. The control can be discrete (using **clustering**) or continuous (using **manifold learning**). **You don't have to map the controls onto a real UI, just extract and visualise a useful signal from the image data**.

The final system should be able to take a webcam image and output either a class or a (possibly $n$-dimensional) continuous value.

## Tips

* The webcam capture code is provided for you. `cam = Webcam()` creates a camera object and `img = cam.snap()` captures a single image from the first video device; if you have several, then you can use `cam = Webcam(1)` etc. to select the device to use. The result will be a $W\times H\times 3$ NumPy array, with colours **in the BGR order**.

* You should resize your image (using `scipy.ndimage.zoom`) to something small (e.g. 32x48 or 64x64) so that the learning is feasible in the time available.

* Your "interface" should probably show a 2D or 1D layout of the data in the training set, and have a mode where a new webcam image can be captured and plotted on the layout. You should consider colouring the data points by their attributes (e.g. cluster label) and/or showing some small images on the plot to get an idea of what is going on.

* You can preprocess features as you like, but a good clustering/manifold learning algorithm will be able to capture much of the structure without this. **The simplicity of the processing applied will considered in judging!**; minimise the amount of hand-tweaking that you do.

* Remember that some layout algorithms (e.g. t-SNE) are **unstable**. You may want to run the dimensional reduction several times and choose a good result, and use a repeatable random number seed (e.g. set it using `np.random.seed` or pass a custom `RandomState` to `sklearn`).


In [None]:
# simple OpenCV image capture from the video device
class Webcam(object):
    def __init__(self, cam_id=0):
        self.cap = cv2.VideoCapture(cam_id)        
        
    def snap(self):
        ret, frame = self.cap.read()
        return frame    
    
# snap(), snap(), snap()...

In [None]:
# Solution

## More advanced unsupervised learning
The algorithms we have seen so far are relatively old but well supported without tricky dependencies. There are many more modern approaches that can be used; unfortunately these are harder to setup for a one day course and often much slower to train. These include:

* Deep autoencoder structures, which learn latent spaces by back propagating through a "bottleneck layer". [tSNE, for example can be cast as a deep learning structure](https://github.com/johnhw/tsne_demo) which is very flexible.
![Paramteric tSNE](imgs/kyle_tsne_mnist.png)
*[From: https://www.flickr.com/photos/kylemcdonald/25478228166 by Kyle McDonald]*

* [Variational autoencoders (VAEs)](https://arxiv.org/abs/1606.05908), which are very powerful deep learning models for learning latent spaces

* The outstanding [UMAP](https://umap-learn.readthedocs.io/en/latest/api.html) algorithm which is somewhat similar to tSNE, but often has better results in disentagling complex spaces. [See this talk for details](https://www.youtube.com/embed/nq6iPZVUxZU)