Home

Amy de Buitléir edited this page Jan 13, 2016 · 30 revisions
Clone this wiki locally

A Kohonen Self-organising Map (SOM) maps input patterns onto a regular grid (usually two-dimensional) where each node in the grid is a model of the input data, and does so using a method which ensures that any topological relationships within the input data are also represented in the grid. This implementation supports the use of non-numeric patterns.

In layman's terms, a SOM can be useful when you you want to discover the underlying structure of some data.

How it works

Suppose you want to get a picture of the cost of housing in your neighbourhood. You have the sales price of every house sold in the last year, but now what? You could take the average price, but that alone doesn't tell you much about the distribution. For example, if 99 houses sold for €100,000, and one mansion sold for €2,500,000, the mean price is €124,000. That might give you the impression that €124,000 is a typical price to pay for a home, but that's nearly 25% higher than most people pay. There are other things you can measure, like the median, mode, range, and standard distribution. These will give you more information, but what if the distribution of prices is more complicated? What you might want in this circumstance is to group the data into clusters by price. Each cluster could be labelled with a representative price. You might have one cluster for low-income housing, one for middle-income housing, one for upper-income housing, one for mansions, etc.

A SOM can be useful for situations like this. This example will give you a rough idea how it works. (Note that even if you have multiple values in each record, or non-numeric data, you can probably still use a SOM, as explained in the next section.)

We could start with a 3x3 grid, and assign random numbers to each cell.

6494523 9404885 5322480
1340617 5284246 692037
8517022 5952110 4373129

Now we go through the data set. For each value, we:

  • Identify the value in the grid that is closest to the input value.
  • Adjust that cell to be a bit closer to the input value.
  • Adjust the neighbouring cells to make them a little closer to the input value as well, but by a smaller amount.

For example, suppose the first house sold for €125,500. The last cell in the middle row is the closest. If we adjust it by 10%, 692037 becomes 635383. If we then adjust the neighbouring cells by 5%, the grid will look like this:

6494523 8940916 5062631
1340617 5026309 635383
8517022 5660780 4160748

As we process more and more data, something cool happens: The grid will become a sort of "map" of the housing prices. You can think of each cell as a bin containing prices. The value of the cell gives you an idea of the typical price in that bin. If you process the data several times, the map will become an even better representation of the input data. On each pass, we would probably reduce the amount of adjustment, so that the map gradually "settles". When working with a larger grid, we might adjust not only best matching cell and its neighbours, but cells that are a bit further away. With each pass, we would reduce the size of the "neighbourhood", again allowing the map to "settle".

Once you have finished training the map, you can use it to classify prices. I.e., given a price, you could query the map to find out which bin it belongs to.

Since there's only one variable in this example (the price of a house), you could simply graph the data to see where the peaks are. But what if you had lots of variables? You may not know a priori how the variables relate to each other. You might have a table with a hundred fields for each house sale (such as number of bedrooms, number of bathrooms, whether or not there's a swimming pool, the postcode, how close it is to the nearest railway station, the seller's name, the buyer's name, the colour of the house, land area, whether or not a solicitor was involved, the name of the escrow agent). The price of the house probably depends on many of these variables, but to different degrees. A SOM can help you see the underlying structure in complex data.

The explanation above was only intended to give you a general insight into how SOMs work. There are many excellent tutorials on SOMs (see "Recommended reading", below) that explain the maths behind it.

How to use the SOM package

To use the Som module, of course you'll need some data, i.e., a bunch of records. A record could be just about anything, from a single numeric value to a set of fields which might be alphanumeric, boolean, or whatever.

You need a way to measure how different two records are. This is your distance metric. The choice of distance metric is up to you; choosing a good metric is a bit of an art form. Assuming your records have multiple fields, you'll probably want to calculate some sort of difference between each corresponding pair of fields, and then somehow combine those numbers into one number representing the difference between the records. When comparing two numeric values, you'll often want to subtract one from the other and use the absolute value of the result. When comparing two non-numeric values, you'll need to be a bit more creative. If you're comparing two street addresses, for example, you might use the distance between them.

You also need a way to alter a record to make it slightly more similar to another record, which we will call the target. The amount of adjustment is controlled by the learning rate. Assuming again that your records have multiple fields, you'll probably want to adjust the value in each field to make it more similar to the corresponding field in the target. If a field is numeric, you might tweak it by a percentage related to the learning rate. If the field is non-numeric, you can choose some other way that is appropriate to the type of data you're working with. For example, you might replace a street address with another address that is slightly closer to the target.

Store your records using a type that implements the Pattern class (from the Som module). This requires you to implement difference and makeSimilar. Next, you need to choose a grid design with a suitable size and shape. Hexagonal grids are commonly used; see the HexGrid type in the Grid module. Finally, build a GridMap (see the GridMap module) and fill it with some suitable initial values.

Now you are ready to train the SOM and use it to classify additional data. See the functions train, trainBatch, classify and classifyAndTrain in the Som module. The following example illustrates the process.

Example: Housing prices

The file housePrices.hs builds a Self-Organising Map (SOM) based on some (fake) housing prices. As the SOM learns the data, it builds a map of the data. The map here is a square grid with 4 tiles in it, so it divides the prices into 4 clusters. The value associated with each tile is the model ("typical" price) of the houses in that cluster. Since we only have one variable to analyse, simply graphing the data might be a better way to visualise it. However, this is a nice simple example of how to use the som package.

The test data was randomly generated as follows:

  • 80 numbers from a Gaussian distribution with mean 75000
  • 200 numbers from a Gaussian distribution with mean 150000
  • 150 numbers from a Gaussian distribution with mean 350000
  • 40 numbers from a Gaussian distribution with mean 900000

In each case, the standard deviation is 1000. With a SOM of four clusters, we might expect to see model values of approximately 75k, 150k, 350k and 900k (in any order), based on how the data was generated. Here are the results from one sample run, which demonstrates that we did get a good "picture" of the data.

[74707,150241,899627,349980]

Example: Pixel colours

The file colour.hs builds a Self-Organising Map (SOM) based on the colours in an image. As the SOM learns the colours in the image, it builds a map of the colour distribution of the image. The map here is a hexagonal grid with 7 tiles in it, so it divides the colours into 7 clusters. The value associated with each tile is the RGB value of the typical colour for that cluster.

The program may take a minute or two to run. The output is the list of indices for the hexagonal grid, paired with the colour associated with that cluster.

Here is some sample output for the image shown.

SOM illustration

The SOM does a good job of representing the colours in the image. However, the gold in the cat's medallion, the yellow around its spots, and the pink of the cat's nose aren't shown. If we had used a bigger grid, it would probably show those colours as well. (You might think that the white of the cat's body is also missing, but those pixels are actually light blue because the cat's body is shiny and reflects the surroundings.)

To do

Discuss classification

Discuss scaling and normalising data

Advice on how to improve results

Recommended reading