Self Organizing Maps in Scala and on Spark
Scala
Switch branches/tags
Nothing to show
Clone or download
Latest commit cb55c8e Apr 4, 2017
Permalink
Failed to load latest commit information.
media Initial Commit Apr 2, 2017
src Removed Hex.scala Apr 4, 2017
.gitignore Initial Commit Apr 2, 2017
LICENSE Initial Commit Apr 2, 2017
README.md Initial Commit Apr 2, 2017
pom.xml Initial Commit Apr 2, 2017

README.md

Self Organizing Map - Spark / Standalone

Reacquainting myself with Kohonen SOM after 20 years. Check out this excellent tutorial and the below references that I found useful in my rediscovery of SOMs. I never intended this project to be a general purpose library, there are plenty of those. I learn by explicitly doing. So, the below applications are explicit and concrete implementations of SOM based solutions as standalone Scala and Spark applications. Along the way, I stole code from some of the below references, to whom I thank profusely.

Building the Project

The project build is based on Maven:

mvn clean package

The Applications

All application accept a configuration file in HOCON format.

SOM App

This application is based on this tutorial. It organizes an input set of random RGB triplets (or weights) into a 2 dimensional grid, where similar colors are group together into neighboring cells. The similarity is based on the euclidean distance of the color components .

The following is the application.conf file generation:

cat << EOF > /tmp/application.conf
numWeights = 1000

numIterations = 2000
alpha = 0.2
somSize = 10

somImageCell = 20
somImagePath = "/tmp/som.png"

errImagePath = "/tmp/err.png"
errImageWidth = 400
errImageHeight = 300
EOF

Maven is used to execute the SOM application where it is organizing 1000 (numWeights) random RGB colors into a SOM Grid of 10x10 cells (somSize). It will train over 2000 (numIterations) epochs with a starting learning ratio of 0.2 (alpha).

mvn -q exec:java\
 -Djava.awt.headless=true\
 -Dconfig.file=/tmp/application.conf\
 -Dexec.mainClass=com.esri.SOMApp

At the end of the execution, two files will be generated:

  • /tmp/som.png is an image of the final color coded grid.
  • /tmp/err.png is an image of a plot of the training error for each epoch.
/tmp/som.png /tmp/err.png

Spark App

For a larger input set, we can take advantage of Spark's scheduling of distributed tasks to train the SOM in parallel.

Create a random set of RGB colors using AWK. Note that the color components are between 0.0 and 1.0:

cat << EOF > /tmp/rgb.awk
BEGIN{
 OFS=","
 srand()
 for(i=0;i<1000000;i++){
    print rand(),rand(),rand()
 }
}
EOF
awk -f /tmp/rgb.awk > /tmp/rgb.csv

Configure the application to use the newly generated /tmp/rgb.csv as an input:

cat << EOF > /tmp/application.conf
rgbPath = "/tmp/rgb.csv"

numIterations = 200
numPartitions = 8
alpha = 0.2
somSize = 10

somImageCell = 20
somImagePath = "/tmp/som.png"

EOF

Submit the job:

spark-submit\
 --driver-java-options "-Djava.awt.headless=true -Dconfig.file=/tmp/application.conf"\
 --executor-memory 16G\
 target/spark-som-0.1-jar-with-dependencies.jar

The following is a sample output of /tmp/som.png:

TSP App

The final application is solving TSP using SOM. Google it, there is a lot of literature on this. The attached TSPApp is my specific implementation using a 1 dimensional circular network, and uses the Manhattan distance between the cities as proximity evaluator.

Create a configuration file named tsp.conf:

cat << EOF > /tmp/tsp.conf
numIterations = 3000
alpha = 0.2

somImagePath = "/tmp/tsp.png"
somImageSize = 400

errImagePath = "/tmp/err.png"
errImageWidth = 400
errImageHeight = 300

numCities = 50
numNodesPerCity = 2

EOF

Run the job:

mvn -q exec:java\
 -Djava.awt.headless=true\
 -Dconfig.file=/tmp/tsp.conf\
 -Dexec.mainClass=com.esri.TSPApp
/tmp/tsp.png /tmp/err.png

References