# Kmeans over a set of GeoTiffs

This notebook loads a set of GeoTiffs into a **RDD** of Tiles, with each Tile being a band in the GeoTiff. Each GeoTiff file contains **SpringIndex-** or **LastFreeze-** value for one year over the entire USA.

Kmeans takes years as dimensions. Hence, the matrix has cells as rows and the years as columns. To cluster on all years, the matrix needs to be transposed. The notebook has two flavors of matrix transpose, locally by the Spark-driver or distributed using the Spark-workers. Once transposed the matrix is converted to a **RDD** of dense vectors to be used by **Kmeans** algorithm from **Spark-MLlib**. The end result is a grid where each cell has a cluster ID which is then saved into a SingleBand GeoTiff. By saving the result into a GeoTiff, the reader can plot it using a Python notebook as the one defined in the [python examples](../examples/python).

## Dependencies

In [1]:
import geotrellis.proj4.CRS
import geotrellis.raster.{ArrayTile, DoubleArrayTile, Tile}
import geotrellis.raster.io.geotiff._
import geotrellis.raster.io.geotiff.writer.GeoTiffWriter
import geotrellis.raster.io.geotiff.{GeoTiff, SinglebandGeoTiff}
import geotrellis.spark.io.hadoop._
import geotrellis.vector.{Extent, ProjectedExtent}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

## Load multiple GeoTiffs into a RDD

In [None]:
val single_band = True;

val local_mode = True;

var projected_extent = new ProjectedExtent(new Extent(0,0,0,0), CRS.fromName("EPSG:3857"))
var num_cols_rows :(Int, Int) = (0, 0)
var band_RDD: RDD[Array[Double]] = sc.emptyRDD
var band_vec: RDD[Vector] = sc.emptyRDD
var band0: RDD[(Long, Double)] = sc.emptyRDD
var band0_index: RDD[Long] = sc.emptyRDD

val pattern: String = "tif"
var filepath: String = ""

if (single_band) {
    //Single band GeoTiff
    filepath = "hdfs:///user/hadoop/spring-index/LastFreeze/"
} else {
    //Multi band GeoTiff
    filepath = "hdfs:///user/hadoop/spring-index/BloomFinal/"
}

if (single_band) {
    //Lets load a Singleband GeoTiffs and return RDD just with the tiles.
    val tiles_RDD = sc.hadoopGeoTiffRDD(filepath, pattern).values
    val bands_RDD = tiles_RDD.map(m => m.toArrayDouble())
else {
    //Lets load Multiband GeoTiffs and return RDD just with the tiles.
    val tiles_RDD = sc.hadoopMultibandGeoTiffRDD(filepath, pattern).values
    
    //Lets read the average of the Spring-Index which is stored in the 4th band
    val bands_RDD = tiles_RDD.map(m => m.band(3).toArrayDouble())
}

## Read metadata and create indexes

In [None]:
//Retrieve the ProjectExtent which contains metadata such as CRS and bounding box
val extents_withIndex = sc.hadoopGeoTiffRDD(filepath, pattern).keys.zipWithIndex().map{case (e,v) => (v,e)}
projected_extent = (extents_withIndex.filter(m => m._1 == 0).values.collect())(0)

//Retrive the numbre of cols and rows of the Tile's grid
val tiles_withIndex = tiles_RDD.zipWithIndex().map{case (e,v) => (v,e)}
val tile0 = (tiles_withIndex.filter(m => m._1==0).values.collect())(0)
num_cols_rows = (tile0.cols,tile0.rows)

//Get Index for each Cell
val bands_withIndex = bands_RDD.zipWithIndex().map { case (e, v) => (v, e) }
band0_index = bands_withIndex.filter(m => m._1 == 0).values.flatMap(m => m).zipWithIndex.filter(m => !m._1.isNaN).map { case (v, i) => (i) }

//Get the Tile's grid
band0 = bands_withIndex.filter(m => m._1 == 0).values.flatMap( m => m).zipWithIndex.map{case (v,i) => (i,v)}

//Lets filter out NaN
band_RDD = bands_RDD.map(m => m.filter(!_.isNaN))

## Matrix transpose

We need to do a Matrix transpose to have clusters per cell and not per year. With a GeoTiff representing a single year, the loaded data looks liks this:
```
bands_RDD.map(s => Vectors.dense(s)).cache()

//The vectors are rows and therefore the matrix will look like this:
[
Vectors.dense(0.0, 1.0, 2.0),
Vectors.dense(3.0, 4.0, 5.0),
Vectors.dense(6.0, 7.0, 8.0),
Vectors.dense(9.0, 0.0, 1.0)
]
```

The information was gathered from the blog [how to convert a matrix to a RDD of vectors](http://jacob119.blogspot.nl/2015/11/how-to-convert-matrix-to-rddvector-in.html) and a stackoverflow post on [how to transpose an rdd in spark](https://stackoverflow.com/questions/29390717/how-to-transpose-an-rdd-in-spark).




In [None]:
//A) For small memory footprint RDDs we can simply bring it to the Driver node and transpose it    
if (local_mode) {
    //First transpose and then parallelize otherwise you get:
    //error: polymorphic expression cannot be instantiated to expected type;
    val band_vec_T = band_RDD.collect().transpose
    
    //Convert to a RDD
    val transposed = sc.parallelize(band_vec_T)
    band_vec = sc.parallelize(band_vec_T).map(m => Vectors.dense(m)).cache()

//B) For large memory footpring RDDs we need to run in distributed mode
} else {
    // Split the matrix into one number per line.
    val byColumnAndRow = band_RDD.zipWithIndex.flatMap {
        case (row, rowIndex) => row.zipWithIndex.map {
            case (number, columnIndex) => columnIndex -> (rowIndex, number)
        }
    }

    // Build up the transposed matrix. Group and sort by column index first.
    val byColumn = byColumnAndRow.groupByKey.sortByKey().values

    // Then sort by row index.
    val transposed = byColumn.map {
        indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
    }
}

## Kmeans training

In [None]:
//Create a RDD of dense vectors and cache it
band_vec = transposed.map(m => Vectors.dense(m.toArray)).cache()
    
/*
 Here we will train kmeans
*/

val numClusters = 3
val numIterations = 5
val clusters = {
    KMeans.train(band_vec, numClusters, numIterations)
}

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(band_vec)
println("Within Set Sum of Squared Errors = " + WSSSE)

//Un-persist it to save memory
band_vec.unpersist()

## Cluster model's result management

In [None]:
// Lets show the result.
println("Cluster Centers: ")
//clusters.clusterCenters.foreach(println)

//Lets save the model into HDFS. If the file already exists it will abort and report error.
if (single_band) {
    clusters.save(sc, "hdfs:///user/emma/spring_index/LastFreeze/all_kmeans_model")
} else {
    clusters.save(sc, "hdfs:///user/emma/spring_index/BloomFinal/all_kmeans_model")
}

## Run Kmeans clustering

Run Kmeans and obtain the clusters per each cell.

In [None]:
//Cache it so kmeans is more efficient
band_vec.cache()

val res = clusters.predict(band_vec)
res.repartition(1)getNumPartitions

//Un-persist it to save memory
band_vec.unpersist()

### Show the cluster ID for the first 50 cells

In [None]:
val res_out = res.collect()//.take(50)
res_out.foreach(println)

println(res_out.size)

## Assign cluster ID to each grid cell

To assign the clusterID to each grid cell it is necessary to get the indices of gird cells they belong to. The process is not straight forward because the ArrayDouble used for the creation of each dense Vector does not contain the NaN values, therefore there is not a direct between the indices in the Tile's grid and the ones in **res** (kmeans result).

To join the two RDDS the knowledge was obtaing from a stackoverflow post on [how to perform basic joins of two rdd tables in spark using python](https://stackoverflow.com/questions/31257077/how-do-you-perform-basic-joins-of-two-rdd-tables-in-spark-using-python).

### Merge two RDDs, one containing the clusters_ID indices and the other one the indices of a Tile's grid cells.

In [None]:
//The zip operator would be the most appropriated operator for this operation.
//However, it requires the RRDs to have the same number of partitions and each partition have the same number of records.
//val cluster_cell_pos = res.zip(band0_index)

//Since we can't use Zip, we index each RDD and then we join them.
val cluster_cell_pos = ((res.zipWithIndex().map{ case (v,i) => (i,v)}).join(band0_index.zipWithIndex().map{ case (v,i) => (i,v)})).map{ case (k,(v,i)) => (v,i)}

### Associate a Cluster_IDs to respective Grid_cell

In [None]:
//We use a left join if an Cluster_ID indice does not exist, None is set as value.
val grid_clusters = band0.leftOuterJoin(cluster_cell_pos.map{ case (c,i) => (i.toLong, c)})

//Convert all None to NaN
val grid_clusters_res = grid_clusters.sortByKey(true).map{case (k, (v, c)) => if (c == None) (k, Double.NaN) else (k, c.get.toDouble)}//.collect().foreach(println)

## Store the results in a SingleBand GeoTiff

The Grid with the cluster IDs is stored in a SinglBand GeoTiff.

In [None]:
val cluster_cells :Array[Int] = grid_clusters_res.values.map(m => m.toInt).collect()

//Define a Tile
val cluster_tile = ArrayTile(cluster_cells, num_cols_rows._1, num_cols_rows._2)

//Generate GeoTiff with the same header as the one used by the input GeoTiffs
val geoTiff = SinglebandGeoTiff(cluster_tile, projected_extent.extent, projected_extent.crs, Tags.empty, GeoTiffOptions.DEFAULT)

//Convert to RDD and save it as a ByteArray, if the file already exists it will report error.
sc.parallelize(geoTiff.toByteArray).saveAsObjectFile("hdfs:///user/emma/spring-index/LastFreeze/clusters.tif")

//Another option is to write to the local file system of the node where JupyterHub is running
//val path = "~/clusters.tif"
//GeoTiffWriter.write(geoTiff, path)