# Kmeans over a set of GeoTiffs

This notebook loads a set of GeoTiffs into a **RDD** of Tiles, with each Tile being a band in the GeoTiff. Each GeoTiff file contains **SpringIndex-** or **LastFreeze-** value for one year over the entire USA.

Kmeans takes years as dimensions. Hence, the matrix has cells as rows and the years as columns. To cluster on all years, the matrix needs to be transposed. The notebook has two flavors of matrix transpose, locally by the Spark-driver or distributed using the Spark-workers. Once transposed the matrix is converted to a **RDD** of dense vectors to be used by **Kmeans** algorithm from **Spark-MLlib**. The end result is a grid where each cell has a cluster ID which is then saved into a SingleBand GeoTiff. By saving the result into a GeoTiff, the reader can plot it using a Python notebook as the one defined in the [python examples](../examples/python).

<span style="color:red">In this notebook the reader only needs to modify the variables in **Mode of Operation Setup**</span>.

## Dependencies

In [1]:
import sys.process._

import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}


import geotrellis.proj4.CRS
import geotrellis.raster.{CellType, ArrayTile, DoubleArrayTile, Tile, UByteCellType}
import geotrellis.raster.io.geotiff._
import geotrellis.raster.io.geotiff.writer.GeoTiffWriter
import geotrellis.raster.io.geotiff.{GeoTiff, SinglebandGeoTiff}
import geotrellis.spark.io.hadoop._
import org.apache.hadoop.io._
import geotrellis.vector.{Extent, ProjectedExtent}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry, RowMatrix}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import org.apache.hadoop.io.{IOUtils, SequenceFile}
import org.apache.hadoop.io.SequenceFile.Writer

//Spire is a numeric library for Scala which is intended to be generic, fast, and precise.
import spire.syntax.cfor._

## Mode of operation

Here the user can define the mode of operation.
* **rdd_offline_mode**: If false it means the notebook will create all data from scratch and store grid0, grid0_index, protected_extent and num_cols_rows (from grid0) into HDFS. Otherwise, these data structures are read from HDFS.
* **matrix_offline_mode**: If false it means the notebook will create a mtrix,  transposed it and save it to HDFS. Otherwise, these data structures are read from HDFS.
* **kmeans_offline_mode**: If false it means the notebook will train kmeans and run kemans and store kmeans model into HDFS. Otherwise, these data structures are read from HDFS.

It is also possible to define which directory of GeoTiffs is to be used and on which **band** to run Kmeans. The options are
* **BloomFinal** or **LeafFinal** which are multi-band (**4 bands**)
* **DamageIndex** and **LastFreeze** which are single-band and if set band_num higher, it will reset to 0

For kmeans the user can define the **number of iterations** and **number of clusters** as an inclusive range. Such range is defined using **minClusters**, **maxClusters**, and **stepClusters**. These variables will set a loop starting at **minClusters** and stopping at **maxClusters** (inclusive), iterating **stepClusters** at the time. <span style="color:red">Note that when using a range **kemans offline mode** is not possible and it will be reset to **online mode**</span>.

### Mode of Operation setup

In [2]:
//Operation mode
var rdd_offline_mode = true
var matrix_offline_mode = true
var kmeans_offline_mode = true

//GeoTiffs to be read from "hdfs:///user/hadoop/spring-index/"
var dir_path = "hdfs:///user/hadoop/spring-index/"
var offline_dir_path = "hdfs:///user/emma/spring-index/"
var geoTiff_dir = "BloomFinal"
var band_num = 4

//Kmeans number of iterations and clusters
var numIterations = 35
var minClusters = 5
var maxClusters = 5
var stepClusters = 1
var save_kmeans_model = true

rdd_offline_mode = true
matrix_offline_mode = true
kmeans_offline_mode = true
dir_path = hdfs:///user/hadoop/spring-index/
offline_dir_path = hdfs:///user/emma/spring-index/
geoTiff_dir = BloomFinal
band_num = 4
numIterations = 35
minClusters = 5
maxClusters = 5
stepClusters = 1
save_kmeans_model = true


true


<span style="color:red">DON'T MODIFY ANY PIECE OF CODE FROM HERE ON!!!</span>.


### Mode of operation validation

In [3]:
//Validation, do not modify these lines.
var single_band = false
if (geoTiff_dir == "BloomFinal" || geoTiff_dir == "LeafFinal") {
    single_band = false
} else if (geoTiff_dir == "LastFreeze" || geoTiff_dir == "DamageIndex") {
    single_band = true
    if (band_num > 0) {
        println("Since LastFreezze and DamageIndex are single band, we will use band 0!!!")
        band_num  = 0
    }
} else {
    println("Directory unknown, please set either BloomFinal, LeafFinal, LastFreeze or DamageIndex!!!")
}

if (minClusters > maxClusters) {
    maxClusters = minClusters
}
if (stepClusters < 1) {
    stepClusters = 1
}

//Paths to store data structures for Offline runs
var grid0_path = offline_dir_path + geoTiff_dir + "/grid0"
var grid0_index_path = offline_dir_path + geoTiff_dir + "/grid0_index"
var grids_noNaN_path = offline_dir_path + geoTiff_dir + "/grids_noNaN"
var metadata_path = offline_dir_path + geoTiff_dir + "/metadata"
var grids_matrix_path = offline_dir_path + geoTiff_dir + "/grids_matrix"

//Check offline modes
var conf = sc.hadoopConfiguration
var fs = org.apache.hadoop.fs.FileSystem.get(conf)

val rdd_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(grid0_path))
val matrix_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(grids_matrix_path))
                                      
if (rdd_offline_mode != rdd_offline_exists) {
    println("\"Load GeoTiffs\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + rdd_offline_exists.toString())
    rdd_offline_mode = rdd_offline_exists
} 
if (matrix_offline_mode != matrix_offline_exists) {
    println("\"Matrix\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + matrix_offline_exists.toString())
    matrix_offline_mode = matrix_offline_exists
}

var num_kmeans :Int  = 1
if (minClusters != maxClusters) {
    num_kmeans = ((maxClusters - minClusters) / stepClusters) + 1
}
println(num_kmeans)
var kmeans_model_paths :Array[String] = Array.fill[String](num_kmeans)("")
var wssse_path :String = offline_dir_path + geoTiff_dir + "/wssse"
var geotiff_hdfs_paths :Array[String] = Array.fill[String](num_kmeans)("")
var geotiff_tmp_paths :Array[String] = Array.fill[String](num_kmeans)("")

if (num_kmeans > 1) {
    var numClusters_id = 0
    cfor(minClusters)(_ <= maxClusters, _ + stepClusters) { numClusters =>
        kmeans_model_paths(numClusters_id) = offline_dir_path + geoTiff_dir + "/kmeans_model_" + numClusters + "_" + numIterations
        
        //Check if the file exists
        if (fs.exists(new org.apache.hadoop.fs.Path(kmeans_model_paths(numClusters_id)))) {
            println("The kmeans model path " + kmeans_model_paths(numClusters_id) + " exists, please remove it.")
        }
        
        geotiff_hdfs_paths(numClusters_id) = offline_dir_path + geoTiff_dir + "/clusters_" + numClusters + "_" + numIterations + ".tif"
        geotiff_tmp_paths(numClusters_id) = "/tmp/clusters_" + geoTiff_dir + "_" + numClusters + "_" + numIterations + ".tif"
        if (fs.exists(new org.apache.hadoop.fs.Path(geotiff_hdfs_paths(numClusters_id)))) {
            println("There is already a GeoTiff with the path: " + geotiff_hdfs_paths(numClusters_id) + ". Please make either a copy or move it to another location, otherwise, it will be over-written.")
        }
        numClusters_id += 1
    }
    kmeans_offline_mode = false
} else { 
    kmeans_model_paths(0) = offline_dir_path + geoTiff_dir + "/kmeans_model_" + minClusters + "_" + numIterations
    val kmeans_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(kmeans_model_paths(0)))
    if (kmeans_offline_mode != kmeans_offline_exists) {
        println("\"Kmeans\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + kmeans_offline_exists.toString())
        kmeans_offline_mode = kmeans_offline_exists
    }
    geotiff_hdfs_paths(0) = offline_dir_path + geoTiff_dir + "/clusters_" + minClusters + "_" + numIterations + ".tif"
    geotiff_tmp_paths(0) = "/tmp/clusters_" + geoTiff_dir + "_" + minClusters + "_" + numIterations + ".tif"
    if (fs.exists(new org.apache.hadoop.fs.Path(geotiff_hdfs_paths(0)))) {
        println("There is already a GeoTiff with the path: " + geotiff_hdfs_paths(0) + ". Please make either a copy or move it to another location, otherwise, it will be over-written.")
    }
}

Waiting for a Spark session to start...

1
There is already a GeoTiff with the path: hdfs:///user/emma/spring-index/BloomFinal/clusters_5_35.tif. Please make either a copy or move it to another location, otherwise, it will be over-written.


single_band = false
grid0_path = hdfs:///user/emma/spring-index/BloomFinal/grid0
grid0_index_path = hdfs:///user/emma/spring-index/BloomFinal/grid0_index
grids_noNaN_path = hdfs:///user/emma/spring-index/BloomFinal/grids_noNaN
metadata_path = hdfs:///user/emma/spring-index/BloomFinal/metadata
grids_matrix_path = hdfs:///user/emma/spring-index/BloomFinal/grids_matrix
conf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/usr/lib/spark-2.1.1-bin-without-hadoop/conf/hive-site.xml
fs = DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1637340016_36, ugi=emma (auth:SI...


DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1637340016_36, ugi=emma (auth:SIMPLE)]]

## Functions to (de)serialize any structure into Array[Byte]

In [4]:
def serialize(value: Any): Array[Byte] = {
    val out_stream: ByteArrayOutputStream = new ByteArrayOutputStream()
    val obj_out_stream = new ObjectOutputStream(out_stream)
    obj_out_stream.writeObject(value)
    obj_out_stream.close
    out_stream.toByteArray
}

def deserialize(bytes: Array[Byte]): Any = {
    val obj_in_stream = new ObjectInputStream(new ByteArrayInputStream(bytes))
    val value = obj_in_stream.readObject
    obj_in_stream.close
    value
}

serialize: (value: Any)Array[Byte]
deserialize: (bytes: Array[Byte])Any


## Load GeoTiffs

Using GeoTrellis all GeoTiffs of a directory will be loaded into a RDD. Using the RDD, we extract a grid from the first file to lated store the Kmeans cluster_IDS, we build an Index for populate such grid and we filter out here all NaN values.

In [5]:
//Global variables
var projected_extent = new ProjectedExtent(new Extent(0,0,0,0), CRS.fromName("EPSG:3857"))
var grid0: RDD[(Long, Double)] = sc.emptyRDD
var grid0_index: RDD[Long] = sc.emptyRDD
var grids_noNaN_RDD: RDD[Array[Double]] = sc.emptyRDD
var num_cols_rows :(Int, Int) = (0, 0)
var cellT :CellType = UByteCellType
var grids_RDD :RDD[Array[Double]] = sc.emptyRDD

//Local variables
val pattern: String = "tif"
val filepath: String = dir_path + geoTiff_dir

if (rdd_offline_mode) {
    grids_noNaN_RDD = sc.objectFile(grids_noNaN_path)
    grid0 = sc.objectFile(grid0_path)
    grid0_index = sc.objectFile(grid0_index_path)

    val metadata = sc.sequenceFile(metadata_path, classOf[IntWritable], classOf[BytesWritable]).map(_._2.copyBytes()).collect()
    projected_extent = deserialize(metadata(0)).asInstanceOf[ProjectedExtent]
    num_cols_rows = (deserialize(metadata(1)).asInstanceOf[Int], deserialize(metadata(2)).asInstanceOf[Int])
} else {
    if (single_band) {
        //Lets load a Singleband GeoTiffs and return RDD just with the tiles.
        val tiles_RDD = sc.hadoopGeoTiffRDD(filepath, pattern).values
    
        //Retrive the numbre of cols and rows of the Tile's grid
        val tiles_withIndex = tiles_RDD.zipWithIndex().map{case (e,v) => (v,e)}
        val tile0 = (tiles_withIndex.filter(m => m._1==0).values.collect())(0)
        num_cols_rows = (tile0.cols,tile0.rows)
        cellT = tile0.cellType
    
        grids_RDD = tiles_RDD.map(m => m.toArrayDouble())
    } else {
        //Lets load Multiband GeoTiffs and return RDD just with the tiles.
        val tiles_RDD = sc.hadoopMultibandGeoTiffRDD(filepath, pattern).values
    
        //Retrive the numbre of cols and rows of the Tile's grid
        val tiles_withIndex = tiles_RDD.zipWithIndex().map{case (e,v) => (v,e)}
        val tile0 = (tiles_withIndex.filter(m => m._1==0).values.collect())(0)
        num_cols_rows = (tile0.cols,tile0.rows)
        cellT = tile0.cellType
    
        //Lets read the average of the Spring-Index which is stored in the 4th band
        grids_RDD = tiles_RDD.map(m => m.band(3).toArrayDouble())
    }

    //Retrieve the ProjectExtent which contains metadata such as CRS and bounding box
    val projected_extents_withIndex = sc.hadoopGeoTiffRDD(filepath, pattern).keys.zipWithIndex().map{case (e,v) => (v,e)}
    projected_extent = (projected_extents_withIndex.filter(m => m._1 == 0).values.collect())(0)

    //Get Index for each Cell
    val grids_withIndex = grids_RDD.zipWithIndex().map { case (e, v) => (v, e) }
    grid0_index = grids_withIndex.filter(m => m._1 == 0).values.flatMap(m => m).zipWithIndex.filter(m => !m._1.isNaN).map { case (v, i) => (i) }

    //Get the Tile's grid
    grid0 = grids_withIndex.filter(m => m._1 == 0).values.flatMap( m => m).zipWithIndex.map{case (v,i) => (i,v)}

    //Lets filter out NaN
    grids_noNaN_RDD = grids_RDD.map(m => m.filter(!_.isNaN))
    
    //Store data in HDFS
    grid0.saveAsObjectFile(grid0_path)
    grid0_index.saveAsObjectFile(grid0_index_path)
    grids_noNaN_RDD.saveAsObjectFile(grids_noNaN_path)
    
    val writer: SequenceFile.Writer = SequenceFile.createWriter(conf,
        Writer.file(metadata_path),
        Writer.keyClass(classOf[IntWritable]),
        Writer.valueClass(classOf[BytesWritable])
    )

    writer.append(new IntWritable(1), new BytesWritable(serialize(projected_extent)))
    writer.append(new IntWritable(2), new BytesWritable(serialize(num_cols_rows._1)))
    writer.append(new IntWritable(3), new BytesWritable(serialize(num_cols_rows._2)))
    writer.hflush()
    writer.close()
}

projected_extent = ProjectedExtent(Extent(-126.30312894720473, 14.29219617034159, -56.162671563152486, 49.25462702827337),geotrellis.proj4.CRS$$anon$3@41d0d1b7)
grid0 = MapPartitionsRDD[7] at objectFile at <console>:87
grid0_index = MapPartitionsRDD[9] at objectFile at <console>:88
grids_noNaN_RDD = MapPartitionsRDD[5] at objectFile at <console>:86
num_cols_rows = (7808,3892)
cellT = uint8raw
grids_RDD = EmptyRDD[3] at emptyRDD at <console>:77
pattern = tif
filepath = hdfs:///user/hadoop/spring-index/BloomFinal


hdfs:///user/hadoop/spring-index/BloomFinal

## Matrix

We need to do a Matrix transpose to have clusters per cell and not per year. With a GeoTiff representing a single year, the loaded data looks liks this:
```
bands_RDD.map(s => Vectors.dense(s)).cache()

//The vectors are rows and therefore the matrix will look like this:
[
Vectors.dense(0.0, 1.0, 2.0),
Vectors.dense(3.0, 4.0, 5.0),
Vectors.dense(6.0, 7.0, 8.0),
Vectors.dense(9.0, 0.0, 1.0)
]
```

To achieve that we convert the **RDD[Vector]** into a distributed Matrix, a [**CoordinateMatrix**](https://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix), which as a **transpose** method.

In [6]:
//Global variables
var grids_matrix: RDD[Vector] = sc.emptyRDD

if (matrix_offline_mode) {
    grids_matrix = sc.objectFile(grids_matrix_path)
} else {
    val mat :RowMatrix = new RowMatrix(grids_noNaN_RDD.map(m => Vectors.dense(m)))

    // Split the matrix into one number per line.
    val byColumnAndRow = mat.rows.zipWithIndex.map {
        case (row, rowIndex) => row.toArray.zipWithIndex.map {
            case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
        }   
    }.flatMap(x => x)
    
    val matt: CoordinateMatrix = new CoordinateMatrix(byColumnAndRow)
    val matt_T = matt.transpose()
    grids_matrix = matt_T.toRowMatrix().rows
    grids_matrix.saveAsObjectFile(grids_matrix_path)
}


grids_matrix = MapPartitionsRDD[14] at objectFile at <console>:69


MapPartitionsRDD[14] at objectFile at <console>:69

## Kmeans

We use Kmeans from Sparl-MLlib. The user should only modify the variables on Kmeans setup.

### Kmeans Training

In [7]:
//Global variables
var kmeans_models :Array[KMeansModel] = new Array[KMeansModel](num_kmeans)
var wssse_data :List[(Int, Int, Double)] = List.empty

if (kmeans_offline_mode) {
    kmeans_models(0) = KMeansModel.load(sc, kmeans_model_paths(0))
    val wssse_data_RDD :RDD[(Int, Int, Double)]  = sc.objectFile(wssse_path)
    wssse_data  = wssse_data_RDD.collect().toList
} else {
    var numClusters_id = 0
    if (fs.exists(new org.apache.hadoop.fs.Path(wssse_path))) {
        val wssse_data_RDD :RDD[(Int, Int, Double)]  = sc.objectFile(wssse_path)
        wssse_data  = wssse_data_RDD.collect().toList
    }
    grids_matrix.cache()
    cfor(minClusters)(_ <= maxClusters, _ + stepClusters) { numClusters =>
        println(numClusters)
        kmeans_models(numClusters_id) = {
            KMeans.train(grids_matrix, numClusters, numIterations)
        }

        // Evaluate clustering by computing Within Set Sum of Squared Errors
        val WSSSE = kmeans_models(numClusters_id).computeCost(grids_matrix)
        println("Within Set Sum of Squared Errors = " + WSSSE)
                
        wssse_data = wssse_data :+ (numClusters, numIterations, WSSSE)
        
        //Save kmeans model
        if (save_kmeans_model) {
            kmeans_models(numClusters_id).save(sc, kmeans_model_paths(numClusters_id))
            
        }
        numClusters_id += 1
    }

    //Un-persist it to save memory
    grids_matrix.unpersist()
    
    if (fs.exists(new org.apache.hadoop.fs.Path(wssse_path))) {
        println("We will delete the wssse file")
        try { fs.delete(new org.apache.hadoop.fs.Path(wssse_path), true) } catch { case _ : Throwable => { } }
    }
    
    println("Lets create it with the new data")
    sc.parallelize(wssse_data, 1).saveAsObjectFile(wssse_path)
}



kmeans_models = Array(org.apache.spark.mllib.clustering.KMeansModel@3d6c5739)
wssse_data = List((35,35,2.5339807158420963E10), (40,35,2.386049786386734E10), (45,35,2.2600763533288258E10), (50,35,2.1799386581334805E10), (55,35,2.098570400859793E10), (60,35,2.0395817151504257E10), (65,35,1.9603798291774227E10), (70,35,1.9086929807489525E10), (75,35,1.854631924299215E10), (80,35,1.8068770812709595E10), (85,35,1.769766672502065E10), (90,35,1.729243792828873E10), (95,35,1.7151593959877056E10), (100,35,1.6705043556647955E10), (110,35,1.6139365756995707E10), (120,35,1.5427827712581413E10), (130,35,1.5008362712794022E10), (140,35,1.4641235988739805E10), (150,35,1.4268992624340197E10), (160,35,1.3952870530548647E10),...


List((35,35,2.5339807158420963E10), (40,35,2.386049786386734E10), (45,35,2.2600763533288258E10), (50,35,2.1799386581334805E10), (55,35,2.098570400859793E10), (60,35,2.0395817151504257E10), (65,35,1.9603798291774227E10), (70,35,1.9086929807489525E10), (75,35,1.854631924299215E10), (80,35,1.8068770812709595E10), (85,35,1.769766672502065E10), (90,35,1.729243792828873E10), (95,35,1.7151593959877056E10), (100,35,1.6705043556647955E10), (110,35,1.6139365756995707E10), (120,35,1.5427827712581413E10), (130,35,1.5008362712794022E10), (140,35,1.4641235988739805E10), (150,35,1.4268992624340197E10), (160,35,1.3952870530548647E10), (170,35,1.3517124877948017E10), (180,35,1.3335754460133108E10), (190,35,1.290478451111284E10), (200,35,1.2687839076375637E10), (210,35,1.2390782393825275E10), (220,35,1.2222396246071983E10), (230,35,1.1960885750938011E10), (240,35,1.1828089797301386E10), (250,35,1.1646639148858463E10), (260,35,1.1481614897674076E10), (270,35,1.1314988131089611E10), (280,35,1.113068574558

### Inspect WSSSE

In [8]:
//current
println(wssse_data)

//from disk
if (fs.exists(new org.apache.hadoop.fs.Path(wssse_path))) {
    var wssse_data_tmp :RDD[(Int, Int, Double)] = sc.objectFile(wssse_path)//.collect()//.toList
    println(wssse_data_tmp.collect().toList)    
}

List((35,35,2.5339807158420963E10), (40,35,2.386049786386734E10), (45,35,2.2600763533288258E10), (50,35,2.1799386581334805E10), (55,35,2.098570400859793E10), (60,35,2.0395817151504257E10), (65,35,1.9603798291774227E10), (70,35,1.9086929807489525E10), (75,35,1.854631924299215E10), (80,35,1.8068770812709595E10), (85,35,1.769766672502065E10), (90,35,1.729243792828873E10), (95,35,1.7151593959877056E10), (100,35,1.6705043556647955E10), (110,35,1.6139365756995707E10), (120,35,1.5427827712581413E10), (130,35,1.5008362712794022E10), (140,35,1.4641235988739805E10), (150,35,1.4268992624340197E10), (160,35,1.3952870530548647E10), (170,35,1.3517124877948017E10), (180,35,1.3335754460133108E10), (190,35,1.290478451111284E10), (200,35,1.2687839076375637E10), (210,35,1.2390782393825275E10), (220,35,1.2222396246071983E10), (230,35,1.1960885750938011E10), (240,35,1.1828089797301386E10), (250,35,1.1646639148858463E10), (260,35,1.1481614897674076E10), (270,35,1.1314988131089611E10), (280,35,1.113068574558

### Run Kmeans clustering

Run Kmeans and obtain the clusters per each cell.

In [34]:
//Cache it so kmeans is more efficient
grids_matrix.cache()

var kmeans_res: Array[RDD[Int]] = Array.fill(num_kmeans)(sc.emptyRDD)
var numClusters_id = 0
cfor(minClusters)(_ <= maxClusters, _ + stepClusters) { numClusters =>
    kmeans_res(numClusters_id) = kmeans_models(numClusters_id).predict(grids_matrix)
    numClusters_id += 1
}

//Un-persist it to save memory
grids_matrix.unpersist()

kmeans_res = Array(MapPartitionsRDD[79] at map at KMeansModel.scala:69)
numClusters_id = 1


MapPartitionsRDD[14] at objectFile at <console>:69

#### Sanity test

It can be skipped, it only shows the cluster ID for the first 50 cells

In [79]:
val kmeans_res_out = kmeans_res(0).take(50)
kmeans_res_out.foreach(print)

println(kmeans_res_out.size)

2311111312133221212021142403412312123331333141214050


kmeans_res_out = Array(2, 3, 1, 1, 1, 1, 1, 3, 1, 2, 1, 3, 3, 2, 2, 1, 2, 1, 2, 0, 2, 1, 1, 4, 2, 4, 0, 3, 4, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 1, 3, 3, 3, 1, 4, 1, 2, 1, 4, 0)


[2, 3, 1, 1, 1, 1, 1, 3, 1, 2, 1, 3, 3, 2, 2, 1, 2, 1, 2, 0, 2, 1, 1, 4, 2, 4, 0, 3, 4, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 1, 3, 3, 3, 1, 4, 1, 2, 1, 4, 0]

## Build GeoTiff with Kmeans cluster_IDs

The Grid with the cluster IDs is stored in a SingleBand GeoTiff and uploaded to HDFS.

### Assign cluster ID to each grid cell and save the grid as SingleBand GeoTiff

To assign the clusterID to each grid cell it is necessary to get the indices of gird cells they belong to. The process is not straight forward because the ArrayDouble used for the creation of each dense Vector does not contain the NaN values, therefore there is not a direct between the indices in the Tile's grid and the ones in **kmeans_res** (kmeans result).

To join the two RDDS the knowledge was obtaing from a stackoverflow post on [how to perform basic joins of two rdd tables in spark using python](https://stackoverflow.com/questions/31257077/how-do-you-perform-basic-joins-of-two-rdd-tables-in-spark-using-python).

In [78]:
var numClusters_id = 0

cfor(minClusters)(_ <= maxClusters, _ + stepClusters) { numClusters =>
    //Merge two RDDs, one containing the clusters_ID indices and the other one the indices of a Tile's grid cells
    val cluster_cell_pos = ((kmeans_res(numClusters_id).zipWithIndex().map{ case (v,i) => (i,v)}).join(grid0_index.zipWithIndex().map{ case (v,i) => (i,v)})).map{ case (k,(v,i)) => (v,i)}

    //Associate a Cluster_IDs to respective Grid_cell
    val grid_clusters = grid0.leftOuterJoin(cluster_cell_pos.map{ case (c,i) => (i.toLong, c)})

    //Convert all None to NaN
    val grid_clusters_res = grid_clusters.sortByKey(true).map{case (k, (v, c)) => if (c == None) (k, Double.NaN) else (k, c.get.toDouble)}
    
    //Define a Tile
    val cluster_cells :Array[Double] = grid_clusters_res.values.collect()
    val cluster_cellsD = DoubleArrayTile(cluster_cells, num_cols_rows._1, num_cols_rows._2)
    val cluster_tile = geotrellis.raster.DoubleArrayTile.empty(num_cols_rows._1, num_cols_rows._2)
    cfor(0)(_ < num_cols_rows._1, _ + 1) { col =>
        cfor(0)(_ < num_cols_rows._2, _ + 1) { row =>
            val v = cluster_cellsD.getDouble(col, row)
            if (v != Double.NaN)
                cluster_tile.setDouble(col, row, v)
        }
    }

    val geoTif = new SinglebandGeoTiff(cluster_tile, projected_extent.extent, projected_extent.crs, Tags.empty, GeoTiffOptions.DEFAULT)
    
    //Save to /tmp/
    GeoTiffWriter.write(geoTif, geotiff_tmp_paths(numClusters_id))

    //Upload to HDFS
    var cmd = "hadoop dfs -copyFromLocal -f " + geotiff_tmp_paths(numClusters_id) + " " + geotiff_hdfs_paths(numClusters_id)
    Process(cmd)!
    
    //Remove from /tmp/
    cmd = "rm -fr " + geotiff_tmp_paths(numClusters_id)
    Process(cmd)!
    
    numClusters_id += 1
}

Instead use the hdfs command for it.



numClusters_id = 1




1