# Correlation between MODIS and Spring-Index


## Dependencies

In [41]:
import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}

import geotrellis.proj4.CRS
import geotrellis.raster.io.geotiff.writer.GeoTiffWriter
import geotrellis.raster.io.geotiff.{SinglebandGeoTiff, _}
import geotrellis.raster.{CellType, DoubleArrayTile, Tile, UByteCellType}
import geotrellis.spark.io.hadoop._
import geotrellis.vector.{Extent, ProjectedExtent}
import org.apache.hadoop.io.SequenceFile.Writer
import org.apache.hadoop.io.{SequenceFile, _}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.sys.process._

## Mode of operation

Here the user can define the mode of operation.
* **rdd_offline_mode**: If false it means the notebook will create all data from scratch and store protected_extent and num_cols_rows into HDFS. Otherwise, these data structures are read from HDFS.

It is also possible to define which directory of GeoTiffs is to be used and on which **band** to run Kmeans. The options are
* **all** which are a multi-band (**8 bands**) GeoTiffs
* Or choose single band ones:
    0. Onset_Greenness_Increase
    1. Onset_Greenness_Maximum
    2. Onset_Greenness_Decrease
    3. Onset_Greenness_Minimum
    4. NBAR_EVI_Onset_Greenness_Minimum
    5. NBAR_EVI_Onset_Greenness_Maximum
    6. NBAR_EVI_Area
    7. Dynamics_QC

<span style="color:red">Note that when using a range **kemans offline mode** is not possible and it will be reset to **online mode**</span>.

### Mode of Operation setup
<a id='mode_of_operation_setup'></a>

In [42]:
//Operation mode
var rdd_offline_mode = true

//GeoTiffs to be read from "hdfs:///user/hadoop/spring-index/"
var spring_path = "hdfs:///user/hadoop/spring-index/"
var spring_dir = "BloomFinal"
var modis_path = "hdfs:///user/hadoop/avhrr/"
var modis_dir = "SOST"
var out_path = "hdfs:///user/emma/correlation/"
var band_num = 3

//Years between (inclusive) 1989 - 2014
var modis_first_year = 1989
var modis_last_year = 2014

//Years between (inclusive) 1980 - 2015
var spring_first_year = 1989
var spring_last_year = 2014

//Mask
val toBeMasked = true
val mask_path = "hdfs:///user/hadoop/usa_mask.tif"

rdd_offline_mode = true
spring_path = hdfs:///user/hadoop/spring-index/
spring_dir = BloomFinal
modis_path = hdfs:///user/hadoop/avhrr/
modis_dir = SOST
out_path = hdfs:///user/emma/correlation/
band_num = 3
toBeMasked = true
mask_path = hdfs:///user/hadoop/usa_mask.tif


hdfs:///user/hadoop/usa_mask.tif


<span style="color:red">DON'T MODIFY ANY PIECE OF CODE FROM HERE ON!!!</span>.


### Mode of operation validation

In [43]:
//Check offline modes
var conf = sc.hadoopConfiguration
var fs = org.apache.hadoop.fs.FileSystem.get(conf)

var spring_grid_path = out_path + spring_dir + "_spring_grid"
var modis_grid_path = out_path + modis_dir + "_modis_grid"
var metadata_path = out_path + "/metadata"

val rdd_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(metadata_path))
if (rdd_offline_mode != rdd_offline_exists) {
    println("\"Load GeoTiffs\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + rdd_offline_exists.toString())
    rdd_offline_mode = rdd_offline_exists
} 

var corr_tif = out_path + "_" + modis_dir + "_" + spring_dir + ".tif"
var corr_tif_tmp = "/tmp/correlation_" + modis_dir + "_" + spring_dir + ".tif"

//Years
val modis_years = 1980 to 2015
val spring_years = 1989 to 2014

if (!modis_years.contains(modis_first_year) || !(modis_years.contains(modis_last_year))) {
    println("Invalid range of years for " + modis_dir + ". I should be between " + modis_first_year + " and " + modis_last_year)
    System.exit(0)
}

if (!spring_years.contains(spring_first_year) || !(spring_years.contains(spring_last_year))) {
    println("Invalid range of years for " + spring_dir + ". I should be between " + spring_first_year + " and " + spring_last_year)
    System.exit(0)
}

if ( ((modis_last_year - spring_first_year) > (spring_last_year - spring_first_year)) || ((modis_last_year - spring_first_year) > (spring_last_year - spring_first_year))) {
    println("The range of years for each data set should be of the same length.");
    System.exit(0)
}
    
var spring_years_range = (spring_years.indexOf(spring_first_year), spring_years.indexOf(spring_last_year))
var modis_years_range = (modis_years.indexOf(modis_first_year), modis_years.indexOf(modis_last_year))

//Global variables
var projected_extent = new ProjectedExtent(new Extent(0,0,0,0), CRS.fromName("EPSG:3857"))
var spring_grids_RDD: RDD[Array[Double]] = sc.emptyRDD
var modis_grids_RDD: RDD[Array[Double]] = sc.emptyRDD
var num_cols_rows :(Int, Int) = (0, 0)
var cellT :CellType = UByteCellType
var mask_tile0 :Tile = new SinglebandGeoTiff(geotrellis.raster.ArrayTile.empty(cellT, num_cols_rows._1, num_cols_rows._2), projected_extent.extent, projected_extent.crs, Tags.empty, GeoTiffOptions.DEFAULT).tile

"Load GeoTiffs" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to false


conf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/usr/lib/spark-2.1.1-bin-without-hadoop/conf/hive-site.xml
fs = DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-427935819_36, ugi=emma (auth:SIMPLE)]]
spring_grid_path = hdfs:///user/emma/correlation/BloomFinal_spring_grid
modis_grid_path = hdfs:///user/emma/correlation/SOST_modis_grid
metadata_path = hdfs:///user/emma/correlation//metadata
rdd_offline_exists = false
corr_tif = hdfs:///user/emma/correlation/_SOST_BloomFinal.tif
corr_tif_tmp = /tmp/correlation_SOST_BloomFinal.tif


projected_extent: geotrellis....


/tmp/correlation_SOST_BloomFinal.tif

## Functions to (de)serialize any structure into Array[Byte]

In [44]:
def serialize(value: Any): Array[Byte] = {
    val out_stream: ByteArrayOutputStream = new ByteArrayOutputStream()
    val obj_out_stream = new ObjectOutputStream(out_stream)
    obj_out_stream.writeObject(value)
    obj_out_stream.close
    out_stream.toByteArray
}

def deserialize(bytes: Array[Byte]): Any = {
    val obj_in_stream = new ObjectInputStream(new ByteArrayInputStream(bytes))
    val value = obj_in_stream.readObject
    obj_in_stream.close
    value
}

serialize: (value: Any)Array[Byte]
deserialize: (bytes: Array[Byte])Any


## Load GeoTiffs

Using GeoTrellis all GeoTiffs of a directory will be loaded into a RDD. Using the RDD, we extract a grid from the first file to lated store the Kmeans cluster_IDS, we build an Index for populate such grid and we filter out here all NaN values.

In [None]:
//Load Mask
if (toBeMasked) {
    val mask_tiles_RDD = sc.hadoopGeoTiffRDD(mask_path).values
    val mask_tiles_withIndex = mask_tiles_RDD.zipWithIndex().map{case (e,v) => (v,e)}
    mask_tile0 = (mask_tiles_withIndex.filter(m => m._1==0).filter(m => !m._1.isNaN).values.collect())(0)
}

//Local variables
val pattern: String = "tif"
val spring_filepath: String = spring_path + "/" + spring_dir
val modis_filepath: String = modis_path + "/" + modis_dir

### Satellite data

In [50]:
if (rdd_offline_mode) {
    spring_grids_RDD = sc.objectFile(spring_grid_path)
    val metadata = sc.sequenceFile(metadata_path, classOf[IntWritable], classOf[BytesWritable]).map(_._2.copyBytes()).collect()
    projected_extent = deserialize(metadata(0)).asInstanceOf[ProjectedExtent]
    num_cols_rows = (deserialize(metadata(1)).asInstanceOf[Int], deserialize(metadata(2)).asInstanceOf[Int])
} else {
    //Lets load Spring-Index Multiband GeoTiffs and return RDD just with the tiles.
    val spring_geos_RDD = sc.hadoopMultibandGeoTiffRDD(spring_filepath, pattern)
    val spring_tiles_RDD = spring_geos_RDD.values

    //Retrieve the number of cols and rows of the Tile's grid
    val tiles_withIndex = spring_tiles_RDD.zipWithIndex().map{case (v,i) => (i,v)}
    val tile0 = (tiles_withIndex.filter(m => m._1==0).values.collect())(0)

    num_cols_rows = (tile0.cols, tile0.rows)
    cellT = tile0.cellType

    //Retrieve the ProjectExtent which contains metadata such as CRS and bounding box
    val projected_extents_withIndex = spring_geos_RDD.keys.zipWithIndex().map{case (e,v) => (v,e)}
    projected_extent = (projected_extents_withIndex.filter(m => m._1 == 0).values.collect())(0)

    //Lets read the average of the Spring-Index which is stored in the 4th band
    if (toBeMasked) {
      val mask_tile_broad :Broadcast[Tile] = sc.broadcast(mask_tile0)
      spring_grids_RDD = spring_tiles_RDD.map(m => m.band(band_num).localInverseMask(mask_tile_broad.value, 1, -1000).toArrayDouble().filter(_ != -1000))
    } else {
      spring_grids_RDD = spring_tiles_RDD.map(m => m.band(band_num).toArrayDouble())
    }
        
    //Store data in HDFS
    spring_grids_RDD.saveAsObjectFile(spring_grid_path)

    val writer: SequenceFile.Writer = SequenceFile.createWriter(conf,
        Writer.file(metadata_path),
        Writer.keyClass(classOf[IntWritable]),
        Writer.valueClass(classOf[BytesWritable])
    )

    writer.append(new IntWritable(1), new BytesWritable(serialize(projected_extent)))
    writer.append(new IntWritable(2), new BytesWritable(serialize(num_cols_rows._1)))
    writer.append(new IntWritable(3), new BytesWritable(serialize(num_cols_rows._2)))
    writer.hflush()
    writer.close()
}



lastException = null


Name: org.apache.spark.SparkException
Message: Task not serializable
StackTrace:   at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
  ... 74 elided
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
Serialization stack:
	- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default

### Model data

In [50]:
if (rdd_offline_mode) {
    modis_grids_RDD = sc.objectFile(modis_grid_path)
} else {
    //Lets load MODIS Singleband GeoTiffs and return RDD just with the tiles.
    val modis_geos_RDD = sc.hadoopGeoTiffRDD(modis_filepath, pattern)
    val modis_tiles_RDD = modis_geos_RDD.values

    if (toBeMasked) {
        val mask_tile_broad :Broadcast[Tile] = sc.broadcast(mask_tile0)
        modis_grids_RDD = modis_tiles_RDD.map(m => m.localInverseMask(mask_tile_broad.value, 1, -1000).toArrayDouble().filter(_ != -1000))
    } else {
        modis_grids_RDD = modis_tiles_RDD.map(m => m.toArrayDouble())
    }
        
    //Store in HDFS
    modis_grids_RDD.saveAsObjectFile(modis_grid_path)
}



lastException = null


Name: org.apache.spark.SparkException
Message: Task not serializable
StackTrace:   at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
  ... 74 elided
Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration
Serialization stack:
	- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default

## Matrix

We need to do a Matrix transpose to have clusters per cell and not per year. With a GeoTiff representing a single year, the loaded data looks liks this:
```
bands_RDD.map(s => Vectors.dense(s)).cache()

//The vectors are rows and therefore the matrix will look like this:
[
Vectors.dense(0.0, 1.0, 2.0),
Vectors.dense(3.0, 4.0, 5.0),
Vectors.dense(6.0, 7.0, 8.0),
Vectors.dense(9.0, 0.0, 1.0)
]
```

To achieve that we convert the **RDD[Vector]** into a distributed Matrix, a [**CoordinateMatrix**](https://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix), which as a **transpose** method.

### Satellite data

In [9]:
val t0 = System.nanoTime()

//Global variables
var spring_matrix: RDD[Vector] = sc.emptyRDD

val grid_cells_sizeB = sc.broadcast(spring_cells_size)
if (matrix_offline_mode) {
    spring_matrix = sc.objectFile(spring_matrix_path)
} else {
    val mat :RowMatrix = new RowMatrix(spring_noNaN_RDD.map(m => m.zipWithIndex).map(m => m.filter(!_._1.isNaN)).map(m => Vectors.sparse(grid_cells_sizeB.value.toInt, m.map(v => v._2), m.map(v => v._1))))
    // Split the matrix into one number per line.
    val byColumnAndRow = mat.rows.zipWithIndex.map {
        case (row, rowIndex) => row.toArray.zipWithIndex.map {
            case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
        }   
    }.flatMap(x => x)
    
    val matt: CoordinateMatrix = new CoordinateMatrix(byColumnAndRow)
    val matt_T = matt.transpose()
    
    spring_matrix = matt_T.toIndexedRowMatrix().rows.sortBy(_.index).map(_.vector)
    spring_matrix.saveAsObjectFile(spring_matrix_path)
}

val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

Elapsed time: 105526406ns


t0 = 46708214638977
grids_matrix = MapPartitionsRDD[22] at objectFile at <console>:75
grid_cells_sizeB = Broadcast(10)
t1 = 46708320165383


46708320165383

### Model Data

In [9]:
val t0 = System.nanoTime()
//Global variables
var modis_matrix: RDD[Vector] = sc.emptyRDD

val grid_cells_sizeB = sc.broadcast(grid_cells_size)
if (matrix_offline_mode) {
    modis_matrix = sc.objectFile(grids_matrix_path)
} else {
    val mat :RowMatrix = new RowMatrix(grids_noNaN_RDD.map(m => m.zipWithIndex).map(m => m.filter(!_._1.isNaN)).map(m => Vectors.sparse(grid_cells_sizeB.value.toInt, m.map(v => v._2), m.map(v => v._1))))
    // Split the matrix into one number per line.
    val byColumnAndRow = mat.rows.zipWithIndex.map {
        case (row, rowIndex) => row.toArray.zipWithIndex.map {
            case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
        }   
    }.flatMap(x => x)
    
    val matt: CoordinateMatrix = new CoordinateMatrix(byColumnAndRow)
    val matt_T = matt.transpose()
    
    modis_matrix = matt_T.toIndexedRowMatrix().rows.sortBy(_.index).map(_.vector)
    modis_matrix.saveAsObjectFile(modis_matrix_path)
}

val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

Elapsed time: 105526406ns


t0 = 46708214638977
grids_matrix = MapPartitionsRDD[22] at objectFile at <console>:75
grid_cells_sizeB = Broadcast(10)
t1 = 46708320165383


46708320165383

### Validation

In [None]:
var spring_matrix_rows = spring_matrix.count()
var modis_matrix_rows = modis_matrix.count()

if (spring_matrix_row != modis_matrix_rows) {
    println("For correlation it is necessary to have a matrix with same number of rows and columns!!!")
    println("Spring matrix has " spring_matrix_rows + " rows while modis matrix has " + modis_matrix_rows + " rows!!!")
    System.exit(0)
}

## Correlation

In [51]:
val modis = modis_matrix.zipWithIndex().map{ case (v, i) => (i,v)}
val spring = spring_matrix.zipWithIndex().map{case (v,i) => (i, v)}    

val numCells = modis.count()
var corr_res :Array[Double] = Array[Double](numCells)
var cell = 0
while (cell < numCells) {
    val modis_array = modis.filter(_._1 == cell).flatMap(_._2)
    val spring_array = spring.filter(_._1 == cell).flatMap(_._2)
    corr_res(cell) = Statistics.corr(modis_array, spring_array, "pearson")
    cell += 1
}



lastException = null


Name: java.lang.IllegalArgumentException
Message: Can't zip RDDs with unequal numbers of partitions: List(26, 0)
StackTrace:   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at

## Build GeoTiff with Kmeans cluster_IDs

The Grid with the cluster IDs is stored in a SingleBand GeoTiff and uploaded to HDFS.

### Assign cluster ID to each grid cell and save the grid as SingleBand GeoTiff

To assign the clusterID to each grid cell it is necessary to get the indices of gird cells they belong to. The process is not straight forward because the ArrayDouble used for the creation of each dense Vector does not contain the NaN values, therefore there is not a direct between the indices in the Tile's grid and the ones in **kmeans_res** (kmeans result).

To join the two RDDS the knowledge was obtaing from a stackoverflow post on [how to perform basic joins of two rdd tables in spark using python](https://stackoverflow.com/questions/31257077/how-do-you-perform-basic-joins-of-two-rdd-tables-in-spark-using-python).

In [48]:
//Define a Tile
val corr_cells :Array[Double] = corr_res.collect()
val corr_cellsD = DoubleArrayTile(corr_cells, num_cols_rows._1, num_cols_rows._2)

val geoTif = new SinglebandGeoTiff(corr_cellsD, projected_extent.extent, projected_extent.crs, Tags.empty, GeoTiffOptions(compression.DeflateCompression))

//Save to /tmp/
GeoTiffWriter.write(geoTif, corr_tif_tmp)

//Upload to HDFS
var cmd = "hadoop dfs -copyFromLocal -f " + corr_tif_tmp + " " + corr_tif
Process(cmd)!

//Remove from /tmp/
cmd = "rm -fr " + corr_tif_tmp
Process(cmd)!

Name: Unknown Error
Message: lastException: Throwable = null
<console>:132: error: not enough arguments for method collect: (pf: PartialFunction[Double,B])(implicit bf: scala.collection.generic.CanBuildFrom[Array[Double],B,That])That.
Unspecified value parameter pf.
       val corr_cells :Array[Double] = corr_res.collect()
                                                       ^

StackTrace: 

# [Visualize results](plot_kmeans_clusters.ipynb)