# SVD between a Model and Satellite data

This notebook shows how to multiply two matrices and calculate SVD. Each matrix is created out a set of GeoTiffs for a series of years. Both matrices should have the same dimension.

For demonstration we will use from a model (spring-index) and from a satellite (AVHRR).

## Dependencies

In [81]:

import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}

import geotrellis.proj4.CRS
import geotrellis.raster.io.geotiff.writer.GeoTiffWriter
import geotrellis.raster.io.geotiff.{SinglebandGeoTiff, _}
import geotrellis.raster.{CellType, DoubleArrayTile, Tile, UByteCellType}
import geotrellis.spark.io.hadoop._
import geotrellis.vector.{Extent, ProjectedExtent}
import org.apache.hadoop.io.SequenceFile.Writer
import org.apache.hadoop.io.{SequenceFile, _}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}

import scala.sys.process._

## Mode of operation

Here the user can define the mode of operation.
* **rdd_offline_mode**: If false it means the notebook will create all data from scratch and store protected_extent and num_cols_rows into HDFS. Otherwise, these data structures are read from HDFS.

It is also possible to define which directory of GeoTiffs is to be used and on which **band** to run Kmeans. The options are
* **all** which are a multi-band (**8 bands**) GeoTiffs
* Or choose single band ones:
    0. Onset_Greenness_Increase
    1. Onset_Greenness_Maximum
    2. Onset_Greenness_Decrease
    3. Onset_Greenness_Minimum
    4. NBAR_EVI_Onset_Greenness_Minimum
    5. NBAR_EVI_Onset_Greenness_Maximum
    6. NBAR_EVI_Area
    7. Dynamics_QC

<span style="color:red">Note that when using a range **kemans offline mode** is not possible and it will be reset to **online mode**</span>.

### Mode of Operation setup
<a id='mode_of_operation_setup'></a>

In [82]:
var model_rdd_offline_mode = true
var model_matrix_offline_mode = true
var satellite_rdd_offline_mode = true
var satellite_matrix_offline_mode = true

//Using spring-index model
var model_path = "hdfs:///user/hadoop/spring-index/"
var model_dir = "BloomGridmet"

//Using AVHRR Satellite data
//var satellite_path = "hdfs:///user/hadoop/avhrr/"
//var satellite_dir = "SOSTLowPR"
var satellite_path = "hdfs:///user/hadoop/spring-index/"
var satellite_dir = "LeafGridmet"

var out_path = "hdfs:///user/pheno/svd/spark/" + model_dir + satellite_dir + "/"
//First band is 0
var band_num = 0

//Years between (inclusive) 1989 - 2014
var satellite_first_year = 1989
var satellite_last_year = 2014

//Years between (inclusive) 1980 - 2015
var model_first_year = 1989
var model_last_year = 2014

//Mask
val toBeMasked = false
val mask_path = "hdfs:///user/hadoop/usa_mask_low.tif"

val save_rdds = true
val save_matrix = true

model_rdd_offline_mode = true
model_matrix_offline_mode = true
satellite_rdd_offline_mode = true
satellite_matrix_offline_mode = true
model_path = hdfs:///user/hadoop/spring-index/
model_dir = BloomGridmet
satellite_path = hdfs:///user/hadoop/spring-index/
satellite_dir = LeafGridmet
out_path = hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet/
band_num = 0
satellite_first_year = 1989
satellite_last_year = 2014
model_first_year = 1989
model_last_year = 2014
toBeMasked = false
mask_path = hdfs:///user/hadoop/usa_mask_low.tif
save_rdds = true
save_matrix = true


true


<span style="color:red">DON'T MODIFY ANY PIECE OF CODE FROM HERE ON!!!</span>.


### Mode of operation validation

In [83]:
//Check offline modes
var conf = sc.hadoopConfiguration
var fs = org.apache.hadoop.fs.FileSystem.get(conf)

//Paths to store data structures for Offline runs
var mask_str = ""
if (toBeMasked)
  mask_str = "_mask"
var model_grid0_path = out_path + model_dir + "_grid0"
var model_grid0_index_path = out_path + model_dir + "_grid0_index"

var model_grid_path = out_path + model_dir + "_grid"
var satellite_grid_path = out_path + satellite_dir + "_grid"
var model_matrix_path = out_path + model_dir + "_matrix"
var satellite_matrix_path = out_path + satellite_dir + "_matrix"
var metadata_path = out_path + model_dir + "_metadata"

var sc_path = out_path + model_dir + "_sc"
var mc_path = out_path + model_dir + "_mc"

val model_rdd_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(model_grid_path))
val model_matrix_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(model_matrix_path))
val satellite_rdd_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(satellite_grid_path))
val satellite_matrix_offline_exists = fs.exists(new org.apache.hadoop.fs.Path(satellite_matrix_path))

if (model_rdd_offline_mode != model_rdd_offline_exists) {
  println("\"Load GeoTiffs\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + model_rdd_offline_exists.toString())
  model_rdd_offline_mode = model_rdd_offline_exists
}

if (model_matrix_offline_mode != model_matrix_offline_exists) {
  println("\"Matrix\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + model_matrix_offline_exists.toString())
  model_matrix_offline_mode = model_matrix_offline_exists
}

var model_skip_rdd = false
if (model_matrix_offline_exists) {
    println("Since we have a matrix, the load of the grids RDD will be skipped!!!")
    model_skip_rdd = true
}

if (satellite_rdd_offline_mode != satellite_rdd_offline_exists) {
  println("\"Load GeoTiffs\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + satellite_rdd_offline_exists.toString())
  satellite_rdd_offline_mode = satellite_rdd_offline_exists
}

if (satellite_matrix_offline_mode != satellite_matrix_offline_exists) {
  println("\"Matrix\" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to " + satellite_matrix_offline_exists.toString())
  satellite_matrix_offline_mode = satellite_matrix_offline_exists
}

var satellite_skip_rdd = false
if (satellite_matrix_offline_exists) {
    println("Since we have a matrix, the load of the grids RDD will be skipped!!!")
    satellite_skip_rdd = true
}

var corr_tif = out_path + "_" + satellite_dir + "_" + model_dir + ".tif"
var corr_tif_tmp = "/tmp/svd_" + satellite_dir + "_" + model_dir + ".tif"

//Years
val model_years = 1989 to 2014
val satellite_years = 1989 to 2014

if (!satellite_years.contains(satellite_first_year) || !(satellite_years.contains(satellite_last_year))) {
  println("Invalid range of years for " + satellite_dir + ". I should be between " + satellite_first_year + " and " + satellite_last_year)
  System.exit(0)
}

if (!model_years.contains(model_first_year) || !(model_years.contains(model_last_year))) {
  println("Invalid range of years for " + model_dir + ". I should be between " + model_first_year + " and " + model_last_year)
  System.exit(0)
}

if ( ((satellite_last_year - model_first_year) > (model_last_year - model_first_year)) || ((satellite_last_year - model_first_year) > (model_last_year - model_first_year))) {
  println("The range of years for each data set should be of the same length.");
  System.exit(0)
}

var model_years_range = (model_years.indexOf(model_first_year), model_years.indexOf(model_last_year))
var satellite_years_range = (satellite_years.indexOf(satellite_first_year), satellite_years.indexOf(satellite_last_year))

//Global variables
var projected_extent = new ProjectedExtent(new Extent(0,0,0,0), CRS.fromName("EPSG:3857"))
var model_grid0: RDD[(Long, Double)] = sc.emptyRDD
var model_grid0_index: RDD[Long] = sc.emptyRDD
var grids_RDD: RDD[Array[Double]] = sc.emptyRDD
var model_grids_RDD: RDD[Array[Double]] = sc.emptyRDD
var model_grids: RDD[Array[Double]] = sc.emptyRDD
val rows = sc.parallelize(Array[Double]()).map(m => Vectors.dense(m))
var model_summary: MultivariateStatisticalSummary = new RowMatrix(rows).computeColumnSummaryStatistics()
var model_std :Array[Double] = new Array[Double](0)

var satellite_grids_RDD: RDD[Array[Double]] = sc.emptyRDD
var satellite_grids: RDD[Array[Double]] = sc.emptyRDD
var satellite_summary: MultivariateStatisticalSummary = new RowMatrix(rows).computeColumnSummaryStatistics()
var satellite_std :Array[Double] = new Array[Double](0)

var num_cols_rows :(Int, Int) = (0, 0)
var cellT :CellType = UByteCellType
var mask_tile0 :Tile = new SinglebandGeoTiff(geotrellis.raster.ArrayTile.empty(cellT, num_cols_rows._1, num_cols_rows._2), projected_extent.extent, projected_extent.crs, Tags.empty, GeoTiffOptions.DEFAULT).tile
var satellite_cells_size :Long = 0
var model_cells_size :Long = 0
var t0 : Long = 0
var t1 : Long = 0

"Load GeoTiffs" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to false
"Matrix" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to false
"Load GeoTiffs" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to false
"Matrix" offline mode is not set properly, i.e., either it was set to false and the required file does not exist or vice-versa. We will reset it to false

conf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/usr/lib/spark-2.1.1-bin-without-hadoop/conf/hive-site.xml
fs = DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_907625744_41, ugi=pheno (auth:SIMPLE)]]
model_grid0_path = hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet/BloomGridmet_grid0
model_grid0_index_path = hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet/BloomGridmet_grid0_index
model_grid_path = hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet/BloomGridmet_grid
satellite_grid_path = hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet...


mask_str: String = ""


hdfs:///user/pheno/svd/spark/BloomGridmetLeafGridmet/LeafGridmet_grid

## Functions to (de)serialize any structure into Array[Byte]

In [84]:
def serialize(value: Any): Array[Byte] = {
    val out_stream: ByteArrayOutputStream = new ByteArrayOutputStream()
    val obj_out_stream = new ObjectOutputStream(out_stream)
    obj_out_stream.writeObject(value)
    obj_out_stream.close
    out_stream.toByteArray
}

def deserialize(bytes: Array[Byte]): Any = {
    val obj_in_stream = new ObjectInputStream(new ByteArrayInputStream(bytes))
    val value = obj_in_stream.readObject
    obj_in_stream.close
    value
}

serialize: (value: Any)Array[Byte]
deserialize: (bytes: Array[Byte])Any


## Load GeoTiffs

Using GeoTrellis all GeoTiffs of a directory will be loaded into a RDD. Using the RDD, we extract a grid from the first file to lated store the Kmeans cluster_IDS, we build an Index for populate such grid and we filter out here all NaN values.

In [85]:
t0 = System.nanoTime()

//Load Mask
if (toBeMasked) {
  val mask_tiles_RDD = sc.hadoopGeoTiffRDD(mask_path).values
  val mask_tiles_withIndex = mask_tiles_RDD.zipWithIndex().map{case (e,v) => (v,e)}
  mask_tile0 = (mask_tiles_withIndex.filter(m => m._1==0).filter(m => !m._1.isNaN).values.collect())(0)
}

//Local variables
val pattern: String = "tif"
val satellite_filepath: String = satellite_path + satellite_dir
val model_filepath: String = model_path + "/" + model_dir

t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

Elapsed time: 198799ns


t0 = 10986953876449
pattern = tif
satellite_filepath = hdfs:///user/hadoop/spring-index/LeafGridmet
model_filepath = hdfs:///user/hadoop/spring-index//BloomGridmet
t1 = 10986954075248


10986954075248

### Satellite data

In [86]:
t0 = System.nanoTime()

if (satellite_rdd_offline_mode) {
  satellite_grids_RDD = sc.objectFile(satellite_grid_path)
} else {
  //Lets load MODIS Singleband GeoTiffs and return RDD just with the tiles.
  val satellite_geos_RDD = sc.hadoopGeoTiffRDD(satellite_filepath, pattern)
  val satellite_tiles_RDD = satellite_geos_RDD.values

  if (toBeMasked) {
    val mask_tile_broad :Broadcast[Tile] = sc.broadcast(mask_tile0)
    satellite_grids_RDD = satellite_tiles_RDD.map(m => m.localInverseMask(mask_tile_broad.value, 1, -1000).toArrayDouble().filter(_ != -1000))
  } else {
    satellite_grids_RDD = satellite_tiles_RDD.map(m => m.toArrayDouble())
  }

  //Store in HDFS
  if (save_rdds) {
      satellite_grids_RDD.saveAsObjectFile(satellite_grid_path)
  }
}
val satellite_grids_withIndex = satellite_grids_RDD.zipWithIndex().map { case (e, v) => (v, e) }

//Filter out the range of years:
satellite_grids = satellite_grids_withIndex.filterByRange(satellite_years_range._1, satellite_years_range._2).values
satellite_grids.persist(StorageLevel.DISK_ONLY)

//Collect Stats:
satellite_summary = Statistics.colStats(satellite_grids.map(m => Vectors.dense(m)))
satellite_std = satellite_summary.variance.toArray.map(m => scala.math.sqrt(m))

var satellite_grid0_index: RDD[Double] = satellite_grids_withIndex.filter(m => m._1 == 0).values.flatMap(m => m)
satellite_cells_size = satellite_grid0_index.count().toInt
println("Number of cells is: " + satellite_cells_size)

t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

Number of cells is: 1414560                                                     
Elapsed time: 9940263382ns


t0 = 10990578985885
satellite_grids_withIndex = MapPartitionsRDD[1075] at map at <console>:198
satellite_grids = MapPartitionsRDD[1077] at values at <console>:201
satellite_summary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@470166fc
satellite_std = [D@1ab30351
satellite_grid0_index = MapPartitionsRDD[1085] at flatMap at <console>:208
satellite_cells_size = 1414560
t1 = 11000519249267


11000519249267

### Model data

In [87]:
t0 = System.nanoTime()

if (model_rdd_offline_mode) {
  model_grids_RDD = sc.objectFile(model_grid_path)
  model_grid0 = sc.objectFile(model_grid0_path)
  model_grid0_index = sc.objectFile(model_grid0_index_path)
  val metadata = sc.sequenceFile(metadata_path, classOf[IntWritable], classOf[BytesWritable]).map(_._2.copyBytes()).collect()
  projected_extent = deserialize(metadata(0)).asInstanceOf[ProjectedExtent]
  num_cols_rows = (deserialize(metadata(1)).asInstanceOf[Int], deserialize(metadata(2)).asInstanceOf[Int])
  cellT = deserialize(metadata(3)).asInstanceOf[CellType]
} else {
  if (band_num != 0) {
    val model_geos_RDD = sc.hadoopMultibandGeoTiffRDD(model_filepath, pattern)
    val model_tiles_RDD = model_geos_RDD.values

    //Retrieve the number of cols and rows of the Tile's grid
    val tiles_withIndex = model_tiles_RDD.zipWithIndex().map { case (v, i) => (i, v) }
    val tile0 = (tiles_withIndex.filter(m => m._1 == 0).values.collect()) (0)

    num_cols_rows = (tile0.cols, tile0.rows)
    cellT = tile0.cellType

    //Retrieve the ProjectExtent which contains metadata such as CRS and bounding box
    val projected_extents_withIndex = model_geos_RDD.keys.zipWithIndex().map { case (e, v) => (v, e) }
    projected_extent = (projected_extents_withIndex.filter(m => m._1 == 0).values.collect()) (0)

    val band_numB: Broadcast[Int] = sc.broadcast(band_num)
    if (toBeMasked) {
      val mask_tile_broad: Broadcast[Tile] = sc.broadcast(mask_tile0)
      grids_RDD = model_tiles_RDD.map(m => m.band(band_numB.value).localInverseMask(mask_tile_broad.value, 1, -1000).toArrayDouble())
    } else {
      grids_RDD = model_tiles_RDD.map(m => m.band(band_numB.value).toArrayDouble())
    }
  } else {
    val model_geos_RDD = sc.hadoopGeoTiffRDD(model_filepath, pattern)
    val model_tiles_RDD = model_geos_RDD.values

    //Retrieve the number of cols and rows of the Tile's grid
    val tiles_withIndex = model_tiles_RDD.zipWithIndex().map { case (v, i) => (i, v) }
    val tile0 = (tiles_withIndex.filter(m => m._1 == 0).values.collect()) (0)

    num_cols_rows = (tile0.cols, tile0.rows)
    cellT = tile0.cellType

    //Retrieve the ProjectExtent which contains metadata such as CRS and bounding box
    val projected_extents_withIndex = model_geos_RDD.keys.zipWithIndex().map { case (e, v) => (v, e) }
    projected_extent = (projected_extents_withIndex.filter(m => m._1 == 0).values.collect()) (0)

    if (toBeMasked) {
      val mask_tile_broad: Broadcast[Tile] = sc.broadcast(mask_tile0)
      grids_RDD = model_tiles_RDD.map(m => m.localInverseMask(mask_tile_broad.value, 1, -1000).toArrayDouble())
    } else {
      grids_RDD = model_tiles_RDD.map(m => m.toArrayDouble())
    }
  }

  //Get Index for each Cell
  val grids_withIndex = grids_RDD.zipWithIndex().map { case (e, v) => (v, e) }
  if (toBeMasked) {
    model_grid0_index = grids_withIndex.filter(m => m._1 == 0).values.flatMap(m => m).zipWithIndex.filter(m => m._1 != -1000.0).map { case (v, i) => (i) }
  } else {
    model_grid0_index = grids_withIndex.filter(m => m._1 == 0).values.flatMap(m => m).zipWithIndex.map { case (v, i) => (i) }
  }

  //Get the Tile's grid
  model_grid0 = grids_withIndex.filter(m => m._1 == 0).values.flatMap( m => m).zipWithIndex.map{case (v,i) => (i,v)}

  //Lets filter out NaN
  if (toBeMasked) {
    model_grids_RDD = grids_RDD.map(m => m.filter(m => m != -1000.0))
  } else {
    model_grids_RDD = grids_RDD
  }

  //Store data in HDFS
  model_grids_RDD.saveAsObjectFile(model_grid_path)
  model_grid0.saveAsObjectFile(model_grid0_path)
  model_grid0_index.saveAsObjectFile(model_grid0_index_path)

  val writer: SequenceFile.Writer = SequenceFile.createWriter(conf,
    Writer.file(metadata_path),
    Writer.keyClass(classOf[IntWritable]),
    Writer.valueClass(classOf[BytesWritable])
  )

  writer.append(new IntWritable(1), new BytesWritable(serialize(projected_extent)))
  writer.append(new IntWritable(2), new BytesWritable(serialize(num_cols_rows._1)))
  writer.append(new IntWritable(3), new BytesWritable(serialize(num_cols_rows._2)))
  writer.append(new IntWritable(4), new BytesWritable(serialize(cellT)))
  writer.hflush()
  writer.close()
}

val model_grids_withIndex = model_grids_RDD.zipWithIndex().map { case (e, v) => (v, e) }

//Filter out the range of years:
model_grids = model_grids_withIndex.filterByRange(model_years_range._1, model_years_range._2).values
model_grids.persist(StorageLevel.DISK_ONLY)

//Collect Stats:
model_summary = Statistics.colStats(model_grids.map(m => Vectors.dense(m)))
//model_std = model_summary.variance.toArray.map(m => scala.math.sqrt(m))

var model_tile0_index: RDD[Double] = model_grids_withIndex.filter(m => m._1 == 0).values.flatMap(m => m)
model_cells_size = model_tile0_index.count().toInt

t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

Elapsed time: 41297494747ns                                                     


t0 = 11006417639401
model_grids_withIndex = MapPartitionsRDD[1118] at map at <console>:279
model_grids = MapPartitionsRDD[1120] at values at <console>:282
model_summary = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@55d0e57c
model_tile0_index = MapPartitionsRDD[1128] at flatMap at <console>:289
model_cells_size = 1414560
t1 = 11047715134148


11047715134148

## Matrices

### Satellite

In [88]:
//Satellite
val satellite_cells_sizeB = sc.broadcast(satellite_cells_size)
val satellite_mat: RowMatrix = new RowMatrix(satellite_grids.map(m => m.zipWithIndex).map(m => m.filter(!_._1.isNaN)).map(m => Vectors.sparse(satellite_cells_sizeB.value.toInt, m.map(v => v._2), m.map(v => v._1))))

// Split the matrix into one number per line.
val sat_byColumnAndRow = satellite_mat.rows.zipWithIndex.map {
  case (row, rowIndex) => row.toArray.zipWithIndex.map {
    case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
  }
}.flatMap(x => x)
val satellite_blockMatrix: BlockMatrix = new CoordinateMatrix(sat_byColumnAndRow).toBlockMatrix()



satellite_cells_sizeB = Broadcast(571)
satellite_mat = org.apache.spark.mllib.linalg.distributed.RowMatrix@56d1a612
sat_byColumnAndRow = MapPartitionsRDD[1134] at flatMap at <console>:162
satellite_blockMatrix = org.apache.spark.mllib.linalg.distributed.BlockMatrix@41714a00


org.apache.spark.mllib.linalg.distributed.BlockMatrix@41714a00

### SC

In [89]:
//SC
val sc_exists = fs.exists(new org.apache.hadoop.fs.Path(sc_path))
var Sc :BlockMatrix = null
if (sc_exists) {
  val rdd_indexed_rows :RDD[IndexedRow]= sc.objectFile(sc_path)
  Sc = new IndexedRowMatrix(rdd_indexed_rows).toBlockMatrix()
} else {
  val satellite_M_1_Gc = sc.parallelize(Array[Vector](satellite_summary.mean)).map(m => Vectors.dense(m.toArray))
  val satellite_M_1_Gc_RowM: RowMatrix = new RowMatrix(satellite_M_1_Gc)
  val sat_M_1_Gc_byColumnAndRow = satellite_M_1_Gc_RowM.rows.zipWithIndex.map {
    case (row, rowIndex) => row.toArray.zipWithIndex.map {
      case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
    }
  }.flatMap(x => x)
  val satellite_M_1_Gc_blockMatrix = new CoordinateMatrix(sat_M_1_Gc_byColumnAndRow).toBlockMatrix()

  val sat_matrix_Nt_1 = new Array[Double](satellite_grids.count().toInt)
  satellite_grids.unpersist(false)
  for (i <- 0 until sat_matrix_Nt_1.length)
    sat_matrix_Nt_1(i) = 1
  val satellite_M_Nt_1 = sc.parallelize(sat_matrix_Nt_1).map(m => Vectors.dense(m))
  val satellite_M_Nt_1_RowM: RowMatrix = new RowMatrix(satellite_M_Nt_1)
  val sat_M_Nt_1_byColumnAndRow = satellite_M_Nt_1_RowM.rows.zipWithIndex.map {
    case (row, rowIndex) => row.toArray.zipWithIndex.map {
      case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
    }
  }.flatMap(x => x)
  val satellite_M_Nt_1_blockMatrix = new CoordinateMatrix(sat_M_Nt_1_byColumnAndRow).toBlockMatrix()
  val satellite_M_Nt_Gc_blockMatrix = satellite_M_Nt_1_blockMatrix.multiply(satellite_M_1_Gc_blockMatrix)

  //Sc = satellite_blockMatrix.subtract(satellite_M_Nt_Gc_blockMatrix)
  val joined_mat :RDD[ (Long, (Array[Double], Array[Double]))] = satellite_blockMatrix.toCoordinateMatrix().toRowMatrix().rows.map(_.toArray).zipWithUniqueId().map{case (v,i) => (i,v)}.join(satellite_M_Nt_Gc_blockMatrix.toCoordinateMatrix().toRowMatrix().rows.map(_.toArray).zipWithUniqueId().map{case (v,i) => (i,v)})
  Sc = (new CoordinateMatrix(joined_mat.map {case (row_index, (a,b)) => a.zip(b).map(m => m._1*m._2).zipWithIndex.map{ case (v,col_index) => new MatrixEntry(row_index, col_index,v)}}.flatMap(m => m))).toBlockMatrix() 

  //save to disk
  Sc.toIndexedRowMatrix().rows.saveAsObjectFile(sc_path)
}
Sc.persist(StorageLevel.DISK_ONLY)



sc_exists = false
Sc = org.apache.spark.mllib.linalg.distributed.BlockMatrix@af400ca


org.apache.spark.mllib.linalg.distributed.BlockMatrix@af400ca

### Model Data

In [90]:
//Model
val model_cells_sizeB = sc.broadcast(model_cells_size)
val model_mat: RowMatrix = new RowMatrix(model_grids.map(m => m.zipWithIndex).map(m => m.filter(!_._1.isNaN)).map(m => Vectors.sparse(model_cells_sizeB.value.toInt, m.map(v => v._2), m.map(v => v._1))))

// Split the matrix into one number per line.
val mod_byColumnAndRow = model_mat.rows.zipWithIndex.map {
  case (row, rowIndex) => row.toArray.zipWithIndex.map {
    case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
  }
}.flatMap(x => x)
val model_blockMatrix: BlockMatrix = new CoordinateMatrix(mod_byColumnAndRow).transpose().toBlockMatrix()

model_cells_sizeB = Broadcast(595)
model_mat = org.apache.spark.mllib.linalg.distributed.RowMatrix@273dfe93
mod_byColumnAndRow = MapPartitionsRDD[1202] at flatMap at <console>:162
model_blockMatrix = org.apache.spark.mllib.linalg.distributed.BlockMatrix@6eebebcb


org.apache.spark.mllib.linalg.distributed.BlockMatrix@6eebebcb

### Mc

In [91]:
//MC
val mc_exists = fs.exists(new org.apache.hadoop.fs.Path(mc_path))
var Mc :BlockMatrix = null
if (mc_exists) {
  val rdd_indexed_rows :RDD[IndexedRow]= sc.objectFile(mc_path)
  Mc = new IndexedRowMatrix(rdd_indexed_rows).toBlockMatrix()
} else {
  val model_M_1_Gc = sc.parallelize(Array[Vector](model_summary.mean)).map(m => Vectors.dense(m.toArray))
  val model_M_1_Gc_RowM: RowMatrix = new RowMatrix(model_M_1_Gc)
  val mod_M_1_Gc_byColumnAndRow = model_M_1_Gc_RowM.rows.zipWithIndex.map {
    case (row, rowIndex) => row.toArray.zipWithIndex.map {
      case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
    }
  }.flatMap(x => x)
  val model_M_1_Gc_blockMatrix = new CoordinateMatrix(mod_M_1_Gc_byColumnAndRow).toBlockMatrix()

  val model_matrix_Nt_1 = new Array[Double](model_grids.count().toInt)
  model_grids.unpersist(false)

  for (i <- 0 until model_matrix_Nt_1.length)
    model_matrix_Nt_1(i) = 1
  val model_M_Nt_1 = sc.parallelize(model_matrix_Nt_1).map(m => Vectors.dense(m))
  val model_M_Nt_1_RowM: RowMatrix = new RowMatrix(model_M_Nt_1)
  val mod_M_Nt_1_byColumnAndRow = model_M_Nt_1_RowM.rows.zipWithIndex.map {
    case (row, rowIndex) => row.toArray.zipWithIndex.map {
      case (number, columnIndex) => new MatrixEntry(rowIndex, columnIndex, number)
    }
  }.flatMap(x => x)
  val model_M_Nt_1_blockMatrix = new CoordinateMatrix(mod_M_Nt_1_byColumnAndRow).toBlockMatrix()
  val model_M_Nt_Gc_blockMatrix = model_M_Nt_1_blockMatrix.multiply(model_M_1_Gc_blockMatrix)
  val model_M_Gc_Nt_blockMatrix = model_M_Nt_Gc_blockMatrix.transpose
  
  //Mc = model_blockMatrix.subtract(model_M_Gc_Nt_blockMatrix)
  val joined_mat :RDD[ (Long, (Array[Double], Array[Double]))] = model_blockMatrix.toCoordinateMatrix().toRowMatrix().rows.map(_.toArray).zipWithUniqueId().map{case (v,i) => (i,v)}.join(model_M_Gc_Nt_blockMatrix.toCoordinateMatrix().toRowMatrix().rows.map(_.toArray).zipWithUniqueId().map{case (v,i) => (i,v)})
  Mc = (new CoordinateMatrix(joined_mat.map {case (row_index, (a,b)) => a.zip(b).map(m => m._1*m._2).zipWithIndex.map{ case (v,col_index) => new MatrixEntry(row_index, col_index,v)}}.flatMap(m => m))).toBlockMatrix()

  //save to disk
  Mc.toIndexedRowMatrix().rows.saveAsObjectFile(mc_path)
}
Mc.persist(StorageLevel.DISK_ONLY)



mc_exists = false
Mc = org.apache.spark.mllib.linalg.distributed.BlockMatrix@370c2642


org.apache.spark.mllib.linalg.distributed.BlockMatrix@370c2642

## Matrix Multiplication

In [92]:
//Matrix Multiplication
//val matrix_mul = model_blockMatrix.multiply(satellite_blockMatrix)
val matrix_mul = Mc.multiply(Sc)
matrix_mul.persist(StorageLevel.DISK_ONLY)

//val resRowMatrix: RowMatrix = new RowMatrix(matrix_mul.toIndexedRowMatrix().rows.sortBy(_.index).map(_.vector))




matrix_mul = org.apache.spark.mllib.linalg.distributed.BlockMatrix@6032e8cc


org.apache.spark.mllib.linalg.distributed.BlockMatrix@6032e8cc

In [93]:
println(matrix_mul.numCols)
println(matrix_mul.numRows)

1414560
493832


## SVD

In [94]:
//SVD
val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix_mul.toCoordinateMatrix().toRowMatrix().computeSVD(10, computeU = true)

//val U: RowMatrix = svd.U // The U factor is a RowMatrix.
//val s: Vector = svd.s // The singular values are stored in a local dense vector.
//val V: Matrix = svd.V // The V factor is a local dense matrix.

[Stage 2466:>                                                     (0 + 34) / 36]

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 29 in stage 2466.0 failed 4 times, most recent failure: Lost task 29.3 in stage 2466.0 (TID 200652, 145.100.59.233, executor 24): ExecutorLostFailure (executor 24 exited caused by one of the running tasks) Reason: Command exited with code 52
Driver stacktrace:
StackTrace: Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anon

In [95]:
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val S = Matrices.diag(s)

U = org.apache.spark.mllib.linalg.distributed.RowMatrix@13a81f4
s = [1.7294690472100357E13,3.5417440832692556E9,2.218091441578333E9,1.4102137597250912E9,1.0501195763116548E9,7.490420148329376E8,5.105869757765809E8,4.219423656858187E8,3.515090255611161E8,2.966443139452226E8]
V = 


lastException: Throwable = null
-0.00718766236...


-0.008214439989956223   -0.0024723820007321044  ... (10 total)
-0.00821443998995622    -0.0024723820007328673  ...
-0.007214179882276797   -0.004419063578345915   ...
-0.0074317588149111925  -0.004702615520864071   ...
-0.00737423750464065    -0.004590520013245656   ...
-0.007080193275179696   -0.00374039318980654    ...
-0.007080193275179695   -0.003740393189807042   ...
-0.007187662361606361   -0.005214828820355853   ...
-0.007196118383966567   -0.005165290038956709   ...
-0.00719611838396657    -0.005165290038956005   ...
-0.007196118383966571   -0.005165290038956264   ...
-0.007393471035251325   -0.005381283570195753   ...
-0.007196118383966569   -0.005165290038956245   ...
-0.007223990876863853   -0.00422548179952084    ...
-0.007043827597506955   -0.004331001303692338   ...
-0.007043827597506951   -0.004331001303692099   ...
-0.007035059608884771   -0.003689546775446188   ...
-0.0072138165626801955  -0.005279059954993707   ...
-0.0072519146908152365  -0.004725102219028031   ...
-

# Save results

### R

In [96]:
U.rows.map(m => m.toArray.mkString(",")).repartition(1).saveAsTextFile(out_path + "U.csv")



### V

In [97]:
sc.parallelize(V.rowIter.toVector.map(m => m.toArray.mkString(",")),1).saveAsTextFile(out_path + "V.csv")

### S

In [98]:
sc.parallelize(S.rowIter.toVector.map(m => m.toArray.mkString(",")),1).saveAsTextFile(out_path + "S.csv")