# Logistic Regression

In this tutorial we will introduce how to build a logistic regression model using BigDL. We use *MNIST* data for experiments in this tutorial. For more information about MNIST, please refer to this [site](http://yann.lecun.com/exdb/mnist/). The first thing we need to do it to import necessary packages and inilialize the engine.

This part aims at preparing for loading MNIST data

In [2]:
import java.nio.ByteBuffer
import java.nio.file.{Files, Path, Paths}

import com.intel.analytics.bigdl.dataset.ByteRecord
import com.intel.analytics.bigdl.utils.File
import scopt.OptionParser

def load(featureFile: String, labelFile: String): Array[ByteRecord] = {
    val featureBuffer = ByteBuffer.wrap(Files.readAllBytes(Paths.get(featureFile)))
    val labelBuffer = ByteBuffer.wrap(Files.readAllBytes(Paths.get(labelFile)))
    
    val labelMagicNumber = labelBuffer.getInt()
    require(labelMagicNumber == 2049)
    val featureMagicNumber = featureBuffer.getInt()
    require(featureMagicNumber == 2051)

    val labelCount = labelBuffer.getInt()
    val featureCount = featureBuffer.getInt()
    require(labelCount == featureCount)

    val rowNum = featureBuffer.getInt()
    val colNum = featureBuffer.getInt()

    val result = new Array[ByteRecord](featureCount)
    var i = 0
    while (i < featureCount) {
      val img = new Array[Byte]((rowNum * colNum))
      var y = 0
      while (y < rowNum) {
        var x = 0
        while (x < colNum) {
          img(x + y * colNum) = featureBuffer.get()
          x += 1
        }
        y += 1
      }
      result(i) = ByteRecord(img, labelBuffer.get().toFloat + 1.0f)
      i += 1
    }

    result
}

In [3]:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext

import com.intel.analytics.bigdl._
import com.intel.analytics.bigdl.utils._
import com.intel.analytics.bigdl.dataset.DataSet
import com.intel.analytics.bigdl.dataset.image.{BytesToGreyImg, GreyImgNormalizer, GreyImgToBatch, GreyImgToSample}
import com.intel.analytics.bigdl.nn.{ClassNLLCriterion, Module}
import com.intel.analytics.bigdl.numeric.NumericFloat
import com.intel.analytics.bigdl.optim._
import com.intel.analytics.bigdl.utils.{Engine, LoggerFilter, T, Table}
import com.intel.analytics.bigdl.models.lenet.Utils._
import com.intel.analytics.bigdl.nn.{ClassNLLCriterion, Linear, LogSoftMax, Sequential, Reshape}
import com.intel.analytics.bigdl.optim.SGD
import com.intel.analytics.bigdl.optim.Top1Accuracy
import com.intel.analytics.bigdl.tensor._

Then we get and store MNIST for training and testing. You should edit the paths below according to your system settings.

In [4]:
val trainData = "../../datasets/mnist/train-images-idx3-ubyte"
val trainLabel = "../../datasets/mnist/train-labels-idx1-ubyte"
val validationData = "../../datasets/mnist/t10k-images-idx3-ubyte"
val validationLabel = "../../datasets/mnist/t10k-labels-idx1-ubyte"

In [5]:
//Parameters
val batchSize = 2048
val learningRate = 0.2
val maxEpochs = 15

//Network Parameters
val nInput = 784 //MNIST data input (img shape: 28*28)
val nClasses = 10  //MNIST total classes (0-9 digits)

In [6]:
Engine.init

In [7]:
val trainSet = 
    DataSet.array(load(trainData, trainLabel), sc) -> BytesToGreyImg(28, 28) -> GreyImgNormalizer(trainMean, trainStd) -> GreyImgToBatch(batchSize)
val validationSet = 
    DataSet.array(load(validationData, validationLabel), sc) -> BytesToGreyImg(28, 28) -> GreyImgNormalizer(testMean, testStd) -> GreyImgToBatch(batchSize)

In [8]:
val model = Sequential().add(Reshape(Array(28 * 28))).add(Linear(nInput, nClasses)).add(LogSoftMax())
model

Sequential[9857d67b]{
  [input -> (1) -> (2) -> (3) -> output]
  (1): Reshape[3dec9b34](784)
  (2): Linear[c5ea5213](784 -> 10)
  (3): LogSoftMax[5b4a673]
}

In [9]:
val optimizer = Optimizer(model = model, dataset = trainSet, criterion = ClassNLLCriterion[Float]())
optimizer.setValidation(trigger = Trigger.everyEpoch, dataset = validationSet, vMethods = Array(new Top1Accuracy[Float], new Top5Accuracy[Float], new Loss[Float]))
optimizer.setOptimMethod(new SGD(learningRate=learningRate))
optimizer.setEndWhen(Trigger.maxEpoch(maxEpochs))

com.intel.analytics.bigdl.optim.DistriOptimizer@7ed045fb

In [10]:
val trainedModel = optimizer.optimize()
trainedModel

can't find locality partition for partition 0 Partition locations are (ArrayBuffer(172.168.2.109)) Candidate partition locations are
(0,List()).


Sequential[9857d67b]{
  [input -> (1) -> (2) -> (3) -> output]
  (1): Reshape[3dec9b34](784)
  (2): Linear[c5ea5213](784 -> 10)
  (3): LogSoftMax[5b4a673]
}

In [11]:
val rddData = sc.parallelize(load(validationData, validationLabel), batchSize)
val transformer = BytesToGreyImg(28, 28) -> GreyImgNormalizer(testMean, testStd) -> GreyImgToSample()
val evaluationSet = transformer(rddData)
        
val result = model.evaluate(evaluationSet, Array(new Top1Accuracy[Float]), Some(batchSize))

result.foreach(r => println(s"${r._2} is ${r._1}"))

Top1Accuracy is Accuracy(correct: 9223, count: 10000, accuracy: 0.9223)


In [12]:
val predictions = model.predict(evaluationSet)
val preLabels = predictions.take(20).map(_.toTensor.max(1)._2.valueAt(1)).mkString(",")
val labels = evaluationSet.take(20).map(_.label.valueAt(1)).mkString(",")
println(preLabels)
println(labels)

8.0,3.0,2.0,1.0,7.0,2.0,5.0,10.0,7.0,1.0,1.0,7.0,10.0,1.0,4.0,6.0,10.0,8.0,4.0,6.0
8.0,3.0,2.0,1.0,5.0,2.0,5.0,10.0,6.0,10.0,1.0,7.0,10.0,1.0,2.0,6.0,10.0,8.0,4.0,5.0


In [13]:
sc.stop()