# Scala Spark: Simulating Data

Simulating data is an important technique to:

1. Explore simple assumptions and how they play around statistically
2. Check the sense of your data processing and analysis


Suppose you build a complex data processing pipleline. How do you know that it's working? ie., not misprocessing the data. 

If you know ahead of time what the simulation will tell you, your analytical pipeline should produe these assumptions as its conclusions. 

It's can be a good idea to build a simulation of data sets *before* using real-world data. This allows you to think through the various attributes you will be provided, and allows you to build a data processing pipeline you can unit-test with known fixed outputs. 

## Discrete Random Numbers: Flips and Trials

* A 1/5 chance of picking true:

In [3]:
val choices = true +: Array.fill(4)(false)

[36mchoices[39m: [32mArray[39m[[32mBoolean[39m] = [33mArray[39m(true, false, false, false, false)

* A random index:

In [5]:
import scala.util.Random

Random.nextInt(choices.length)

[32mimport [39m[36mscala.util.Random

[39m
[36mres4_1[39m: [32mInt[39m = [32m1[39m

* A random choice, true with prb. 20% ...

In [6]:
choices(Random.nextInt(choices.length))

[36mres5[39m: [32mBoolean[39m = false

* Let's make these methods:

In [8]:
object Choose {
    def pick[A](choices: Seq[A]) = choices(Random.nextInt(choices.length))
    def flip(odds: Int) = pick(true +: Array.fill(odds - 1)(false))
}


defined [32mobject[39m [36mChoose[39m

In [9]:
Choose.flip(2) // prb. 1 in 2 , ie., 50%

[36mres8[39m: [32mBoolean[39m = false

* 10 flips:

In [10]:
Array.fill(10)(Choose.flip(2))

[36mres9[39m: [32mArray[39m[[32mBoolean[39m] = [33mArray[39m(
  true,
  true,
  false,
  true,
  false,
  false,
  true,
  false,
  true,
  true
)

#### Illustrative uses

In [13]:
Array.fill(10)(Choose.flip(2)).count( _ == true ) 

[36mres12[39m: [32mInt[39m = [32m6[39m

In [14]:
Array.fill(1000)(Choose.flip(2)).count( _ == true) / 1000.0

[36mres13[39m: [32mDouble[39m = [32m0.518[39m

In [17]:
Array.fill(1000)(Choose.flip(3)).count( _ == true) / 1000.0

[36mres16[39m: [32mDouble[39m = [32m0.337[39m

In [18]:
Choose.pick(18 to 30)

[36mres17[39m: [32mInt[39m = [32m25[39m

In [19]:
Choose.pick(List("London", "Leeds"))

[36mres18[39m: [32mString[39m = [32m"London"[39m

In [20]:
Choose.pick(List("Sherlock", "Watson"))

[36mres19[39m: [32mString[39m = [32m"Sherlock"[39m

## Continous Random Numbers: The Normal Distribution

In [24]:
val knownMean = 10

[36mknownMean[39m: [32mInt[39m = [32m10[39m

In [25]:
val knownSD = 5

[36mknownSD[39m: [32mInt[39m = [32m5[39m

* Five random numbers with standard deviation knownSD and mean knownMean

In [22]:
new java.util.Random().nextGaussian()

[36mres21[39m: [32mDouble[39m = [32m0.977131313671385[39m

In [23]:
val r = new java.util.Random()

[36mr[39m: [32mRandom[39m = java.util.Random@5c0a7148

In [26]:
Vector.fill(5)(knownSD * r.nextGaussian() + knownMean)

[36mres25[39m: [32mVector[39m[[32mDouble[39m] = [33mVector[39m(
  [32m10.491342385193198[39m,
  [32m15.85099201249532[39m,
  [32m17.135115394351068[39m,
  [32m13.073481748942566[39m,
  [32m15.122401291768236[39m
)

* Let's calculate mean, varience and std. deviation of 1000 such numbers:

In [27]:
object Normal {
    val _r = new java.util.Random()
    val normals = Vector.fill(1000)(knownSD * _r.nextGaussian() + knownMean) 
    
    
    val total_ = normals.reduce( _ + _ )
    val mean_ = total_ / normals.length
    val var_ = normals.map( _ - mean_ ).map(math.pow(_, 2)).reduce(_ + _) / normals.length
    val sd_ = math.sqrt(var_)
    
}

defined [32mobject[39m [36mNormal[39m

* The sample of 1000 has approximately the known mean and known std. dev.

In [28]:
Normal.sd_

[36mres27[39m: [32mDouble[39m = [32m5.204307819337279[39m

In [29]:
Normal.mean_

[36mres28[39m: [32mDouble[39m = [32m9.840070049726146[39m

* Let's package this up into something useful:

In [35]:
object Choose {
    val _r = new java.util.Random()
    
    def pick[A](choices: Seq[A]) = choices(_r.nextInt(choices.length))
    def flip(odds: Int) = pick(true +: Array.fill(odds - 1)(false))
    
    
    def around(point: Double, within: Double) = pick(Vector.fill(10)(within * _r.nextGaussian() + point) )
}




defined [32mobject[39m [36mChoose[39m

In [36]:
Choose.around(10, 2)

[36mres35[39m: [32mDouble[39m = [32m9.964221743283609[39m

## A Data Generator

* Suppose we want to generate many objects:

In [37]:
case class Product(
    level: String,
    price: Double,
    weight: Double,
    isElectrical: Boolean,
    isDomestic: Boolean,
    isProfessional: Boolean
)

defined [32mclass[39m [36mProduct[39m

* Some fields are discrete (small set of options); others are continous (number falls anywhere in range). 
* Let's define a Generator to generate a Product:

In [38]:
object Generator {
    val options = Array("Gold", "Silver", "Bronze")
    val prices = 0 to 10
    
    def genProduct() = {
        val isExpensive = flip(10)
        
        Product(
            level  = Choose.pick(options),
            price  = Choose.pick(prices) * (if(isExpensive) 10 else 1),
            weight = Choose.around(10, 2),
            isElectrical = Choose.flip(2),
            isDomestic = Choose.flip(3),
            isProfessional = (isExpensive || Choose.flip(5)),
        )
    }
    
    
    def genMany(amount: Int) = Array.fill(amount) { genProduct() }
}

defined [32mobject[39m [36mGenerator[39m

In [39]:
Generator.genMany(2)

[36mres38[39m: [32mArray[39m[[32mProduct[39m] = [33mArray[39m(
  [33mProduct[39m([32m"Gold"[39m, [32m6.0[39m, [32m10.794630306972964[39m, false, true, false),
  [33mProduct[39m([32m"Gold"[39m, [32m10.0[39m, [32m9.380459485937807[39m, true, false, false)
)

* Let's test the generator by calculating various statistical measures and checking for sense:

In [40]:
object Product {
    val weights = Generator.genMany(100).map( _.weight )
    val meanWeight_ = weights.reduce( _ + _ ) / weights.length
    
    val isPro = Generator.genMany(100).map( _.isProfessional )
    val probPro_ = isPro.count( _ == true) / isPro.length.toDouble
}

defined [32mobject[39m [36mProduct[39m

In [77]:
Product.meanWeight_

[36mres76[39m: [32mDouble[39m = [32m10.31417706806951[39m

In [41]:
Product.probPro_

[36mres40[39m: [32mDouble[39m = [32m0.33[39m

* Notice that prob. of being professional is c. 0.3 which is c. 0.2 + 0.1
    * why?
    * HINT: when do probabilities add? when do they multiply?