# Functional Programming for Data Analysis

### Jim Pivarski

Fourth notebook: Scala

Scala is a functional programming language, like Haskell but not as strict, and it runs on the Java Virtual Machine (JVM).

This last point makes it harder to integrate into physics applications, but easier to integrate into business analytics, since most computing infrastructure in industry is based on Java, rather than C++.

Scala is also Spark's native tongue: Spark was written in Scala and provides Java, Python, and R interfaces as a convenience.

Python programming in Spark is not efficient. If you're going to be doing any architectural work in Spark, you should use Scala.

Scala also provides an example of type-safe functional programming, which provides better errors and safety when building large applications.

It also has pattern-matching, which my "functional playground" in Python lacks.

In [29]:
// Crash course in Scala syntax

val xs = List(1, 2, 3, 4, 5)    // statically typed, but inferred (like C++'s "auto")

val ys = 1 :: 2 :: 3 :: Nil     // some syntax is meant to appeal to Haskell fans

def squared(x: Int) = x * x     // when required, types are capitalized and after colons

xs.map(squared)

xs.map(x => x + 1)              // short lambda syntax

xs.map(_ * 2)                   // even shorter for special cases (one argument, no type)

xs map {x => x + 1}             // dots and parentheses aren't always needed (again, like Haskel)

[36mxs[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m1[39m, [32m2[39m, [32m3[39m, [32m4[39m, [32m5[39m)
[36mys[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m1[39m, [32m2[39m, [32m3[39m)
defined [32mfunction[39m [36msquared[39m
[36mres28_3[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m1[39m, [32m4[39m, [32m9[39m, [32m16[39m, [32m25[39m)
[36mres28_4[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m2[39m, [32m3[39m, [32m4[39m, [32m5[39m, [32m6[39m)
[36mres28_5[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m2[39m, [32m4[39m, [32m6[39m, [32m8[39m, [32m10[39m)
[36mres28_6[39m: [32mList[39m[[32mInt[39m] = [33mList[39m([32m2[39m, [32m3[39m, [32m4[39m, [32m5[39m, [32m6[39m)

## Getting some data

These are the same CMS public data events as last time, now viewed as Scala objects.

(Also note that we're installing software and loading it interactively; one of Java's features is zero-hassle installation.)

In [22]:
import $ivy.`org.diana-hep:histogrammar_2.11:1.0.4`

[32mimport [39m[36m$ivy.$                                      [39m

In [23]:
import org.dianahep.histogrammar.tutorial.cmsdata
val events = cmsdata.EventIterator().take(1000).toList

[32mimport [39m[36morg.dianahep.histogrammar.tutorial.cmsdata
[39m
[36mevents[39m: [32mList[39m[[32mcmsdata[39m.[32mEvent[39m] = [33mList[39m(
  [33mEvent[39m(
    [33mList[39m(),
    [33mList[39m(
      [33mMuon[39m(
        [32m4.8594961166381845[39m,
        [32m-30.2398738861084[39m,
        [32m137.7764892578125[39m,
        [32m141.13978576660156[39m,
        [32m-1[39m,
        [32m0.0[39m
      )
[33m...[39m

Not coincidentally, the Scala functionals have mostly the same names as mine.

In [24]:
// sequential calculation
events.flatMap(_.muons).map(_.pt).take(10)

[36mres23[39m: [32mList[39m[[32mDouble[39m] = [33mList[39m(
  [32m30.62784150336687[39m,
  [32m31.641719444653546[39m,
  [32m25.997956733809474[39m,
  [32m40.514054143148[39m,
  [32m36.94010454322916[39m,
  [32m24.424106610105476[39m,
  [32m39.758086508574195[39m,
  [32m36.73538886104245[39m,
  [32m69.39617109916814[39m,
  [32m46.84325839229087[39m
)

In [25]:
// parallel calculation (just add ".par"!)
events.par.flatMap(_.muons).map(_.pt).take(10)

[36mres24[39m: [32mcollection[39m.[32mparallel[39m.[32mimmutable[39m.[32mParSeq[39m[[32mDouble[39m] = ParVector(30.62784150336687, 31.641719444653546, 25.997956733809474, 40.514054143148, 36.94010454322916, 24.424106610105476, 39.758086508574195, 36.73538886104245, 69.39617109916814, 46.84325839229087)

## Pattern matching

The most idiomatic Scala code compares values against 

## Immutable data

Scala collections are immutable by default, though there are versions that can be modified in-place to optimize special cases.







## Structural sharing

<table>
<tr style="background-color: white;"><td><span style="font-family: Lato, sans-serif; font-size: 35.84px">When <i>all</i> values are immutable, we can dramatically reduce the memory required for tree-like data structures by refusing to copy the ones that don't change in a transformation.</span></td><td style="width: 600px;"><img src="http://2.bp.blogspot.com/_r-NJO1NMiu4/TRA69XdCU8I/AAAAAAAAAnM/Re0VElAeLc4/s1600/ds_2_new.gif" style="margin-left: auto; margin-right: auto; width: 100%"></td></tr>
</table>

Totally immutable with structural sharing is a different sweet spot for performance than traditional transform-in-place, with added safety and parallelizability.

In [None]:
var identifier = 'A'

def message(id: Char) =
    if (id.toByte > 'G'.toByte)
        "    <-- new node"
    else
        ""

object TreeList {
    def apply[T](values: T*): TreeList[T] = {
        val (value, children) = values.toList match {
            case Nil => throw new Exception("cannot be empty")
            case one :: Nil => (one, List())
            case first :: rest =>
                val (left, right) = rest.splitAt(rest.size / 2)
                (first, List(left, right).flatMap({
                    case Nil => List()
                    case x => List(TreeList(x: _*))
                }))
        }

        new TreeList(value, children)
    }
}
class TreeList[T](val value: T, val children: List[TreeList[T]]) {
    val id = identifier
    identifier = (identifier.toByte + 1).toChar

    def toString(indent: String): String = {
        val prefix = "\n%s%s: value %s%s".format(indent, id, value, message(id))
        val subtrees = children.map(_.toString(indent + "    "))
        (prefix :: subtrees).mkString
    }
    override def toString() = toString("")

    def size: Int = 1 + children.map(_.size).sum
    
    def toList: List[T] = value +: children.flatMap(_.toList)
    
    def get(index: Int): T = index match {
        case 0 => value
        case i if i - 1 < children.head.size => children.head.get(i - 1)
        case i => children.last.get(i - 1 - children.head.size)
    }
    
    def inserted(index: Int, newval: T): TreeList[T] = index match {
        case 0 =>
            new TreeList(value, List(new TreeList(newval, children)))
        case i if i - 1 < children.head.size =>
            new TreeList(value, children.head.inserted(i - 1, newval) :: children.tail)
        case i =>
            new TreeList(value, List(children.head, children.last.inserted(i - 1 - children.head.size, newval)))
    }
}

In [None]:
identifier = 'A'
val xs = TreeList(1, 2, 3, 4, 5, 6, 7)
xs.toList

In [None]:
val ys = xs.inserted(5, 999)
ys.toList