In [3]:
case class Usage(uid: Int, uname: String, usage: Int)

val r = new scala.util.Random(42)
// Genera 1000 registros 
val data = for(i <- 0 to 1000)
    yield (Usage(i, "user-"+r.alphanumeric.take(5).mkString(""), r.nextInt(1000)))

defined class Usage
r: scala.util.Random = scala.util.Random@45ff33cc
data: scala.collection.immutable.IndexedSeq[Usage] = Vector(Usage(0,user-Gpi2C,525), Usage(1,user-DgXDi,502), Usage(2,user-M66yO,170), Usage(3,user-xTOn6,913), Usage(4,user-3xGSz,246), Usage(5,user-2aWRN,727), Usage(6,user-EzZY1,65), Usage(7,user-ZlZMZ,935), Usage(8,user-VjxeG,756), Usage(9,user-iqf1P,3), Usage(10,user-91S1q,794), Usage(11,user-qHNj0,501), Usage(12,user-7hb94,460), Usage(13,user-bz0WF,142), Usage(14,user-71nwy,479), Usage(15,user-7GZz1,823), Usage(16,user-1CSk6,140), Usage(17,user-WPzlL,246), Usage(18,user-VaEit,451), Usage(19,user-PSaRq,679), Usage(20,user-0Kkzu,332), Usage(21,user-UN3MG,172), Usage(22,user-KwwER,442), Usage(23,user-ZnltJ,923), Usage(24,user-IRA17,741), Usage(25,user-yNHRT,299), Us...


In [6]:
val dsUsage = data.toDS()
val dsUsage2 = spark.createDataset(data)
dsUsage.show(10)
dsUsage2.show(10)

dsUsage: org.apache.spark.sql.Dataset[Usage] = [uid: int, uname: string ... 1 more field]
dsUsage2: org.apache.spark.sql.Dataset[Usage] = [uid: int, uname: string ... 1 more field]


In [20]:
//Funciona en el shell
dsUsage.filter(d => d.usage > 900)
    .orderBy(desc("usage"))
    .show(5, false)
// definiendo función para filter
def filterWithUsage(u: Usage) = u.usage > 900
dsUsage.filter(filterWithUsage(_)).orderBy(desc("usage"))
/*
dsUsage.filter($"usage" > 900)
    .orderBy(desc("usage"))
    .show(5, false)
*/

+---+----------+-----+
|uid|uname     |usage|
+---+----------+-----+
|605|user-NL6c4|999  |
|561|user-5n2xY|999  |
|113|user-nnAXr|999  |
|634|user-L0wci|999  |
|345|user-QKrVb|996  |
+---+----------+-----+
only showing top 5 rows



filterWithUsage: (u: Usage)Boolean
test: org.apache.spark.sql.Dataset[Usage] = [uid: int, uname: string ... 1 more field]


In [27]:
// error en notebook / funciona en shell
dsUsage.map(u => {if (u.usage > 750) u.usage * .15 else u.usage * .50}).show(5,false)

def computeCostUsage(usage: Int): Double = {
    if(usage > 750) usage * 0.15 else usage * 0.50
}
dsUsage.map(u => {computeCostUsage(u.usage)}).show(5,false)

In [30]:
case class UsageCost(uid: Int, uname: String, usage: Int, cost: Double)

defined class UsageCost


In [None]:
def computeUserCostUsage(u: Usage): UsageCost = {
    val v = if(u.usage > 750) u.usage * 0.15 else u.usage * 0.50
    UsageCost(u.uid, u.uname, u.usage, v)
}

dsUsage.map(u => {computeUserCostUsage(u)}).show(5)

In [None]:
// Convertir DataFrame a DataSet de SomeCaseClass
// dataframe.as[CaseClass]

 Para la mitigación de costes es recomendable evitar el uso excesivo de lambdas, utilizar DSL expressions para evitar la serializacion y deserialización del uso de lambdas.

// Ejemplo con lambda (a evitar)\
`personDS.filter(x => x.firstName == "Nell").distinct().count`\
// Ejemplo con DSL query\
`personDS.filter($"firstName" === "Nell").distinct().count`\

- Advantages of using lambdas:
    - Good for semi-structured data
    - Very powerful
- Disadvantages:
    - Catalyst can't interpret lambdas until runtime. 
    - Lambdas are opaque to Catalyst. Since it doesn't know what a lambda is doing, it can't move it elsewhere in the processing.
    - Jumping between lambdas and the DataFrame query API can hurt performance.
    - Working with lambdas means that we need to `deserialize` from Tungsten's format to an object and then reserialize back to Tungsten format when the lambda is done.
    
If you _have_ to use lambdas, chaining them together can help.\

`personDS
  .filter(x => x.birthDate.split("-")(0).toInt > earliestYear) // everyone above 40
  .filter($"salary" > 80000) // everyone earning more than 80K
  .filter(x => x.lastName.startsWith("J")) // last name starts with J
  .filter($"firstName".startsWith("D")) // first name starts with D
  .count()`

Same but with DSL, más eficiente, no es necesaria la deserialización y serialización\
`personDS
  .filter(year($"birthDate") > earliestYear) // everyone above 40
  .filter($"salary" > 80000) // everyone earning more than 80K
  .filter($"lastName".startsWith("J")) // last name starts with J
  .filter($"firstName".startsWith("D")) // first name starts with D
  .count()`
