# Data Visualization with Vega

A picture is worth a thousand words. In machine learning, we usually handle high-dimensional data, which is impossible to draw on display directly. But a variety of statistical plots are tremendously valuable for us to grasp the characteristics of many data points. Smile provides data visualization tools such as plots and maps for researchers to understand information more easily and quickly.

Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG.

Vega-Lite is a high-level grammar of interactive graphics. It provides a concise JSON syntax for rapidly generating visualizations to support analysis. Vega-Lite specifications can be compiled to Vega specifications.

Smile provides some interactive statistical plots with latest Vega-Lite.

First, let's import Smile and also add an `implict` function to display Vega-Lite specification (a JSON object) in almond.

In [None]:
import $ivy.`com.github.haifengl::smile-scala:2.2.2`
import $ivy.`org.slf4j:slf4j-simple:1.7.30`  

import java.lang.Math._
import smile.json._
import smile.plot.vega._
import smile.plot.show
import smile._

implicit def render(spec: JsObject): Unit = {
  publish.html(iframe(spec))
}

Now let's plot a heart. Math is beautiful, isn't it?

In [2]:
val heart = -314 to 314 map { i =>
    val t = i / 100.0
    val x = 16 * pow(sin(t), 3)
    val y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    Array(x, y)
}

show(plot(heart.toArray))

[36mheart[39m: [32mIndexedSeq[39m[[32mArray[39m[[32mDouble[39m]] = [33mVector[39m(
  [33mArray[39m([32m-6.463732966847141E-8[39m, [32m-16.999960683595546[39m),
  [33mArray[39m([32m-2.492524155144247E-5[39m, [32m-16.997917101644497[39m),
  [33mArray[39m([32m-1.6104112289149345E-4[39m, [32m-16.992774931671576[39m),
  [33mArray[39m([32m-5.042681728047771E-4[39m, [32m-16.984537273625897[39m),
  [33mArray[39m([32m-0.001150255171237311[39m, [32m-16.973209090042378[39m),
  [33mArray[39m([32m-0.002194348113494316[39m, [32m-16.958797198160877[39m),
  [33mArray[39m([32m-0.0037314948126257923[39m, [32m-16.941310259102387[39m),
  [33mArray[39m([32m-0.005856149864141234[39m, [32m-16.920758764121906[39m),
  [33mArray[39m([32m-0.008662180059536085[39m, [32m-16.897155017962813[39m),
  [33mArray[39m([32m-0.012242770334712043[39m, [32m-16.87051311934288[39m),
  [33mArray[39m([32m-0.01669033033888591[39m, [32m-16.84084893860716[3

Note that the function `plot` returns a `JsObject` that encapsulates the plot specification. The function `show` does the renderring job (with the help of implict argument `display` that we defined earlier).

## Scatter Plot

A scatter plot displays data as a collection of points. The points can be color-coded, which is very useful for classification tasks. The user can use `plot` functions to draw scatter plot easily.
```
  def plot(data: Array[Array[Double]],
           fields: (String, String) = ("x", "y"),
           color: Option[(String, Either[Array[Int], Array[String]])] = None,
           shape: Option[(String, Either[Array[Int], Array[String]])] = None,
           sizeOrText: Option[(String, Either[Array[Double], Array[String]])] = None,
           properties: JsObject = JsObject()): JsObject
```
The optional arguments `color`, `shape`, and `sizeOrText` take a tuple of 2 elements. The first element is the name of field, used in the legend of plot. The second element is either an array of integers (e.g. the class labels) or an array of strings (e.g. "red", "blue", etc.)

In [3]:
val iris = read.arff("data/weka/iris.arff")
val attributes = iris.names
val x = iris.select(0, 1).toArray
val y = iris("class").toIntArray
show(plot(x, fields = (attributes(0), attributes(1)), color = Some("class" -> Left(y))))

[scala-interpreter-1] INFO smile.io.Arff - Read ARFF relation iris


[36miris[39m: [32mdata[39m.[32mDataFrame[39m = [sepallength: float, sepalwidth: float, petallength: float, petalwidth: float, class: byte nominal[Iris-setosa, Iris-versicolor, Iris-virginica]]
+-----------+----------+-----------+----------+-----------+
|sepallength|sepalwidth|petallength|petalwidth|      class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|         3|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|          5|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|          5|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
+-----------+--------

In this example, we plot the first two columns of Iris data. We use the class label for legend and color coding. However, the Iris data has four attributes. A general practice is plot all the attribute pairs.

In [4]:
show(spm(iris, clazz=Some("class")))

## Box Plot

The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

Box plots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers.

```
def boxplot(data: (String, Array[Double]), group: (String, Array[String]), properties: JsObject = JsObject()): JsObject
```

In [5]:
show(boxplot("sepallength" -> iris("sepallength").toDoubleArray, "class" -> iris("class").toStringArray))

## Histogram

A histogram is a graphical representation of the distribution of numerical data. The range of values is divided into a series of consecutive, non-overlapping intervals/bins. The bins must be adjacent, and are usually equal size.
```
    def hist(x: (String, Array[Double]), k: Int, properties: JsObject = JsObject()): JsObject
``` 
where k is the number of bins (10 by default), or you can also specify an array of the breakpoints between bins.

Let's apply the histogram to an interesting data: the wisdom of crowds. The original experiment took place about a hundred years ago at a county fair in England. The fair had a guess the weight of the ox contest. Francis Galton calculated the average of all guesses, which is right to within one pound.

Recently, NPR Planet Money ran the experiment again. NPR posted a couple of pictures of a cow (named Penelope) and asked people to guess her weight. They got over 17,000 responses. The average of guesses was 1,287 pounds, which is pretty close to Penelope's weight 1,355 pounds.

In [6]:
val cow = read.csv("data/npr/cow.txt", header = false)("V1").toDoubleArray
show(hist("Weight" -> cow.filter(_ <= 3500), 50))

[36mcow[39m: [32mArray[39m[[32mDouble[39m] = [33mArray[39m(
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.0[39m,
  [32m1.321[39m,
  [32m2.0[39m,
  [32m3.0[39m,
  [32m3.141592654[39m,
  [32m2.489913124[39m,
  [32m2.642729366[39m,
  [32m2.795545607[39m,
  [32m2.948361848[39m,
  [32m60.0[39m,
  [32m69.0[39m,
  [32m100.0[39m,
  [32m100.0[39m,
  [32m102.0[39m,
  [32m109.0[39m,
  [32m144.0[39m,
  [32m146.0[39m,
  [32m146.0[39m,
  [32m155.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
  [32m165.0[39m,
...