# Data Visualization with Vega

A picture is worth a thousand words. In machine learning, we usually handle high-dimensional data, which is impossible to draw on display directly. But a variety of statistical plots are tremendously valuable for us to grasp the characteristics of many data points. Smile provides data visualization tools such as plots and maps for researchers to understand information more easily and quickly.

Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG.

Vega-Lite is a high-level grammar of interactive graphics. It provides a concise JSON syntax for rapidly generating visualizations to support analysis. Vega-Lite specifications can be compiled to Vega specifications.

Smile provides some interactive statistical plots with latest Vega-Lite.

First, let's import Smile and also add an `implict` function to display Vega-Lite specification (a JSON object) in almond.

In [None]:
import $ivy.`com.github.haifengl::smile-scala:2.1.0`
import $ivy.`org.slf4j:slf4j-simple:1.7.26`  

import java.lang.Math._
import smile.json._
import smile.plot.vega._
import smile.plot.show
import smile._

implicit def display(spec: JsObject): Unit = {
  publish.html(iframe(spec))
}

Now let's plot a heart. Math is beautiful, isn't it?

In [None]:
val heart = -314 to 314 map { i =>
    val t = i / 100.0
    val x = 16 * pow(sin(t), 3)
    val y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    Array(x, y)
}

show(plot(heart.toArray))

Note that the function `plot` returns a `JsObject` that encapsulates the plot specification. The function `show` does the renderring job (with the help of implict argument `display` that we defined earlier).

## Scatter Plot

A scatter plot displays data as a collection of points. The points can be color-coded, which is very useful for classification tasks. The user can use `plot` functions to draw scatter plot easily.
```
def plot(data: Array[Array[Double]], legend: Char = '*', color: Color = BLACK): PlotCanvas

def plot(data: Array[Array[Double]], labels: Array[String]): PlotCanvas

def plot(data: Array[Array[Double]], label: Array[Int], legend: Char, palette: Array[Color]): PlotCanvas

def plot(data: Array[Array[Double]], label: Array[Int], legend: Array[Char], palette: Array[Color]): PlotCanvas
```

In [None]:
val iris = read.arff("data/weka/iris.arff")
/*
val x = iris.select(0, 1).toArray
val y = iris("class").toIntArray
val canvas = plot(x, y, Array('*', '+', 'o'), Array(RED, BLUE, CYAN))
val names = iris.names
canvas.setAxisLabels(names(0), names(1))
show(canvas)
*/

In this example, we plot the first two columns of Iris data. We use the class label for legend and color coding. It is also easy to draw a 3D plot.

In [None]:
val x = iris.select(0, 1, 2).toArray // take first three columns
val canvas = plot(x, y, Array('*', '+', 'o'), Array(RED, BLUE, CYAN))
canvas.setAxisLabels(names(0), names(1), names(2))
show(canvas)

However, the Iris data has four attributes. So even 3D plot is not sufficient to see the whole picture. A general practice is plot all the attribute pairs. For example,

In [None]:
show(spm(iris, clazz=Some("class")))

## Box Plot

The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

Box plots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers.