vignettes/huffman-coding.Rmd

---
title: "Huffman coding"
author: "Thomas Mailund"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Huffman coding}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

[Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding) is a compression approach based on character frequencies. To code a string, we work out the frequency of each letter in the string and then build a tree where we put the letters in the leaves and structure the tree such that the most frequent letters are closest to the root. We then encode each letter by the path we must take in that tree to get to the leaf that contain that letter. We need one bit for each edge we follow; zero, say, if we go left and one if we go right.

In [Haskell: the Craft of Functional Programming](https://amzn.to/2GHf6b6), there is an example of implementing Huffman coding in Haskell. In this vignette, I will adapt that to R with `pmatch` patterns. I have structured the vignette such that I switch between showing the original Haskell code and then the corresponding R code. If you are not familiar with Haskell, I think you should still be able to follow along.

## Data types

The data structures used in the Haskell implementation are these:

```haskell
data Tree = Leaf Char Int | Node Int Tree Tree
data Bit = L | R deriving (Eq, Show)
type HCode = [ Bit ]
type Table = [ (Char, HCode) ]
```

The first two are data types created from constructors, and we can define the corresponding R data structures using a `:=` declaration:

```{r}
library(pmatch)

Tree := Leaf(char : character, count : integer) | 
        Node(count : integer, left : Tree, right : Tree)
Bit := L | R
```

The other two, `HCode` and `Table` are lists over `Bit` and pairs of characters and `HCode`, respectively. We won't define types for these. In R, we don't have static types so we don't really name types. We need a list representation, however, and to make one that matches lists in Haskell, we define a linked list:

```{r}
llist := NIL | CONS(car, cdr : llist)
```

On occasions, we have to create singleton lists, and rather than writing `CONS(val,NIL)` for those, I will define a function for it. I will define a type to handle pairs in pattern matching.

```{r}
single <- function(val) CONS(val, NIL)
pair := ..(x, y)
```

## Computing frequencies

Before we can encode a string we need to calculate the frequency of the characters in it. In the Haskell implementation this is done using two merge sorts, one that sorts the string characters alphabetically and count the number of occurrences of each and then another sort that orders the characters according to their frequency. Later on, we will build a tree from a list of characters, and by having the least frequent characters at the front of the list, and building from that end, we will propagate the less frequent characters to the bottom of the tree and the more frequent characters to the top.

The Haskell `mergeSort` function is implemented like this:

```haskell
mergeSort :: ([a] -> [a] -> [a]) -> [a] -> [a]
mergeSort merge xs
  | length xs < 2 = xs
  | otherwise
      = merge (mergeSort merge first) (mergeSort merge second)
        where
          first = take half xs
          second = drop half xs
          half = (length xs) `div` 2
```

This function takes another function, `merge`, as a parameter. It is via this function we will implement the two different sort functions we need.

We don't have `where` blocks in R, so we need to compute `first`, `second` and `half` before the recursion, but otherwise the the R implementation follows the Haskell implementation. We need some list functions first, though, for `length`, `take` and `drop`. We can implement them like this:

```{r}
llength <- case_func(
    acc = 0,
    NIL -> acc,
    CONS(car, cdr) -> llength(cdr, acc + 1)
)

llrev <- case_func(
    acc = NIL,
        NIL -> acc,
        CONS(car, cdr) -> llrev(cdr, CONS(car, acc))
)

lltake <- case_func(acc = NIL,
    ..(., 0) -> llrev(acc),
    ..(NIL, k) -> stop("There were less than k elements in the list"),
    ..(CONS(car, cdr), k) -> lltake(..(cdr, k - 1), CONS(car, acc))
)
lldrop <- case_func(
    ..(lst, 0) -> lst,
    ..(NIL, k) -> stop("There were less than k elements in the list"),
    ..(CONS(., cdr), k) -> lldrop(..(cdr, k - 1))
)
```

The `llrev` function, for reversing a linked list, is a helper function that lets us write tail-recursive take and drop functions. It isn't terribly important here since we do not use any tail-recursion optimisation, but once you get into the habit of writing your functions tail recurse it is hard not to. Later on in the vignette I will use simpler functions rather than tail-recursive ones to make the examples easier to follow.

Anyway, with the list functions in place, the merge sort can be implemented like this:

```{r}
merge_sort <- function(merge, xs) {
    n <- llength(xs)
    if (n < 2) return(xs)
    half <- n %/% 2 
    first <- lltake(..(xs, half))
    second <- lldrop(..(xs, half))
    merge(..(merge_sort(merge, first), merge_sort(merge, second)))
}
```

### Sorting alphabetically and counting characters

To sort characters alphabetically and at the same time count the number of occurrences for each, we need a merge function. The Haskell function looks like this:

```haskell
alphaMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
alphaMerge xs [] = xs
alphaMerge [] ys = ys
alphaMerge ((p,n) : xs) ((q,m) : ys)
  | (p == q)  = (p,n+m) : alphaMerge xs ys
  | (p < q)   = (p,n) : alphaMerge xs ((q,m) : ys)
  | otherwise = (q,m) : alphaMerge ((p,n) : xs) ys
```

We do not have conditions like `| (p == q)` in R, so we handle those using `if`-`else`-statements, but otherwise we translate the function statement by statement into this:

```{r}
alpha_merge <- case_func(
    ..(xs, NIL) -> xs,
    ..(NIL, ys) -> ys,
    ..(CONS(..(p,n), xs), CONS(..(q,m), ys)) -> {
        if (p == q) {
            CONS(..(p,n+m), alpha_merge(..(xs, ys)))
        } else if (p < q) {
            CONS(..(p,n), alpha_merge(..(xs, CONS(..(q,m), ys))))
        } else {
            CONS(..(q,m), alpha_merge(..(CONS(..(p,n), xs), ys)))
        }
    }
)
```

To test the function I want to make it a little easier to build linked lists, so I have written this helper function:

```{r}
llist_from_list <- function(x) {
    llist <- NIL
    n <- length(x)
    while (n > 0) {
        llist <- CONS(x[[n]], llist)
        n <- n - 1
    }
    llist
}
```

Ok, let us see if we can merge two lists:

```{r}
xs <- llist_from_list(list(
    ..("a", 1), ..("b", 2)
))
ys <- llist_from_list(list(
    ..("a", 3), ..("c", 3)
))
alpha_merge(..(xs, ys))
```

Yup, that went okay.

What about sorting? First I want a function for concatenating two linked lists, and then I will test it on the concatenation of `xs` and `ys` from above.

Concatenation can look like this:

```{r}
llconcat <- case_func( 
    ..(NIL,ys) -> ys,
    ..(CONS(x,xs),ys) -> CONS(x, llconcat(..(xs, ys)))
)
```

(Here, I did choose a simple version over a tail-recursive one).

Onwards to the test:

```{r}
zs <- llconcat(..(xs, ys))
merge_sort(alpha_merge, zs)
merge_sort(alpha_merge, llrev(zs))
```

Yup, that works.


### Sorting based on character frequency

To compute the character frequency we need to write this function:

```haskell
freqMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
freqMerge xs [] = xs
freqMerge [] ys = ys
freqMerge ((p,n):xs) ((q,m):ys)
  | (n < m || (n == m && p < q)) = (p,n) : freqMerge xs ((q,m):ys)
  | otherwise                    = (q,m) : freqMerge((p,n):xs) ys
```

The translation is done statement by statement and with `if`-statements instead of `|` conditions:

```{r}
freq_merge <- case_func(
    ..(xs, NIL) -> xs,
    ..(NIL, ys) -> ys,
    ..(CONS(..(p,n), xs), CONS(..(q,m), ys)) -> {
            if (n < m || (n == m && p < q)) {
                CONS(..(p,n), freq_merge(..(xs, CONS(..(q,m), ys))))
            } else {
                CONS(..(q,m), freq_merge(..(CONS(..(p,n), xs), ys)))
            }      
    }
)
```

```{r}
xs <- llist_from_list(list(
    ..("a", 5), ..("b", 3), ..("c", 2)
))
merge_sort(freq_merge, xs)
```

### Computing frequencies

To compute a list of character frequencies for a string, we need to combine the sort and merge functions. In Haskell, the `frequency` function looks like this:

```haskell
frequency :: [Char] -> [(Char,Int)]
frequency
  = mergeSort freqMerge . mergeSort alphaMerge . map start
    where
      start ch = (ch, 1)
```

We see that we need another list function, `map`. I have implemented it this way:

```{r}
llmap <- case_func(f, acc = NIL,
    NIL -> llrev(acc),
    CONS(car, cdr) -> llmap(cdr, f, CONS(f(car), acc))
)
```

I want to work on lists of characters, so as a helper function I have written `string_to_list` that translates a string into a list containing the characters in the string:

```{r}
string_to_list <- function(x)
    llist_from_list(strsplit(x, "")[[1]])

string_to_list("foobar")
```

For composing functions, we can use `magrittr` pipelines. If we do, we just have to remember that the functions are evaluated from left-to-right, whereas in the Haskell code, they are evaluated right-to-left. So

```Haskell
mergeSort freqMerge . mergeSort alphaMerge . map start
```

becomes

```{r}
library(magrittr)
frequency = . %>% 
    string_to_list() %>% 
    llmap(function(char) ..(char, 1L)) %>%
    merge_sort(alpha_merge, .) %>%
    merge_sort(freq_merge, .)
```

If we call `frequency` on a string, we get a list of pairs of characters and integers. We should interpret that as a list of characters and the number of occurrences each have in the string.

```{r}
frequency("foo")
```

## Building trees

From a frequency list we need to build tree. This is the Haskell code:

```haskell
makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList

toTreeList :: [ (Char,Int) ] -> [Tree]
toTreeList = map (uncurry Leaf)
```

The `uncurl Leaf` stuff is a technicality needed in Haskell. To translate a list of character-count pairs into a list of trees, we simply do this:

```{r}
to_tree_list <- . %>% llmap(
    case_func(..(char,count) -> Leaf(char, count))
)
xs <- "foo" %>% string_to_list() %>% llmap(function(char) ..(char, 1L))
to_tree_list(xs)
```

The `makeCodes` function looks like this in Haskell:

```haskell
makeCodes :: [ Tree ] -> Tree
makeCodes [t] = t
makeCodes ts = makeCodes (amalgamate ts)
```

and we can translate it almost mechanically:

```{r}
make_codes <- case_func(
    CONS(t, NIL) -> t,
    ts -> make_codes(amalgamate(ts))
)
```

The same goes for `amalgamate`:

```haskell
amalgamate :: [ Tree ] -> [ Tree ]
amalgamate (t1:t2:ts) = insTree (pair t1 t2) ts
```

```{r}
amalgamate <- case_func(
    CONS(t1, CONS(t2, ts)) -> ins_tree(..(pair(t1, t2), ts))
)
```

We can see that we need functions `pair` and `ins_tree` for these functions. The `pair` function shouldn't be confused with the `PAIR` constructor—one of the reasons I put the latter in all uppercase. Other than that, the entire translation of the Haskell code is just done statement by statement:

```haskell
pair :: Tree -> Tree -> Tree
pair t1 t2 = Node (v1 + v2) t1 t2
             where v1 = value t1
                   v2 = value t2

value :: Tree -> Int
value (Leaf _ n)   = n
value (Node n _ _) = n
```

```{r}
value <- case_func(
    Leaf(., n) -> n,
    Node(n, ., .) -> n
)
pair <- function(t1, t2) Node(value(t1) + value(t2), t1, t2)
```

```haskell
insTree :: Tree -> [ Tree ] -> [ Tree ]
insTree t1 []          = [t1]
insTree t1 (t2 : rest) =
  if v1 < v2 then t1 : t2 : rest else t2 : insTree t1 rest
  where v1 = value t1
        v2 = value t2
```

```{r}
ins_tree <- case_func(
    ..(t, NIL) -> single(t),
    ..(t1, CONS(t2, rest)) -> {
        v1 <- value(t1)
        v2 <- value(t2)
        if (v1 < v2) {
            CONS(t1, CONS(t2, rest))
        } else {
            CONS(t2, ins_tree(..(t1, rest)))
        }
    }
)
```

The `makeTree` function was defined as a composition:

```haskell
makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList
```

We do the same using a pipeline:

```{r}
make_tree <- . %>%
    to_tree_list() %>%
    make_codes()
```

```{r}
tree <- "foo" %>% frequency() %>% make_tree()
tree
```

I have written some code for plotting trees, in the function `plot_tree`. I don't show it here, it is a simple translation of the data structure into a table I can plot using `tidy graph` and `ggraph`, but we can use it to display trees:

```{r, echo=FALSE}
library(tidygraph)
library(ggraph)

plot_tree <- function(tree) {
    dfn <- case_func(n,
            Leaf(char,count) ->
                list(structure(Leaf(char,count), number = n), n + 1L),
            Node(count, left, right) -> {
                bind[left,n] <- dfn(left, n)
                bind[right,n] <- dfn(right, n)
                list(structure(Node(count, left, right), number = n), n + 1L)
            }
    )
    bind[tree,size] <- dfn(tree, 0L)
    
    node_label <- rep("", size)
    
    from <- vector("integer", length = size - 1)
    to <-  vector("integer", length = size - 1)
    edge_label <- vector("character", length = size - 1)
    
    index <- 1
    populate_table <- case_func(
        Leaf(char, count) ~ {
            node_label[attr(tree, "number") + 1] <<- char
        },
        Node(., left, right) ~ {
            this_node_n <- attr(tree, "number")
            left_node_n <- attr(left, "number")
            right_node_n <- attr(right, "number")
            from[index] <<- this_node_n ; to[index] <<- left_node_n
            edge_label[index] <<- "L"
            index <<- index + 1
            from[index] <<- this_node_n ; to[index] <<- right_node_n
            edge_label[index] <<- "R"
                
            index <<- index + 1
            populate_table(left)
            populate_table(right)
        }
    )
    populate_table(tree)
    
    nodes <- data.frame(node_label = node_label, 
                        stringsAsFactors = FALSE)
    # add one because ggraph doesn't like zero
    edges <- data.frame(from = from + 1, to = to + 1, 
                        edge_label = edge_label,
                        stringsAsFactors = FALSE)
    tidygraph::tbl_graph(nodes, edges) %>%
        ggraph(layout = "tree") +
        geom_node_point(aes(filter = (node_label == "")), size = 5) +
        geom_node_text(aes(filter = (node_label != ""), label = node_label)) +
        geom_edge_link(aes(label = edge_label, hjust = 2),
                       arrow = arrow(length = unit(4, 'mm')), 
                       end_cap = circle(3, 'mm')) +
        theme_graph()
}
```

```{r}
"foobarbaz" %>% frequency() %>% make_tree() %>% plot_tree()
```

## Code tables

The result of an encoding will be a sequence of `L` and `R` bits. We will represent these in linked lists, so to make it simpler to display these I have written these helper functions:

```{r}
as.list.llist <- function(x, all.names = FALSE, sorted = FALSE, ...) {
    n <- llength(x)
    v <- vector("list", length = n)
    i <- 1
    while (i <= n) {
        v[i] <- x$car
        i <- i + 1
        x <- x$cdr
    }
    v
}
as.vector.llist <- function(x, mode = "any") {
    unlist(as.list(x))
}

toString.code <- function(x, ...) {
    x %>% llmap(toString) %>% as.vector() %>% paste0(collapse = "")
}

code <- llist_from_list(
    list(L, L, R, L)
)
code
class(code) <- c("code", class(code))
code %>% toString() %>% cat()
```

I do not simply set the class of a list to `"code"`. I could, and most of the time that wouldn't be a problem. However, in the specification of linked lists we required that `cdr` should be of type `llist`, so if we want to concatenate two codes—and I can reveal that we do later—then we need codes to have that type. I therefore make codes specialisations of `llist` with `c("code", class(code))`.

The encoding of a string will be implemented as follows: for each character in the string, we look up the code for that character, i.e. the sequence of left and right edges we need to reach the character in a tree, and we concatenate all these. In the implementation of the encoding we want a table that maps characters to their code, and in Haskell that table is computed using this function:

```haskell
convert :: HCode -> Tree -> Table
convert cd (Leaf c n) = [(c,cd)]
convert cd (Node n t1 t2) = (convert (cd++[L]) t1) ++
                            (convert (cd++[R]) t2)

codeTable :: Tree -> Table
codeTable = convert []
```

Since we pattern match on the tree and not the code, we will flip the order of the arguments. We can implement it in R like this:

```{r}
convert <- case_func(code,
    Leaf(char, count) -> single(
        ..(char, structure(code, class = c("code", class(code))))
    ),
    Node(count, t1, t2) -> llconcat(
        ..(convert(t1, llconcat(..(code, single(L)))),
           convert(t2, llconcat(..(code, single(R)))))
    )
)

code_table <- . %>% convert(NIL) %>% 
    structure(class = c("code_table", class(.)))
```

I have added a class to tables to make them easier to print.

```{r}
toString.code_table <- function(x, ...) {
    x %>% llmap(case_func(..(x,y) ~ paste0(x, " => ", toString(y)))
    ) %>% as.vector() %>% paste0(collapse = "\n")
}
```

```{r}
tree <- "foobaar" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()

tree <- "foobarbaz" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()
```

## Coding and decoding

Now, we have everything we need in order to encode and decode strings/codes. Both encoding and decoding requires a tree, so we define a function for building a tree from a string:

```haskell
codes :: [Char] -> Tree
codes = makeTree . frequency
```

```{r}
codes <- . %>% frequency() %>% make_tree()
```

```{r}
tree <- codes("foobarbaz")
tree %>% code_table() %>% toString() %>% cat()
```

As I have already explained, encoding combines looking up the code for individual characters and concatenating all of them. The lookup is implemented in Haskell like this:

```haskell
lookupTable :: Table -> Char -> HCode
lookupTable []            c = error "lookupTable"
lookupTable ((ch,n) : tb) c
  | ch == c                 = n
  | otherwise               = lookupTable tb c
```

And in R like this:

```{r}
lookup_table <- case_func(char,
   NIL -> stop(paste(char, "was not found in the table")),
   CONS(..(ch,n), tb) -> {
       if (ch == char) {
           n
       } else {
           lookup_table(tb, char)
       }
   }
) 
```

The encoding combines that with concatenation. We map over the characters and look up their code. We then concatenate the result.

```Haskell
codeMessage :: Table -> [Char] -> HCode
codeMessage tbl = concat . map (lookupTable tbl)
```

Our `llconcat` function only works on two linked lists, however, so we need to handle concatenation slightly differently. I have handled it by writing a reduce function that I can use to concatenate lists pairwise:

```{r}
llreduce <- case_func(f, acc,
    NIL -> acc,
    CONS(val, rest) -> llreduce(rest, f, f(acc, val))  
)
```

The `code_message` is very similar to the Haskell version, but I have added a translation from strings to lists before the main code (this isn't needed in Haskell because strings are already lists of characters) and I set the class of the result to make it easier to print coded messages.

```{r}
code_message <- function(message, table) {
    message %>% string_to_list() %>%
        llmap(function(char) lookup_table(table, char)) %>%
        llreduce(function(x,y) llconcat(..(x,y)), NIL) %>% 
        structure(class = c("code", class(.)))
}

tree <- codes("foobarbaz")
table <- tree %>% code_table()
code <- "foobar" %>% code_message(table)
code %>% toString() %>% cat()
```

Decoding works by following paths down the tree until we get to a character, which we then add to the list as the decoding of a part of the code, and doing this as long as we have path-bits in our encoded message.

```haskell
decodeMessage :: Tree -> HCode -> [Char]
decodeMessage tr =
  decodeByTree tr
  where
    decodeByTree (Node n t1 t2) (L : rest) = decodeByTree t1 rest
    decodeByTree (Node n t1 t2) (R : rest) = decodeByTree t2 rest
    decodeByTree (Leaf c n)     rest       = c : decodeByTree tr rest
    decodeByTree t              []         = []
```

For this function, the R code closely follows the Haskell code, except that we have to put the `decode_by_tree` function at the beginning of the function instead of in a `where` block:

```{r}
decode_message <- function(tree, code) {
    decode_by_tree <- case_func(
        ..(Node(n, t1, t2), CONS(L, rest)) -> decode_by_tree(..(t1, rest)),
        ..(Node(n, t1, t2), CONS(R, rest)) -> decode_by_tree(..(t2, rest)),
        ..(Leaf(char, n), rest) -> CONS(char, decode_by_tree(..(tree, rest))),
        ..(t, NIL) -> NIL
    )
    decode_by_tree(..(tree, code)) %>% as.vector() %>% paste0(collapse = "")
}

decode_message(tree, code)
```

That's it. Huffman coding in R.