-
Notifications
You must be signed in to change notification settings - Fork 1
/
huffman-coding.Rmd
665 lines (530 loc) · 20.7 KB
/
huffman-coding.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
---
title: "Huffman coding"
author: "Thomas Mailund"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Huffman coding}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
[Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding) is a compression approach based on character frequencies. To code a string, we work out the frequency of each letter in the string and then build a tree where we put the letters in the leaves and structure the tree such that the most frequent letters are closest to the root. We then encode each letter by the path we must take in that tree to get to the leaf that contain that letter. We need one bit for each edge we follow; zero, say, if we go left and one if we go right.
In [Haskell: the Craft of Functional Programming](https://amzn.to/2GHf6b6), there is an example of implementing Huffman coding in Haskell. In this vignette, I will adapt that to R with `pmatch` patterns. I have structured the vignette such that I switch between showing the original Haskell code and then the corresponding R code. If you are not familiar with Haskell, I think you should still be able to follow along.
## Data types
The data structures used in the Haskell implementation are these:
```haskell
data Tree = Leaf Char Int | Node Int Tree Tree
data Bit = L | R deriving (Eq, Show)
type HCode = [ Bit ]
type Table = [ (Char, HCode) ]
```
The first two are data types created from constructors, and we can define the corresponding R data structures using a `:=` declaration:
```{r}
library(pmatch)
Tree := Leaf(char : character, count : integer) |
Node(count : integer, left : Tree, right : Tree)
Bit := L | R
```
The other two, `HCode` and `Table` are lists over `Bit` and pairs of characters and `HCode`, respectively. We won't define types for these. In R, we don't have static types so we don't really name types. We need a list representation, however, and to make one that matches lists in Haskell, we define a linked list:
```{r}
llist := NIL | CONS(car, cdr : llist)
```
On occasions, we have to create singleton lists, and rather than writing `CONS(val,NIL)` for those, I will define a function for it. I will define a type to handle pairs in pattern matching.
```{r}
single <- function(val) CONS(val, NIL)
pair := ..(x, y)
```
## Computing frequencies
Before we can encode a string we need to calculate the frequency of the characters in it. In the Haskell implementation this is done using two merge sorts, one that sorts the string characters alphabetically and count the number of occurrences of each and then another sort that orders the characters according to their frequency. Later on, we will build a tree from a list of characters, and by having the least frequent characters at the front of the list, and building from that end, we will propagate the less frequent characters to the bottom of the tree and the more frequent characters to the top.
The Haskell `mergeSort` function is implemented like this:
```haskell
mergeSort :: ([a] -> [a] -> [a]) -> [a] -> [a]
mergeSort merge xs
| length xs < 2 = xs
| otherwise
= merge (mergeSort merge first) (mergeSort merge second)
where
first = take half xs
second = drop half xs
half = (length xs) `div` 2
```
This function takes another function, `merge`, as a parameter. It is via this function we will implement the two different sort functions we need.
We don't have `where` blocks in R, so we need to compute `first`, `second` and `half` before the recursion, but otherwise the the R implementation follows the Haskell implementation. We need some list functions first, though, for `length`, `take` and `drop`. We can implement them like this:
```{r}
llength <- case_func(
acc = 0,
NIL -> acc,
CONS(car, cdr) -> llength(cdr, acc + 1)
)
llrev <- case_func(
acc = NIL,
NIL -> acc,
CONS(car, cdr) -> llrev(cdr, CONS(car, acc))
)
lltake <- case_func(acc = NIL,
..(., 0) -> llrev(acc),
..(NIL, k) -> stop("There were less than k elements in the list"),
..(CONS(car, cdr), k) -> lltake(..(cdr, k - 1), CONS(car, acc))
)
lldrop <- case_func(
..(lst, 0) -> lst,
..(NIL, k) -> stop("There were less than k elements in the list"),
..(CONS(., cdr), k) -> lldrop(..(cdr, k - 1))
)
```
The `llrev` function, for reversing a linked list, is a helper function that lets us write tail-recursive take and drop functions. It isn't terribly important here since we do not use any tail-recursion optimisation, but once you get into the habit of writing your functions tail recurse it is hard not to. Later on in the vignette I will use simpler functions rather than tail-recursive ones to make the examples easier to follow.
Anyway, with the list functions in place, the merge sort can be implemented like this:
```{r}
merge_sort <- function(merge, xs) {
n <- llength(xs)
if (n < 2) return(xs)
half <- n %/% 2
first <- lltake(..(xs, half))
second <- lldrop(..(xs, half))
merge(..(merge_sort(merge, first), merge_sort(merge, second)))
}
```
### Sorting alphabetically and counting characters
To sort characters alphabetically and at the same time count the number of occurrences for each, we need a merge function. The Haskell function looks like this:
```haskell
alphaMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
alphaMerge xs [] = xs
alphaMerge [] ys = ys
alphaMerge ((p,n) : xs) ((q,m) : ys)
| (p == q) = (p,n+m) : alphaMerge xs ys
| (p < q) = (p,n) : alphaMerge xs ((q,m) : ys)
| otherwise = (q,m) : alphaMerge ((p,n) : xs) ys
```
We do not have conditions like `| (p == q)` in R, so we handle those using `if`-`else`-statements, but otherwise we translate the function statement by statement into this:
```{r}
alpha_merge <- case_func(
..(xs, NIL) -> xs,
..(NIL, ys) -> ys,
..(CONS(..(p,n), xs), CONS(..(q,m), ys)) -> {
if (p == q) {
CONS(..(p,n+m), alpha_merge(..(xs, ys)))
} else if (p < q) {
CONS(..(p,n), alpha_merge(..(xs, CONS(..(q,m), ys))))
} else {
CONS(..(q,m), alpha_merge(..(CONS(..(p,n), xs), ys)))
}
}
)
```
To test the function I want to make it a little easier to build linked lists, so I have written this helper function:
```{r}
llist_from_list <- function(x) {
llist <- NIL
n <- length(x)
while (n > 0) {
llist <- CONS(x[[n]], llist)
n <- n - 1
}
llist
}
```
Ok, let us see if we can merge two lists:
```{r}
xs <- llist_from_list(list(
..("a", 1), ..("b", 2)
))
ys <- llist_from_list(list(
..("a", 3), ..("c", 3)
))
alpha_merge(..(xs, ys))
```
Yup, that went okay.
What about sorting? First I want a function for concatenating two linked lists, and then I will test it on the concatenation of `xs` and `ys` from above.
Concatenation can look like this:
```{r}
llconcat <- case_func(
..(NIL,ys) -> ys,
..(CONS(x,xs),ys) -> CONS(x, llconcat(..(xs, ys)))
)
```
(Here, I did choose a simple version over a tail-recursive one).
Onwards to the test:
```{r}
zs <- llconcat(..(xs, ys))
merge_sort(alpha_merge, zs)
merge_sort(alpha_merge, llrev(zs))
```
Yup, that works.
### Sorting based on character frequency
To compute the character frequency we need to write this function:
```haskell
freqMerge :: [(Char,Int)] -> [(Char,Int)] -> [(Char,Int)]
freqMerge xs [] = xs
freqMerge [] ys = ys
freqMerge ((p,n):xs) ((q,m):ys)
| (n < m || (n == m && p < q)) = (p,n) : freqMerge xs ((q,m):ys)
| otherwise = (q,m) : freqMerge((p,n):xs) ys
```
The translation is done statement by statement and with `if`-statements instead of `|` conditions:
```{r}
freq_merge <- case_func(
..(xs, NIL) -> xs,
..(NIL, ys) -> ys,
..(CONS(..(p,n), xs), CONS(..(q,m), ys)) -> {
if (n < m || (n == m && p < q)) {
CONS(..(p,n), freq_merge(..(xs, CONS(..(q,m), ys))))
} else {
CONS(..(q,m), freq_merge(..(CONS(..(p,n), xs), ys)))
}
}
)
```
```{r}
xs <- llist_from_list(list(
..("a", 5), ..("b", 3), ..("c", 2)
))
merge_sort(freq_merge, xs)
```
### Computing frequencies
To compute a list of character frequencies for a string, we need to combine the sort and merge functions. In Haskell, the `frequency` function looks like this:
```haskell
frequency :: [Char] -> [(Char,Int)]
frequency
= mergeSort freqMerge . mergeSort alphaMerge . map start
where
start ch = (ch, 1)
```
We see that we need another list function, `map`. I have implemented it this way:
```{r}
llmap <- case_func(f, acc = NIL,
NIL -> llrev(acc),
CONS(car, cdr) -> llmap(cdr, f, CONS(f(car), acc))
)
```
I want to work on lists of characters, so as a helper function I have written `string_to_list` that translates a string into a list containing the characters in the string:
```{r}
string_to_list <- function(x)
llist_from_list(strsplit(x, "")[[1]])
string_to_list("foobar")
```
For composing functions, we can use `magrittr` pipelines. If we do, we just have to remember that the functions are evaluated from left-to-right, whereas in the Haskell code, they are evaluated right-to-left. So
```Haskell
mergeSort freqMerge . mergeSort alphaMerge . map start
```
becomes
```{r}
library(magrittr)
frequency = . %>%
string_to_list() %>%
llmap(function(char) ..(char, 1L)) %>%
merge_sort(alpha_merge, .) %>%
merge_sort(freq_merge, .)
```
If we call `frequency` on a string, we get a list of pairs of characters and integers. We should interpret that as a list of characters and the number of occurrences each have in the string.
```{r}
frequency("foo")
```
## Building trees
From a frequency list we need to build tree. This is the Haskell code:
```haskell
makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList
toTreeList :: [ (Char,Int) ] -> [Tree]
toTreeList = map (uncurry Leaf)
```
The `uncurl Leaf` stuff is a technicality needed in Haskell. To translate a list of character-count pairs into a list of trees, we simply do this:
```{r}
to_tree_list <- . %>% llmap(
case_func(..(char,count) -> Leaf(char, count))
)
xs <- "foo" %>% string_to_list() %>% llmap(function(char) ..(char, 1L))
to_tree_list(xs)
```
The `makeCodes` function looks like this in Haskell:
```haskell
makeCodes :: [ Tree ] -> Tree
makeCodes [t] = t
makeCodes ts = makeCodes (amalgamate ts)
```
and we can translate it almost mechanically:
```{r}
make_codes <- case_func(
CONS(t, NIL) -> t,
ts -> make_codes(amalgamate(ts))
)
```
The same goes for `amalgamate`:
```haskell
amalgamate :: [ Tree ] -> [ Tree ]
amalgamate (t1:t2:ts) = insTree (pair t1 t2) ts
```
```{r}
amalgamate <- case_func(
CONS(t1, CONS(t2, ts)) -> ins_tree(..(pair(t1, t2), ts))
)
```
We can see that we need functions `pair` and `ins_tree` for these functions. The `pair` function shouldn't be confused with the `PAIR` constructor—one of the reasons I put the latter in all uppercase. Other than that, the entire translation of the Haskell code is just done statement by statement:
```haskell
pair :: Tree -> Tree -> Tree
pair t1 t2 = Node (v1 + v2) t1 t2
where v1 = value t1
v2 = value t2
value :: Tree -> Int
value (Leaf _ n) = n
value (Node n _ _) = n
```
```{r}
value <- case_func(
Leaf(., n) -> n,
Node(n, ., .) -> n
)
pair <- function(t1, t2) Node(value(t1) + value(t2), t1, t2)
```
```haskell
insTree :: Tree -> [ Tree ] -> [ Tree ]
insTree t1 [] = [t1]
insTree t1 (t2 : rest) =
if v1 < v2 then t1 : t2 : rest else t2 : insTree t1 rest
where v1 = value t1
v2 = value t2
```
```{r}
ins_tree <- case_func(
..(t, NIL) -> single(t),
..(t1, CONS(t2, rest)) -> {
v1 <- value(t1)
v2 <- value(t2)
if (v1 < v2) {
CONS(t1, CONS(t2, rest))
} else {
CONS(t2, ins_tree(..(t1, rest)))
}
}
)
```
The `makeTree` function was defined as a composition:
```haskell
makeTree :: [ (Char,Int) ] -> Tree
makeTree = makeCodes . toTreeList
```
We do the same using a pipeline:
```{r}
make_tree <- . %>%
to_tree_list() %>%
make_codes()
```
```{r}
tree <- "foo" %>% frequency() %>% make_tree()
tree
```
I have written some code for plotting trees, in the function `plot_tree`. I don't show it here, it is a simple translation of the data structure into a table I can plot using `tidy graph` and `ggraph`, but we can use it to display trees:
```{r, echo=FALSE}
library(tidygraph)
library(ggraph)
plot_tree <- function(tree) {
dfn <- case_func(n,
Leaf(char,count) ->
list(structure(Leaf(char,count), number = n), n + 1L),
Node(count, left, right) -> {
bind[left,n] <- dfn(left, n)
bind[right,n] <- dfn(right, n)
list(structure(Node(count, left, right), number = n), n + 1L)
}
)
bind[tree,size] <- dfn(tree, 0L)
node_label <- rep("", size)
from <- vector("integer", length = size - 1)
to <- vector("integer", length = size - 1)
edge_label <- vector("character", length = size - 1)
index <- 1
populate_table <- case_func(
Leaf(char, count) ~ {
node_label[attr(tree, "number") + 1] <<- char
},
Node(., left, right) ~ {
this_node_n <- attr(tree, "number")
left_node_n <- attr(left, "number")
right_node_n <- attr(right, "number")
from[index] <<- this_node_n ; to[index] <<- left_node_n
edge_label[index] <<- "L"
index <<- index + 1
from[index] <<- this_node_n ; to[index] <<- right_node_n
edge_label[index] <<- "R"
index <<- index + 1
populate_table(left)
populate_table(right)
}
)
populate_table(tree)
nodes <- data.frame(node_label = node_label,
stringsAsFactors = FALSE)
# add one because ggraph doesn't like zero
edges <- data.frame(from = from + 1, to = to + 1,
edge_label = edge_label,
stringsAsFactors = FALSE)
tidygraph::tbl_graph(nodes, edges) %>%
ggraph(layout = "tree") +
geom_node_point(aes(filter = (node_label == "")), size = 5) +
geom_node_text(aes(filter = (node_label != ""), label = node_label)) +
geom_edge_link(aes(label = edge_label, hjust = 2),
arrow = arrow(length = unit(4, 'mm')),
end_cap = circle(3, 'mm')) +
theme_graph()
}
```
```{r}
"foobarbaz" %>% frequency() %>% make_tree() %>% plot_tree()
```
## Code tables
The result of an encoding will be a sequence of `L` and `R` bits. We will represent these in linked lists, so to make it simpler to display these I have written these helper functions:
```{r}
as.list.llist <- function(x, all.names = FALSE, sorted = FALSE, ...) {
n <- llength(x)
v <- vector("list", length = n)
i <- 1
while (i <= n) {
v[i] <- x$car
i <- i + 1
x <- x$cdr
}
v
}
as.vector.llist <- function(x, mode = "any") {
unlist(as.list(x))
}
toString.code <- function(x, ...) {
x %>% llmap(toString) %>% as.vector() %>% paste0(collapse = "")
}
code <- llist_from_list(
list(L, L, R, L)
)
code
class(code) <- c("code", class(code))
code %>% toString() %>% cat()
```
I do not simply set the class of a list to `"code"`. I could, and most of the time that wouldn't be a problem. However, in the specification of linked lists we required that `cdr` should be of type `llist`, so if we want to concatenate two codes—and I can reveal that we do later—then we need codes to have that type. I therefore make codes specialisations of `llist` with `c("code", class(code))`.
The encoding of a string will be implemented as follows: for each character in the string, we look up the code for that character, i.e. the sequence of left and right edges we need to reach the character in a tree, and we concatenate all these. In the implementation of the encoding we want a table that maps characters to their code, and in Haskell that table is computed using this function:
```haskell
convert :: HCode -> Tree -> Table
convert cd (Leaf c n) = [(c,cd)]
convert cd (Node n t1 t2) = (convert (cd++[L]) t1) ++
(convert (cd++[R]) t2)
codeTable :: Tree -> Table
codeTable = convert []
```
Since we pattern match on the tree and not the code, we will flip the order of the arguments. We can implement it in R like this:
```{r}
convert <- case_func(code,
Leaf(char, count) -> single(
..(char, structure(code, class = c("code", class(code))))
),
Node(count, t1, t2) -> llconcat(
..(convert(t1, llconcat(..(code, single(L)))),
convert(t2, llconcat(..(code, single(R)))))
)
)
code_table <- . %>% convert(NIL) %>%
structure(class = c("code_table", class(.)))
```
I have added a class to tables to make them easier to print.
```{r}
toString.code_table <- function(x, ...) {
x %>% llmap(case_func(..(x,y) ~ paste0(x, " => ", toString(y)))
) %>% as.vector() %>% paste0(collapse = "\n")
}
```
```{r}
tree <- "foobaar" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()
tree <- "foobarbaz" %>% frequency() %>% make_tree()
table <- tree %>% code_table()
tree %>% plot_tree()
table %>% toString() %>% cat()
```
## Coding and decoding
Now, we have everything we need in order to encode and decode strings/codes. Both encoding and decoding requires a tree, so we define a function for building a tree from a string:
```haskell
codes :: [Char] -> Tree
codes = makeTree . frequency
```
```{r}
codes <- . %>% frequency() %>% make_tree()
```
```{r}
tree <- codes("foobarbaz")
tree %>% code_table() %>% toString() %>% cat()
```
As I have already explained, encoding combines looking up the code for individual characters and concatenating all of them. The lookup is implemented in Haskell like this:
```haskell
lookupTable :: Table -> Char -> HCode
lookupTable [] c = error "lookupTable"
lookupTable ((ch,n) : tb) c
| ch == c = n
| otherwise = lookupTable tb c
```
And in R like this:
```{r}
lookup_table <- case_func(char,
NIL -> stop(paste(char, "was not found in the table")),
CONS(..(ch,n), tb) -> {
if (ch == char) {
n
} else {
lookup_table(tb, char)
}
}
)
```
The encoding combines that with concatenation. We map over the characters and look up their code. We then concatenate the result.
```Haskell
codeMessage :: Table -> [Char] -> HCode
codeMessage tbl = concat . map (lookupTable tbl)
```
Our `llconcat` function only works on two linked lists, however, so we need to handle concatenation slightly differently. I have handled it by writing a reduce function that I can use to concatenate lists pairwise:
```{r}
llreduce <- case_func(f, acc,
NIL -> acc,
CONS(val, rest) -> llreduce(rest, f, f(acc, val))
)
```
The `code_message` is very similar to the Haskell version, but I have added a translation from strings to lists before the main code (this isn't needed in Haskell because strings are already lists of characters) and I set the class of the result to make it easier to print coded messages.
```{r}
code_message <- function(message, table) {
message %>% string_to_list() %>%
llmap(function(char) lookup_table(table, char)) %>%
llreduce(function(x,y) llconcat(..(x,y)), NIL) %>%
structure(class = c("code", class(.)))
}
tree <- codes("foobarbaz")
table <- tree %>% code_table()
code <- "foobar" %>% code_message(table)
code %>% toString() %>% cat()
```
Decoding works by following paths down the tree until we get to a character, which we then add to the list as the decoding of a part of the code, and doing this as long as we have path-bits in our encoded message.
```haskell
decodeMessage :: Tree -> HCode -> [Char]
decodeMessage tr =
decodeByTree tr
where
decodeByTree (Node n t1 t2) (L : rest) = decodeByTree t1 rest
decodeByTree (Node n t1 t2) (R : rest) = decodeByTree t2 rest
decodeByTree (Leaf c n) rest = c : decodeByTree tr rest
decodeByTree t [] = []
```
For this function, the R code closely follows the Haskell code, except that we have to put the `decode_by_tree` function at the beginning of the function instead of in a `where` block:
```{r}
decode_message <- function(tree, code) {
decode_by_tree <- case_func(
..(Node(n, t1, t2), CONS(L, rest)) -> decode_by_tree(..(t1, rest)),
..(Node(n, t1, t2), CONS(R, rest)) -> decode_by_tree(..(t2, rest)),
..(Leaf(char, n), rest) -> CONS(char, decode_by_tree(..(tree, rest))),
..(t, NIL) -> NIL
)
decode_by_tree(..(tree, code)) %>% as.vector() %>% paste0(collapse = "")
}
decode_message(tree, code)
```
That's it. Huffman coding in R.