- Why coding?
-
Source coding
- Convert source messages to channel symbols (for example 0,1)
- Minimize number of symbols needed
- (Adapt probabilities of symbols to maximize mutual information)
-
Error control
- Protection against channel errors / Adds new (redundant) symbols
Source-channel separation theorem (informal):
-
It is possible to obtain the best reliable communication by performing the two tasks separately:
- Source coding: to minimize number of symbols needed
- Error control coding (channel coding): to provide protection against noise
-
Assume we code for transmission over ideal channels with no noise
-
Transmitted symbols are perfectly recovered at the receiver
-
Main concerns:
- minimize the number of symbols needed to represent the messages
- make sure we can decode the messages
-
Advantages:
- Efficiency
- Short communication times
- Can decode easily
-
Let
$S = \left\lbrace s_1, s_2, ... s_N \right\rbrace$ = an input discrete memoryless source -
Let
$X = \left\lbrace x_1, x_2, ... x_M \right\rbrace$ = the alphabet of the code- Example: binary: {0,1}
-
A code is a mapping from
$S$ to the set of all codewords:$$C = \left\lbrace c_1, c_2, ... c_N \right\rbrace$$
Message Codeword
- Codeword length
$l_i$ = the number of symbols in$c_i$
-
Encoding: given a sequence of messages, replace each message with its codeword
-
Decoding: given a sequence of symbols, deduce the original sequence of messages
-
Example: at blackboard
-
How to measure representation efficiency of a code?
-
Average code length = average of the codeword lengths:
$$\overline{l} = \sum_i p(s_i) l_i$$ -
The probability of a codeword = the probability of the corresponding message
-
Smaller average length: code more efficient (better)
-
How small can the average length be?
A code can be:
- non-singular: all codewords are different
- uniquely decodable: for any received sequence of symbols, there is only one corresponding sequence of messages
- i.e. no sequence of messages produces the same sequence of symbols
- i.e. there is never a confusion at decoding
- instantaneous (also known as prefix-free): no codeword is prefix to another code
- A prefix = a codeword which is the beginning of another codeword
Examples: at the blackboard
Example at blackboard
-
Theorem:
- An instantaneous code is uniquely decodable
-
Proof:
- There is exactly one codeword matching the beginning of the sequence
- Suppose the true initial codeword is c
- There can't be a shorter codeword c', since it would be prefix to c
- There can't be a longer codeword c'', since c would be prefix to it
- Remove first codeword from sequence
- By the same argument, there is exactly one codeword matching the new beginning, and so on ...
- There is exactly one codeword matching the beginning of the sequence
-
Note: the converse is not necessary true; there exist uniquely decodable codes which are not instantaneous
-
How to decode an instantaneous code using the graph
- Start at the root, each bit means go left or go right
- Illustrate at whiteboard
-
Advantage on instantaneous code over uniquely decodable: simple decoding
-
Why the name instantaneous?
- The codeword can be decoded as soon as it is fully received
- Counter-example: Uniquely decodable, non-instantaneous, delay 6: {0, 01, 011, 1110}
-
Theorem:
- An uniquely decodable code is non-singular
-
Proof: by contradiction
- Assume the uniquely-decodable code is singular
- This means some codewords are not unique (two different messages have same codeword)
- When decoding that codeword, you obviously can't decide which is the corresponding message
- Contradition: the code is not uniquely decodable
- So if the code is uniquely-decodable, it must be non-singular
-
Note: the converse is not necessary true; there exist some non-singular codes which are not uniquely-decodable
-
Relation between code types:
- Instantaneous
$\subset$ uniquely decodable$\subset$ non-singular
- Instantaneous
-
When can an instantaneous code exist?
-
Kraft inequality theorem:
- There exists an instantaneous code with
$D$ symbols and codeword lengths${l_1, l_2, \ldots l_n}$ if and only if the lengths satisfy the following inequality: $$ \sum_i D^{-l_i} \leq 1.$$
- There exists an instantaneous code with
-
Proof: At blackboard
-
Comments:
- If lengths do not satisfy this, no instantaneous code exists
- If the lengths of a code satisfy this, that code can be instantaneous or not (there exists an instantaneous code, but not necessarily that one)
- Kraft inequality means that the codewords lengths cannot be all very small
-
From the proof => we have equality in the relation $$ \sum_i D^{-l_i} = 1$$ only if the lowest level is fully covered <=> no unused branches
-
For an instantaneous code which satisfies Kraft with equality, all the graph branches terminate with codewords (there are no unused branches)
- This is most economical: codewords are as short as they can be
-
Instantaneous codes must obey Kraft inequality
-
How about uniquely decodable codes?
-
McMillan theorem (no proof given):
- Any uniquely decodable code also satisfies the Kraft inequality: $$ \sum_i D^{-l_i} \leq 1.$$
-
Consequence:
- For every uniquely decodable code, there exists in instantaneous code with the same lengths!
- Even though the class of uniquely decodable codes is larger than that of instantaneous codes, it brings no benefit in codeword length
- We can always use just instantaneous codes.
-
How to find an instantaneous code with code lengths
${l_i}$ - Check that lengths satisfy Kraft relation
- Draw graph
- Assign nodes in a certain order (e.g. descending probability)
-
Easy, standard procedure
-
Example: at blackboard
-
We want to minimize the average length of a code:
$$\overline{l} = \sum_i p(s_i) l_i$$ -
But the lengths must obey the Kraft inequality (for uniquely decodable)
-
So we reach the following constrained optimization problem:
-
Method of Lagrange multipliers: standard mathematical tool
-
To solve the following constrained optimization problem $$\begin{aligned} \textbf{minimize } & f(x) \ \textrm{subject to } & g(x) = 0 \end{aligned}$$ one must build a new function
$L(x, \lambda)$ (the Lagrangean function):$$L(x, \lambda) = f(x) - \lambda g(x)$$ and the solution$x$ is among the solutions of the system: $$\begin{aligned} & \frac{\partial L(x, \lambda)}{\partial x} = 0 \ & \frac{\partial L(x, \lambda)}{\partial \lambda} = 0 \end{aligned}$$ -
If there are multiple variables
$x_i$ , derivation is done for each one
-
In our case:
- The unknown
$x$ are$l_i$ - The function is
$f(x) = \overline{l} = \sum_i p(s_i) l_i$ - The constraint is
$g(x) = \sum_i D^{-l_i} - 1$
- The unknown
-
(Solve at blackboard)
-
The optimal values are:
$$\boxed{l_i = -\log(p(s_i))}$$ -
Intuition: using
$l_i = -\log(p(s_i))$ satisfies Kraft with equality, so the lengths cannot be any shorter, in general
-
The optimal codeword lengths are:
$$\boxed{l_i = -\log(p(s_i))}$$ -
Higher probability => smaller codeword
- more efficient
- language examples: "da", "nu", "the", "le" ...
-
Smaller probability => longer codeword
- it appears rarely => no problem
-
Overall, we obtain the minimum average length
-
If the optimal values are:
$$l_i = -\log(p(s_i))$$ -
Then the minimal average length is:
$$\min \overline{l} = \sum_i p(s_i) l_i = -\sum_i p(s_i) \log(p(s_i)) = H(S)$$ -
The entropy of a source = the minimum average length necessary to encode the messages
- e.g. the minimum number of bits required to represent the data in binary form
-
This tells us something about entropy
- This is what entropy means in practice
- Small entropy => can be written (encoded) with few bits
- Large entropy => requires more bits for encoding
-
This tells us something about the average length of codes
- The average length of an uniquely decodable code must be at least as large
as the source entropy
$$H(S) \leq \overline{l}$$
- The average length of an uniquely decodable code must be at least as large
as the source entropy
-
One can never represent messages, on average, with a code having average length less than the entropy
- Analogy: 1 liter of water
- 1 liter of water = the quantity of water that can fit in any bottle
of size
$\geq$ 1 liter, but not in any bottle$<$ 1 liter$$Bottle \geq water$$ - Information of the source = the water
- The code used for representing the messages = the bottle that carries the water
$$\overline{l} \geq H(S)$$
- 1 liter of water = the quantity of water that can fit in any bottle
of size
-
Efficiency of a code (
$M$ = size of code alphabet):$$\eta = \frac{H(S)}{\overline{l} \log{M}}$$ - usually
$M$ = 2 so$\eta = \frac{H(S)}{\overline{l}}$ - but if
$M>2$ a factor of$\log{M}$ is needed because$H(S)$ in bits (binary) but$\overline{l}$ not in bits (M symbols)
- usually
-
Redundancy of a code:
$$\rho = 1- \eta$$ -
These measures indicate how close is the average length to the optimal value
-
When
$\eta = 1$ : optimal code
-
Problem:
$l_i = -\log(p(s_i))$ might not be an integer number- but the codeword lengths must be natural numbers
-
An optimal code = a code that attains the minimum average length
$\overline{l} = H(S)$ -
An optimal code can always be found for a source where all
$p(s_i)$ are powers of 2- e.g.
$1/2$ ,$1/4$ ,$1/2^n$ , known as dyadic distribution - the lengths
$l_i = -\log(p(s_i))$ are all natural numbers => can be attained - the code with lengths
$l_i$ can be found with the graph-based procedure
- e.g.
-
What if
$-\log(p(s_i))$ is not a natural number? i.e.$p(s_i)$ is not a power of 2 -
Shannon's solution: round to next largest natural number
$$l_i = \lceil -\log(p(s_i)) \rceil$$ i.e.
$-\log(p(s_i)) = 2.15$ =>$l_i = 3$
-
Shannon coding:
- Arrange probabilities in descending order
- Use codeword lengths
$l_i = \lceil -\log(p(s_i)) \rceil$ Find any instantaneous code for these lengths$^{*}$ - For every message
$s_i$ :- compute the sum of all the probabilities up to this message
- multiply this value with
$2^{l_i}$ - floor the result and convert to binary
-
The code obtained = a "Shannon code"
-
Simple scheme, better algorithms are available
- Example: compute lengths for
$S: (0.9, 0.1)$
- Example: compute lengths for
-
But still enough to prove fundamental results
Theorem:
- The average length of a Shannon code satisfies
$$H(S) \leq \overline{l} < H(S) + 1$$
Proof:
-
The first inequality is because H(S) is minimum length
-
The second inequality: a. Use Shannon code:
$$l_i = \lceil -\log(p(s_i)) \rceil = -\log(p(s_i)) + \epsilon_i$$ where$0 \leq \epsilon_i < 1$ a. Compute average length:
$$\overline{l} = \sum_i p(s_i) l_i = H(S) + \underbrace{\sum_i p(s_i) \epsilon_i}_{< 1}$$ a. Since$\epsilon_i < 1$ =>$\sum_i p(s_i) \epsilon_i < \sum_i p(s_i) = 1$ \qed
-
Average length of Shannon code is at most 1 bit longer than the minimum possible value
- That's quite efficient
- There exist even better codes, in general
-
Q: Can we get even closer to the minimum length?
-
A: Yes, as close as we want!
- In theory, at least ... :)
- See next slide.
Shannon's first theorem (coding theorem for noiseless channels):
- It is possible to encode an infinitely long sequences of messages from a source S with an average length as close as desired to H(S), but never below H(S)
\
Key points:
- we can always obtain
$\overline{l} \to H(S)$ - for an infinitely long sequence
Proof:
- Average length can never go below H(S) because this is minimum
- How can it get very close to H(S) (from above)?
- Use
$n$ -th order extension$S^n$ of S - Use Shannon coding for
$S^n$ , so it satisfies$$H(S^n) \leq \overline{l_{S^n}} < H(S^n) + 1$$ - But
$H(S^n) = n H(S)$ , and average length per message of$S$ is$$\overline{l_{S}} = \frac{\overline{l_{S^n}}}{n}$$ because messages of$S^n$ are just$n$ messages of$S$ glued together - So, dividing by
$n$ :$$\boxed{H(S) \leq \overline{l_{S}} < H(S) + \frac{1}{n}}$$ - If extension order
$n \to \infty$ , then$$\overline{l_{S}} \to H(S)$$ \qed
- Use
- Analogy: how to buy things online without paying for delivery :)
- FanCourier taxes 15 lei per delivery
- not efficient to buy something worth a few lei
- How to improve efficiency? Buy
$n$ things bundled together! - The delivery cost per unit is now
$\frac{15}{n}$ - As
$n \to \infty$ , the delivery cost per unit$\to 0$ - What's 15 lei when you pay
$\infty$ lei...
- What's 15 lei when you pay
- FanCourier taxes 15 lei per delivery
Comments:
- Shannon's first theorem shows that we can approach H(S)
to any desired accuracy using extensions of large order of the source
- This is not practical: the size of
$S^n$ gets too large for large$n$ - Other (better) algorithms than Shannon coding are used in practice to approach
$H(S)$
- This is not practical: the size of
-
Consider a source with probabilities
$p(s_i)$ -
We use a code designed for a different source:
$l_i = -\log(q(s_i))$ -
The message probabilities are
$p(s_i)$ but the code is designed for$q(s_i)$ \ -
Examples:
- design a code based on a sample data file (like in lab)
- but we use it to encode various other files => probabilities might differ slightly
- e.g. design a code based a Romanian text, but encode a text in English \
-
What happens?
-
We lose some efficiency:
- Codeword lengths
$\overline{l_i}$ are not optimal for our source => increased$\overline{l}$
- Codeword lengths
-
If code were optimal, best average length = entropy
$H(S)$ :$$\overline{l_{optimal}} = -\sum{p(s_i) \log{p(s_i)}}$$ -
But the actual average length we obtain is:
$$\overline{l_{actual}} = \sum{p(s_i) l_i} = -\sum{p(s_i) \log{q(s_i)}}$$
- Difference between average lengths is:
-
The difference = the Kullback-Leibler distance between the two distributions
- is always
$\geq 0$ => improper code means increased$\overline{l}$ (bad) - distributions more different => larger average length (worse)
- is always
-
The KL distance between the distributions = the number of extra bits used because of a code optimized for a different distribution
$q(s_i)$ than the true distribution of our data$p(s_i)$
Reminder: where is the Kullback–Leibler distance used
-
Here: Using a code optimized for a different distribution:
- Average length is increased with
$D_{KL}(p || q)$
- Average length is increased with
-
In chapter IV (Channels): Definition of mutual information:
- Distance between
$p(x_i \cap y_j)$ and the distribution of two independent variables$p(x_i) \cdot p(y_j)$ $$I(X,Y) = \sum_{i,j} p(x_i \cap y_j) \log(\frac{p(x_i \cap y_j)}{p(x_i)p(y_j)})$$
- Distance between
Shannon-Fano (binary) coding procedure:
- Sort the message probabilities in descending order
- Split into two subgroups as nearly equal as possible
- Assign first bit
$0$ to first group, first bit$1$ to second group - Repeat on each subgroup
- When reaching one single message => that is the codeword
Example: blackboard
Comments:
- Shannon-Fano coding does not always produce the shortest code lengths
- Connection: yes-no answers (example from first chapter)
Huffman coding procedure (binary):
- Sort the message probabilities in descending order
- Join the last two probabilities, insert result into existing list, preserve descending order
- Repeat until only two messages are remaining
- Assign first bit
$0$ and$1$ to the final two messages - Go back step by step: every time we had a sum, append
$0$ and$1$ to the end of existing codeword
Example: blackboard
Properties of Huffman coding:
- Produces a code with the smallest average length (better than Shannon-Fano)
- Assigning
$0$ and$1$ can be done in any order => different codes, same lengths - When inserting a sum into an existing list, may be equal to another value => options
- we can insert above, below or in-between equal values
- leads to codes with different individual lengths, but same average length
- Some better algorithms exist which do not assign a codeword to every single message (they code a while sequence at once, not every message)
General Huffman coding procedure for codes with
-
Have
$M$ symbols$\left\lbrace x_1, x_2, ... x_M \right\rbrace$ -
Add together the last
$M$ symbols -
When assigning symbols, assign all
$M$ symbols -
Important: at the final step must have
$M$ remaining values- May be necessary to add virtual messages with probability 0 at the end of the initial list,
to end up with exactly
$M$ messages in the last step
- May be necessary to add virtual messages with probability 0 at the end of the initial list,
to end up with exactly
-
Example : blackboard
Example: compare binary Huffman and Shannon-Fano for:
-
For every symbol
$x_i$ we can compute the average number of symbols$x_i$ in a code$$\overline{l_{x_i}} = \sum_i p(s_i) l_{x_i}(s_i)$$ -
$l_{x_i}(s_i) =$ number of symbols$x_i$ in the codeword of$s_i$ - e.g.: average number of 0's and 1's in a code
-
-
Divide by average length => probability (frequency) of symbol
$x_i$ $$p(x_i) = \frac{\overline{l_{x_i}}}{\overline{l}}$$ -
These are the probabilities of the input symbols for the transmission channel
- they play an important role in Chapter IV (transmission channels)
-
Consider that the messages are already written in a binary code
- Example: characters in ASCII code
-
Source coding = remapping the original codewords to other codewords
- The new codewords are shorter, on average
-
This means data compression
- Just like the example in lab session
-
What does data compression remove?
- Removes redundancy: unused bits, patterns, regularities etc.
- If you can guess somehow the next bit in a sequence, it means the bit is not really necessary, so compression will remove it
- The compressed sequence looks like random data: impossible to guess, no discernable patterns
- Consider data compression with Shannon or Huffman coding, like we did in lab
- What property do we exploit in order to obtain compression?
- How does compressible data look like?
- How does incompressible data look like?
- What are the limitation of our data compression method?
- How could it be improved?
- Other types of coding do exist (info only)
- Arithmetic coding
- Adaptive schemes
- etc.
- Average length:
$\overline{l} = \sum_i p(s_i) l_i$ - Code types: instantaneous
$\subset$ uniquely decodable$\subset$ non-singular - All instantaneous or uniqualy decodable code must obey Kraft: $$ \sum_i D^{-l_i} \leq 1$$
- Optimal codes:
$l_i = -\log(p(s_i))$ ,$\overline{l_{min}} = H(S)$ - Shannon's first theorem: use
$n$ -th order extension of$S$ ,$S^n$ :$$\boxed{H(S) \leq \overline{l_{S}} < H(S) + \frac{1}{n}}$$ - average length always larger, but as close as desired to
$H(S)$
- average length always larger, but as close as desired to
- Coding techniques:
- Shannon: ceil the optimal codeword lengths (round to upper)
- Shannon-Fano: split in two groups approx. equal
- Huffman: group last two. Is best of all.