# TP2 - Compression, Prediction, Generation: Text Entropy

#### Stella Douka, Guillaume Charpiat 

#### Credits: Gaétan Marceau-Caron, Francesco Pezzicoli

### Introduction

In this TP we are interested in compressing and generating texts written in natural languages.
Given a text of length $n$, a sequence of symbols is just a vector $(x_1, . . . , x_n)$ where each $x_i$ is a symbol i.e. $x_i = a, b, c, \dots$. We can define the alphabet of possible symbols as $\mathcal{A} = \{a_1,a_2,\dots,a_M\}$ then each $x_i$ can have $M$ values.

In order to model the sequence of symbols we need a joint probability distribution for each symbol in the sequence, namely $p(X_1 = x_1, X_2 = x_2, \dots , X_n = x_n)$. If our alphabet had $M$ symbols, for modelling a sequence of length $n$ we would need $M^n$ probabilities. Thus some assumptions are required in order to reduce this dimensionality. In this case we will use two different models for $p$, the IID and the Markov Chain model.

### IID Model
The IID model assumes:

$$ p(X_1 = x_1, X_2 = x_2, \dots , X_n = x_n) = \prod_{i=1}^n p(X_i = x_i)$$

i.e. that the symbols in a sequence are independent and identically distributed. With this model we need only $M$ probabilities, one for each symbol. One can generalize and use symbols not of a single character but of multiples ones. For example using 3 characters per symbol, the symbols would be of the form $aaa,aab,...,zzz$. When using $k$ characters per symbols in an alphabet of $M$ characters, the needed probabilities would be $M^k$.


### Markov Chain Model

The Markov Chain model assume a limited range of dependence of the symbols. Indeed for an order $k$ Markov Chain:


$$p(X_i | X_{i-1},X_{i-2},\dots,X_1) = p(X_i | X_{i-1},X_{i-2},\dots,X_{i-k})$$


The meaning of the above structure is that the $i$-th symbol in the sequence depends only on the previous $k$ symbols. We add the time *invariant assumption*, meaning that the conditional probabilities do not depend on the time index $i$ i.e. $p(X_i | X_{i-1},X_{i-2},\dots,X_{i-k}) = p(X_{k+1} | X_{k},X_{k-1},\dots,X_{1})$. The most common and widely used Markov Chain is the Markov Chain of order 1:

$$p(X_i | X_{i-1},X_{i-2},\dots,X_1) = p(X_i | X_{i-1})$$

In this case the conditional probability $p(X_i|X_{i−1})$ can be expressed using $M^2$
numbers. Usually this is referred to as the *transition matrix*. Given an alphabet $\mathcal{A} = \{a_1,a_2,\dots,a_M\}$ the transition matrix can be written as: 

$$ \mathbb{M}_{kl} = p(X_i = a_k| X_{i-1} = a_l) $$

### Entropy and Cross-Entropy


- For the IID model of order 1 the entropy computation is straightforward: 
$$ H_{IID} = -\sum_{i=1}^M p(a_i) log p(a_i)$$ 
and consequently, starting from two distributions $p,q$ fitted on two different texts, the cross-entropy:
$$ CE_{IID} = -\sum_{i=1}^M p(a_i) log q(a_i)$$


- For the MC model of order 1 the entropy is defined as follows: 
$$ H_{MC} = - \sum_{kl} \pi(a_k) p(X_i = a_k| X_{i-1} = a_l) log \left(p(X_i = a_k| X_{i-1} = a_l)\right)= - \sum_{kl} \pi_k\mathbb{M}_{kl} log \mathbb{M}_{kl}$$
where $\pi$ is the stationary distribution of the Markov Chain i.e. $\pi_k = \mathbb{M}_{kl} \pi_l$. The code to compute the stationary distribution is already given.
The cross-entropy:
$$ CE_{IID} = - \sum_{kl} \pi_k\mathbb{M}_{kl} log \mathbb{M'}_{kl}$$
with $\mathbb{M}$ and $\mathbb{M'}$ are fitted on two different texts.


### Theoretical Questions: 

1) Interpret the time invariant assumption associated to our Markov chains in the contex of text generation.

2) How can we rewrite a Markov chain of higher order as a Markov chain of order 1?

3) Given a probability distribution over symbols, how to use it for generating sentences?

1.  **Interpret the time invariant assumption associated to our Markov chains in the context of text generation.**
    The time-invariant assumption means that the probability of a particular symbol appearing depends *only* on the preceding *k* symbols (for an order *k* Markov chain), regardless of *where* in the sequence (or text) this pattern occurs. In text generation, this implies that the rules governing which character is likely to follow a specific sequence of *k* characters (e.g., the probability of 'e' following 'th') are considered constant throughout the entire text being generated. The likelihood of a transition doesn't change whether it's at the beginning, middle, or end of the text.

2.  **How can we rewrite a Markov chain of higher order as a Markov chain of order 1?**
    In Markov chains of order *k*, the probability of the current symbol depends on the *k* preceding symbols ($p(X_i | X_{i-1},X_{i-2},\dots,X_{i-k})$). In the specific case of an order 1 Markov chain, the dependency is only on the single preceding symbol ($p(X_i | X_{i-1})$). By redefining the state space to include sequences of past states, an \(n\)-order Markov chain can be represented as a first-order Markov chain.

3.  **Given a probability distribution over symbols, how to use it for generating sentences?**
    We can use two approaches to model the probability of sequences of symbols:
    * **IID Model:** Assigns an independent probability $p(X_i = a_i)$ to each symbol $a_i$.
    * **Markov Chain Model:** Assigns conditional probabilities $p(X_i | X_{i-1},X_{i-2},\dots,X_{i-k})$ (or just $p(X_i | X_{i-1})$ for order 1) based on preceding symbols.

### Practical questions

In order to construct our IID and Markov Chain models we need some text. Our source will be a set of classical novels available at: https://www.lri.fr/~gcharpia/informationtheory/TP2_texts.zip

We will use the symbols in each text to learn the probabilities of each model. The alphabet we suggest for the characters to use is string.printable which is made of $\sim 100$ characters. (see below)

For both models, perform the following steps:

1) For different orders of dependencies, train the model on a novel and compute the associated entropy. What do you observe as the order increases? Explain your observations.

2) Use the other novels as test sets and compute the cross-entropy for each model trained previously. How to handle symbols (or sequences of symbols) not seen in the training set?

3) For each order of dependencies, compare the cross-entropy with the entropy. Explain and interpret the differences.

4) Choose the order of dependencies with the lowest cross-entropy and generate some sentences.

5) Train one model per novel and use the KL divergence in order to cluster the novels.


<b>Hints</b> : 

- In the MC case limit yourself to order $2$ (the computation can become quite expensive). If you have $ M \sim 100$ characters, for order $1$ you will need a $\sim 100 \times 100$ matrix, for order $2$ a $\sim 10^4 \times 10^4$ matrix.

- For the second order MC model you need to compute: $p(X_{i+1},X_{i}|X_{i},X_{i-1})$

- It is possible to implement efficiently the two models with dictionaries inPython.  For the IID model, a key of the dictionary is simply a symbol and the value is the number of occurrences of the symbol in the text. For a Markov chain, a key of the dictionary is also a symbol, but the value is a vector that contains the number of occurrences of each character of the alphabet.  Notice that a symbol may consist of one or several characters. Note also that there is no need to explicitly consider all possible symbols; the ones that are observed in the training set are sufficient.

- A low probability can be assigned to symbols not observed in the training-set.

#### Computing stationary distribution 

Here we provide you two version of the function to compute the stationary distirbution of a markov chain and show a small example

In [1]:
#direct way to find pi (can be slow)
import  numpy  as np

def Compute_stationary_distribution(P_kl):
    ## P_kl must be the transition matrix from state l to state k!
    evals , evecs = np.linalg.eig(P_kl)   
    evec1 = evecs[:,np.isclose(evals , 1)]
    evec1 = evec1 [:,0]
    pi = evec1 / evec1.sum()
    pi = pi.real #stationary  probability
    
    return pi 

#iteative way (should be faster)
def Compute_stationary_distribution_it(P_kl, n_it):
    pi = np.random.uniform(size=P_kl.shape[0]) #initial state, can be a random one!
    pi /= pi.sum()
    #print(pi,pi.sum())
    for t in range(n_it):   
        pi = np.matmul(P_kl,pi)
    
    return pi

In [2]:
##simple example of computation of stationary distribution 

n_it = 1000                                     ##remind to check that n_it is enough to reach convergence
P_kl = np.array([[0.7,0.5],[0.3,0.5]])
Compute_stationary_distribution_it(P_kl,n_it)

array([0.625, 0.375])

#### Defining the Alphabet

Example of uploading a text and filtering out characters which are not in the chosen alphabet

In [4]:
import  string

def import_text(file_name):
    lines = []
    with  open(file_name , encoding='UTF8') as f:
        lines = f.readlines ()
        text = '\n'.join(lines)
        printable = set(string.printable)
        text = ''.join(filter(lambda x: x in printable , text))     
    return text

text = import_text('./texts/Alighieri.txt')

#### IID - MODEL

In [11]:
import numpy as np
import math
from collections import defaultdict, Counter 

class IIDModel:
    def __init__(self, order=1):
        self.order = order
        self.prob_dict = {}
        self.alphabet = None
        self.total_count = 0

    def process(self, text):
        counts = Counter(text)
        self.total_count = sum(counts.values())

        self.alphabet = list(set(text))

        self.prob_dict = {}
        for sym in self.alphabet:
            self.prob_dict[sym] = counts[sym] / self.total_count

    def getEntropy(self):
        H = 0.0
        for sym, p in self.prob_dict.items():
            if p > 0:
                H -= p * math.log(p, 2)
        return H

    def getCrossEntropy(self, text):
        n = len(text)
        if n == 0:
            return 0.0

        eps = 1e-8
        V = len(self.alphabet)
        logsum = 0.0
        for x in text:
            if x in self.prob_dict:
                px = self.prob_dict[x]
            else:
                px = eps
            logsum += math.log(px, 2)
        CE = -logsum / n
        return CE

    def generate(self, length):
        if not self.prob_dict:
            return ""
        import random
        symbols = list(self.prob_dict.keys())
        probs   = list(self.prob_dict.values())
        out = []
        for _ in range(length):
            out.append(random.choices(symbols, weights=probs, k=1)[0])
        return ''.join(out)

In [12]:
##clustering texts 

def KL_divergence(dist1, dist2):
    kl = 0.0
    for x, p1 in dist1.items():
        if p1 <= 0:
            continue
        p2 = dist2.get(x, 1e-12)
        kl += p1 * math.log(p1 / p2, 2)
    return kl

#### MARKOV CHAIN - MODEL

In [15]:
class MarkovModel:
    def __init__(self, order=2):
        self.order = order
        self.transitions = defaultdict(Counter)
        self.alphabet = None
        self.transition_probs = defaultdict(dict)

    def process(self, text):
        if len(text) < self.order:
            return

        self.alphabet = list(set(text))

        if self.order == 1:
            for i in range(len(text) - 1):
                state = text[i]
                nxt   = text[i+1]
                self.transitions[state][nxt] += 1

        elif self.order == 2:
            for i in range(len(text) - 2):
                state = (text[i], text[i+1])
                nxt   = text[i+2]
                self.transitions[state][nxt] += 1

        for state, counter_next in self.transitions.items():
            total = sum(counter_next.values())
            for nxt_sym, cnt in counter_next.items():
                self.transition_probs[state][nxt_sym] = cnt / total

    def getEntropy(self):
        if not self.transition_probs:
            return 0.0

        states = list(self.transition_probs.keys())
        idx_map = {s: i for i, s in enumerate(states)}

        n_states = len(states)
        P_kl = np.zeros((n_states, n_states))

        if self.order == 1:
            for s in states:
                i_s = idx_map[s]
                for nxt_sym, pval in self.transition_probs[s].items():
                    i_next = idx_map[nxt_sym] if nxt_sym in idx_map else None
                    if i_next is not None:
                        P_kl[i_next, i_s] = pval

        elif self.order == 2:
            for s in states:
                i_s = idx_map[s]
                (a, b) = s
                for nxt_sym, pval in self.transition_probs[s].items():
                    new_state = (b, nxt_sym)
                    i_next = idx_map[new_state] if new_state in idx_map else None
                    if i_next is not None:
                        P_kl[i_next, i_s] = pval

        pi = Compute_stationary_distribution_it(P_kl, 1000)

        H = 0.0
        for s in states:
            i_s = idx_map[s]
            pi_s = pi[i_s]
            for nxt_sym, pval in self.transition_probs[s].items():
                if pval > 0:
                    H -= pi_s * pval * math.log(pval, 2)
        return H

    def getCrossEntropy(self, text):
        n = len(text)
        if self.order == 1:
            if n < 2:
                return 0.0
            log_sum = 0.0
            count_trans = 0
            for i in range(n - 1):
                state = text[i]
                nxt   = text[i+1]
                p = self.transition_probs.get(state, {}).get(nxt, 1e-8)
                log_sum += math.log(p, 2)
                count_trans += 1
            return -log_sum / count_trans

        elif self.order == 2:
            if n < 3:
                return 0.0
            log_sum = 0.0
            count_trans = 0
            for i in range(n - 2):
                state = (text[i], text[i+1])
                nxt   = text[i+2]
                p = self.transition_probs.get(state, {}).get(nxt, 1e-8)
                log_sum += math.log(p, 2)
                count_trans += 1
            return -log_sum / count_trans

    def generate(self, length):
        if not self.transition_probs:
            return ""
        import random

        states = list(self.transition_probs.keys())
        out = []

        if self.order == 1:
            state = random.choice(states)
            out.append(state)
            for _ in range(length - 1):
                next_dict = self.transition_probs[state]
                if not next_dict:
                    break
                symbols = list(next_dict.keys())
                probs   = list(next_dict.values())
                nxt = random.choices(symbols, weights=probs, k=1)[0]
                out.append(nxt)
                state = nxt
            return ''.join(out)

        elif self.order == 2:
            state = random.choice(states)
            out.append(state[0])
            out.append(state[1])
            for _ in range(length - 2):
                next_dict = self.transition_probs[state]
                if not next_dict:
                    break
                symbols = list(next_dict.keys())
                probs   = list(next_dict.values())
                nxt = random.choices(symbols, weights=probs, k=1)[0]
                out.append(nxt)
                state = (state[1], nxt)
            return ''.join(out)

In [17]:
iid_model = IIDModel(order=1)
iid_model.process(text)
H_iid = iid_model.getEntropy()
print("IID Entropy (training):", H_iid)

CE_iid_self = iid_model.getCrossEntropy(text)
print("IID Cross-Entropy on itself:", CE_iid_self)

gen_iid = iid_model.generate(200)
print("\nIID Generated Text (first 200 chars):\n", gen_iid)

markov1 = MarkovModel(order=1)
markov1.process(text)
H_m1 = markov1.getEntropy()
print("\nMarkov(1) Entropy (training):", H_m1)
CE_m1_self = markov1.getCrossEntropy(text)
print("Markov(1) Cross-Entropy on itself:", CE_m1_self)

gen_m1 = markov1.generate(200)
print("Markov(1) Generated Text (first 200 chars):\n", gen_m1)

markov2 = MarkovModel(order=2)
markov2.process(text)
H_m2 = markov2.getEntropy()
print("\nMarkov(2) Entropy (training):", H_m2)
CE_m2_self = markov2.getCrossEntropy(text)
print("Markov(2) Cross-Entropy on itself:", CE_m2_self)

gen_m2 = markov2.generate(200)
print("Markov(2) Generated Text (first 200 chars):\n", gen_m2)

test_text = import_text("./texts/Dostoevsky.txt")

CE_iid_test = iid_model.getCrossEntropy(test_text)
CE_m1_test = markov1.getCrossEntropy(test_text)
CE_m2_test = markov2.getCrossEntropy(test_text)

print("\nCross-Entropy on Dostoevsky (IID):   ", CE_iid_test)
print("Cross-Entropy on Dostoevsky (M1):    ", CE_m1_test)
print("Cross-Entropy on Dostoevsky (M2):    ", CE_m2_test)

iid_model_sh = IIDModel(order=1)
iid_model_sh.process(test_text)
kl_Alighieri_Sh = KL_divergence(iid_model.prob_dict, iid_model_sh.prob_dict)
kl_Sh_Alighieri = KL_divergence(iid_model_sh.prob_dict, iid_model.prob_dict)

print("\nKL divergence Alighieri -> Dostoevsky:", kl_Alighieri_Sh)
print("KL divergence Dostoevsky -> Alighieri:", kl_Sh_Alighieri)

IID Entropy (training): 4.1898567285596116
IID Cross-Entropy on itself: 4.189856728559201

IID Generated Text (first 200 chars):
 ia danEaaroVtars og:a i u d oe
smeOi
col  nd eon 
 te lcza  tfdnori
eieu teetzacFsrio n

ls

 ohnltuicnu,hsioial
clile ll

m ogeosrlher lUmnpad
 o
c tfomt
cc r pet
ooolle
me dusi mdtnoi urdiu anrl;com

Markov(1) Entropy (training): 3.171667169429363
Markov(1) Cross-Entropy on itself: 3.1716779288317674
Markov(1) Generated Text (first 200 chars):
 Atrte co pemonalce colui roi.

 Quagr  covovenoin dagndome i  panzale

 frimmo manciaranterdior de
 nocel prllegncheraio,     Ditiano   l ola moici pa vora.
 pe so ve da che linacher a naso ma
  ca pe

Markov(2) Entropy (training): 2.611189845967297
Markov(2) Cross-Entropy on itself: 2.611204100825836
Markov(2) Generated Text (first 200 chars):
 be quarvea ca nossiersia que chEsta guanto ria!;

  cognon s i suov , andon dalto milava vogna n se surgatte, quel per ch lidistre

  pancia cote, se: Comento,


  no ventu 

1) As the Markov order goes from IID (order=0) to Markov(1) to Markov(2), the **entropy decreases** (e.g., from ~4.19 bits to ~3.17 bits to ~2.61 bits per character). The **higher the order**, the more context the model uses to predict the next symbol. This **increases predictability** (reduces uncertainty) and hence **lowers the entropy**. An IID model knows nothing about context; Markov(1) uses the previous symbol, and Markov(2) uses the previous two symbols, thereby reducing uncertainty even more.

2) When testing the Alighieri-trained model on a *different* author (e.g., Dostoevsky), the cross-entropy can become **quite high**—especially for **Markov(2)**. This is because the **second-order** context patterns from Alighieri’s text rarely match Dostoevsky’s style. If the test text contains **new** symbols or transitions not present in training, the model would naively assign probability **zero**, causing infinite -log p. A potential solution is **smoothing** — assigning a small positive probability (e.g., 1e-8) to unseen events. This prevents cross-entropy from increasing to infinity.

3) **Entropy** is computed on the same text the model was trained on, so it measures how well the model “explains” its **own** training data. **Cross-entropy** is measured on **new** text (potentially from a different distribution). Typically on the **training** text cross-entropy approximates entropy. On **unseen** text, cross-entropy is greater or equal to entropy. The gap reflects **distribution mismatch** (i.e., **KL divergence**). The bigger the mismatch, the larger the cross-entropy compared to the model’s training entropy. IID cross-entropy on Dostoevsky is 5.59 bits/char, vs. Markov(1) = 7.90, and Markov(2) = 11.90. For Markov(2) the cross-entropy is so high because the second-order model overfits more specific patterns from Alighieri’s style that do not appear in Dostoevsky. Consequently, it assigns many zero/near-zero probabilities, resulting in a large negative log-likelihood (hence large cross-entropy). In contrast, the IID model does not “specialize” as heavily and therefore generalizes somewhat better on a different style of text — thus a lower cross-entropy than Markov(1) or Markov(2).

Overall, the results illustrate that the Markov(2) model is very “specialized” to Alighieri’s text, which is excellent for capturing nuances in that text, but relatively poor for modeling a different text (Dostoevsky). We also see that a higher order model achieves lower entropy on training (more context = less uncertainty). Other observations are that cross-entropy on unseen text is often higher, especially for specialized (Markov(2)) models (smoothing can handle unseen events) and that cross-entropy is greater or equal to entropy (the gap is KL divergence (distribution mismatch)).