# Chapter 6: Topics in Information Theory and Machine Learning
Computing entropies, cross-entropies, and the Kullback-Leibler divergence

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## The Modules

In [1]:
import regex as re
from collections import Counter
from math import log2

## The Dataset

In [2]:
PATH_CORPUS = '../datasets/'

# Computation of entropy with file
salammbo_f = PATH_CORPUS + 'salammbo_chapters/salammbo_wikisource.txt'

# Computation of cross-entropies with files P and M
FILES_P = ['salammbo_chapters/salammbo_train_wikisource.txt',
           'salammbo_chapters/salammbo_ch15.txt',
           'modern_classics/notredame.txt',
           'modern_classics/1984.txt']
FILE_M = 'salammbo_chapters/salammbo_train_wikisource.txt'

files_p = [PATH_CORPUS + file for file in FILES_P]
file_m = PATH_CORPUS + FILE_M

## Corpus Processing 
Functions to normalize the corpus and count the frequencies

In [3]:
def normalize(corpus: str,
              upper: bool = False) -> str:
    """
    Normalizes the text to ease the counts
    Normalizes the blanks
    Then removes all the chars below ASCII code 32
    :param corpus:
    :return:
    """
    # Comment the next line to have the results from the 2014 edition
    corpus = re.sub(r'\s', ' ', corpus)
    corpus = re.sub(r'[\x00-\x1F]', '', corpus)
    if upper:
        corpus = corpus.upper()
    return corpus


def rel_freqs(corpus: str) -> dict[str, float]:
    """
    Computes the relative frequencies of the chars in a corpus
    """
    counts = Counter(corpus)
    total = counts.total()
    return {key: val/total
            for key, val in counts.items()}

In [4]:
text_s = normalize(open(salammbo_f).read()).strip()

chars_s_upper = Counter(text_s.upper())
print('Frequencies of characters. Letters set in uppercase')
total_chars = 0
for char in sorted(chars_s_upper):
    print(char, chars_s_upper[char])
    if re.search(r'[\p{L} ]', char):
        total_chars += chars_s_upper[char]
print('Total chars:', total_chars)
print()

Frequencies of characters. Letters set in uppercase
  103496
! 939
' 5992
( 21
) 21
, 9643
- 1518
. 4584
: 363
; 1765
? 178
A 42439
B 5757
C 14202
D 18907
E 71186
F 4993
G 5148
H 5293
I 33627
J 1220
K 92
L 30960
M 13090
N 32911
O 22647
P 13161
Q 3964
R 33555
S 46753
T 35084
U 29268
V 6916
W 1
X 2206
Y 1232
Z 413
« 563
» 559
À 1884
Â 605
Æ 9
Ç 452
È 2002
É 7709
Ê 898
Ë 6
Î 277
Ï 66
Ô 397
Ù 179
Û 213
Œ 96
Total chars: 593314



## Entropy
We define functions to compute the entropy and cross entropy

In [5]:
def entropy(rel_freqs: dict[str, float]) -> float:
    entropy = 0.0
    for char in rel_freqs:
        entropy -= rel_freqs[char] * log2(rel_freqs[char])
    return entropy


def cross_entropy(rel_frequency_p: dict[str, float],
                  rel_frequency_m: dict[str, float]) -> float:
    cross_entropy = 0.0
    for char in rel_frequency_p:
        if rel_frequency_m.get(char, 0.0) != 0.0:
            cross_entropy -= rel_frequency_p[char] * \
                log2(rel_frequency_m[char])
    return cross_entropy

## Results

We compute the entropy of our corpus

In [6]:
print('Frequencies of characters')
chars_s = Counter(text_s)
for char in sorted(chars_s):
    print(char, chars_s[char])
print()

rel_frequency_s = rel_freqs(text_s)
entropy_s = entropy(rel_frequency_s)
print('Entropy Salammbô:', entropy_s)

Frequencies of characters
  103496
! 939
' 5992
( 21
) 21
, 9643
- 1518
. 4584
: 363
; 1765
? 178
A 589
B 389
C 807
D 486
E 525
F 15
G 180
H 449
I 751
J 93
K 53
L 993
M 806
N 226
O 281
P 345
Q 177
R 143
S 812
T 321
U 178
V 53
X 2
Y 3
Z 48
a 41850
b 5368
c 13395
d 18421
e 70661
f 4978
g 4968
h 4844
i 32876
j 1127
k 39
l 29967
m 12284
n 32685
o 22366
p 12816
q 3787
r 33412
s 45941
t 34763
u 29090
v 6863
w 1
x 2204
y 1229
z 365
« 563
» 559
À 1
Ç 4
à 1883
â 605
æ 9
ç 448
è 2002
é 7709
ê 898
ë 6
î 277
ï 66
ô 397
ù 179
û 213
œ 96

Entropy Salammbô: 4.370305083521603


The cross entropies of the different texts

In [7]:
print('Cross-entropies')
print("{:7} {:7} {:7} {:7}".format('H(P)', 'H(M)', 'H(P, M)', 'D(P||M)'))
for file_p in files_p:
    text_p = normalize(open(file_p).read().strip())
    text_m = normalize(open(file_m).read().strip())

    rel_frequency_p = rel_freqs(text_p)
    rel_frequency_m = rel_freqs(text_m)

    entropy_p = entropy(rel_frequency_p)
    entropy_m = entropy(rel_frequency_m)
    cross_entropy_v = cross_entropy(rel_frequency_p, rel_frequency_m)
    divergence = - entropy_p + cross_entropy_v
    print("{:6.5f} {:6.5f} {:6.5f} {:6.5f}".format(entropy_p, entropy_m,
                                                   cross_entropy_v, divergence))

Cross-entropies
H(P)    H(M)    H(P, M) D(P||M)
4.37168 4.37168 4.37168 0.00000
4.31338 4.37168 4.32544 0.01206
4.42285 4.37168 4.44187 0.01902
4.34982 4.37168 4.79617 0.44635


And the cross-perplexities

In [8]:
print('Cross-perplexity')
print("{:6} {:6} {:6}".format('H(P)', 'H(M)', 'H(P, M)'))
for file_p in files_p:
    text_p = normalize(open(file_p).read().strip())
    text_m = normalize(open(file_m).read().strip())

    rel_frequency_p = rel_freqs(text_p)
    rel_frequency_m = rel_freqs(text_m)

    entropy_p = entropy(rel_frequency_p)
    entropy_m = entropy(rel_frequency_m)
    cross_entropy_v = cross_entropy(rel_frequency_p, rel_frequency_m)
    print("{:4.2f} {:4.2f} {:4.2f}".format(2**entropy_p, 2**entropy_m,
                                           2**cross_entropy_v))

Cross-perplexity
H(P)   H(M)   H(P, M)
20.70 20.70 20.70
19.88 20.70 20.05
21.45 20.70 21.73
20.39 20.70 27.78
