In [11]:
import numpy as np
from scipy.stats import entropy

We want to look at the 'peakiness' of the distribution of PMI values for a given word.  Given a PMI matrix, where each row corresponds the the PMI of a given word with all the other words in the sentence, define peakiness as a function of a row of PMI values:
$$\text{peakiness}(\text{row}) = 1- \frac{S(\text{row}) }{ \log_2(\text{sentence length}) }$$
where $$S(\text{row}) = -\sum_{i \in \hat{\text{row}}} i \log_2(i),$$ the entropy of the row, normalized, treated as a probability vector.

In [139]:
# examples
def peakiness(vec):
    return 1 - entropy(vec, base=2)/np.log2(len(vec))
examples = ([1,0,0,0],[0,1,1,1],[1,1,1,1])
for row in examples:
    print(row, peakiness(row))

[1, 0, 0, 0] 1.0
[0, 1, 1, 1] 0.20751874963942196
[1, 1, 1, 1] 0.0


In [150]:
# example
RESULTS_DIR = "results/distilbert-base-cased(2)_pad10_2020-06-30-12-38/" 
npz = np.load(RESULTS_DIR + 'pmi_matrices.npz')

for sentence, matrix in npz.items():
    print(sentence)
    matrix = matrix + np.transpose(matrix) # symmetrize
    for i, row in enumerate(matrix):
        row -= min(row) # shift to remove negative values
        row = row[np.arange(len(row))!=i] # remove diagonal
        print(peakiness(row))
    print()

We 're about to see if advertising works .
0.5184355844078246
0.2486014563695923
0.15419496887546558
0.3623848885319584
0.29113989940705765
0.2479175793387406
0.18143920109804623
0.2830238145592764
0.21864999485332337

Odds and Ends
0.02699802995067968
0.016740503786128125
0.0825769775345

