# Calculating perplexity

In [1]:
import numpy as np

# Setting random seeds
np.random.seed(32)

In [18]:
# Load from .npy files
predictions = np.load('predictions.npy')
targets = np.load('targets.npy')

# Print shapes
print(f'predictions has shape: {predictions.shape}')
predictions[0]

predictions has shape: (32, 64, 256)


array([[-15.579997, -25.735575, -15.576893, ..., -15.574669, -15.571493,
        -15.569425],
       [-24.01082 , -35.80076 , -23.743649, ..., -23.807941, -23.727554,
        -23.804428],
       [-15.783699, -14.416848, -15.512791, ..., -15.729168, -15.671564,
        -15.53212 ],
       ...,
       [-22.37673 , -29.096514, -22.266487, ..., -22.157543, -22.212416,
        -22.285917],
       [-23.18771 , -39.62314 , -23.07188 , ..., -23.058746, -22.928747,
        -23.131004],
       [-21.843483, -26.035233, -21.877586, ..., -21.576801, -21.74238 ,
        -21.694439]], dtype=float32)

So, this means that if the predictions matrix has a shape of [32, 64, 256], it indicates there are 32 sentences (sequences), where each sentence can be represented by a maximum of 64 tokens. If a sentence contains fewer than 64 tokens, the remaining slots will be padded, typically with zero vectors in this context. Each token is associated with a vocabulary index ranging from 0 to 255. Therefore, each token in a sentence is represented by a vector of length 256, where each element in this vector is the log probability that the corresponding vocabulary word (indexed from 0 to 255) is actually the word at that position in the sentence.

The code below calculates the probabilities for the first token in the second sentence, showing which vocabulary word the model predicts this token to be:


In [21]:
prob_preds = np.exp(predictions[1, 1])

So, as shown below, the token with the highest probability (the predicted token) has the index 110 in the vocabulary

In [23]:
np.argmax(prob_preds)

110

In [24]:
print(f'targets has shape: {targets.shape}')

targets has shape: (32, 64)


Since the target array has a shape of (32, 64), this means that, for example, targets[1, 1] represents the second token in the second sentence as a number between 0 and 255. We need to transform this representation from a single number within 0 to 255 to a one-hot vector of length 256. In this vector, the index corresponding to the number (for instance, index 110) will have a value of 1, and all other positions will be 0."

In [26]:
targets[1, 1]

110

To transform it from a number between 0 - 255 to a one-hot vector, we can use the function `np.eye()` which generates a diagonal matrix.

In [40]:
predictions.shape

(32, 64, 256)

In [31]:
targets

array([[105, 110,  32, ...,   0,   0,   0],
       [ 97, 110, 110, ...,   0,   0,   0],
       [111, 102,  32, ...,   0,   0,   0],
       ...,
       [105,  32,  97, ...,   0,   0,   0],
       [101, 100, 103, ...,   0,   0,   0],
       [121, 111, 117, ...,   0,   0,   0]], dtype=int32)

In [10]:
reshaphed_targets = np.eye(predictions.shape[-1])[targets]

In [11]:
reshaphed_targets.shape

(32, 64, 256)

## Simplified example of above

In [59]:
fake_targets = np.array([
                [0, 2],
                [3, 1]
               ])

In [60]:
fake_targets.shape

(2, 2)

This means that we have two sentences where sentence number one has two tokens where the first token is the token with index 0 in the vocalbulary and the second token is token with index nr 2 in the vocalbulary and vice versa for sentence nr 2. Lets just assume that our vocalbulary has the size of 4.

### Problem

We now want to replace the number that represents the index of a token in the vocabulary with a one-hot vector. The one-hot vector will have the same length as the vocabulary. For example:

- `fake_targets[0,0]` would then be `[1, 0, 0, 0]`.
- `fake_targets[0,1]` would be `[0, 0, 1, 0]`.
- `fake_targets[1,0]` would be `[0, 0, 0, 1]`.
- `fake_targets[1,1]` would be `[0, 1, 0, 0]`.

### Solution

The way we can do this is with help of numpys `fancy indexing` utlising a identity matrix of the size of the vocalbulary.

In [61]:
I = np.eye(4)
fake_targets=I[fake_targets]

In [62]:
fake_targets[0, 0]

array([1., 0., 0., 0.])

In [63]:
fake_targets[0, 1]

array([0., 0., 1., 0.])

In [64]:
fake_targets[1, 0]

array([0., 0., 0., 1.])


Confused? See: https://chat.openai.com/share/4b65586d-625f-4509-8d5a-3d09ce7229d4

In [55]:
fake_targets[1, 1]

array([0., 1., 0., 0.])

### More on fancy indexing

#### Basic Concept
- Fancy Indexing: In NumPy, fancy indexing refers to the capability to index an array with another array, list, or sequence of integers or boolean values. It's particularly useful when you need to retrieve or modify a non-contiguous subset of an array's data.


#### How It Works with Row and Column Indices
1. Index Arrays: Suppose you have two arrays: row_indices and col_indices. These arrays specify the rows and columns you want to access. For each pair `(row_indices[i], col_indices[i])`, NumPy will fetch the element at that position in the source array.

2. Order of Execution: The order you mentioned (for row in rows for col in columns) correctly outlines how NumPy applies these indices. For each row index, it pairs with the corresponding column index to pick out the element. This is done sequentially, which means it starts with the first element of row_indices and col_indices, combines them into an index pair, and moves onto the next.

### Practical Example
Imagine you have the following array and you want to select specific elements using fancy indexing:

```python
import numpy as np

# A sample 4x4 matrix
A = np.array([
    [11, 12, 13, 14],
    [21, 22, 23, 24],
    [31, 32, 33, 34],
    [41, 42, 43, 44]
])

# Row and column indices
row_indices = np.array([0, 2])
col_indices = np.array([1, 3])

# Applying fancy indexing
selected_elements = A[row_indices, col_indices]  # Outputs: array([12, 34])
```
In this example:

row_indices contains `[0, 2]` and col_indices contains [`1, 3]`.
The resulting selected_elements will include `A[0,1]` (which is `12`) and `A[2,3]` (which is `34`).
The indexing first takes the 0th row and 1st column, then the 2nd row and 3rd column.
This method provides a powerful way to access and manipulate elements in an array based on multiple index vectors, following the logical order you described.

### Example 2

#### Combining Indices into a Single Matrix

Suppose you have the following `2x2` matrix of indices:

```python
indices = np.array([[0, 2], [1, 3]])
```

This matrix seems to combine row and column indices, but in a way that might be misinterpreted. Using it directly for indexing in a 2D array like A will not work as you might expect if you're aiming for the elements at `(0,1)` and `(2,3)`. Instead, this matrix will be treated as two separate row indices lists, which can lead to unexpected behavior.

##### Example of Misinterpretation
Let's see what happens if you try to use this matrix directly:

```python
import numpy as np

# A sample 4x4 matrix
A = np.array([
    [11, 12, 13, 14],
    [21, 22, 23, 24],
    [31, 32, 33, 34],
    [41, 42, 43, 44]
])

# Combined indices matrix
indices = np.array([[0, 2], [1, 3]])

# Attempt to use the indices matrix
selected_elements = A[indices]
```

Here, selected_elements will not be what you might expect. It will actually select entire rows based on the indices and will look like this:

```python
array([
    [11, 12, 13, 14],  # Row 0
    [31, 32, 33, 34],  # Row 2
    [21, 22, 23, 24],  # Row 1
    [41, 42, 43, 44]   # Row 3
])
```

This result occurs because each sublist in indices is interpreted as a full set of row indices to select, leading to a selection of full rows, not individual elements.

Correct Way to Use Combined Indices for Specific Elements
To correctly use combined indices for selecting specific elements like `(0,1)` and `(2,3)`, you should separate the indices into two separate arrays for rows and columns:

```python
# Correct row and column indices
row_indices = np.array([0, 2])
col_indices = np.array([1, 3])

# Correct indexing to select specific elements
correct_selected_elements = A[row_indices, col_indices]  # Outputs: array([12, 34])
```

In this way, row_indices and col_indices are used to directly specify the exact elements to retrieve, yielding the correct results `[12, 34]`.

In summary, combining indices into a single matrix can lead to confusion unless you correctly separate and apply them as individual arrays for rows and columns when dealing with multi-dimensional arrays.

In [69]:
reshaped_targets = np.eye(predictions.shape[-1])[targets]
print(f'reshaped_targets has shape: {reshaped_targets.shape}')

reshaped_targets has shape: (32, 64, 256)


In [70]:
log_p = np.sum(predictions * reshaped_targets, axis= -1)

In [71]:
log_p

array([[ -5.39654493,  -1.03111839,  -0.66916656, ..., -22.37672997,
        -23.18770981, -21.84348297],
       [ -4.58577061,  -1.13412857,  -8.53803253, ..., -20.15686035,
        -26.83709717, -23.57501984],
       [ -5.22238874,  -1.28241444,  -0.17312431, ..., -21.328228  ,
        -19.85441208, -33.88444138],
       ...,
       [ -5.39654493, -17.29168129,  -4.36076593, ..., -20.82580185,
        -21.06583786, -22.44311523],
       [ -5.93131638, -14.24741745,  -0.26373291, ..., -26.74324799,
        -18.38433075, -22.35527802],
       [ -5.67053604,  -0.10595131,   0.        , ..., -23.33252335,
        -28.08737564, -23.87880707]])

In [72]:
log_p.shape

(32, 64)

In [73]:
non_pad = 1.0 - np.equal(targets, 0)
print(f'non_pad has shape: {non_pad.shape}\n')
print(f'non_pad looks like this: \n\n {non_pad}')

non_pad has shape: (32, 64)

non_pad looks like this: 

 [[1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 ...
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]]


In [74]:
np.equal(targets, 0)

array([[False, False, False, ...,  True,  True,  True],
       [False, False, False, ...,  True,  True,  True],
       [False, False, False, ...,  True,  True,  True],
       ...,
       [False, False, False, ...,  True,  True,  True],
       [False, False, False, ...,  True,  True,  True],
       [False, False, False, ...,  True,  True,  True]])

In [75]:
non_pad = 1-np.equal(targets, 0)

In [76]:
non_pad

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [77]:
real_log_p = log_p * non_pad
print(f'real log probabilities still have shape: {real_log_p.shape}')

real log probabilities still have shape: (32, 64)


In [78]:
real_log_p

array([[ -5.39654493,  -1.03111839,  -0.66916656, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -4.58577061,  -1.13412857,  -8.53803253, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.22238874,  -1.28241444,  -0.17312431, ...,  -0.        ,
         -0.        ,  -0.        ],
       ...,
       [ -5.39654493, -17.29168129,  -4.36076593, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.93131638, -14.24741745,  -0.26373291, ...,  -0.        ,
         -0.        ,  -0.        ],
       [ -5.67053604,  -0.10595131,   0.        , ...,  -0.        ,
         -0.        ,  -0.        ]])

In [79]:
print(f'log probabilities before filtering padding: \n\n {log_p}\n')
print(f'log probabilities after filtering padding: \n\n {real_log_p}')

log probabilities before filtering padding: 

 [[ -5.39654493  -1.03111839  -0.66916656 ... -22.37672997 -23.18770981
  -21.84348297]
 [ -4.58577061  -1.13412857  -8.53803253 ... -20.15686035 -26.83709717
  -23.57501984]
 [ -5.22238874  -1.28241444  -0.17312431 ... -21.328228   -19.85441208
  -33.88444138]
 ...
 [ -5.39654493 -17.29168129  -4.36076593 ... -20.82580185 -21.06583786
  -22.44311523]
 [ -5.93131638 -14.24741745  -0.26373291 ... -26.74324799 -18.38433075
  -22.35527802]
 [ -5.67053604  -0.10595131   0.         ... -23.33252335 -28.08737564
  -23.87880707]]

log probabilities after filtering padding: 

 [[ -5.39654493  -1.03111839  -0.66916656 ...  -0.          -0.
   -0.        ]
 [ -4.58577061  -1.13412857  -8.53803253 ...  -0.          -0.
   -0.        ]
 [ -5.22238874  -1.28241444  -0.17312431 ...  -0.          -0.
   -0.        ]
 ...
 [ -5.39654493 -17.29168129  -4.36076593 ...  -0.          -0.
   -0.        ]
 [ -5.93131638 -14.24741745  -0.26373291 ...  -0.        

In [80]:
log_ppx = np.sum(real_log_p, axis=1) / np.sum(non_pad, axis=1)
log_ppx = np.mean(-log_ppx)
print(f'The log perplexity and perplexity of the model are respectively: {log_ppx} and {np.exp(log_ppx)}')

The log perplexity and perplexity of the model are respectively: 2.6211854987065033 and 13.752016923578548
