torch.tensor vs. torch.Tensor:

torch.tensor infers the dtype automatically, while torch.Tensor is an alias for torch.FloatTensor and returns a torch.FloatTensor. I would recommend to stick to torch.tensor, which also has arguments like dtype

torch.tensor/as_tensor vs. torch.from_numpy:  

from_numpy() inherits the dtype of the ndarray. On the other hand, torch.Tensor is an alias for torch.FloatTensor and can only return a FloatTensor. Therefore, if you pass int64 array to torch.Tensor, output tensor is FloatTensor and they wouldn't share the storage. torch.from_numpy gives you torch.LongTensor as expected and the storage will be shared.

torch.tensor vs. torch.as_tensor:
torch.tensor always copies the data. For example, torch.tensor(x) is equivalent to x.clone().detach().

torch.as_tensor tries to avoid copying the data. One of the cases where as_tensor avoids copying the data is if the original data is a numpy array.

When a numpy array and a tensor share the storage, then changing one will change the other.

Bottom line:

keys_and_values = np.random.rand(16 * 512, 3, 2, 10).astype('float32')

keys_and_values = torch.tensor(keys_and_values)  --> This copies the data, no sharing

keys_and_values = torch.as_tensor(keys_and_values) -> This does NOT copy the data, sharing

keys_and_values = torch.from_numpy(keys_and_values) -> This does NOT copy the data, sharing

############ Other way round ##############


Tensor.numpy(*, force=False) → numpy.ndarray
Returns the tensor as a NumPy ndarray.

If force is False (the default), the conversion is performed only if the tensor is on the CPU, does not require grad, does not have its conjugate bit set, and is a dtype and layout that NumPy supports. The returned ndarray and the tensor will share their storage, so changes to the tensor will be reflected in the ndarray and vice versa.
 )
If force is True this is equivalent to calling t.detach().cpu().resolve_conj().resolve_neg().numpy(). If the tensor isn’t on the CPU or the conjugate or negative bit is set, the tensor won’t share its storage with the returned ndarray. Setting force to True can be a useful shorthand.

Bottom line: you should only call numpy() if you have already moved the Tensor to the CPU !!!!!!!!! and they will SHARE the storage !!!

Note that cpu() already creates a copy of the tensor if the tensor was on the GPU (which will typically be the case)

To create a copy of a tensor, always use .detach().clone() and it will create a copy whereever it is (CPU or GPU)


Quick note one np.random.rand vs. np.random.random, latter is alias for np.random.random_samples
With numpy.random.rand, the length of each dimension of the output array is a separate argument. 
With numpy.random.random_sample, the shape argument is a single tuple.
For example, to create an array of samples with shape (3, 5), you can write
sample = np.random.rand(3, 5)
or
sample = np.random.random_sample((3, 5)) 

When selecting a vector DB, consider the following:

What operations do we need?
- Add?
- Store?
- Search?
- Remove?
- Reconstruct?

Frequency of each operation?
Accuracy vs. speed vs. required mem footprint?
Size of index (how many vectors will we be storing)?
Size of query (how many vectors are we querying with)?
GPU or CPU?
Is retraining the index with new additions required?


In [None]:
# NOTE: mem_dim CAN BE WHATEVER DIM, this can be embed_dim or this can be the attn head dim, depends how you implement it.

In [1]:
import faiss
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# We need a simply vector DB where we can search, add and remove (forget) - Faiss can do all this

# The size of the vectors we'll be storing.
mem_dim = 64

# The max amount of (key,value) pairs we're storing in our memory
max_mems = 10000

# index object is basically the DB - L2 is the distance metric (Euclidian). Faiss types trade off speed/accuracy/memory footprint.
# FlatL2 supports add/remove/search so it will do.

index = faiss.IndexFlatL2(mem_dim)

# For each of the 10, we want the 2 most similar ones in the index (via Euclidian).
top_k = 2

# Anything below this point in this cell is the initial explanation, comment out to teach, before you go to next cell
#
## Let's ADD some fake vectors
#vectors = np.random.random((10000,mem_dim)).astype('float32')
#index.add(vectors)
##index.ntotal -> 10000
#
## Let's REMOVE some vectors
##index.remove_ids(np.arange(10)) # Remove vectors with id's 0 through 9
##index.ntotal -> 9990
#
## Let's do a query
#query_vectors = np.random.random((10,mem_dim)).astype('float32')     # Batch of 10 query vectors
#
#
#
#distances, ids = index.search(query_vectors, top_k)
#
##distances
##array([[5.5422297, 5.679163 ],    <<<<<< distances for 2 closest vectors for first vector in our search
##       [5.946422 , 5.969053 ],
##       [5.812578 , 6.1550074],
##       [5.0106854, 5.0456457],
##       [5.7170763, 5.8991313],
##       [5.5098534, 5.654152 ],
##       [6.250798 , 6.3652353],
##       [5.3992786, 5.7259717],
##       [5.371162 , 5.6683345],
##       [5.3517475, 5.617537 ]], dtype=float32)
#
##ids
##array([[1285, 8094],             <<<<<< ids of the 2 closest vectors (0 -> 9999) here
##       [4660, 8888],
##       [6929, 4617],
##       [9606,   99],
##       [3304, 3644],
##       [6760,  265],
##       [8001, 1974],
##       [9486, 2416],
##       [5023, 2839],
##       [4886,  580]])
#
##distances, ids, vectors = index.search_and_reconstruct(query_vectors, top_k)
### This also contains the actual matching vectors
##vectors.shape
###(10, 2, 64)

In the last layer, which is the kNN-augmented layer, we'll be searching in FAISS using the key vectors (which will be the same as the
query vectors (unless there is recurrence, in which case we have more keys than queries). For each key vector, we'll search the top_k
matching key vectors that we have in memory.

Now we also need the corresponding value vectors, so we need to store these as well. We'll do vector similarity matching with the keys
so the keys must go in a vector DB, but we can store the matching values in a regular DB as long as we can retrieve each value that 
corresponds with a matching key (from the vector DB).

In this case we're going to store it on disk with a memmap numpy array. This is nice because it allows us to talk to our data using 
the numpy interface (just as if it would be all in mem) - but it's backed by a file.

In [2]:
db_filepath = "./numpy.memmap"
db = np.memmap(db_filepath, mode='w+', dtype=np.float32, shape = (max_mems,2,mem_dim))

# db[1] = np.random.rand(1,2,mem_dim) <<<<<< store 2 vectors in position 1 (second position)
# db
# memmap([[[0.        , 0.        , 0.        , ..., 0.        ,
#          0.        , 0.        ],
#         [0.        , 0.        , 0.        , ..., 0.        ,
#          0.        , 0.        ]],
#
#        [[0.7254081 , 0.07150084, 0.92323035, ..., 0.28860936,
#          0.20353778, 0.72763807],
#         [0.00908635, 0.2388858 , 0.34646714, ..., 0.73859435,
#          0.35705045, 0.8246553 ]],
#
#        [[0.        , 0.        , 0.        , ..., 0.        ,
#          0.        , 0.        ],
#         [0.        , 0.        , 0.        , ..., 0.        ,
#          0.        , 0.        ]],
#
#        ...,

Now we put this all in a class - object must be able to:
- add to index and database
- query the index
- retrieve in database based on query results
- remove from index and database ("forgetting the oldest memories"

In [3]:
# ADDING

# Let's first create some fake data that we can push through - and to understand requirements as well

batch_size = 2

keys = np.random.rand(batch_size, 512, mem_dim).astype('float32')    # 512 is seq_len

# Now we want to feed in the values as well, so we create an extra dimension

keys_and_values = np.random.rand(batch_size, 512, 2, mem_dim).astype('float32')

# Say we have 16 as batch_size => 16 * 512 vector PAIRS (key and value) to into the db, while 16*512 key vectors go into the index
# Note: this means we're storing a memory for each of the tokens in the sequence - but we only really use the last token to predict the
# next one ... is this right?

# Let's slam batch_size and seq_len together:
# Alternative:
# keys_and_values = keys_and_values.flatten(0,1) apparently -> test this

keys_and_values = keys_and_values.reshape(-1, 2, mem_dim)         # -1 means infer the first dimension

# Now for the index we just need to store the keys, so let's grab those

keys = keys_and_values[:,0,:]
# Btw:
# keys.shape
# (512, 64)

index.add(keys)
# If you ever get error here on non contiguous array: index.add(np.ascontiguousarray(keys))  -> this is FAISS thing, not related to tensors

# For the db numpy array we have to keep track of where we are ourselves

db_offset = 0 # initialize once at the level of the object

# Then each time the add function is called:

num_added = keys_and_values.shape[0]

indexes_added = (np.arange(num_added) + db_offset)

db_offset = db_offset + num_added

db[indexes_added] = keys_and_values


In [4]:
# QUERYING THE INDEX

# Create fake vectors that we want to query with

queries = np.random.rand(batch_size, 512, mem_dim).astype('float32')    # We're going to query with EVERY token from every sequence in the batch 

# Again we slam batch_size and seq_len together
queries = queries.reshape(-1, mem_dim)

# Let's try a search

distances, ids = index.search(queries, top_k)

# Since we store things in the db in the same order as for the index, we can simply query the db using the returned ids (which are indexes).

result_keys_and_values = db[ids]


#result_keys_and_values.shape # -> (1024, 2, 2, 64) for batch_size x seq_len, top_k, and then the query and the key vector

# Now when we return this to the layer, we want to restore the batch and seq_len dimension.

result_keys_and_values = result_keys_and_values.reshape(batch_size, 512, top_k, 2, mem_dim)


REMOVAL (forgetting)

In the paper they have one memory for each batch dimension (the dimension in which a paper is fed in). So if batch_size is 8 then
there are 8 memories (index and DB). This makes the whole thing more efficient since the best hits will come from the same paper
anyway. In this implementation we'll just do a shared memory across all batches.

Also in our implementation we're going to clear the database after each batch. In the batch dimension there are 1O chunks or 512
tokens and we'll use the memory as we work through those 10, but after that we start with an empty memory again.

So we're implementing a clear() function that simply clears out the index and the DB.

So there's no real "forgetting" here, it's more like restarting from scratch with each batch.

In [None]:
Now we need to add all of the above in a class form.

In [5]:
import faiss
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class Memory:

    def __init__(self, mem_dim=64, max_mems=10000):

        self.index = faiss.IndexFlatL2(mem_dim)
        db_filepath = "./numpy.memmap"          
        self.db = np.memmap(db_filepath, mode='w+', dtype=np.float32, shape = (max_mems,2,mem_dim))

        self.db_offset = 0

    def add(self, keys_and_values): # keys_and_values is (b_s, seq_len, 2, mem_dim) so for each token the key and the value vectors

        # Note: what comes in in a tensor

        # Vid: keys_and_values = keys_and_values.flatten(0,1)
        # Note that reshape reuses the buffer because the order of the elements is not changed.
        keys_and_values = keys_and_values.reshape(-1, 2, mem_dim) # Slam batch and seq_len together so (b_s * seq_len, 2, mem_dim) now

        # Vid: keys, values = keys_and_values.unbind(mem_dim=-2)
        keys = keys_and_values[:,0,:]

        # Vid does this first: 
        #keys = np.ascontiguousarray(keys.numpy())
        # The ascontiguous should not be needed, but at the min the below should throw an error since FAISS only supports numpy, so:
        keys = keys.numpy()

        # Note that with the above, the tensor and the numpy array share storage. This should be ok since we already should have 
        # created a copy with cpu() before getting to this point.
        
        self.index.add(keys)
        
        # Now the DB

        num_added = keys_and_values.shape[0]

        indexes_added = (np.arange(num_added) + self.db_offset)

        self.db[indexes_added] = keys_and_values # Automatic conversion from Tensor (right) to Numpy array (left)

        self.db_offset = self.db_offset + num_added

        self.db.flush() # needed?

    def query(self, queries, top_k=2):  # queries is (b_s, seq_len, mem_dim)

        # Need to fetch b_s and seq_len because we need to restore to them later:

        b_s = queries.shape[0]
        seq_len = queries.shape[1]

        queries = queries.reshape(-1, mem_dim) # queries is now (b_s * seq_len, mem_dim)

        # queries = np.ascontiguousarray(queries.numpy()) 
        # ascontiguous() should not be needed but converting Tensor to Numpy array should be as FAISS only supports numpy:
        queries = queries.numpy()

        # Same remark here, query tensors will already be copies because cpy() should already have been called.
        
        distances, ids = self.index.search(queries, top_k)
        
        result_keys_and_values = self.db[ids]

        # Now restore the b_s and seq_len dimesions

        result_keys_and_values = result_keys_and_values.reshape(b_s, seq_len, top_k, 2, mem_dim)
        
        # Now we need to convert Numpy back to Tensor

        #result_keys_and_values = torch.tensor(result_keys_and_values) <<<< this one is giving issues, maybe due to copying
        result_keys_and_values = torch.from_numpy(result_keys_and_values)

        # Regarding the above: tensor and numpy array share the storage but assumption is this is ok because to() will create copy
        
        # Note that in the vid the final reshape is done AFTER the conversion to Tensor, like this:
        # result_keys_and_values = torch.unflatten(result_keys_and_values, 0, (b_s, seq_len))
        
        return result_keys_and_values # (b_s, seq_len, top_k, 2, mem_dim])
    
    def clear():

        self.index.reset()
        self.db[:] = 0       # Optional but clean
        self.db_offset = 0


In [6]:
# Test

batch_size = 8
mem_dim = 10
segments = 10
seq_len = 512
max_mems = batch_size * segments * seq_len

mem = Memory(mem_dim=mem_dim, max_mems=max_mems)

#keys_and_values = np.random.rand(batch_size, 512, 2, mem_dim).astype('float32')
# We need the torch version of this:
keys_and_values = torch.randn(batch_size, seq_len, 2, mem_dim)    # Default type is torch.FloatTensor == 32-bit float

queries = torch.randn(batch_size, seq_len, mem_dim)

# 1. Add keys and values to the memory

mem.add(keys_and_values)

#mem.index.ntotal -> 8192
#mem.db[8191] -> at this location we indeed have 2 vectors stored:
#memmap([[ 0.5223559 ,  0.82290304, -0.24579963,  0.33095127, -0.19270769,
#         -0.09946475, -1.3664232 ,  0.87740934, -1.7536417 ,  1.4747871 ],
#        [ 0.30017638, -1.134407  ,  1.0770321 ,  0.76789665, -0.6984901 ,
#         -0.49128056,  0.8477661 , -0.98776984,  0.02807933, -0.5398302 ]],
#       dtype=float32)

result_keys_and_values = mem.query(queries,3)

result_keys_and_values.shape

torch.Size([8, 512, 3, 2, 10])

In [7]:
import numpy as np
import torch

keys_and_values = np.random.rand(16 * 512, 3, 2, 10).astype('float32')
#keys_and_values = np.random.rand(1, 5, 2, 10).astype('float32')

#keys_and_values = torch.tensor(keys_and_values)
keys_and_values = torch.from_numpy(keys_and_values)
keys_and_values.shape


torch.Size([8192, 3, 2, 10])

In [None]:
# From here got to stage 6, which is stage 5 but with PosEncoding and Memory