# Make Visual Words

## Rquirements

k means clustering is memory intensive process. 

## Input

* n-dim descriptors
    * For making visual words, which image contains which descriptor is not important. We just do quantization.

* Number of visual words

## Process

Build cluster.
Say, we use k-means clustering. We determine number of clusters. 
Run k-means clustering. 
We can accelerate speed by using approximate nearest neighbor search such as kd-tree, but we lose accuracy. 

## Output

index of visual words, and its cluster's center point. 

From a given feature, we can use the output information to determine most closest visual words. 



In [1]:
# Load all features in the dataset. 
# Q. What if there is excatly same feature many times? do we add it into feature pool? 
# Q. Does many same feature point affect the result of k-menas clustering?

# An image may contain several features, and we need to get all features. 


% load_ext autoreload
% autoreload 2
# from utils import oxf5k_feature_reader as feature_reader

import numpy as np
import gc

def feature_reader(feature_bin_path="./data/feature/feat_oxc1_hesaff_sift.bin"):
    """
    This method reads official oxf5k descriptor.
    
    binary file contains 128byte sift descriptor. It is 128-d integer vector. 
    """
        
    features = []    
    # Read feature from bin file, and make tuple format. 
    with open(feature_bin_path, "rb") as f:
        # Read 128 byte        
        raw_binary = f.read(128)
        while len(raw_binary) == 128:
            dt = np.dtype(np.uint8)
            # dt = dt.newbyteorder('>')
            descriptor = np.frombuffer(raw_binary, dtype=dt)
            features.append(descriptor)
            raw_binary = f.read(128)
            
    return features


def load_features_from_dataset():
    """
    """
    pass

all_features = feature_reader()
print(len(all_features))
gc.collect()

16334970


233

In [2]:
% load_ext autoreload
% autoreload 2
from utils import clustering_utils
import numpy as np

dim = 128

# n = 80000
# k = 20000
# data_points = np.random.randint(256, size=(n, dim), dtype=np.uint8)

data_points = np.array(all_features)
k = 131072 # 2^17 = 131072

# Timing History:
# n = 80,000 k=20,000 => single k-d tree with FLANN, 13 step, 6 min 


print('data_points:', data_points)
print('data_points shape:', data_points.shape)
print('data_points dtype:', data_points.dtype)
print('k:', k)

max_val = 2**8-1 # 255. uint8 max. for SIFT descriptor
print('max_val:', max_val)
center_points = np.random.randint(max_val, size=(k, dim), dtype=np.uint8) 
# center_points = center_points.astype(float)
print('center_points:', center_points)
print('center_points type:', center_points.dtype)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
data_points: [[ 38  18   4 ...,  45  49  10]
 [110  50   4 ...,   4  44  43]
 [102  28   1 ...,  50 110   2]
 ..., 
 [ 65  83  58 ...,   0   0   3]
 [ 34  64  70 ...,   1   1  13]
 [ 22   8  36 ...,  27  12  13]]
data_points shape: (16334970, 128)
data_points dtype: uint8
k: 131072
max_val: 255
center_points: [[214 200 217 ...,  78  78 195]
 [151  23 150 ..., 167  57 240]
 [237 189  49 ..., 107 213 182]
 ..., 
 [ 11 188  40 ..., 155  39 167]
 [ 11 197 223 ...,  40 135  77]
 [ 68 251 116 ..., 179  21  89]]
center_points type: uint8


In [7]:
%%time
import pqkmeans
import pickle

# Train a PQ encoder.
# Each vector is devided into 4 parts and each part is
# encoded with log256 = 8 bit, resulting in a 32 bit PQ code.
print('encode with subset')
M=4
encoder = pqkmeans.encoder.PQEncoder(num_subdim=M, Ks=256)
# Q. do we have to use only subset? If we have train/test set split, use train for this. use test for query
# encoder.fit(data_points[:1000])
# time complexity for PQ: O(DL) where D is number of points, L is ???
encoder.fit(data_points[:1000000])  # Use top 1M descriptor for training visual words codebook for oxford5k
pickle.dump(encoder, open('encoder.pkl', 'wb'))

encode with subset
CPU times: user 18min 45s, sys: 18h 12min 45s, total: 18h 31min 31s
Wall time: 29min 5s


In [8]:
%%time
print('transform whole set')
# Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8.
# You can train the encoder and transform the input vectors to PQ codes preliminary.
from tqdm import tqdm as tqdm

# # For big-data that cannot fit in memory, we can use generator
pqcode_generator = encoder.transform_generator(data_points)
N, _ = data_points.shape
data_points_pqcode = np.empty([N, M], dtype=encoder.code_dtype)
print("data_points_pqcode.shape:\n{}".format(data_points_pqcode.shape))
print("data_points_pqcode.nbytes:\n{} bytes".format(data_points_pqcode.nbytes))
print("data_points_pqcode.dtype:\n{}".format(data_points_pqcode.dtype))
for n, code in enumerate(tqdm(pqcode_generator, total=N)):
    data_points_pqcode[n, :] = code

# For small data fit in memory, simply run this. 
# data_points_pqcode = encoder.transform(data_points)

np.save('data_points_pqcode.npy', data_points_pqcode)

  0%|          | 1/16334970 [00:00<514:13:56,  8.82it/s]

transform whole set
data_points_pqcode.shape:
(16334970, 4)
data_points_pqcode.nbytes:
65339880 bytes
data_points_pqcode.dtype:
uint8


100%|██████████| 16334970/16334970 [02:54<00:00, 93489.15it/s]

CPU times: user 22min 15s, sys: 1h 34min 8s, total: 1h 56min 23s
Wall time: 2min 54s





In [9]:
%%time
print('run k-means with k:', k)
# Run clustering with k=5 clusters.
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=k)
clustered = kmeans.fit_predict(data_points_pqcode)

print('clustered len:', len(clustered))

clustering_centers_numpy = np.array(kmeans.cluster_centers_, dtype=encoder.code_dtype)  # Convert to np.array with the proper dtype
np.save('clustering_centers_numpy.npy', clustering_centers_numpy)# Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).

run k-means with k: 131072


TypeError: can't pickle _pqkmeans.PQKMeans objects

In [12]:
clustering_centers_numpy = np.array(kmeans.cluster_centers_, dtype=encoder.code_dtype)  # Convert to np.array with the proper dtype
np.save('clustering_centers_numpy.npy', clustering_centers_numpy)

print(len(clustered))
print(clustered[:100])

16334970
[63282, 50010, 92089, 81189, 21063, 66520, 46672, 76169, 49071, 67414, 48013, 29133, 26552, 119182, 82376, 65436, 44761, 34527, 65975, 13034, 37923, 9074, 62471, 77480, 112837, 48880, 48533, 120092, 18133, 5037, 42087, 88979, 80304, 87754, 131045, 16786, 115494, 106030, 74411, 58582, 121602, 45224, 76870, 73495, 10629, 69107, 78043, 113028, 25101, 5110, 7054, 85535, 87005, 42624, 113283, 20710, 65208, 16873, 5370, 33726, 55063, 120055, 1947, 24614, 39952, 82136, 108034, 67621, 18490, 118866, 25119, 20558, 52050, 50779, 27128, 5015, 19313, 91604, 92100, 60879, 280, 8280, 82264, 94363, 90620, 86251, 109833, 77385, 24939, 37156, 81603, 81470, 11455, 75787, 115062, 2887, 90975, 40920, 46697, 94815]


In [None]:
# Q. How to recover kemans object from kmeans.cluster_centers_?

In [None]:
%%time
clustering_utils.run_kmeans_clustering(data_points, k, init_center_points=center_points, knn_method="naive")

# Way to speed up codebook learning

* [Annoy]() to assign centroids for all features
* FLANN vs ANNOY


In [None]:
%%time
center_points = clustering_utils.run_kmeans_clustering(data_points, k, max_iter=15, init_center_points=center_points, knn_method="kdtree")

k means clustering with k=131072, knn_method:kdtree
k-means clustering. start step:0
num_same_centroid: 0
k-means clustering. start step:1
num_same_centroid: 1070631
k-means clustering. start step:2
num_same_centroid: 6305761
k-means clustering. start step:3
num_same_centroid: 8634853
k-means clustering. start step:4
num_same_centroid: 9524823
k-means clustering. start step:5
num_same_centroid: 9975007
k-means clustering. start step:6
num_same_centroid: 10253542
k-means clustering. start step:7
num_same_centroid: 10525599
k-means clustering. start step:8
num_same_centroid: 10715520
k-means clustering. start step:9
num_same_centroid: 10820592
k-means clustering. start step:10
num_same_centroid: 10881514
k-means clustering. start step:11
num_same_centroid: 10961578
k-means clustering. start step:12
num_same_centroid: 11051719
k-means clustering. start step:13
num_same_centroid: 11078194
k-means clustering. start step:14
num_same_centroid: 11135441
k-means clustering. start step:15
num_sa

In [111]:
%%time
clustering_utils.run_kmeans_clustering(data_points.astype(float), k, init_center_points=center_points, knn_method="kdtree")

k means clustering with k=20000, knn_method:kdtree
k-means clustering. start step:0
num_same_centroid: 0
k-means clustering. start step:1
num_same_centroid: 74836
k-means clustering. start step:2
num_same_centroid: 78470
k-means clustering. start step:3
num_same_centroid: 79362
k-means clustering. start step:4
num_same_centroid: 79690
k-means clustering. start step:5
num_same_centroid: 79870
k-means clustering. start step:6
num_same_centroid: 79931
k-means clustering. start step:7
num_same_centroid: 79974
k-means clustering. start step:8
num_same_centroid: 79985
k-means clustering. start step:9
num_same_centroid: 79995
k-means clustering. start step:10
num_same_centroid: 80000
cluster assignment is not changed. It means convergence!
CPU times: user 47min 37s, sys: 4.91 s, total: 47min 42s
Wall time: 7min 47s


In [24]:
%%time
clustering_utils.run_kmeans_clustering(data_points, k, init_center_points=center_points, knn_method="kdtree")

k means clustering with k=20000, knn_method:kdtree
k-means clustering. start step:0
num_same_centroid: 0
k-means clustering. start step:1
num_same_centroid: 74767
k-means clustering. start step:2
num_same_centroid: 78475
k-means clustering. start step:3
num_same_centroid: 79350
k-means clustering. start step:4
num_same_centroid: 79707
k-means clustering. start step:5
num_same_centroid: 79870
k-means clustering. start step:6
num_same_centroid: 79921
k-means clustering. start step:7
num_same_centroid: 79954
k-means clustering. start step:8
num_same_centroid: 79980
k-means clustering. start step:9
num_same_centroid: 79993
k-means clustering. start step:10
num_same_centroid: 79994
k-means clustering. start step:11
num_same_centroid: 79997
k-means clustering. start step:12
num_same_centroid: 79999
k-means clustering. start step:13
num_same_centroid: 80000
cluster assignment is not changed. It means convergence!
CPU times: user 36min 40s, sys: 4.51 s, total: 36min 44s
Wall time: 6min 2s


In [5]:

def get_visual_words(kmeans_info, features):
    """
    convert features to list of visual word index. 
    """
    

for image in images:
    visual_words = get_visual_words(kmeans_info, features)
    

def save_visual_words_to_file(visual_words, output_dir):
    pass
    

NameError: name 'images' is not defined