## Estimate memory usage of co-clustering

This notebook demonstates how to estimate the memory usage of co-clustering. And compare it with the actual memory usage.

NOTE: This notebook makes use of [memory_profiler](https://github.com/pythonprofilers/memory_profiler), run the following block to install it using `pip`:

In [1]:
# !pip install memory-profiler

In [2]:
import matplotlib.pyplot as plt
import numpy as np 
from memory_profiler import memory_usage, profile
from cgc import coclustering_numpy
from cgc.utils import mem_estimate_coclustering_numpy
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

First we define a function which compare the actual vs estimated memory usage for a certain set of matrix sizes (m, n) and number of row/column clusters (k, l):

In [3]:
def memory_usage_compare(input, m, n, k, l):
    # Compare real vs memomry estimated usage of coclustering_numpy 

    # Generate data
    Z = np.random.random(size=(m, n))
    
    # Initialize cluster
    row_clusters = coclustering_numpy._initialize_clusters(m, k)
    col_clusters = coclustering_numpy._initialize_clusters(n, l)
    input['nclusters_row'] = k
    input['nclusters_col'] = l
    input['row_clusters_init'] = row_clusters
    input['col_clusters_init'] = col_clusters

    # Real memory peak
    mem_profile = np.array(memory_usage((coclustering_numpy.coclustering, (Z,), input), interval=0.01))
    mem_peak_real = np.max(mem_profile - mem_profile[0]) + Z.nbytes/2**20 # Add the size of Z because it's initialized before profiling

    # Estimatated memory peak
    mem_peak_est = mem_estimate_coclustering_numpy(m, n, k, l, 'MB')[0]
    mem_est_diff = mem_peak_real - mem_peak_est # difference between peak and estimation
    results = np.array([[mem_peak_real, mem_peak_est, mem_est_diff]])

    return results
  

Then we perform the comparison among a certain set of (m,n) and (k,l):

In [4]:
mn_list = [(300000, 50), (500000, 50), (50, 500000), (100, 300000), (5000,5000)]
kl_list = [(100, 20), (250, 20), (20,250), (10, 100), (20, 20)]

input = {
    'errobj': 1.e-5, 
    'niters': 1,
    'low_memory': False,
    'numba_jit': False
}

comp_res = np.empty((0,3)) # [row: mem_peak_real, mem_peak_est, mem_est_diff]
for mn, kl in zip(mn_list, kl_list): 
    m = mn[0]
    n = mn[1]
    k = kl[0]
    l = kl[1]
    comp_res = np.append(comp_res, memory_usage_compare(input, m, n, k, l), axis=0)


INFO:cgc.utils:Estimated memory usage: 600.85MB, peak number: 1
INFO:cgc.utils:Estimated memory usage: 2217.39MB, peak number: 1
INFO:cgc.utils:Estimated memory usage: 2217.39MB, peak number: 2
INFO:cgc.utils:Estimated memory usage: 715.33MB, peak number: 2
INFO:cgc.utils:Estimated memory usage: 193.21MB, peak number: 1


The comparision output: measured memory peak, estimated memory peak, and the difference between the two.

In [5]:
# [mem_peak_real, mem_peak_est, mem_est_diff]
comp_res

array([[ 632.69482422,  600.85391998,   31.84090424],
       [2214.80517578, 2217.38910675,   -2.58393097],
       [2162.32861328, 2217.38910675,  -55.06049347],
       [ 745.27636719,  715.33298492,   29.94338226],
       [ 195.83642578,  193.2144165 ,    2.62200928]])

If we calculate the fraction of the difference/actual usage, we can see the fraction is quite small: 

In [6]:
#difference percentage
comp_res[:,2]/comp_res[:,0]

array([ 0.05032585, -0.00116666, -0.02546352,  0.04017756,  0.01338877])