## Estimate memory usage of co-clustering

This Notebook demonstates how to estimate the memory usage of co-clustering. And compare it with the actual memory usage.

NOTE: This notebook makes use of memory_profiler, run the following block to install it using pip:

In [1]:
# !pip install memory-profiler

In [2]:
import matplotlib.pyplot as plt
import numpy as np 
from memory_profiler import memory_usage, profile
from cgc import coclustering_numpy
from cgc.utils import mem_estimate_coclustering_numpy
import logging
import sys

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

First we define a function which compare the actual vs estimated memory usage for a certain set of (m, n, k, l), i.e. matrix sizes and row/col cluster numbers.

In [3]:
def memory_usage_comapre(input, m, n, k, l):
    # Compare real vs memomry estimated usage of coclustering_numpy 

    # Generate data
    Z = np.random.random(size=(m, n))
    
    # Initialize cluster
    row_clusters = coclustering_numpy._initialize_clusters(m, k)
    col_clusters = coclustering_numpy._initialize_clusters(n, l)
    input['nclusters_row'] = k
    input['nclusters_col'] = l
    input['row_clusters_init'] = row_clusters
    input['col_clusters_init'] = col_clusters

    # Real memory peak
    mem_profile = np.array(memory_usage((coclustering_numpy.coclustering, (Z,), input), interval=0.01))
    mem_peak_real = np.max(mem_profile - mem_profile[0]) + Z.nbytes/2**20 # Add the size of Z because it's initialized before profiling

    # Estimatated memory peak
    mem_peak_est = mem_estimate_coclustering_numpy(m, n, k, l, 'MB')[0]
    mem_est_diff = mem_peak_real - mem_peak_est # difference between peak and estimation
    results = np.array([[mem_peak_real, mem_peak_est, mem_est_diff]])

    return results
  

Then we perform the comparison among a certain set of (m,n) and (k,l).
Due to the initialisations within `memory_profiler`, the comparison for the first set may show artificial extra memory usage. Please execute the cell below twice to avoid this effect.

In [4]:
mn_list = [(300000, 50), (500000, 50), (50, 500000), (100, 300000), (5000,5000)]
kl_list = [(100, 20), (250, 20), (20,250), (10, 100), (20, 20)]

input = {'errobj':1.e-5, 
         'niters':1, 
         'epsilon':1.e-8}
input['low_memory'] = False
input['numba_jit'] = False

comp_res = np.empty((0,3)) # [row: mem_peak_real, mem_peak_est, mem_est_diff]
for mn, kl in zip(mn_list, kl_list): 
    m = mn[0]
    n = mn[1]
    k = kl[0]
    l = kl[1]
    comp_res = np.append(comp_res, memory_usage_comapre(input, m, n, k, l), axis=0)


INFO:cgc.utils:Estimated memory usage: 686.69MB, peak number: 1
INFO:cgc.utils:Estimated memory usage: 686.69MB, peak number: 1
INFO:cgc.utils:Estimated memory usage: 2575.02MB, peak number: 1
INFO:cgc.utils:Estimated memory usage: 2575.02MB, peak number: 2
INFO:cgc.utils:Estimated memory usage: 801.17MB, peak number: 2
INFO:cgc.utils:Estimated memory usage: 193.79MB, peak number: 1


The comparision output: real memory peak, estiamted memorypeak, and real minus estimated

In [5]:
# [mem_peak_real, mem_peak_est, mem_est_diff]
comp_res

array([[ 758.35107422,  686.68746948,   71.66360474],
       [ 670.96435547,  686.68746948,  -15.72311401],
       [2589.81689453, 2575.01983643,   14.79705811],
       [2552.00830078, 2575.01983643,  -23.01153564],
       [ 813.09667969,  801.16653442,   11.93014526],
       [ 190.75830078,  193.78662109,   -3.02832031]])

If we calculate the fraction of the difference/actual usage, we can see the fraction is quite small. Therefore we can conclude that the estiamte is quite accurrate. 

In [6]:
#difference percentage
comp_res[:,2]/comp_res[:,0]

array([ 0.09449925, -0.02343361,  0.00571355, -0.00901703,  0.01467248,
       -0.01587517])