# Description

Exactly the same code as in `09`, but here I disable numba.

# Disable numba

In [1]:
%env NUMBA_DISABLE_JIT=1

env: NUMBA_DISABLE_JIT=1


# Remove pycache dir

In [2]:
!echo ${CODE_DIR}




In [3]:
!find ${CODE_DIR} -regex '^.*\(__pycache__\)$' -print

In [4]:
!find ${CODE_DIR} -regex '^.*\(__pycache__\)$' -prune -exec rm -rf {} \;

In [5]:
!find ${CODE_DIR} -regex '^.*\(__pycache__\)$' -print

# Modules

In [6]:
import numpy as np

from ccc.coef import ccc

# Settings

In [7]:
N_REPS = 10

In [8]:
np.random.seed(0)

# Setup

In [9]:
# let numba compile all the code before profiling
ccc(np.random.rand(10), np.random.rand(10))

0.15625

# Run with `n_samples` small

## `n_samples=50`

In [10]:
N_SAMPLES = 50

In [11]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [12]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [13]:
%%timeit func()
func()

40.2 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%%prun -s cumulative -l 20 -T 10-n_samples_small_50.txt
func()

 
*** Profile printout saved to text file '10-n_samples_small_50.txt'. 


         6320 function calls (6310 primitive calls) in 0.044 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.044    0.044 {built-in method builtins.exec}
        1    0.000    0.000    0.044    0.044 <string>:1(<module>)
        1    0.000    0.000    0.044    0.044 454136789.py:1(func)
       10    0.000    0.000    0.044    0.004 impl.py:307(ccc)
      139    0.000    0.000    0.040    0.000 threading.py:280(wait)
      546    0.040    0.000    0.040    0.000 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    0.036    0.004 impl.py:492(compute_coef)
       10    0.000    0.000    0.035    0.004 impl.py:485(cdist_func)
       10    0.000    0.000    0.035    0.004 impl.py:192(cdist_parts_parallel)
       69    0.000    0.000    0.035    0.001 threading.py:563(wait)
       70    0.000    0.000    0.034    0.000

## `n_samples=100`

In [15]:
N_SAMPLES = 100

In [16]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [17]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [18]:
%%timeit func()
func()

121 ms ± 566 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [19]:
%%prun -s cumulative -l 20 -T 10-n_samples_small_100.txt
func()

 
*** Profile printout saved to text file '10-n_samples_small_100.txt'. 


         8447 function calls (8437 primitive calls) in 0.124 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.124    0.124 {built-in method builtins.exec}
        1    0.000    0.000    0.124    0.124 <string>:1(<module>)
        1    0.000    0.000    0.124    0.124 454136789.py:1(func)
       10    0.000    0.000    0.124    0.012 impl.py:307(ccc)
      196    0.000    0.000    0.118    0.001 threading.py:280(wait)
      774    0.118    0.000    0.118    0.000 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    0.113    0.011 impl.py:492(compute_coef)
       10    0.000    0.000    0.112    0.011 impl.py:485(cdist_func)
       10    0.000    0.000    0.112    0.011 impl.py:192(cdist_parts_parallel)
       97    0.000    0.000    0.110    0.001 threading.py:563(wait)
      100    0.000    0.000    0.110    0.001

## `n_samples=500`

In [20]:
N_SAMPLES = 500

In [21]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [22]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [23]:
%%timeit func()
func()

134 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [24]:
%%prun -s cumulative -l 20 -T 10-n_samples_small_500.txt
func()

 
*** Profile printout saved to text file '10-n_samples_small_500.txt'. 


         8534 function calls (8524 primitive calls) in 0.137 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.137    0.137 {built-in method builtins.exec}
        1    0.000    0.000    0.137    0.137 <string>:1(<module>)
        1    0.000    0.000    0.137    0.137 454136789.py:1(func)
       10    0.001    0.000    0.137    0.014 impl.py:307(ccc)
      200    0.000    0.000    0.130    0.001 threading.py:280(wait)
      790    0.130    0.000    0.130    0.000 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    0.121    0.012 impl.py:492(compute_coef)
       10    0.000    0.000    0.120    0.012 impl.py:485(cdist_func)
       10    0.000    0.000    0.120    0.012 impl.py:192(cdist_parts_parallel)
      100    0.000    0.000    0.119    0.001 threading.py:563(wait)
      100    0.000    0.000    0.119    0.001

## `n_samples=1000`

In [25]:
N_SAMPLES = 1000

In [26]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [27]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [28]:
%%timeit func()
func()

154 ms ± 893 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [29]:
%%prun -s cumulative -l 20 -T 10-n_samples_small_1000.txt
func()

 
*** Profile printout saved to text file '10-n_samples_small_1000.txt'. 


         8534 function calls (8524 primitive calls) in 0.156 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.156    0.156 {built-in method builtins.exec}
        1    0.000    0.000    0.156    0.156 <string>:1(<module>)
        1    0.000    0.000    0.156    0.156 454136789.py:1(func)
       10    0.001    0.000    0.156    0.016 impl.py:307(ccc)
      200    0.000    0.000    0.148    0.001 threading.py:280(wait)
      790    0.148    0.000    0.148    0.000 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    0.138    0.014 impl.py:492(compute_coef)
       10    0.000    0.000    0.137    0.014 impl.py:485(cdist_func)
       10    0.001    0.000    0.137    0.014 impl.py:192(cdist_parts_parallel)
      100    0.000    0.000    0.135    0.001 threading.py:563(wait)
      100    0.000    0.000    0.135    0.001

**CONCLUSION:** as expected, with relatively small samples, the numba-compiled version (`09-cdist_parts_v04`) performs much better than the non-compiled one.

# Run with `n_samples` large

## `n_samples=50000`

In [30]:
N_SAMPLES = 50000

In [31]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [32]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [33]:
%%timeit func()
func()

2.35 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%%prun -s cumulative -l 20 -T 10-n_samples_large_50000.txt
func()

 
*** Profile printout saved to text file '10-n_samples_large_50000.txt'. 


         8534 function calls (8524 primitive calls) in 2.349 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.349    2.349 {built-in method builtins.exec}
        1    0.000    0.000    2.349    2.349 <string>:1(<module>)
        1    0.000    0.000    2.349    2.349 454136789.py:1(func)
       10    0.002    0.000    2.349    0.235 impl.py:307(ccc)
      200    0.001    0.000    2.326    0.012 threading.py:280(wait)
      790    2.325    0.003    2.325    0.003 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    1.487    0.149 impl.py:492(compute_coef)
       10    0.000    0.000    1.486    0.149 impl.py:485(cdist_func)
       10    0.001    0.000    1.486    0.149 impl.py:192(cdist_parts_parallel)
      100    0.001    0.000    1.479    0.015 _base.py:201(as_completed)
      100    0.000    0.000    1.478    0

## `n_samples=100000`

In [35]:
N_SAMPLES = 100000

In [36]:
x = np.random.rand(N_SAMPLES)
y = np.random.rand(N_SAMPLES)

In [37]:
def func():
    for i in range(N_REPS):
        ccc(x, y)

In [38]:
%%timeit func()
func()

4.7 s ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [39]:
%%prun -s cumulative -l 20 -T 10-n_samples_large_100000.txt
func()

 
*** Profile printout saved to text file '10-n_samples_large_100000.txt'. 


         8534 function calls (8524 primitive calls) in 4.763 seconds

   Ordered by: cumulative time
   List reduced from 125 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.763    4.763 {built-in method builtins.exec}
        1    0.000    0.000    4.763    4.763 <string>:1(<module>)
        1    0.007    0.007    4.763    4.763 454136789.py:1(func)
       10    0.004    0.000    4.756    0.476 impl.py:307(ccc)
      200    0.001    0.000    4.727    0.024 threading.py:280(wait)
      790    4.726    0.006    4.726    0.006 {method 'acquire' of '_thread.lock' objects}
       10    0.000    0.000    2.934    0.293 impl.py:492(compute_coef)
       10    0.000    0.000    2.932    0.293 impl.py:485(cdist_func)
       10    0.002    0.000    2.932    0.293 impl.py:192(cdist_parts_parallel)
      100    0.001    0.000    2.923    0.029 _base.py:201(as_completed)
      100    0.000    0.000    2.922    0

**CONCLUSION:** this is unexpected. With very large samples, the python version performs better! Something to look at in the future. The profiling file for 100,000 samples () shows that the `cdist_parts_parallel` is taking more time in the numba-compiled version than in the python version. Maybe the compiled ARI implementation could be improved in these cases with large samples.

Haoyu: On my machine, however, the JITed version is slower than the non-JITed version with small sample size, and faster with large sample size. This makes more sense given the overhead of JIT compilation outweighs the benefit of JIT compilation with small sample size, and with large sample size, the JITed version can take advantage of the runtime-compiled code.