fix: optimize scores avoiding array creation #2163

davidbp · 2021-03-12T11:36:15Z

This PR wants to avoid many array creations at ranker runtime.

The code can be locally benchmarked the python script minimal_working_example_score.py attached below, the script is self contained and generates dummy data to test performance. It can be copied and executed anywhere.

The results of python minimal_working_example_score.py are

Current Jina _group_by
	n_repetitions=100000, n_retrieved_items=10, time=4.09 sec
	n_repetitions=100000, n_retrieved_items=20, time=4.82 sec
	n_repetitions=100000, n_retrieved_items=30, time=5.33 sec
	n_repetitions=100000, n_retrieved_items=40, time=6.03 sec
	n_repetitions=100000, n_retrieved_items=50, time=6.64 sec
	n_repetitions=100000, n_retrieved_items=60, time=7.23 sec
	n_repetitions=100000, n_retrieved_items=70, time=7.84 sec
	n_repetitions=100000, n_retrieved_items=80, time=8.29 sec
	n_repetitions=100000, n_retrieved_items=90, time=9.19 sec
	n_repetitions=100000, n_retrieved_items=100, time=9.9 sec

Optimized proposal _group_by_optimized
	n_repetitions=100000, n_retrieved_items=10, time=1.38 sec
	n_repetitions=100000, n_retrieved_items=20, time=2.08 sec
	n_repetitions=100000, n_retrieved_items=30, time=2.79 sec
	n_repetitions=100000, n_retrieved_items=40, time=3.59 sec
	n_repetitions=100000, n_retrieved_items=50, time=4.26 sec
	n_repetitions=100000, n_retrieved_items=60, time=5.4 sec
	n_repetitions=100000, n_retrieved_items=70, time=6.15 sec
	n_repetitions=100000, n_retrieved_items=80, time=7.08 sec
	n_repetitions=100000, n_retrieved_items=90, time=7.88 sec
	n_repetitions=100000, n_retrieved_items=100, time=8.65 sec

Table results
                   original execution times (sec)  optimized execution times (sec)
n_retrieved_items                                                                 
10                                           4.09                             1.38
20                                           4.82                             2.08
30                                           5.33                             2.79
40                                           6.03                             3.59
50                                           6.64                             4.26
60                                           7.23                             5.40
70                                           7.84                             6.15
80                                           8.29                             7.08
90                                           9.19                             7.88
100                                          9.90                             8.65

Script minimal_working_example_score.py:

import numpy as np

COL_STR_TYPE = 'U64'  
        
def get_data_batch(n, 
                   n_cols = 4,
                   n_parent_ids = 10,
                   n_max=100,
                   type_cols = [COL_STR_TYPE, COL_STR_TYPE, COL_STR_TYPE, np.float ]):

    match_idx = np.zeros((n,n_cols))
    
    doc_id = np.random.randint(0,n_parent_ids,n)
    chunk_id1 = np.random.randint(0,n_max,n)
    chunk_id2 = np.random.randint(0,n_max,n)
    scores = np.random.rand(n)
    
    r = [(a1,a2,a2,a4) for a1,a2,a2,a4 in zip(doc_id,chunk_id1, chunk_id2, scores)]
    return np.array(r,  dtype=[
                ('c0', COL_STR_TYPE),
                ('c1', COL_STR_TYPE),
                ('c2', COL_STR_TYPE),
                ('c3', np.float)])

######### Current Jina code

def _sort_doc_by_score(r):
    r = np.array(
        r,
        dtype=[
            ('ids', COL_STR_TYPE),
            ('scores', np.float64),
        ],
    )
    return np.sort(r, order='scores')[::-1]

def _score_list(_groups):    
    r = []
    for _g in _groups:
        match_id = _g[0]['c0']
        score = np.random.rand()
        r.append((match_id, score))

    return _sort_doc_by_score(r)

def _group_by(match_idx, col_name):
    _sorted_m = np.sort(match_idx, order=col_name)
    _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
    return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]


######### Proposed Jina code

def _score_list_optimized(_groups):
    n_groups = len(_groups)
    res = np.empty((n_groups,), dtype=[('ids','U64'), ('scores', np.float64)] )
    for i,_g in enumerate(_groups):
        res[i] = (_g['c0'][0], np.random.rand())
    return res

def _score_optimized(_groups):
    res = _score_list_optimized(_groups)
    res[::-1].sort(order='scores')
    return res

def _group_by_optimized(match_idx, col_name):
    _sorted_m = np.sort(match_idx, order=col_name)
    n_elements = len(_sorted_m[col_name])
    list_numpy_arrays = []
    prev_val = _sorted_m[col_name][0]
    prev_index = 0
    for i, current_val in enumerate(_sorted_m[col_name]):
        if current_val != prev_val:
            list_numpy_arrays.append(_sorted_m[prev_index:i])
            prev_index = i
            prev_val = current_val
    if current_val == prev_val:
        list_numpy_arrays.append(_sorted_m[prev_index:])

if __name__ == '__main__':

    import timeit    
    import pandas as pd
    
    n_repetitions = 100000
    n_retrieved_items_tests = [10,20,30,40,50,60,70,80,90,100]

    times_original = []
    times_optimized = []

    print('\nCurrent Jina _group_by')
    for n_retrieved_items in n_retrieved_items_tests:    
        t_original = timeit.timeit('_group_by(match_idx,"c0")',number=n_repetitions,
                      setup=f"""from __main__ import _group_by, get_data_batch;\\
                                n_retrieved_items={n_retrieved_items};\\
                                match_idx = get_data_batch(n_retrieved_items)""")
        t_original = round(t_original,2)

        print(f'\tn_repetitions={n_repetitions}, n_retrieved_items={n_retrieved_items}, time={t_original} sec')
        times_original.append(t_original)

    print('\nOptimized proposal _group_by_optimized')
    for n_retrieved_items in n_retrieved_items_tests:    
        t_optimized = timeit.timeit('_group_by_optimized(match_idx,"c0")',number=n_repetitions,
                      setup=f"""from __main__ import _group_by_optimized, get_data_batch;\\
                                n_retrieved_items={n_retrieved_items};\\
                                match_idx = get_data_batch(n_retrieved_items)""")
        t_optimized = round(t_optimized,2)
        print(f'\tn_repetitions={n_repetitions}, n_retrieved_items={n_retrieved_items}, time={t_optimized} sec')
        times_optimized.append(t_optimized)

    ## Gather results and plot a table
    result = pd.DataFrame({'original execution times (sec)':times_original, 
                           'optimized execution times (sec)':times_optimized},
                           index = n_retrieved_items_tests)
    result.index.name = 'n_retrieved_items'
    print(f'\nTable results')
    print(result)

…e-score-aggregate-matches-ranker

JoanFM · 2021-03-12T11:50:10Z

jina/drivers/rank/aggregate/__init__.py

@@ -72,30 +72,35 @@ def _insert_query_matches(

    @staticmethod
    def _group_by(match_idx, col_name):
-        # sort by ``col
+        """
+        Create an list of numpy arrays with the same ``doc_id`` in each position of the list


I guess it depends of the col_name right?

jina/drivers/rank/aggregate/__init__.py

JoanFM · 2021-03-12T11:51:16Z

jina/drivers/rank/aggregate/__init__.py

-            r.append((match_id, score))
-        return self._sort_doc_by_score(r)
+        n_groups = len(_groups)
+        res = np.empty((n_groups,), dtype=[('ids','U64'), ('scores', np.float64)] )


instead of 'U64' use the Chunk2DocRanker variable to define the type

'scores' is also a name found in Chunk2DocRanker namespace

jina/drivers/rank/aggregate/__init__.py

JoanFM · 2021-03-12T11:51:52Z

jina/drivers/rank/aggregate/__init__.py

+
+        for i,_g in enumerate(_groups):
+           #res[i] = (match_id,    score)
+            res[i] = (_g['c0'][0], self.exec_fn(_g, query_chunk_meta, match_chunk_meta))


what is this 'c0'?

github-actions · 2021-03-12T11:55:38Z

Latency summary

Current PR yields:

😶 index QPS at 1054, delta to last 3 avg.: +1%
😶 query QPS at 17, delta to last 3 avg.: -1%

Breakdown

Version	Index QPS	Query QPS
current	1054	17
`1.0.10`	1051	16
`1.0.9`	1034	17

Backed by latency-tracking. Further commits will update this comment.

codecov · 2021-03-12T13:17:25Z

Codecov Report

Merging #2163 (8c7f595) into master (e80b383) will increase coverage by 1.46%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2163      +/-   ##
==========================================
+ Coverage   89.00%   90.46%   +1.46%     
==========================================
  Files         211      211              
  Lines       11269    11278       +9     
==========================================
+ Hits        10030    10203     +173     
+ Misses       1239     1075     -164

Flag	Coverage Δ
daemon	`50.20% <0.00%> (-0.05%)`	⬇️
jina	`90.92% <100.00%> (+1.57%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
jina/drivers/rank/aggregate/__init__.py	`100.00% <100.00%> (ø)`
jina/helper.py	`83.65% <0.00%> (+0.84%)`	⬆️
jina/types/message/__init__.py	`88.20% <0.00%> (+1.53%)`	⬆️
jina/drivers/control.py	`95.08% <0.00%> (+1.63%)`	⬆️
jina/peapods/zmq/__init__.py	`82.01% <0.00%> (+2.13%)`	⬆️
jina/peapods/runtimes/jinad/client.py	`82.48% <0.00%> (+2.18%)`	⬆️
jina/drivers/convertdriver.py	`97.22% <0.00%> (+2.77%)`	⬆️
jina/flow/mixin/crud.py	`89.65% <0.00%> (+3.44%)`	⬆️
jina/peapods/runtimes/zmq/zed.py	`91.48% <0.00%> (+3.54%)`	⬆️
jina/peapods/runtimes/jinad/__init__.py	`95.91% <0.00%> (+4.08%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e80b383...8c7f595. Read the comment docs.

jina/drivers/rank/aggregate/__init__.py

JoanFM · 2021-03-12T14:11:11Z

jina/drivers/rank/aggregate/__init__.py

-        _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
-        # group by ``col``
-        return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]
+        list_numpy_arrays = []


Just curious, why do u think this is an optimization?

I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) will do the sorting again, so at least O(NlogN)?

jina/drivers/rank/aggregate/__init__.py

Yongxuanzhang

This is really cool! Can we make sure which function contributes most in optimization?

Yongxuanzhang · 2021-03-13T00:31:28Z

jina/drivers/rank/aggregate/__init__.py

-        _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
-        # group by ``col``
-        return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]
+        list_numpy_arrays = []


I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) will do the sorting again, so at least O(NlogN)?

Yongxuanzhang

I think most of the optimization is from _group_by. Another function is not so obvious

…e-score-aggregate-matches-ranker

JoanFM · 2021-03-15T19:16:38Z

jina/drivers/rank/aggregate/__init__.py

+                list_numpy_arrays.append(_sorted_m[prev_val:i])
+                prev_val = i
+                current = val
+            if i == n_elements-1 and val == current:                


can this be handled outside the for loop when u are sure u handled the last element?, what is the problem here?

jina/drivers/rank/aggregate/__init__.py

fix: optimize scores avoiding array creation

f0cd46d

jina-bot added size/S area/core This issue/PR affects the core codebase component/driver labels Mar 12, 2021

Merge branch 'master' of https://github.com/jina-ai/jina into optimiz…

1c65273

…e-score-aggregate-matches-ranker

jina-bot added area/core This issue/PR affects the core codebase component/driver and removed area/core This issue/PR affects the core codebase component/driver labels Mar 12, 2021

JoanFM requested changes Mar 12, 2021

View reviewed changes

davidbp added 3 commits March 12, 2021 14:06

fix: fix dtypes

0481d47

fix: fix strings written as quotes by black

98e15cf

fix: update long lines

a7b900d

JoanFM requested changes Mar 12, 2021

View reviewed changes

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

JoanFM mentioned this pull request Mar 12, 2021

Refactor Chunk2DocRankers #2075

Closed

5 tasks

JoanFM reviewed Mar 12, 2021

View reviewed changes

JoanFM requested changes Mar 12, 2021

View reviewed changes

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

Yongxuanzhang reviewed Mar 13, 2021

View reviewed changes

davidbp added 4 commits March 13, 2021 19:42

Merge branch 'master' of https://github.com/jina-ai/jina into optimiz…

1659b02

…e-score-aggregate-matches-ranker

fix: update score to variable name

6666d8a

fix: missing last item in the group

f696bbb

Merge branch 'master' of https://github.com/jina-ai/jina into optimiz…

ff2a24a

…e-score-aggregate-matches-ranker

JoanFM reviewed Mar 15, 2021

View reviewed changes

fix: give black a space

3159d4e

JoanFM requested changes Mar 15, 2021

View reviewed changes

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

davidbp added 2 commits March 16, 2021 09:01

fix: update variable and if condition

283e252

fix: incorrect spacing

8c7f595

davidbp marked this pull request as ready for review March 16, 2021 10:37

davidbp requested a review from a team as a code owner March 16, 2021 10:37

davidbp requested review from cristianmtr and rutujasurve94 March 16, 2021 10:37

JoanFM reviewed Mar 16, 2021

View reviewed changes

jina/drivers/rank/aggregate/__init__.py Show resolved Hide resolved

JoanFM approved these changes Mar 16, 2021

View reviewed changes

davidbp merged commit fc0888e into master Mar 16, 2021

davidbp deleted the optimize-score-aggregate-matches-ranker branch March 16, 2021 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: optimize scores avoiding array creation #2163

fix: optimize scores avoiding array creation #2163

davidbp commented Mar 12, 2021 •

edited

JoanFM Mar 12, 2021

JoanFM Mar 12, 2021

JoanFM Mar 12, 2021

JoanFM Mar 12, 2021

github-actions bot commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited

JoanFM Mar 12, 2021

Yongxuanzhang Mar 13, 2021 •

edited

Yongxuanzhang left a comment

Yongxuanzhang Mar 13, 2021 •

edited

Yongxuanzhang left a comment

JoanFM Mar 15, 2021

fix: optimize scores avoiding array creation #2163

fix: optimize scores avoiding array creation #2163

Conversation

davidbp commented Mar 12, 2021 • edited

JoanFM Mar 12, 2021

Choose a reason for hiding this comment

JoanFM Mar 12, 2021

Choose a reason for hiding this comment

JoanFM Mar 12, 2021

Choose a reason for hiding this comment

JoanFM Mar 12, 2021

Choose a reason for hiding this comment

github-actions bot commented Mar 12, 2021 • edited

Latency summary

Breakdown

codecov bot commented Mar 12, 2021 • edited

Codecov Report

JoanFM Mar 12, 2021

Choose a reason for hiding this comment

Yongxuanzhang Mar 13, 2021 • edited

Choose a reason for hiding this comment

Yongxuanzhang left a comment

Choose a reason for hiding this comment

Yongxuanzhang Mar 13, 2021 • edited

Choose a reason for hiding this comment

Yongxuanzhang left a comment

Choose a reason for hiding this comment

JoanFM Mar 15, 2021

Choose a reason for hiding this comment

davidbp commented Mar 12, 2021 •

edited

github-actions bot commented Mar 12, 2021 •

edited

codecov bot commented Mar 12, 2021 •

edited

Yongxuanzhang Mar 13, 2021 •

edited

Yongxuanzhang Mar 13, 2021 •

edited