Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: optimize scores avoiding array creation #2163

Merged
merged 12 commits into from
Mar 16, 2021

Conversation

davidbp
Copy link
Contributor

@davidbp davidbp commented Mar 12, 2021

This PR wants to avoid many array creations at ranker runtime.

The code can be locally benchmarked the python script minimal_working_example_score.py attached below, the script is self contained and generates dummy data to test performance. It can be copied and executed anywhere.

The results of python minimal_working_example_score.py are

Current Jina _group_by
	n_repetitions=100000, n_retrieved_items=10, time=4.09 sec
	n_repetitions=100000, n_retrieved_items=20, time=4.82 sec
	n_repetitions=100000, n_retrieved_items=30, time=5.33 sec
	n_repetitions=100000, n_retrieved_items=40, time=6.03 sec
	n_repetitions=100000, n_retrieved_items=50, time=6.64 sec
	n_repetitions=100000, n_retrieved_items=60, time=7.23 sec
	n_repetitions=100000, n_retrieved_items=70, time=7.84 sec
	n_repetitions=100000, n_retrieved_items=80, time=8.29 sec
	n_repetitions=100000, n_retrieved_items=90, time=9.19 sec
	n_repetitions=100000, n_retrieved_items=100, time=9.9 sec

Optimized proposal _group_by_optimized
	n_repetitions=100000, n_retrieved_items=10, time=1.38 sec
	n_repetitions=100000, n_retrieved_items=20, time=2.08 sec
	n_repetitions=100000, n_retrieved_items=30, time=2.79 sec
	n_repetitions=100000, n_retrieved_items=40, time=3.59 sec
	n_repetitions=100000, n_retrieved_items=50, time=4.26 sec
	n_repetitions=100000, n_retrieved_items=60, time=5.4 sec
	n_repetitions=100000, n_retrieved_items=70, time=6.15 sec
	n_repetitions=100000, n_retrieved_items=80, time=7.08 sec
	n_repetitions=100000, n_retrieved_items=90, time=7.88 sec
	n_repetitions=100000, n_retrieved_items=100, time=8.65 sec

Table results
                   original execution times (sec)  optimized execution times (sec)
n_retrieved_items                                                                 
10                                           4.09                             1.38
20                                           4.82                             2.08
30                                           5.33                             2.79
40                                           6.03                             3.59
50                                           6.64                             4.26
60                                           7.23                             5.40
70                                           7.84                             6.15
80                                           8.29                             7.08
90                                           9.19                             7.88
100                                          9.90                             8.65

Script minimal_working_example_score.py:

import numpy as np

COL_STR_TYPE = 'U64'  
        
def get_data_batch(n, 
                   n_cols = 4,
                   n_parent_ids = 10,
                   n_max=100,
                   type_cols = [COL_STR_TYPE, COL_STR_TYPE, COL_STR_TYPE, np.float ]):

    match_idx = np.zeros((n,n_cols))
    
    doc_id = np.random.randint(0,n_parent_ids,n)
    chunk_id1 = np.random.randint(0,n_max,n)
    chunk_id2 = np.random.randint(0,n_max,n)
    scores = np.random.rand(n)
    
    r = [(a1,a2,a2,a4) for a1,a2,a2,a4 in zip(doc_id,chunk_id1, chunk_id2, scores)]
    return np.array(r,  dtype=[
                ('c0', COL_STR_TYPE),
                ('c1', COL_STR_TYPE),
                ('c2', COL_STR_TYPE),
                ('c3', np.float)])

######### Current Jina code

def _sort_doc_by_score(r):
    r = np.array(
        r,
        dtype=[
            ('ids', COL_STR_TYPE),
            ('scores', np.float64),
        ],
    )
    return np.sort(r, order='scores')[::-1]

def _score_list(_groups):    
    r = []
    for _g in _groups:
        match_id = _g[0]['c0']
        score = np.random.rand()
        r.append((match_id, score))

    return _sort_doc_by_score(r)

def _group_by(match_idx, col_name):
    _sorted_m = np.sort(match_idx, order=col_name)
    _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
    return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]


######### Proposed Jina code

def _score_list_optimized(_groups):
    n_groups = len(_groups)
    res = np.empty((n_groups,), dtype=[('ids','U64'), ('scores', np.float64)] )
    for i,_g in enumerate(_groups):
        res[i] = (_g['c0'][0], np.random.rand())
    return res

def _score_optimized(_groups):
    res = _score_list_optimized(_groups)
    res[::-1].sort(order='scores')
    return res

def _group_by_optimized(match_idx, col_name):
    _sorted_m = np.sort(match_idx, order=col_name)
    n_elements = len(_sorted_m[col_name])
    list_numpy_arrays = []
    prev_val = _sorted_m[col_name][0]
    prev_index = 0
    for i, current_val in enumerate(_sorted_m[col_name]):
        if current_val != prev_val:
            list_numpy_arrays.append(_sorted_m[prev_index:i])
            prev_index = i
            prev_val = current_val
    if current_val == prev_val:
        list_numpy_arrays.append(_sorted_m[prev_index:])

if __name__ == '__main__':

    import timeit    
    import pandas as pd
    
    n_repetitions = 100000
    n_retrieved_items_tests = [10,20,30,40,50,60,70,80,90,100]

    times_original = []
    times_optimized = []

    print('\nCurrent Jina _group_by')
    for n_retrieved_items in n_retrieved_items_tests:    
        t_original = timeit.timeit('_group_by(match_idx,"c0")',number=n_repetitions,
                      setup=f"""from __main__ import _group_by, get_data_batch;\\
                                n_retrieved_items={n_retrieved_items};\\
                                match_idx = get_data_batch(n_retrieved_items)""")
        t_original = round(t_original,2)

        print(f'\tn_repetitions={n_repetitions}, n_retrieved_items={n_retrieved_items}, time={t_original} sec')
        times_original.append(t_original)

    print('\nOptimized proposal _group_by_optimized')
    for n_retrieved_items in n_retrieved_items_tests:    
        t_optimized = timeit.timeit('_group_by_optimized(match_idx,"c0")',number=n_repetitions,
                      setup=f"""from __main__ import _group_by_optimized, get_data_batch;\\
                                n_retrieved_items={n_retrieved_items};\\
                                match_idx = get_data_batch(n_retrieved_items)""")
        t_optimized = round(t_optimized,2)
        print(f'\tn_repetitions={n_repetitions}, n_retrieved_items={n_retrieved_items}, time={t_optimized} sec')
        times_optimized.append(t_optimized)

    ## Gather results and plot a table
    result = pd.DataFrame({'original execution times (sec)':times_original, 
                           'optimized execution times (sec)':times_optimized},
                           index = n_retrieved_items_tests)
    result.index.name = 'n_retrieved_items'
    print(f'\nTable results')
    print(result)

@jina-bot jina-bot added size/S area/core This issue/PR affects the core codebase component/driver labels Mar 12, 2021
@jina-bot jina-bot added area/core This issue/PR affects the core codebase component/driver and removed area/core This issue/PR affects the core codebase component/driver labels Mar 12, 2021
@@ -72,30 +72,35 @@ def _insert_query_matches(

@staticmethod
def _group_by(match_idx, col_name):
# sort by ``col
"""
Create an list of numpy arrays with the same ``doc_id`` in each position of the list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends of the col_name right?

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
r.append((match_id, score))
return self._sort_doc_by_score(r)
n_groups = len(_groups)
res = np.empty((n_groups,), dtype=[('ids','U64'), ('scores', np.float64)] )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 'U64' use the Chunk2DocRanker variable to define the type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'scores' is also a name found in Chunk2DocRanker namespace

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved

for i,_g in enumerate(_groups):
#res[i] = (match_id, score)
res[i] = (_g['c0'][0], self.exec_fn(_g, query_chunk_meta, match_chunk_meta))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this 'c0'?

@github-actions
Copy link

github-actions bot commented Mar 12, 2021

Latency summary

Current PR yields:

  • 😶 index QPS at 1054, delta to last 3 avg.: +1%
  • 😶 query QPS at 17, delta to last 3 avg.: -1%

Breakdown

Version Index QPS Query QPS
current 1054 17
1.0.10 1051 16
1.0.9 1034 17

Backed by latency-tracking. Further commits will update this comment.

@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #2163 (8c7f595) into master (e80b383) will increase coverage by 1.46%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2163      +/-   ##
==========================================
+ Coverage   89.00%   90.46%   +1.46%     
==========================================
  Files         211      211              
  Lines       11269    11278       +9     
==========================================
+ Hits        10030    10203     +173     
+ Misses       1239     1075     -164     
Flag Coverage Δ
daemon 50.20% <0.00%> (-0.05%) ⬇️
jina 90.92% <100.00%> (+1.57%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/drivers/rank/aggregate/__init__.py 100.00% <100.00%> (ø)
jina/helper.py 83.65% <0.00%> (+0.84%) ⬆️
jina/types/message/__init__.py 88.20% <0.00%> (+1.53%) ⬆️
jina/drivers/control.py 95.08% <0.00%> (+1.63%) ⬆️
jina/peapods/zmq/__init__.py 82.01% <0.00%> (+2.13%) ⬆️
jina/peapods/runtimes/jinad/client.py 82.48% <0.00%> (+2.18%) ⬆️
jina/drivers/convertdriver.py 97.22% <0.00%> (+2.77%) ⬆️
jina/flow/mixin/crud.py 89.65% <0.00%> (+3.44%) ⬆️
jina/peapods/runtimes/zmq/zed.py 91.48% <0.00%> (+3.54%) ⬆️
jina/peapods/runtimes/jinad/__init__.py 95.91% <0.00%> (+4.08%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e80b383...8c7f595. Read the comment docs.

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
@JoanFM JoanFM mentioned this pull request Mar 12, 2021
5 tasks
_, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
# group by ``col``
return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]
list_numpy_arrays = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why do u think this is an optimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) will do the sorting again, so at least O(NlogN)?

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Yongxuanzhang Yongxuanzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! Can we make sure which function contributes most in optimization?

_, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
# group by ``col``
return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1]
list_numpy_arrays = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) will do the sorting again, so at least O(NlogN)?

Copy link
Contributor

@Yongxuanzhang Yongxuanzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the optimization is from _group_by. Another function is not so obvious

list_numpy_arrays.append(_sorted_m[prev_val:i])
prev_val = i
current = val
if i == n_elements-1 and val == current:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be handled outside the for loop when u are sure u handled the last element?, what is the problem here?

jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
jina/drivers/rank/aggregate/__init__.py Outdated Show resolved Hide resolved
@davidbp davidbp marked this pull request as ready for review March 16, 2021 10:37
@davidbp davidbp requested a review from a team as a code owner March 16, 2021 10:37
@davidbp davidbp merged commit fc0888e into master Mar 16, 2021
@davidbp davidbp deleted the optimize-score-aggregate-matches-ranker branch March 16, 2021 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase component/driver size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants