-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] speed up confusion matrix calculation #465
Conversation
Codecov Report
@@ Coverage Diff @@
## master #465 +/- ##
=======================================
Coverage 84.67% 84.67%
=======================================
Files 118 118
Lines 8347 8348 +1
Branches 1366 1365 -1
=======================================
+ Hits 7068 7069 +1
Misses 932 932
Partials 347 347
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Any numbers on benchmark? |
classes = 10
length = 100
times = 10000
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
new_confusion_matrix(pred, labels) # using np.bincount
end = time.time()
print(end - start)
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
confusion_matrix(pred, labels) # original one
end = time.time()
print(end - start) 0.5623 vs 1.1669 |
classes = 100
length = 10
times = 10000
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
new_confusion_matrix(pred, labels)
end = time.time()
print(end - start)
start = time.time()
for i in range(times):
labels = np.random.randint(0, classes, length).astype(np.int64)
pred = np.random.randint(0, classes, length).astype(np.int64)
confusion_matrix(pred, labels)
end = time.time()
print(end - start) 0.5215 vs 0.5446 |
The modification can truly speed up the calculation |
index_pred = label_map[plabel] | ||
confusion_mat[index_real][index_pred] += 1 | ||
max_label = label_set[-1] | ||
label_map = np.zeros(max_label + 1, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use a dictionary instead of np array here, as the original one:
label_map = {label: i for i, label in enumerate(label_set)}
so that 4 lines -> 1 line
any performance concern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label_map = {label: i for i, label in enumerate(label_set)}
y_pred_mapped = [label_map[i] for i in y_pred]
y_real_mapped = [label_map[i] for i in y_real]
This is much slower even than the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numbers would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the for loop is much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal by kenny VS The original one before this PR:
for classes = 10, length = 100, times = 10000
, 1.2643 vs 1.1462
for classes = 10, length = 100, times = 10000
, 0.6333 vs 0.5554
even worse
This PR uses
np.bincount
to speed up confusion matrix calculation