Generate results directly from feature_dict instead of pd.Series #334
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While I was working on bursts vectorization, I was checking the profiling results and I noticed an unusual amount of time being spent in
pd.Series.__init__
and also in_add_timestamp
. In total, they accounted for 6 to 7% of the running time.pd.Series.__init__
: 2.5%_add_timestamp
: 3.8%I realized this was due to the transformation of the dictionary of feature results
feature_dict
to a pd.Seriesfeature_series
, as well as the modification of a pd.Series instead of adict
which comes at a much greater cost due to Pandas overhead, possibly having to move the entire series around in memory, etc. This was done, I imagine, to facilitate some of the computations that were done on that series, mainly the cortex/subcortex projection. But paying a 7% of the running time is too high a cost imho.Also, using a pd.Series to store a row works in opposition to Pandas logic, which actually stores one Series per column, not per row. So what I did was remove the intermediate transformation of
feature_dict
tofeature_series
and instead of theStream.run
function generating a list of pd.Series, it generates a list of feature_dict, then it transforms the dict into the final result dataframe with pd.DataFrame.from_records()To make this work, I had to make some very small syntax changes, but the logic of everything is the same, just using
dict
instead ofpd.Series
. Of course nowProjection.project()
is a bit slower than before after the very naif syntax conversion (there might be a way to speed it up) but overall shaving off a 7% of the runtime basically for free compensates for it.Also, I think removing the reliance on Pandas when it comes to handling data during processing is a positive, as it comes with too much overhead. In fact, I'm a proponent of even ditching
dict
and goingnp.array
all the way, but getting rid of pd.df is a good step in that direction.So, in conclusion:
dict
until the final dataframe generation.