Generate results directly from feature_dict instead of pd.Series #334

toni-neurosc · 2024-06-13T09:55:32Z

While I was working on bursts vectorization, I was checking the profiling results and I noticed an unusual amount of time being spent in pd.Series.__init__ and also in _add_timestamp. In total, they accounted for 6 to 7% of the running time.

pd.Series.__init__: 2.5%
_add_timestamp: 3.8%

I realized this was due to the transformation of the dictionary of feature results feature_dict to a pd.Series feature_series, as well as the modification of a pd.Series instead of a dict which comes at a much greater cost due to Pandas overhead, possibly having to move the entire series around in memory, etc. This was done, I imagine, to facilitate some of the computations that were done on that series, mainly the cortex/subcortex projection. But paying a 7% of the running time is too high a cost imho.

Also, using a pd.Series to store a row works in opposition to Pandas logic, which actually stores one Series per column, not per row. So what I did was remove the intermediate transformation of feature_dict to feature_series and instead of the Stream.run function generating a list of pd.Series, it generates a list of feature_dict, then it transforms the dict into the final result dataframe with pd.DataFrame.from_records()

To make this work, I had to make some very small syntax changes, but the logic of everything is the same, just using dict instead of pd.Series. Of course now Projection.project() is a bit slower than before after the very naif syntax conversion (there might be a way to speed it up) but overall shaving off a 7% of the runtime basically for free compensates for it.

Also, I think removing the reliance on Pandas when it comes to handling data during processing is a positive, as it comes with too much overhead. In fact, I'm a proponent of even ditching dict and going np.array all the way, but getting rid of pd.df is a good step in that direction.

So, in conclusion:

No pd.Series, just dict until the final dataframe generation.
6-7% of the running time saved when cortical projection not enabled.
Dependency on Pandas for data handling reduced.

timonmerk · 2024-06-13T10:07:40Z

Great addition @toni-neurosc! It goes into a similar direction that I just mentioned here: #322 (reply in thread)

Ideally we will move away from the csv write, and only use the sqlite database for that.

timonmerk

Agree with all of those changes, renaming run_analysis to data_prcoessor makes also lots of sense!

timonmerk · 2024-06-14T13:53:29Z

Many thanks @toni-neurosc!

toni-neurosc added 4 commits June 12, 2024 21:37

Construct result df from dicts instead of pd.Series

a4031e1

Rename FeatureReader, fix some tests

f38f591

Fix projection

0144f58

remove print

90a46b4

toni-neurosc requested a review from timonmerk June 13, 2024 09:57

timonmerk approved these changes Jun 14, 2024

View reviewed changes

timonmerk merged commit 6e0b022 into main Jun 14, 2024
6 checks passed

toni-neurosc deleted the result_from_dicts_pr branch June 14, 2024 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate results directly from feature_dict instead of pd.Series #334

Generate results directly from feature_dict instead of pd.Series #334

toni-neurosc commented Jun 13, 2024 •

edited

Loading

timonmerk commented Jun 13, 2024

timonmerk left a comment

timonmerk commented Jun 14, 2024

Generate results directly from feature_dict instead of pd.Series #334

Generate results directly from feature_dict instead of pd.Series #334

Conversation

toni-neurosc commented Jun 13, 2024 • edited Loading

timonmerk commented Jun 13, 2024

timonmerk left a comment

Choose a reason for hiding this comment

timonmerk commented Jun 14, 2024

toni-neurosc commented Jun 13, 2024 •

edited

Loading