Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate results directly from feature_dict instead of pd.Series #334

Merged
merged 4 commits into from
Jun 14, 2024

Conversation

toni-neurosc
Copy link
Collaborator

@toni-neurosc toni-neurosc commented Jun 13, 2024

While I was working on bursts vectorization, I was checking the profiling results and I noticed an unusual amount of time being spent in pd.Series.__init__ and also in _add_timestamp. In total, they accounted for 6 to 7% of the running time.

  • pd.Series.__init__: 2.5%
  • _add_timestamp: 3.8%

I realized this was due to the transformation of the dictionary of feature results feature_dict to a pd.Series feature_series, as well as the modification of a pd.Series instead of a dict which comes at a much greater cost due to Pandas overhead, possibly having to move the entire series around in memory, etc. This was done, I imagine, to facilitate some of the computations that were done on that series, mainly the cortex/subcortex projection. But paying a 7% of the running time is too high a cost imho.

Also, using a pd.Series to store a row works in opposition to Pandas logic, which actually stores one Series per column, not per row. So what I did was remove the intermediate transformation of feature_dict to feature_series and instead of the Stream.run function generating a list of pd.Series, it generates a list of feature_dict, then it transforms the dict into the final result dataframe with pd.DataFrame.from_records()

To make this work, I had to make some very small syntax changes, but the logic of everything is the same, just using dict instead of pd.Series. Of course now Projection.project() is a bit slower than before after the very naif syntax conversion (there might be a way to speed it up) but overall shaving off a 7% of the runtime basically for free compensates for it.

Also, I think removing the reliance on Pandas when it comes to handling data during processing is a positive, as it comes with too much overhead. In fact, I'm a proponent of even ditching dict and going np.array all the way, but getting rid of pd.df is a good step in that direction.

So, in conclusion:

  • No pd.Series, just dict until the final dataframe generation.
  • 6-7% of the running time saved when cortical projection not enabled.
  • Dependency on Pandas for data handling reduced.

@timonmerk
Copy link
Contributor

Great addition @toni-neurosc! It goes into a similar direction that I just mentioned here: #322 (reply in thread)

Ideally we will move away from the csv write, and only use the sqlite database for that.

Copy link
Contributor

@timonmerk timonmerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with all of those changes, renaming run_analysis to data_prcoessor makes also lots of sense!

@timonmerk
Copy link
Contributor

Many thanks @toni-neurosc!

@timonmerk timonmerk merged commit 6e0b022 into main Jun 14, 2024
6 checks passed
@toni-neurosc toni-neurosc deleted the result_from_dicts_pr branch June 14, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants