Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed the data in lot of data #293

Closed
Benjamin-zhangjb opened this issue Dec 10, 2019 · 6 comments
Closed

How to speed the data in lot of data #293

Benjamin-zhangjb opened this issue Dec 10, 2019 · 6 comments
Labels
feature request 💬 Requests for new features help wanted 🙋 Contributions are welcome!

Comments

@Benjamin-zhangjb
Copy link

No description provided.

@Benjamin-zhangjb Benjamin-zhangjb added the feature request 💬 Requests for new features label Dec 10, 2019
@Benjamin-zhangjb
Copy link
Author

plot={'histogram':{'bayesian_blocks_bins': False}} Where it should be added

@neomatrix369
Copy link

neomatrix369 commented Dec 12, 2019

plot={'histogram':{'bayesian_blocks_bins': False}} Where it should be added

Here is an example:

training_profile = train.profile_report(title='Pandas Profiling on training set', 
plot={'histogram': {'bins': 8}}, 
style={'full_width': True}, minify_html=True, pool_size=no_processors)

Thats how you use the parameters.

@rmokros
Copy link

rmokros commented Feb 13, 2020

df.shape (6.370.599, 33)
profile = ProfileReport( df,

plot={'histogram': {'bins': None}},

                    plot={'histogram':{'bayesian_blocks_bins': False}},
                    check_correlation_pearson=False,
                    correlations={
                        "pearson": False,
                        "spearman": False,
                        "kendall": False,
                        "phi_k": False,
                        "cramers": False,
                        "recoded":False})

AWS ml.m5.24xlarge (384 GB memory)
Result:
with 4 floats memory error convert this float to string and runs ok .....

@sbrugman
Copy link
Collaborator

The minimal mode uses even less computation as of v2.8.0. The release before that contained numerous performance optimizations as well, included disabling bayesian_blocks by default.

@mthomp89
Copy link

Running minimal=True on dataframe with shape (326878, 38) with profile report dumping out to html file. The profile report does not produce. Executed 5% sample of original dataframe and the profile report produced in 2-3 minutes. What other configuration settings can be turned off?

pandas_profiling v2.4
pandas v1.0.1

@sbrugman
Copy link
Collaborator

Running minimal=True on dataframe with shape (326878, 38) with profile report dumping out to html file. The profile report does not produce. Executed 5% sample of original dataframe and the profile report produced in 2-3 minutes. What other configuration settings can be turned off?

pandas_profiling v2.4
pandas v1.0.1

Upgrading to the latest version will speed up significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features help wanted 🙋 Contributions are welcome!
Projects
None yet
Development

No branches or pull requests

5 participants