Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for downloading CSV with no rows limit #3810

Closed
ant1j opened this issue Nov 24, 2016 · 26 comments
Closed

Option for downloading CSV with no rows limit #3810

ant1j opened this issue Nov 24, 2016 · 26 comments
Assignees
Labels
Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/Processor Type:Bug Product defects
Milestone

Comments

@ant1j
Copy link

ant1j commented Nov 24, 2016

It would be great to facilitate data extraction through Metabase.

A few potential enhancements to do so:

  • Choose columns to display/export (already covered by Allow users to select fields to be displayed in the table #1445)
  • Have a LIMIT to the query for visualisation update, and choose to remove this LIMIT when exporting
    => this is possible today (but not so user-friendly) by setting the limit, hit "get answer" to check data, and remove the LIMIT options before clicking on the export to CSV icon
  • Allow additional options to CSV export (remove LIMIT condition)
@camsaul camsaul changed the title Ease data subset export Option for CSV download without limit Dec 13, 2016
@camsaul camsaul changed the title Option for CSV download without limit Option for downloading CSV ignoring rows limit of query Dec 13, 2016
@camsaul camsaul changed the title Option for downloading CSV ignoring rows limit of query Option for downloading CSV with no rows limit Dec 13, 2016
@ibushong
Copy link

Not being able to download all rows is a pretty big issue for us. Having something like a "download all" button would be great.

@axb21
Copy link

axb21 commented Dec 21, 2016

I'd also like an option to download CSV files with at least a higher row limit than the current 10,000. I can see the rationale for the limit, but I am exploring using metabase as a way to deliver insights to non-technical people in my company, and every now and then they'll generate a result set with slightly more than 10,000 rows. The most recent one was 13,000 rows or so, and it was frustrating that they could not hit a button and download the whole thing. Just a little too big.

I am not familiar at all with the metabase codebase, having literally just looked at it minutes ago. However, I hacked a bit and found that changing one line in src/metabase/api/dataset.clj (as of commit 4844ecc) sufficed for my purposes, at least as far as I'm able to tell so far. I have attached a patch if you're curious. The runnable jar file is easy to build following the developer notes and so far it's running fine in my environment and people are happily downloading CSV files with >10k rows. Again, I have no clue whether this will wreak havoc on the rest of the codebase, so word of warning.

up-download-row-limit..txt

@hugocar
Copy link

hugocar commented Dec 23, 2016

The fact the limit wasn't enforced in version 0.20 is causing troubles for me. Users, where I work, were beginning to get used to export big data sets and are seeing this limit as a bug/regression. I'm not sure what to tell them.

@tlrobinson
Copy link
Contributor

Confirmed this is a regression in v0.21.0. Downloads of the Orders table in the sample dataset are limited to 2,000 rows even though the copy in the UI says "The maximum download size is 1 million rows.". In v0.20.3 I get all 17,624 rows.

@tlrobinson tlrobinson added the Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness label Dec 23, 2016
@tlrobinson tlrobinson added this to the 0.22.0 milestone Dec 23, 2016
@salsakran
Copy link
Contributor

@tlrobinson I think this is happening on the frontend.

I'm fairly certain the query dictionary being serialized for POST params in https://github.com/metabase/metabase/blob/master/frontend/src/metabase/query_builder/components/DownloadWidget.jsx#L32 includes the limits.

@tlrobinson
Copy link
Contributor

@salsakran I think you're right for unsaved questions. It looks like we started using json_query from the results, which includes the constraints object. If we strip out that object it downloads all rows. PR at #4037.

However, saved questions are still limited to 2k rows by the backend (the frontend doesn't send the query at all in that case).

@axb21
Copy link

axb21 commented Jan 1, 2017

@tlrobinson I think you're right about that. If you have a look at the patch I posted earlier, you'll see I changed code on the backend that allowed me to retrieve >2,000 rows from a saved question; I neglected to mention in my previous comment that my observations concerned saved questions. Anyway, I'm not steeped in this codebase, but this all suggests to me that at least in some cases the backend is controlling how many rows are ever sent to the frontend.

@tlrobinson
Copy link
Contributor

@axb21 Thanks for looking into this. The constant you modified is the limit for normal queries (in the query builder and dashboards), but we want to ignore that for CSV downloads and use the 1 million row limit instead. Also I think your change will only increase the limit for aggregated queries, not "bare rows", which has a separate limit of 2000. We don't actually want to modify either of those limits.

@axb21
Copy link

axb21 commented Jan 1, 2017

@tlrobinson gotcha, that makes sense. One thing I can say that might be of interest: I searched through the source for the number 10,000--the largest number of rows I was ever able to download--and only found that number in a handful of places:

$ find src/ -type f -exec egrep -nil '\b10000\b' {} \;
src/metabase/util/stats.clj
src/metabase/api/dataset.clj
src/metabase/driver/googleanalytics/query_processor.clj
src/metabase/driver.clj

The patch I posted changes that number in src/metabase/api/dataset.clj, which I understand isn't going to work. The other places where 10000 appears did not look (to my naive eyes) like the right places to make a change like this.

@salsakran
Copy link
Contributor

salsakran commented Jan 1, 2017

@tlrobinson I don't think the front end should control how many rows get sent back. This is a limit in place for the backend to remain stable, and it should control how many rows get sent back.

I'll keep poking, I stopped there as I thought that was the root bug. =(

@tlrobinson
Copy link
Contributor

tlrobinson commented Jan 1, 2017

@salsakran My PR already prevents the frontend from sending the constraints object. Are you saying the backend should not respect constraints passed in the request as well?

The following endpoints override constraints with the hard-coded query-constraints (10k grouped rows / 2k bare rows)

  • POST /api/dataset - unsaved question qb/dashboard endpoint
  • POST /api/card/:card-id/query - saved question qb/dashboard endpoint
  • POST /api/card/:card-id/query/csv - saved question CSV download
  • POST /api/card/:card-id/query/json - saved question JSON download

(those last 2 are why the saved question download endpoints are still limited to 10k/2k rows)

And the following do not (allowing constraints to pass through from the request)

  • POST /api/dataset/csv - unsaved question CSV download
  • POST /api/dataset/json - unsaved question JSON download

Perhaps we should just change all the CSV/JSON download endpoints to discard constraints and not replace with query-constraints, so the query processor will use absolute-max-results (1048576, apparently the same as Excel https://github.com/metabase/metabase/blob/master/src/metabase/query_processor.clj#L28-L33 )

@tlrobinson
Copy link
Contributor

@axb21 The 10000 constant you found (max-results) is the correct one, but I'm saying that limit shouldn't be applied for the download endpoints at all.

@tlrobinson tlrobinson self-assigned this Jan 3, 2017
@mmerchant
Copy link

@tlrobinson Just to clarify -- on the CSV export it is intended to limit the export to 10,000 rows? Or is that constraint removed? I tried on the v0.22.0-rc1 tag and do still run into the 10K row limit on CSV exports.

@tlrobinson
Copy link
Contributor

@mmerchant Yes but I think this was merged after RC1. We'll put out another RC later today.

@luqmansyauqi
Copy link

@tlrobinson is this already implemented in v0.22.1?

@SchwiftyOne
Copy link

SchwiftyOne commented Aug 1, 2017

@tlrobinson @axb21 .I know this is an old issue, but does changing (max-results) allow the front-end and/or the CSV export to send more than 10,000 rows? I'm very interested in using Metabase in our corporate environment but the 10,000 row limit is major problem for how we'd want to use it and if changing the constant in the source is fixes this that would solve a big problem for us.

@salsakran
Copy link
Contributor

How do you want to use it?

We're kind of at a mostly stable settling point where most people find 10k enough, while some people on small instances still manage to crash their server once every so often. In general, we think of Metabase as a tool for interactive BI vs pushing big files around. There are a lot of little assumptions on that regard.

If you can give us a clear sense of what you're trying to do with it, we can perhaps find a way to get it done. In general though, we're not trying to be a "pull massive files out of databases" play, and at a certain size (depending on your data warehouse, server specs and browser), you'll want to have a dedicated data export tool in conjunction with us.

@SchwiftyOne
Copy link

We're not looking to download huge datasets (1million rows+), but the queries we'd like to have exported are around 200k to 300k rows or so. I understand that Metabase is primarily a BI visualization tool, so I don't expect it to do huge data set exports but something in that range would be great for us.

@mazameli
Copy link
Contributor

mazameli commented Aug 1, 2017

Wait, so the limit for the number of downloadable rows is 10k and not 1 million rows? If so, I'm going to change this message in the UI:

screen shot 2017-08-01 at 12 45 38 pm

@axb21
Copy link

axb21 commented Aug 1, 2017

@SchwiftyOne yes. I've been happily using a modified version of metabase for over a year now and a higher limit than 10k rows. I couldn't tell you the largest result set someone's downloaded, but it's well over 10k rows and metabase has worked fine.

@salsakran Yes, it makes sense that metabase is not tuned to query and return millions of rows. But an arbitrary, software-selected-at-deploy-time, hard limit on the number of returned rows seems unnecessarily constraining. My preference would be that I, as the person installing and administering a metabase instance, have the power to override this limit if I so choose, and it's on me if I create havoc. Really the request here is to turn a hardcoded limit into a configurable option and it's hard to understand why that'd be controversial unless there's no one available to implement it.

@SchwiftyOne
Copy link

@axb21 Thanks for the feedback! That's great that you can modify it. My only problem now is how to compile it. Unfortunately I don't have a mac, and reading up on how to compile the metabase source to jar, it seems you need a mac to do so. You wouldn't happen to have a compiled metabase jar with your modifications available for download somewhere?

@salsakran
Copy link
Contributor

Sorry, I misspoke.

The limits are
2k for raw table in-browser display
10k for aggregations in-browser display
1M rows for downloads.

@SchwiftyOne it seems like the number you mention (200-300k) should happen without too much drama. You will want to make sure your instance has enough memory, and you might need to tweak some config settings if you start seeing download failures.

@axb21
Copy link

axb21 commented Aug 1, 2017

@SchwiftyOne according to @salsakran's last response you shouldn't need my version of the jar. Honestly I'm several minor versions behind and I hacked in what I needed, so I'd suggest sticking with an official metabase release if it can handle what you need. I'd be happy to share offline what I hacked if it turns out the official release doesn't work for you. Good luck!

@SchwiftyOne
Copy link

I just tested it out and it worked!. Didn't realize that the display limit was 10k but the CSV limit was 1M. Thanks @salsakran and @axb21

@MiBaDK
Copy link

MiBaDK commented Aug 17, 2017

Only getting 100.000 rows. Version 0.25.2. Has this been changed from the 1M?

@paoliniluis
Copy link
Contributor

#28144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:P1 Security holes w/o exploit, crashing, setup/upgrade, login, broken common features, correctness Querying/Processor Type:Bug Product defects
Projects
None yet
Development

No branches or pull requests