Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select rows based on simple criteria #8

Open
chimeno opened this issue Oct 3, 2018 · 7 comments
Open

select rows based on simple criteria #8

chimeno opened this issue Oct 3, 2018 · 7 comments

Comments

@chimeno
Copy link

chimeno commented Oct 3, 2018

Would it be possible to select rows based in very simple criteria?

I mean, I have a database with a very large table and I would like to
select all rows for all tables except data table where I want 1000000 rows ORDER by timestamp DESC;

Thanks for the library.

@mla
Copy link
Owner

mla commented Oct 5, 2018

I could see that. What do you think the command syntax should look like for that?

@chimeno
Copy link
Author

chimeno commented Oct 5, 2018

I'm currently using:
./pg_sample --limit="*=*, nodes_data=1000000"

I guess something like:

./pg_sample --limit="*=*, nodes_data=1000000;order by timestamp DESC"

or

./pg_sample --limit="*=*, nodes_data=1000000(order by timestamp DESC)"

should be easy to parse and is extensible in case other criteria is added.

@mla
Copy link
Owner

mla commented Oct 4, 2021

You should be able to specify a where condition after the =. e.g.,

--limit="users=(user_id < 10)"

@lustickd
Copy link

lustickd commented May 17, 2022

@mla Had a similar question like this, is it possible to select EVERY table in DESC order? I think all (most?) tables in rails for example have "created_at", so it'd be nice to sample rows with ORDER BY created_at DESC as the default since usually early rows in a big database have a bunch of inactive rows. I'm trying with --random but it might be too slow for my purposes

@mla
Copy link
Owner

mla commented May 22, 2022

Hey @lustickd. Sorry for the delay in responding.

You can try this patch, which should just force that ORDER BY for every table.

diff --git a/pg_sample b/pg_sample
index a73af39..a1b5ec8 100755
--- a/pg_sample
+++ b/pg_sample
@@ -630,6 +630,7 @@ while (my $row = lower_keys($sth->fetchrow_hashref)) {
       notice "No candidate key found for '$table'; ignoring --ordered";
     }
   }
+  $order = 'created_at DESC';

We'd have to look at how we can express that for general use. Rails doesn't automatically create an index on all created_at columns, does it? That would be my worry, if you have really large tables.

@mla
Copy link
Owner

mla commented May 22, 2022

You might try this:

--- a/pg_sample
+++ b/pg_sample
@@ -624,7 +624,11 @@ while (my $row = lower_keys($sth->fetchrow_hashref)) {
   } elsif ($opt{ordered}) {
     my @cols = find_candidate_key($table);
     if (@cols) {
-      my $cols = join ', ', map { $dbh->quote_identifier($_) } @cols;
+      my $cols = join ', ',
+        map { "$_ DESC" }
+        map { $dbh->quote_identifier($_) }
+        @cols
+      ;
       $order = "ORDER BY $cols";
     } else {
       notice "No candidate key found for '$table'; ignoring --ordered";

And pass the --ordered option. We order by the first candidate key we find. Rails usually has its "id" column, which should roughly match created_at, I would think. Patch above just adds DESC to those columns. Seems like a reasonable default anyway for that option.

@lustickd
Copy link

Ah that makes sense thanks. Yeah I think created_at doesn't have an index so I'll go with the id method 👍

I did mess around a little bit with tsm_system_rows for random sampling and it's significantly faster than using SORT BY random() in a table with 40 million rows. Runs in 300 milliseconds per table instead of 30 seconds. Apparently the random() function in postgres loads the entire table into memory which makes it extremely slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants