Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longitudinal Study: Phase 1 #149

Closed
johnbent opened this issue Oct 30, 2023 · 6 comments
Closed

Longitudinal Study: Phase 1 #149

johnbent opened this issue Oct 30, 2023 · 6 comments

Comments

@johnbent
Copy link
Contributor

We want to enable a longitudinal study. This requires multiple phases. This issue is about the first phase. Subsequent issues will be created to describe subsequent phases so we can keep the conversations nicely organized.

We need to decide on what to capture in a longitudinal snapshot. One option would be merely to create a snapshot of the full GUFI tree but that's very large and includes a lot of information that wouldn't be necessary for a longitudinal study. Another option is to create a tool which copies a subset of the GUFI tree. It could either copy a subset of the GUFI tree files or it could do GUFI queries and save the output of those queries. That particular question is perhaps a subsequent phase.

Regardless of how we capture a longitudinal snapshot, we also have realized we first want to add additional information into the GUFI tree. Especially we want to capture histograms describing attributes of the entries within a directory into the summary tables. For example, how many files are size [0,1), [1-2), [2-4), etc. To collaboratively decide on the histograms required, we are defining that in this spreadsheet:

https://docs.google.com/spreadsheets/d/17PSZxHLVj731bI9PKY3E1V-TzSAHJiIahioc2nujo1U/edit#gid=0

@johnbent
Copy link
Contributor Author

"It’s a horrible waste of space to put full paths on every record isn’t it, it probably doubles the space needed for the longit table

If you put just dir name in, you can build he path with recurse

And – we wont be publicizing data with paths. [jb made mild edits here]

But I suppose you might want to do group by for each tree, but you can do that with the recurse view so there is a way to do it.

There might be a case for it being more efficient on query because recurse is a join and if you had full path you wouldn’t ever need to do a join.

If you put path in you don’t need inode or pinode really but having them might be useful

One downside is you have to get good at substring ops for pulling out subdirs but [that's not horribly difficult --jb edit].

I guess it might be a wash unless it really is expensive on space. WE could be far smarter about space on the longit table. Everything in gufi is int64 and that was me being lazy, but we know not everything needs that.

Even so, I bet it’s more space for the path than all the rest of the fields combined if you weren’t lazy

I guess if you have path, you could split the longit up into N DB’s (one for every FS or something) and get more parallelism

But of course [snip - jb] its more than just select.. it would be some parallel select/hive like thing. That probably exists but then you have to do the sort/etc. as a second step. We could write that code but its code.

Maybe its just a question of space. You could find out how much space is actually used with a query 😊

I am sure there is a substring trick to find the length of a string so it would be some fancy string thing on path(). And then find the biggest one of those using sort and limit and then you know the longest path and multiple by number of dirs. plus some pad.

@johnbent
Copy link
Contributor Author

Info about how we can do a recursive query to build the tree when the table just has pinode in it:

So when we get this one large table with supersummary records in it with inode and pinode

This is an example of that. the way its written it assumes the top level dir has a pinode of zero, so it knows where to start

This is the way hpss db2 and robinhood postgress work. HPSS has this isroot field on directories so when traversing it knows where to start.

I suspect we will need to create indexes on inode and pinode because we will be doing joining and stuff on them, maybe the first time we can actually use an index 😊. Its just create index ….

This is how you recurse, notice it starts at pinode=0. – you have to have a place to point it to start

And I put in the sql so you can play along

go into sqlite3

like this

sqlite3 testdb

CREATE TABLE summarylong(name TEXT, type TEXT, inode INT64, pinode INT64, isroot INT64);

INSERT INTO summarylong VALUES('test','d',55762961,0,1);

INSERT INTO summarylong VALUES('l1.2','d',55762965,55762961,1);

INSERT INTO summarylong VALUES('l1.3','d',55762966,55762961,1);

INSERT INTO summarylong VALUES('l1.1','d',55762967,55762961,1);

INSERT INTO summarylong VALUES('l2.2','d',55762986,55762965,1);

INSERT INTO summarylong VALUES('l2.3','d',55762987,55762965,1);

INSERT INTO summarylong VALUES('l2.1','d',55762989,55762965,1);

INSERT INTO summarylong VALUES('l2.2','d',55762979,55762966,1);

INSERT INTO summarylong VALUES('l2.3','d',55762980,55762966,1);

INSERT INTO summarylong VALUES('l2.1','d',55762982,55762966,1);

INSERT INTO summarylong VALUES('l2.2','d',55762968,55762967,1);

INSERT INTO summarylong VALUES('l2.3','d',55762969,55762967,1);

INSERT INTO summarylong VALUES('l2.1','d',55762971,55762967,1);

Now we check our work

select * from summarylong;

test|d|55762961|0|1

l1.2|d|55762965|55762961|1

l1.3|d|55762966|55762961|1

l1.1|d|55762967|55762961|1

l2.2|d|55762986|55762965|1

l2.3|d|55762987|55762965|1

l2.1|d|55762989|55762965|1

l2.2|d|55762979|55762966|1

l2.3|d|55762980|55762966|1

l2.1|d|55762982|55762966|1

l2.2|d|55762968|55762967|1

l2.3|d|55762969|55762967|1

l2.1|d|55762971|55762967|

and now the magic recursive dust and some magic field concatenation and you can draw pretty trees and calculate levels

with recursive under(parent,child,level,mypath,myname) as ( select 0,inode,0,name,name from summarylong where pinode=0 union all select pinode,inode,under.level+1,mypath||'/'||name,name from summarylong, under where pinode=under.child ) select substr('......',1,level*3),parent,child,level,mypath,myname from under;

|0|55762961|0|test|test

...|55762961|55762967|1|test/l1.1|l1.1

...|55762961|55762965|1|test/l1.2|l1.2

...|55762961|55762966|1|test/l1.3|l1.3

......|55762967|55762971|2|test/l1.1/l2.1|l2.1

......|55762967|55762968|2|test/l1.1/l2.2|l2.2

......|55762967|55762969|2|test/l1.1/l2.3|l2.3

......|55762965|55762989|2|test/l1.2/l2.1|l2.1

......|55762965|55762986|2|test/l1.2/l2.2|l2.2

......|55762965|55762987|2|test/l1.2/l2.3|l2.3

......|55762966|55762982|2|test/l1.3/l2.1|l2.1

......|55762966|55762979|2|test/l1.3/l2.2|l2.2

......|55762966|55762980|2|test/l1.3/l2.3|l2.3

We should write one of these and distro it as an example against the longit table with the real field names in it. You might even be able to create a view on this query

Create view blah as fancy recursion thing

And then the data scientist thinks the world is wonderfully magical

Lets try it so we can make this data scientist proof

sqlite> create view magicrecurs as with recursive under(parent,child,level,mypath,myname) as ( select 0,inode,0,name,name from summarylong where pinode=0 union all select pinode,inode,under.level+1,mypath||'/'||name,name from summarylong, under where pinode=under.child ) select substr('......',1,level*3),parent,child,level,mypath,myname from under ;

sqlite> select * from magicrecurs;

|0|55762961|0|test|test

...|55762961|55762967|1|test/l1.1|l1.1

...|55762961|55762965|1|test/l1.2|l1.2

...|55762961|55762966|1|test/l1.3|l1.3

......|55762967|55762971|2|test/l1.1/l2.1|l2.1

......|55762967|55762968|2|test/l1.1/l2.2|l2.2

......|55762967|55762969|2|test/l1.1/l2.3|l2.3

......|55762965|55762989|2|test/l1.2/l2.1|l2.1

......|55762965|55762986|2|test/l1.2/l2.2|l2.2

......|55762965|55762987|2|test/l1.2/l2.3|l2.3

......|55762966|55762982|2|test/l1.3/l2.1|l2.1

......|55762966|55762979|2|test/l1.3/l2.2|l2.2

......|55762966|55762980|2|test/l1.3/l2.3|l2.3

Whee, so they can go against the summarylong table or they can go against the hierarchal view magicrecurs

@johnbent
Copy link
Contributor Author

johnbent commented Nov 4, 2023

Seems like maybe post processing with GUFI only queries will work. And this is exactly why GUFI was created! To do super amazing powerful queries very quickly across giant namespaces!

From the big boss man himself:

"So lets say you come to the conclusion that I am right and merging all this stuff into the summary record for each dir is really misleading and you ride that downhill to, this is just post processing to create an historical record which can have age in it, one per dir…

Then this is just a post processing step.

If you buy that then why would you need python at all?

Its just a gufi query

Look

This is a little tree, its not terribly representative, only one file per dir but good enough

ggrider@pn2201328 bin % ls -lR test | grep -i drw
drwxr-xr-x 7 ggrider staff 224 Apr 6 2023 l1.1
drwxr-xr-x 6 ggrider staff 192 Apr 6 2023 l1.2
drwxr-xr-x 6 ggrider staff 192 Apr 6 2023 l1.3
drwxr-xr-x 4 ggrider staff 128 Apr 6 2023 l2.1
drwxr-xr-x 4 ggrider staff 128 Apr 6 2023 l2.2
drwxr-xr-x 4 ggrider staff 128 Apr 6 2023 l2.3
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.1
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.2
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.3
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.1
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.2
drwxr-xr-x 3 ggrider staff 96 Apr 6 2023 l2.3

remember vrpentries view has all the summary fields and all the entries fields in one row, one for every file.
So you have everything you need to just spit out the longit record from a single gufi query pass.

So do a query on vrpentries and include each field that came from the summary table, for each integer variable just say max(field) (or min, doesn’t matter, they are all the same for ever entries record as those are from the summary record. For the entries fields you would need to calculate using udf’s or case or …. So for each entries record you generate whatever you need
**** then you use group by and group by pinode (the inode of the parent – it’s the same in all records too) so you end up with one record per directory as output with the summary record info and whatever you want to do with the file records, sum, avg, count…. So you would create the histogram fields and then sum each in the group (which is the directory). Presto you can create a single record output with all the fields you need, no code.
In fact you could even do a create a temp db and a create final db and the query could do an insert into with the below) and it will insert the stuff into the per thread db and then it will concat all the threads output
Presto you have your longit – just a fancy run of gufi_query and out pops the records, one pass threaded, should be very fast.

ggrider@pn2201328 bin % ./gufi_query -n 1 -E 'select dname, max(duid), min(size),max(size) from vrpentries group by pinode;' -d '|' testi
l1.2|2078|2|2|
l1.3|2078|2|2|
l1.1|2078|2|2|
l2.2|2078|0|0|
l2.3|2078|0|0|
l2.1|2078|0|0|
l2.2|2078|7|7|
l2.3|2078|4|4|
l2.1|2078|10|10|
l2.2|2078|0|0|
l2.3|2078|0|0|
l2.1|2078|0|0|

See vrpentries, it has all the summary fields and all the entries fields in one record so the above is kind of trivial. If you need to add a run date to the records, that’s pretty easy too with a udf or just a a constant or whatever.

This just became dirt simple. Vrpentries is cool! - wonder who thought of that 😊

Not that I hate python but gufi_query is just so powerful, this is a page of sql. If you need a udf for helping the sql be simpler (without so many case statements, add one – it’s a dozen lines of C and probably useful in its own right.

This is how you would do a histogram
ggrider@pn2201328 bin % ./gufi_query -n 1 -E 'select dname, max(duid), min(size),max(size), sum (case when size < 4 then 1 else 0 end) from vrpentries group by pinode;' -d '|' testi
l1.2|2078|2|2|1|
l1.3|2078|2|2|1|
l1.1|2078|2|2|2|
l2.2|2078|0|0|1|
l2.3|2078|0|0|1|
l2.1|2078|0|0|1|
l2.2|2078|7|7|0|
l2.3|2078|4|4|0|
l2.1|2078|10|10|0|
l2.2|2078|0|0|2|
l2.3|2078|0|0|2|
l2.1|2078|0|0|2|

see vrpentries

CREATE VIEW vrpentries as select replace(summary.name,rtrim(summary.name,replace(summary.name,"/","")),"") as dname,summary.name as sname, summary.mode as dmode, summary.nlink as dnlink, summary.uid as duid, summary.gid as dgid, summary.size as dsize, summary.blksize as dblksize, summary.blocks as dblocks, summary.atime as datime, summary.mtime as dmtime, summary.ctime as dctime, summary.linkname as dlinkname, totfiles as dtotfile , totlinks as dtotlinks, minuid as dminuid, maxuid as dmaxuid, mingid as dmingid, maxgid as dmaxgidI, minsize as dminsize, maxsize as dmaxsize, totltk as dtotltk, totmtk as dtotmtk, totltm as totltm, totmtm as dtotmtm, totmtg as dtotmtg, totmtt as dtotmtt, totsize as dtotsize, minctime as dminctime, maxctime as dmaxctime, minmtime as dminmtime, maxmtime as dmaxmtime, minatime as dminatime, maxatime as dmaxatime, minblocks as dminblocks, maxblocks as dmaxblocks, totxattr as dtotxattr,depth as ddepth, mincrtime as dmincrtime, maxcrtime as dmaxcrtime, rollupscore as sroll,pentries.* from summary,pentries where summary.inode=pentries.pinode
/* vrpentries(dname,sname,dmode,dnlink,duid,dgid,dsize,dblksize,dblocks,datime,dmtime,dctime,dlinkname,dtotfile,dtotlinks,dminuid,dmaxuid,dmingid,dmaxgidI,
dminsize,dmaxsize,dtotltk,dtotmtk,totltm,dtotmtm,dtotmtg,dtotmtt,dtotsize,dminctime,dmaxctime,dminmtime,dmaxmtime,dminatime,dmaxatime,dminblocks,dmaxblocks,
dtotxattr,ddepth,dmincrtime,dmaxcrtime,sroll,name,type,inode,mode,nlink,uid,gid,size,blksize,blocks,atime,mtime,ctime,linkname,xattr_names,crtime,ossint1,
ossint2,ossint3,ossint4,osstext1,osstext2,pinode) */;

Did I get that right – no coding needed 😊

@johnbent
Copy link
Contributor Author

johnbent commented Nov 4, 2023

"It should be fast. In fact, if the tree is rolled up it will use the roll up ( that’s what the r in vrpentries is). This should fly, n maybe 10-20 min at most for the entire smash all fs’s
I do think we still need an fsid so when we release data without names ( like to a university) there will be some way to know that. It would be nice to have level/depth as well but I think it’s there, and a snap time stamp.
We need to make sure the top has a pinode of zero or marked somehow in case we need the recurse without the path or name ( like releasing it to a univ)

I think you run and make sure it’s fast enough but its pretty clearly it will be, especially if the tree is rolled up. It’s just a full query all threaded.
The only issue might be space in memory. We might need to allow force of the interims to some real fs ( a nice option for gufi_query anyway if it needs it) I think you can do that when you opendb ( it’s like mem:tempdbname if you want mem and put a path in if you want to place the temp db file. Trivial option to add if it’s not there like -worktmp or something.

Run it with full path in longit table
Get the size, compress it and get the size run a scan and see how fast it is ( single threaded)
Then run it with just name (notnpath) since the recurs code will build the path if needed. See size and compress . Run a recurs and see if it’s just way too slow compared to the scan.

Then decide
Put in full path or not
And
Do you need an incremental post post process.

I suspect it would compress by a big factor.

It should be very fast to run it. If not maybe the tree isn’t rolled up.
Might be like 10-40 GB and a compression maybe a factor of 10 :-)"

@johnbent
Copy link
Contributor Author

johnbent commented Nov 7, 2023

The current plan is that longit snaps should be in the format of SQLite files. I think the motivation for this is merely the same motivation for why GUFI live free is stored in SQLite tables: nice format for querying. Also, as GUFI people, we have developed an expertise with SQLite so it's just a natural choice.

Anyway, we have now thought of an additional motivation to store them in SQLite files so I'm adding that here in case we even start second-guessing this choice. This new motivation is:

"This is a great argument for making the longit snaps be in an SQL table. That was always the plan but for different reasons. Being able to join a latest longit snap with the live tree is very cool. It might even be the case that you’d join the love tree with an old longit snap. I can think of a possible reason now but maybe! Or …. A three-way join. Join an old longit snap with a new one to discover dirs with large (or small) change and then join with live free to learn more about those files.

However, if we need to join with live tree then doesn’t that mean our histograms aren’t expressive enough? I suppose this is an amazing power in case there is ever a future admin who needs it. But, right now at least, I can’t imagine a use case. And, if we can discover a use case, I think that just means we need to make our histograms more expressive. Maybe an admin might want info about a specific file. Maybe an admin notices a “weird” directory and wants to use gufi_ls to look at all the entries. "

@calccrypto
Copy link
Collaborator

Updates to the longitudinal snapshot script: 5f1e426, 0e2ee2f, ff970a0, 113f83a, 5ba7eda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants