-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longitudinal Study: Phase 1 #149
Comments
"It’s a horrible waste of space to put full paths on every record isn’t it, it probably doubles the space needed for the longit table If you put just dir name in, you can build he path with recurse And – we wont be publicizing data with paths. [jb made mild edits here] But I suppose you might want to do group by for each tree, but you can do that with the recurse view so there is a way to do it. There might be a case for it being more efficient on query because recurse is a join and if you had full path you wouldn’t ever need to do a join. If you put path in you don’t need inode or pinode really but having them might be useful One downside is you have to get good at substring ops for pulling out subdirs but [that's not horribly difficult --jb edit]. I guess it might be a wash unless it really is expensive on space. WE could be far smarter about space on the longit table. Everything in gufi is int64 and that was me being lazy, but we know not everything needs that. Even so, I bet it’s more space for the path than all the rest of the fields combined if you weren’t lazy I guess if you have path, you could split the longit up into N DB’s (one for every FS or something) and get more parallelism But of course [snip - jb] its more than just select.. it would be some parallel select/hive like thing. That probably exists but then you have to do the sort/etc. as a second step. We could write that code but its code. Maybe its just a question of space. You could find out how much space is actually used with a query 😊 I am sure there is a substring trick to find the length of a string so it would be some fancy string thing on path(). And then find the biggest one of those using sort and limit and then you know the longest path and multiple by number of dirs. plus some pad. |
Info about how we can do a recursive query to build the tree when the table just has pinode in it: So when we get this one large table with supersummary records in it with inode and pinode This is an example of that. the way its written it assumes the top level dir has a pinode of zero, so it knows where to start This is the way hpss db2 and robinhood postgress work. HPSS has this isroot field on directories so when traversing it knows where to start. I suspect we will need to create indexes on inode and pinode because we will be doing joining and stuff on them, maybe the first time we can actually use an index 😊. Its just create index …. This is how you recurse, notice it starts at pinode=0. – you have to have a place to point it to start And I put in the sql so you can play along go into sqlite3 like this sqlite3 testdb CREATE TABLE summarylong(name TEXT, type TEXT, inode INT64, pinode INT64, isroot INT64); INSERT INTO summarylong VALUES('test','d',55762961,0,1); INSERT INTO summarylong VALUES('l1.2','d',55762965,55762961,1); INSERT INTO summarylong VALUES('l1.3','d',55762966,55762961,1); INSERT INTO summarylong VALUES('l1.1','d',55762967,55762961,1); INSERT INTO summarylong VALUES('l2.2','d',55762986,55762965,1); INSERT INTO summarylong VALUES('l2.3','d',55762987,55762965,1); INSERT INTO summarylong VALUES('l2.1','d',55762989,55762965,1); INSERT INTO summarylong VALUES('l2.2','d',55762979,55762966,1); INSERT INTO summarylong VALUES('l2.3','d',55762980,55762966,1); INSERT INTO summarylong VALUES('l2.1','d',55762982,55762966,1); INSERT INTO summarylong VALUES('l2.2','d',55762968,55762967,1); INSERT INTO summarylong VALUES('l2.3','d',55762969,55762967,1); INSERT INTO summarylong VALUES('l2.1','d',55762971,55762967,1); Now we check our work select * from summarylong; test|d|55762961|0|1 l1.2|d|55762965|55762961|1 l1.3|d|55762966|55762961|1 l1.1|d|55762967|55762961|1 l2.2|d|55762986|55762965|1 l2.3|d|55762987|55762965|1 l2.1|d|55762989|55762965|1 l2.2|d|55762979|55762966|1 l2.3|d|55762980|55762966|1 l2.1|d|55762982|55762966|1 l2.2|d|55762968|55762967|1 l2.3|d|55762969|55762967|1 l2.1|d|55762971|55762967| and now the magic recursive dust and some magic field concatenation and you can draw pretty trees and calculate levels with recursive under(parent,child,level,mypath,myname) as ( select 0,inode,0,name,name from summarylong where pinode=0 union all select pinode,inode,under.level+1,mypath||'/'||name,name from summarylong, under where pinode=under.child ) select substr('......',1,level*3),parent,child,level,mypath,myname from under; |0|55762961|0|test|test ...|55762961|55762967|1|test/l1.1|l1.1 ...|55762961|55762965|1|test/l1.2|l1.2 ...|55762961|55762966|1|test/l1.3|l1.3 ......|55762967|55762971|2|test/l1.1/l2.1|l2.1 ......|55762967|55762968|2|test/l1.1/l2.2|l2.2 ......|55762967|55762969|2|test/l1.1/l2.3|l2.3 ......|55762965|55762989|2|test/l1.2/l2.1|l2.1 ......|55762965|55762986|2|test/l1.2/l2.2|l2.2 ......|55762965|55762987|2|test/l1.2/l2.3|l2.3 ......|55762966|55762982|2|test/l1.3/l2.1|l2.1 ......|55762966|55762979|2|test/l1.3/l2.2|l2.2 ......|55762966|55762980|2|test/l1.3/l2.3|l2.3 We should write one of these and distro it as an example against the longit table with the real field names in it. You might even be able to create a view on this query Create view blah as fancy recursion thing And then the data scientist thinks the world is wonderfully magical Lets try it so we can make this data scientist proof sqlite> create view magicrecurs as with recursive under(parent,child,level,mypath,myname) as ( select 0,inode,0,name,name from summarylong where pinode=0 union all select pinode,inode,under.level+1,mypath||'/'||name,name from summarylong, under where pinode=under.child ) select substr('......',1,level*3),parent,child,level,mypath,myname from under ; sqlite> select * from magicrecurs; |0|55762961|0|test|test ...|55762961|55762967|1|test/l1.1|l1.1 ...|55762961|55762965|1|test/l1.2|l1.2 ...|55762961|55762966|1|test/l1.3|l1.3 ......|55762967|55762971|2|test/l1.1/l2.1|l2.1 ......|55762967|55762968|2|test/l1.1/l2.2|l2.2 ......|55762967|55762969|2|test/l1.1/l2.3|l2.3 ......|55762965|55762989|2|test/l1.2/l2.1|l2.1 ......|55762965|55762986|2|test/l1.2/l2.2|l2.2 ......|55762965|55762987|2|test/l1.2/l2.3|l2.3 ......|55762966|55762982|2|test/l1.3/l2.1|l2.1 ......|55762966|55762979|2|test/l1.3/l2.2|l2.2 ......|55762966|55762980|2|test/l1.3/l2.3|l2.3 Whee, so they can go against the summarylong table or they can go against the hierarchal view magicrecurs |
Seems like maybe post processing with GUFI only queries will work. And this is exactly why GUFI was created! To do super amazing powerful queries very quickly across giant namespaces! From the big boss man himself: "So lets say you come to the conclusion that I am right and merging all this stuff into the summary record for each dir is really misleading and you ride that downhill to, this is just post processing to create an historical record which can have age in it, one per dir… Then this is just a post processing step. If you buy that then why would you need python at all? Its just a gufi query Look This is a little tree, its not terribly representative, only one file per dir but good enough ggrider@pn2201328 bin % ls -lR test | grep -i drw remember vrpentries view has all the summary fields and all the entries fields in one row, one for every file. So do a query on vrpentries and include each field that came from the summary table, for each integer variable just say max(field) (or min, doesn’t matter, they are all the same for ever entries record as those are from the summary record. For the entries fields you would need to calculate using udf’s or case or …. So for each entries record you generate whatever you need ggrider@pn2201328 bin % ./gufi_query -n 1 -E 'select dname, max(duid), min(size),max(size) from vrpentries group by pinode;' -d '|' testi See vrpentries, it has all the summary fields and all the entries fields in one record so the above is kind of trivial. If you need to add a run date to the records, that’s pretty easy too with a udf or just a a constant or whatever. This just became dirt simple. Vrpentries is cool! - wonder who thought of that 😊 Not that I hate python but gufi_query is just so powerful, this is a page of sql. If you need a udf for helping the sql be simpler (without so many case statements, add one – it’s a dozen lines of C and probably useful in its own right. This is how you would do a histogram see vrpentries CREATE VIEW vrpentries as select replace(summary.name,rtrim(summary.name,replace(summary.name,"/","")),"") as dname,summary.name as sname, summary.mode as dmode, summary.nlink as dnlink, summary.uid as duid, summary.gid as dgid, summary.size as dsize, summary.blksize as dblksize, summary.blocks as dblocks, summary.atime as datime, summary.mtime as dmtime, summary.ctime as dctime, summary.linkname as dlinkname, totfiles as dtotfile , totlinks as dtotlinks, minuid as dminuid, maxuid as dmaxuid, mingid as dmingid, maxgid as dmaxgidI, minsize as dminsize, maxsize as dmaxsize, totltk as dtotltk, totmtk as dtotmtk, totltm as totltm, totmtm as dtotmtm, totmtg as dtotmtg, totmtt as dtotmtt, totsize as dtotsize, minctime as dminctime, maxctime as dmaxctime, minmtime as dminmtime, maxmtime as dmaxmtime, minatime as dminatime, maxatime as dmaxatime, minblocks as dminblocks, maxblocks as dmaxblocks, totxattr as dtotxattr,depth as ddepth, mincrtime as dmincrtime, maxcrtime as dmaxcrtime, rollupscore as sroll,pentries.* from summary,pentries where summary.inode=pentries.pinode Did I get that right – no coding needed 😊 |
"It should be fast. In fact, if the tree is rolled up it will use the roll up ( that’s what the r in vrpentries is). This should fly, n maybe 10-20 min at most for the entire smash all fs’s I think you run and make sure it’s fast enough but its pretty clearly it will be, especially if the tree is rolled up. It’s just a full query all threaded. Run it with full path in longit table Then decide I suspect it would compress by a big factor. It should be very fast to run it. If not maybe the tree isn’t rolled up. |
The current plan is that longit snaps should be in the format of SQLite files. I think the motivation for this is merely the same motivation for why GUFI live free is stored in SQLite tables: nice format for querying. Also, as GUFI people, we have developed an expertise with SQLite so it's just a natural choice. Anyway, we have now thought of an additional motivation to store them in SQLite files so I'm adding that here in case we even start second-guessing this choice. This new motivation is: "This is a great argument for making the longit snaps be in an SQL table. That was always the plan but for different reasons. Being able to join a latest longit snap with the live tree is very cool. It might even be the case that you’d join the love tree with an old longit snap. I can think of a possible reason now but maybe! Or …. A three-way join. Join an old longit snap with a new one to discover dirs with large (or small) change and then join with live free to learn more about those files. However, if we need to join with live tree then doesn’t that mean our histograms aren’t expressive enough? I suppose this is an amazing power in case there is ever a future admin who needs it. But, right now at least, I can’t imagine a use case. And, if we can discover a use case, I think that just means we need to make our histograms more expressive. Maybe an admin might want info about a specific file. Maybe an admin notices a “weird” directory and wants to use gufi_ls to look at all the entries. " |
We want to enable a longitudinal study. This requires multiple phases. This issue is about the first phase. Subsequent issues will be created to describe subsequent phases so we can keep the conversations nicely organized.
We need to decide on what to capture in a longitudinal snapshot. One option would be merely to create a snapshot of the full GUFI tree but that's very large and includes a lot of information that wouldn't be necessary for a longitudinal study. Another option is to create a tool which copies a subset of the GUFI tree. It could either copy a subset of the GUFI tree files or it could do GUFI queries and save the output of those queries. That particular question is perhaps a subsequent phase.
Regardless of how we capture a longitudinal snapshot, we also have realized we first want to add additional information into the GUFI tree. Especially we want to capture histograms describing attributes of the entries within a directory into the summary tables. For example, how many files are size [0,1), [1-2), [2-4), etc. To collaboratively decide on the histograms required, we are defining that in this spreadsheet:
https://docs.google.com/spreadsheets/d/17PSZxHLVj731bI9PKY3E1V-TzSAHJiIahioc2nujo1U/edit#gid=0
The text was updated successfully, but these errors were encountered: