Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data request (if possible) #1

Open
hrbrmstr opened this issue Mar 25, 2018 · 11 comments
Open

data request (if possible) #1

hrbrmstr opened this issue Mar 25, 2018 · 11 comments

Comments

@hrbrmstr
Copy link
Owner

hrbrmstr commented Mar 25, 2018

@cboettig / @benmarwick y'all wldn't have some sample JSON I can use, would you? I'm technically not allowed to put "alot" of work-work JSON data out in the wild since it can enable attackers (it saves them the $ of doing recon scans, at least).

It'd also help me direct (what I think will be chapter 7/recipe 6) more specifically for your needs.

no worries if not. I'll either convert some data, ask for work-forgiveness (er, I mean, 'permission') or go on a JSON data hunt or use some CVE data that isn't confidential but may not be the best example of JSON to help non-cyber folks.

@hrbrmstr
Copy link
Owner Author

So https://rud.is/books/drill-sergeant-rstats/reading-a-streaming-json-ndjson-data-file-with-drill-r.html is a boilerplate recipe that's a bit more involved but may help and i can add more recipes for other examples if needed.

@cboettig
Copy link

@hrbrmstr Sure thing! A bunch of example JSON files here: https://gitlab.carlboettiger.info/cboettig/supertreebase/tree/master/json

These are JSON-LD representations of phylogenetic trees originally published in XML formats in the public scientific repository http://treebase.org, all CC0 / public domain.

@hrbrmstr
Copy link
Owner Author

#ty!

@hrbrmstr
Copy link
Owner Author

ZOMGOSH THOSE ARE PERFECT!

@benmarwick
Copy link

I'm afraid I'm mostly using CSV and TSV files, so I'm very grateful to see chapter 4!

@hrbrmstr
Copy link
Owner Author

@benmarwick If you have some specific ones that are share-able, I can make topic-specific recipies as well.

@benmarwick
Copy link

Thanks, most recently I've been working with these https://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/, and wondering if drill might make it easier to work with. As they are, those files a bit impractical for an example. How about I get a small excerpt from one of those and share it here?

@hrbrmstr
Copy link
Owner Author

@benmarwick take a look at https://rud.is/books/drill-sergeant-rstats/working-with-custom-delimited-format-files.html and lemme know if that's tracking towards "helpful". Dealing with that last column will require a bit of Java work (to define a UDF - user defined function), but I was going to cover that anyway and this is a nice example for it. And, it's not as scary as it sounds (if it does, indeed, sound, scary :-). Most Drill UDFs are really simple Java functions based on a template that's easy to modify.

@hrbrmstr
Copy link
Owner Author

https://rud.is/books/drill-sergeant-rstats/writing-simple-drill-custom-functions-udfs-for-field-transformations.html now has the Drill UDF necessary to make the last column more usable.

@hrbrmstr
Copy link
Owner Author

@cboettig What are some "typical" operations one wld be performing on said phylogenetic tree data? I was able to tease out the "tree" but this is one area I've not handled enough SO questions on to be familiar with the data enough to whip up examples (yes, I may answer SO questions both to help folks and to try to get a handle on other disciplines at the same time :-)

SELECT
  version, id,
  b.tree.node AS nodes,
  b.tree.edge AS edges
FROM (
  SELECT 
    a.version, 
    a.`@id` as id,
    FLATTEN(a.trees.tree) as tree
   FROM dfs.supertreebase.`/S100.json` a 
   LIMIT 10
  ) b

image

@cboettig
Copy link

Great question. Common tasks might be:

  • identifying all trees which contain a given otu or set of otus (think "species"; note that the "otu" given on the edge is a reference, one needs to check against the corresponding otu "label" , or more ideally, an identifier URI for said otu.

  • compute the evolutionary distance between two otus: identify trees containing both otus that also include length data on the edges and summing the length of edges back to the common ancestor. (A variation of this involves identifying "time trees", in which lengths on edges are scaled such that all tips are the same length from the overall root of the tree).

  • more pie-in-the-sky is the notion of constructing supertrees from existing trees. Some details here: Use case: linking big cboettig/nexld#3 (comment) (where we are exploring doing this via RDF/sparql, but is non-trivial. This is the example I originally had in mind which gave me the idea that drill may be a more performant / practical approach for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants