Skip to content
ken farmer edited this page Apr 23, 2021 · 18 revisions

welcome to DataGristle's wiki

First off, why DataGristle?

The answer is that the advent of easy access to vast volumes of data has changed the nature of data analysis. Fifteen years ago I might get one new source of data to analyze every month. Or every three months. Now, I get them every few days. But most of our tooling and methods are based on the prior circumstance. So, we have ETL tools, for example, that simply take far too long to configure for new feeds, and are of little help in doing the initial analysis of the data.

What I wanted instead was something that could automate or at least quickly and easily perform some of the common analysis drudgery. And ideally allow me to iteratively refine analysis and transformation and then smoothly transition that work into full automation. It would probably never have the the feature set of a large ETL package, but then again it might build the prototype for one in a fraction of the time.

For me this meant command-line utilities - which could be easily run interactively or embedded within shell scripts for automation.

Oh, and I chose the name one night while laughing with my family about how my very literate teenage children didn't know the word 'gristle' - because in a house of vegetarians it simply never comes up. And not surprisingly, it's not very common in software titles either. Of course, it's got some baggage, but hey, it's a funny word and works great as a metaphor.

Usage Scenarios

  • Operational Diagnostics 1 - a marketing sentiment analysis company uses it for quickly discovering problems in spreadsheets sent to them by their customers. The spreadsheets were often found to be subtly malformed, or had invalid values that could be difficult to find. gristle_profiler was used to quickly sanity-check and find outliers.
  • Operational Diagnostics 2 - a large data warehousing team uses it whenever their bulk load process breaks on invalid data. Their database's bulkloader does not provide much info in this kind of a case, so they use gristle_freaker to quickly size up the nature of the data in a few problematic columns, gristle_viewer to examine individual records, and gristle_profiler to sanity-check the file structure. This has speed up the problem determination and resolution steps enormously.
  • Feed Analysis - a large data warehousing team uses it whenever they have new potential data sources to analyze. The gristle_profiler quickly finds data quality issues, and identifies characteristics useful for data modeling. On some large complex feeds, it can sometimes perform 8-20 hours of initial analysis in just five minutes.

What's Included

  • gristle_profiler - Analyzes csv files and prints information about the file structure and each field within it.
  • gristle_slicer - Selects rows and columns out of csv file.
  • gristle_differ
  • gristle_freaker - Creates frequency distributions of one or more columns of a csv file.
  • gristle_viewer - Displays a single record from a csv file organized in two columns, with labels to the left and values to the right.
  • gristle_validator - Validates csv files using jsonschema
  • as well as some other minor tools

About the Code

Clone this wiki locally