Skip to content

Latest commit

 

History

History
101 lines (76 loc) · 3.75 KB

README.md

File metadata and controls

101 lines (76 loc) · 3.75 KB

Datagristle is a toolbox of tough and flexible data connectors and analyzers.
It's kind of an interactive mix between ETL and data analysis optimized for rapid analysis and manipulation of a wide variety of data.

It's neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It's intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.

Current Status: Strong support for easy analysis and simple transformations of csv files.

#Next Steps:

  • attractive PDF output of gristle_determinator.py
  • metadata database population

#Its objectives include:

  • multi-platform (unix, linux, mac os, windows with effort)
  • multi-language (primarily python)
  • free - no cripple-licensing
  • primary audience is programming data analysts - not non-technical analysts
  • primary environment is command-line rather than windows, graphical desktop or eclipse
  • extensible
  • allow a bi-directional iteration between ETL & data analysis
  • can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.

#Installation

```pip install datagristle```

```easy_install datagristle```
  • Or download tarball from pypi

#Dependencies

  • Python 2.6 or Python 2.7

#Mature Utilities Provided in This Release:

  • gristle_determinator.py
    • Identifies file formats, generates metadata, prints file analysis report
    • This is the most mature - and also used by the other utilities so that you generally do not need to enter file structure info.
  • gristle_freaker.py
    • Produces a frequency distribution of multiple columns from input file.
  • gristle_slicer.py
    • Used to extract a subset of columns and rows out of an input file.
  • gristle_viewer.py
    • Shows one record from a file at a time - formatted based on metadata.

#Immature Utilities Provided in This Release:

  • gristle_differ.py
    • Shows differences between two files
  • gristle_file_converter.py
    • Converts a csv from one dialect to another. Can handle multi-character field delimiters as well as record delimiters.
  • gristle_filter.py
    • Applies simple filter logic to file.
  • gristle_scalar.py
    • Performs scalar operations (min, max, avg, count unique, etc) on a file
  • gristle_validator.py
    • Validates a file - currently just confirms number of fields for each row.

#Future utilities:

  • gristle_metadata.py
    • Manages metadata - allows users to query, add, update, delete file, field, transformation, reporting descriptions.
  • gristle_generator
    • Generates test data based on gristle metadata
  • gristle_validator
    • Confirms validity of database and file structure and contents.
  • gristle_file_joiner.py
    • joins two files on their common keys and produces a new file
  • gristle_grouper.py
    • reads a file, aggregates on a given set of fields, produces a new file
  • gristle_db_loader.py
    • loads a file into a database
  • gristle_db_extractor.py
    • extracts data from a database into a file
  • gristle_field_merge.py
    • prints the matched values from multiple files side by side along with counts

#Licensing

  • Gristle uses the BSD license - see the separate LICENSE file for further information