A curated list of data wrangling resources with a bias towards command line tools without steep learning curves.
xlsx2csv Command line tool to convert xslx to csv. Fast and works for large xlsx files. Doesn't handle passwords.
libreoffice Use GUI or headless mode to convert xlsx to CSV. Doesn't seem to handle passwords or multiple sheets in headless mode
Apache POI Java APIs for manipulating various file formats based upon the Office Open XML standards. Can be used to extract text from spreadsheets, supports passwords etc. but can be memeory intensive.
Excel Streaming Reader Java streaming Excel reader using Apache POI - use for reading in large files without exhausting memory
Excelize Golang library that reads and writes XLSX file generated by Office Excel 2007 and later
ripgrep Extremely fast grep alternative.
jq A lightweight and flexible command-line JSON processor
xmlstarlet Command line tools to transform, query, validate, and edit XML documents
XSV CSV Toolkit Fast, command line toolkit for CSV manipulation and analysis. Written in RUST
awesomecsv A curated list of resoures for dealing CSV data.
awesome-csv A collection about the comma-separated values (CSV) world for rich structured data in (plain) text
zstandard Fast, efficient compressor for small data sets (less than 100MB) leveraging pre trained dictionaries.
shoco A fast acii biased, entropy encoder, for short strings using trained bigrams. Trained on english by default, supports training custom models.
smaz Dictionary based compressor for very small strings (less than 100 bytes). By default uses english dictionary but can be customized via code.
pdftabextract A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents. For an overview see the blog here
Redash Connect to any data source, easily visualize and share your data via dashboard. Open source and self hostable.
Superset - previously known as caravel Data exploration platform designed to be visual, intuitive, and interactive. From Airbnb, python based, supports druid alo with other SQL sources.
Metabase Visual analytics and dashboards from a wide range of SQL sources. Java based, SQL and non SQL modes, easy to share.
miller Like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
libpostal Libpostal is an open source project designed to provide fast, global address expansion and parsing using natural language processing techniques. Unlike several open source address parsing systems which rely explicit patterns and complex regular expressions, libpostal relies on a trained model derived from a corpus of global place names, national address patterns, and different languages terminology. It was created by Al Barrentine, initially for the Mapzen OpenVenues project. See intro post for details.
pelias.io A modular, open-source geocoder built on top of ElasticSearch for fast geocoding. Works with OpenStreetMap, OpenAddresses, Geonames, and Who's on First and can leverage libpostal for address parsing and expansion.
Global Chain Store Names List 131k chain store names from OpenVenues
Generate fake data for testing and demos
phoney Command line program that accepts a template and outputs fake data. golang based.
faker.js and faker cli Generate fake data from a browser, cli or REST call. Node.js based and includes avatars. See demo
faker Python based module and command line tool for generating fake data
elizabeth Python module for generating fake data profiles. Claims to be faster than alternatives, simpler and more self contained.
mockeroo Web app for generating realistic test data. Free up to 1000 records.
generatedata.com Web app for generating test data. Generate up to 100 for free, 5000 for $20 or download from http://benkeen.github.io/generatedata/
dsgen-big Dataset generator for producing dirty data with duplicates, typos etc. Based on the origional Febrl dbgen code.
ranger An open source fake data generator.
kolpa An open source fake data generator in go
Use parallel versions of gzip, bzip etc. where possible. Use difference in compression throughput, especially on modern servers.
lbzip2 Parallel bzip2 compression utility
pigz A parallel implementation of gzip for modern multi-processor, multi-core machines.
xz General-purpose data compression software with a high compression ratio and parallel support.
talisman Javascript NLP library that includes a large selection of phonetic fingerprints, fuzzy matching keys and distance metrics
visidata A curses interface for exploring and arranging tabular data.
conductor Conductor is an orchestration engine from Netflix. Workflows are defined using a JSON based DSL and are either control tasks (fork, conditional etc) or application tasks (e.g. encode a file) that are executed on a remote machine. Process tasks are executed by remote (any language) workers that poll the workflow state. Java core but workers are just http clients.
camunda An open source Business Process and Decision Automation platform with support for BPMN and modelling. Process tasks are executed by task clients that are executed by the process engine. Java focused.
apache airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Workflows are directed acyclic graphs (DAGs) of tasks that are executed, by the airflow scheduler, on an array of workers while following the specified dependencies. Python focused.