A parser for the Virginia State Corporation Commission's business registration records.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
table_maps
.gitignore
.travis.yml
LICENSE
README.md
character_map.yaml
crump
geocode
requirements.txt
table_types.csv

README.md

Crump

Known Vulnerabilities

A parser for the Virginia State Corporation Commission's business entity records, which are provided as a single, enormous fixed-width file. Named for Beverley T. Crump, the first member of the State Corporation Commission.

Crump retrieves the current SCC records (updated weekly) and turns them into CSV and JSON. Alternately, it can improve the quality of the data (formatting dates, ZIP codes, replacing internal status codes with human-readable translations, etc.), atomize the data into millions of individual JSON files, or create Elasticsearch-compatible bulk API data.

The most recent copy of the raw SCC data can be found at https://s3.amazonaws.com/virginia-business/current.zip.

Usage

usage: crump [-h] [-a] [-i file.txt] [-o output_dir] [-t] [-d] [-e] [-m]

optional arguments:
  -h, --help            show this help message and exit
  -a, --atomize         generate millions of per-record JSON files
  -i file.txt, --input file.txt
                        raw SCC data (default: cisbemon.txt)
  -o output_dir, --output output_dir
                        directory for JSON and CSV
  -t, --transform       format properly date, ZIP, etc. fields
  -d, --download        download the data file, if missing
  -e, --elasticsearch   create Elasticsearch bulk API data
  -m, --map             generate Elasticsearch index map

For general purposes, ./crump -td is probably the best way to invoke Crump. This will download the current data file and transform the data to make it adhere to basic data quality norms.

License

Released under the MIT License.