Data cleaning made easy
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.pytest_cache/v/cache
automator
script
test_env
tests
.DS_Store
.gitignore
.travis.yml
CHANGELOG.md
CONTRIBUTING.md
LICENSE
README.md
pytest.ini
requirements.txt
setup.py

README.md

Automator

Build Status

πŸ”₯ Data cleaning made easy

80% of data scientists' time is spent finding, cleansing and organizing data, which leaves only 20 percent to actually perform analysis. Keep a pipeline of SQL statements in a Postgres table that can be automatically called to clean data.

You get:

  • Easy management for everyone on your team from engineering to product to marketing
  • Table for organizing SQL queries from category to order (rank) to status (ex: Draft, Live, etc.)
  • Log of every time the SQL queries are run

Battle-tested at Keel, where it's categorized over $700 million in user transactions.

Installation

The latest stable release (and older versions) can be installed from PyPI:

pip install automator

And run:

python
from automator import Cleaner
myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name')
# Creates two tables (automator_queries and automator_logs)
myAutomator.runInstaller()

If you wish to uninstall, run:

python
from automator import Cleaner
myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name')
myAutomator.runUninstaller()

Usage

Execute the SQL queries that have a status of 'Live' ordered by 'rank':

from automator import Cleaner
myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name')
# All Queries
myAutomator.runQueries()
# Queries with a category of 'Users'
myAutomator.runQueries('Users')

To save an SQL query (Web UI coming Soon):

INSERT INTO public.automator_queries (code, rank, status, category)
VALUES ('DELETE FROM users WHERE email ILIKE ''%test.com''', 1, 'Draft', 'Users');

To update an SQL query (Web UI coming Soon):

UPDATE automator_queries
SET status = 'Live'
WHERE id = 1;

Automator_queries

Queries are stores in your database with the following fields.

  • id - primary key
  • title
  • category
  • description
  • status - automator_queries are only run if they have a status of 'Live'
  • code
  • created_at
  • updated_at
  • rank - ascending integers

Automator_logs

When you run your SQL queries, Automator creates a log with useful information (more details coming soon).

  • category
  • start_time
  • end_time
  • error

Security

It's a great practice to create a separate Postgres user that Automator uses. Once created, you can restrict which tables it has access to (see tutorial).

Package

Basic structure of package is

β”œβ”€β”€ README.md
β”œβ”€β”€ automator
β”‚Β Β  β”œβ”€β”€ __init__.py
β”‚Β Β  β”œβ”€β”€ Cleaner.py
β”‚Β Β  └── version.py
β”œβ”€β”€ pytest.ini
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
└── tests
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ fixtures
    β”‚Β Β  β”œβ”€β”€ state_1-5.py
    β”œβ”€β”€ helpers
    β”‚Β Β  β”œβ”€β”€ __init__.py
    β”‚Β Β  └── my_helper.py
    β”œβ”€β”€ tests_helper.py
    └── unit
        β”œβ”€β”€ __init__.py
        β”œβ”€β”€ test_cleaner.py
        └── test_version.py

Tests

Testing is set up using pytest and coverage is handled with the pytest-cov plugin.

Run your tests with py.test in the root directory.

Coverage is ran by default and is set in the pytest.ini file. To see an html output of coverage open htmlcov/index.html after running the tests.

Travis CI

There is a .travis.yml file that is set up to run your tests for python 2.7 and python 3.2, should you choose to use it.