80% of data scientists' time is spent finding, cleansing and organizing data, which leaves only 20 percent to actually perform analysis. Keep a pipeline of SQL statements in a Postgres table that can be automatically called to clean data.
- Easy management for everyone on your team from engineering to product to marketing
- Table for organizing SQL queries from category to order (rank) to status (ex: Draft, Live, etc.)
- Log of every time the SQL queries are run
Battle-tested at Keel, where it's categorized over $700 million in user transactions.
The latest stable release (and older versions) can be installed from PyPI:
pip install automator
python from automator import Cleaner myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name') # Creates two tables (automator_queries and automator_logs) myAutomator.runInstaller()
If you wish to uninstall, run:
python from automator import Cleaner myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name') myAutomator.runUninstaller()
Execute the SQL queries that have a status of 'Live' ordered by 'rank':
from automator import Cleaner myAutomator = Cleaner('postgres://database_username:database_password@database_url:5432/database_name') # All Queries myAutomator.runQueries() # Queries with a category of 'Users' myAutomator.runQueries('Users')
To save an SQL query (Web UI coming Soon):
INSERT INTO public.automator_queries (code, rank, status, category) VALUES ('DELETE FROM users WHERE email ILIKE ''%test.com''', 1, 'Draft', 'Users');
To update an SQL query (Web UI coming Soon):
UPDATE automator_queries SET status = 'Live' WHERE id = 1;
Queries are stores in your database with the following fields.
- id - primary key
- status - automator_queries are only run if they have a status of 'Live'
- rank - ascending integers
When you run your SQL queries, Automator creates a log with useful information (more details coming soon).
It's a great practice to create a separate Postgres user that Automator uses. Once created, you can restrict which tables it has access to (see tutorial).
Basic structure of package is
├── README.md ├── automator │ ├── __init__.py │ ├── Cleaner.py │ └── version.py ├── pytest.ini ├── requirements.txt ├── setup.py └── tests ├── __init__.py ├── fixtures │ ├── state_1-5.py ├── helpers │ ├── __init__.py │ └── my_helper.py ├── tests_helper.py └── unit ├── __init__.py ├── test_cleaner.py └── test_version.py
Testing is set up using pytest and coverage is handled with the pytest-cov plugin.
Run your tests with
py.test in the root directory.
Coverage is ran by default and is set in the
To see an html output of coverage open
htmlcov/index.html after running the tests.
There is a
.travis.yml file that is set up to run your tests for python 2.7
and python 3.2, should you choose to use it.