-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset script #252
Conversation
thanks!!
|
|
Adding it to the CLI commands is a cool idea! I'm down for that if other people want it too! |
Looks good! Just a couple thoughts: There will probably need to be various checks in place to prevent things like overwriting to existing files, which happened when I ran the script for an existing dataset (hpd_violations.csv). Could also take this further and allow users to add info via cli inputs, like the URL. I would also like to see it on the command line! But just wondering if it is good practice to allow modifications of the source code via cli? Maybe just require a path to the local repo of nycdb that will hold the skeleton files? |
good point @vukevint. Perhaps this directory should use the current working directory instead? |
this reminds me of something I've wanted before: to be able to declare the transformations in schema:
- table_name: some_city_dataset
transformations:
- bbl
- csv
fields:
boro: int
block: int
lot: int
[...] with this #236 it would be very cool if people could submit datasets with just a yaml file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM + some feedback.
First of, holistically: Overall this looks like definitely very positive, well-considered effort that scratches addresses a significant project need. Very nice work indeed.
Also, I haven't taken the time to thoroughly analyze the scripts inner workings, or to test it beyond running it once (on a freshly pulled dob_certificate_occupancy
dataset).
My suggestions below are mostly of the nitpicking sort, in this context (though a couple of the suggested changes are of the strongly recommended). As such, they are "voluntary" suggestions (subject to whatever the admins might say).
Suggested Changes
(1) the script should definitely move to src/scripts
.
I'd also like to boldy suggest we use hyphens rather than underscores in script names at least. So the new path becomes:
src/scripts/create-dataset.py
Both are safe, permitted and widely used in Unix-land -- I happen to think hyphenated filenames in general are much easier on the eyes. Others may disagree of course.
(2) Since this is a "script" it should be have perms set accordingly (+ogx
) and should have canonical script header:
#!/usr/bin/env python3
(3) imports should be done at the top of the script
Currently we have on lines 277-278:
import random
import subprocess
I would also suggest we order stdlib imports (and all other natural groupings of imports) by length, for readability:
import re
import sys
import csv
import random
import argparse
import textwrap
import subprocess
(4) in non-testing context, prefer exceptions over asserts
Right now we have the block:
assert DATASETS_DIR.exists()
assert TRANSFORMATIONS_PY_PATH.exists()
assert SQL_DIR.exists()
assert TEST_DIR.exists()
assert NYCDB_TEST_PY_PATH.exists()
assert TEST_DATA_DIR.exists()
Strictly speaking -- though the assert
keyword can certainly be used this way, it's really intended for testing contexts (and in fact can be a liability in runtime scripts, because it gets disabled if the -O
flag is invoked.
In runtime contexts it's generally better to use exceptions or some other kind of failure handling, e.g.:
if not DATASETS_DIR.exists():
fail("fatal error: cannot verify datasets directory")
(5) Lines 47-61 - remove spaces between these lines (so they contract to a single block)
(6) Prefer triple-double (rather than single) quote block delimiters
As this seems more standard (or at least more consistent with the rest of the codebase).
Applies both to the comment block at the very top, and to various docstrings throughout the script.
(7) Prefer wider over taller text blurbs
It seems these mostly tap out at 55-60 chars per line.
In my view this ends up making the code (and shell output) look choppy, and "taller" than it needs to be.
In the view of many style guides -- paragraphs should in general be delimited consistnetly to 78-79 chars if they are intended for output to the shell (for example the "Scaffolding created ..." blurb).
In docstrings -- people have differing views (I not only don't mind, much in general prefer longer docstrings) but in general it seems there's a consensus these should also fill out to 78-79 characters as well.
Added a review (with requested changes), hopefully in the right spirit. |
Hey @wstlabs Thanks for your really thorough review on this, appreciate all the improvements you suggested. I've made all the changes. If everyone else thinks this looks ok let me know and I can merge it in |
Resolves: #98
Builds off of https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd and https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8
Tested it out and made a couple of small tweaks from the version @austensen had. I think it looks good.
Questions:
I put it in
/src/
, but does anyone feel like it belongs insrc/scripts
since it's.... you know.. a script? If folks agree, I can move it and update the path variables in the script.There's a
to_camel_case()
function in the script file, but it is not actually transforming the text to camelCase, it's transforming it to PascalCase. Maybe it's a nitpick, but I feel like we should be accurate. Feels confusing to me to use the wrong terminology. What do other folks think?