Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset script #252

Merged
merged 6 commits into from
Jul 9, 2024
Merged

Create dataset script #252

merged 6 commits into from
Jul 9, 2024

Conversation

steve52
Copy link
Member

@steve52 steve52 commented Feb 23, 2023

Resolves: #98

Builds off of https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd and https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8

Tested it out and made a couple of small tweaks from the version @austensen had. I think it looks good.

Questions:

  1. I put it in /src/, but does anyone feel like it belongs in src/scripts since it's.... you know.. a script? If folks agree, I can move it and update the path variables in the script.

  2. There's a to_camel_case() function in the script file, but it is not actually transforming the text to camelCase, it's transforming it to PascalCase. Maybe it's a nitpick, but I feel like we should be accurate. Feels confusing to me to use the wrong terminology. What do other folks think?

@steve52 steve52 self-assigned this Feb 23, 2023
@aepyornis
Copy link
Collaborator

thanks!!

  1. in my opinion this might as well just go in src/nycdb and usable from the cli so you can do something like nycdb --scaffold but if others don't want helpers like that in the main program, keeping it as a script is fine too.

  2. DoesntMatterToMe :)

@mimiflynn
Copy link

  1. Yes, I agree regarding src/scripts as it makes the most sense for standalone scripts.
  2. lol isn't it our job to be pedantic? I totally agree. If its PascalCasing a thing then I would hate someone to question their sanity when it doesn't camelCase a thing.

@steve52
Copy link
Member Author

steve52 commented Feb 28, 2023

Adding it to the CLI commands is a cool idea! I'm down for that if other people want it too!

@vukevint
Copy link
Collaborator

Looks good! Just a couple thoughts:

There will probably need to be various checks in place to prevent things like overwriting to existing files, which happened when I ran the script for an existing dataset (hpd_violations.csv). Could also take this further and allow users to add info via cli inputs, like the URL.

I would also like to see it on the command line! But just wondering if it is good practice to allow modifications of the source code via cli? Maybe just require a path to the local repo of nycdb that will hold the skeleton files?

@aepyornis
Copy link
Collaborator

good point @vukevint. Perhaps this directory should use the current working directory instead?

@aepyornis
Copy link
Collaborator

aepyornis commented Mar 1, 2023

Could also take this further and allow users to add info via cli inputs, like the URL.

this reminds me of something I've wanted before: to be able to declare the transformations in datasets.yml and not need the file dataset_transformations.py at all so a schema entry could look like

schema:
  - table_name: some_city_dataset
    transformations:
      - bbl
      - csv
    fields:
      boro: int
      block: int
      lot: int
      [...]

with this #236 it would be very cool if people could submit datasets with just a yaml file.

Copy link
Collaborator

@wstlabs wstlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM + some feedback.

First of, holistically: Overall this looks like definitely very positive, well-considered effort that scratches addresses a significant project need. Very nice work indeed.

Also, I haven't taken the time to thoroughly analyze the scripts inner workings, or to test it beyond running it once (on a freshly pulled dob_certificate_occupancy dataset).

My suggestions below are mostly of the nitpicking sort, in this context (though a couple of the suggested changes are of the strongly recommended). As such, they are "voluntary" suggestions (subject to whatever the admins might say).

Suggested Changes

(1) the script should definitely move to src/scripts.

I'd also like to boldy suggest we use hyphens rather than underscores in script names at least. So the new path becomes:

src/scripts/create-dataset.py

Both are safe, permitted and widely used in Unix-land -- I happen to think hyphenated filenames in general are much easier on the eyes. Others may disagree of course.

(2) Since this is a "script" it should be have perms set accordingly (+ogx) and should have canonical script header:

#!/usr/bin/env python3

(3) imports should be done at the top of the script

Currently we have on lines 277-278:

    import random
    import subprocess

I would also suggest we order stdlib imports (and all other natural groupings of imports) by length, for readability:

import re
import sys
import csv
import random
import argparse
import textwrap
import subprocess

(4) in non-testing context, prefer exceptions over asserts

Right now we have the block:

assert DATASETS_DIR.exists()
assert TRANSFORMATIONS_PY_PATH.exists()
assert SQL_DIR.exists()
assert TEST_DIR.exists()
assert NYCDB_TEST_PY_PATH.exists()
assert TEST_DATA_DIR.exists()

Strictly speaking -- though the assert keyword can certainly be used this way, it's really intended for testing contexts (and in fact can be a liability in runtime scripts, because it gets disabled if the -O flag is invoked.

In runtime contexts it's generally better to use exceptions or some other kind of failure handling, e.g.:

if not DATASETS_DIR.exists():
    fail("fatal error: cannot verify datasets directory")

(5) Lines 47-61 - remove spaces between these lines (so they contract to a single block)

(6) Prefer triple-double (rather than single) quote block delimiters

As this seems more standard (or at least more consistent with the rest of the codebase).

Applies both to the comment block at the very top, and to various docstrings throughout the script.

(7) Prefer wider over taller text blurbs

It seems these mostly tap out at 55-60 chars per line.

In my view this ends up making the code (and shell output) look choppy, and "taller" than it needs to be.

In the view of many style guides -- paragraphs should in general be delimited consistnetly to 78-79 chars if they are intended for output to the shell (for example the "Scaffolding created ..." blurb).

In docstrings -- people have differing views (I not only don't mind, much in general prefer longer docstrings) but in general it seems there's a consensus these should also fill out to 78-79 characters as well.

@wstlabs
Copy link
Collaborator

wstlabs commented Nov 17, 2023

Added a review (with requested changes), hopefully in the right spirit.

@austensen
Copy link
Member

Added a review (with requested changes), hopefully in the right spirit.

Hey @wstlabs Thanks for your really thorough review on this, appreciate all the improvements you suggested. I've made all the changes.

If everyone else thinks this looks ok let me know and I can merge it in

@austensen austensen merged commit 712811e into main Jul 9, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding a script that creates scaffolding for a new dataset
6 participants