Create dataset script #252

steve52 · 2023-02-23T20:53:37Z

Resolves: #98

Builds off of https://gist.github.com/toolness/ff8d00f36234442d650c63311178f8bd and https://gist.github.com/austensen/9d1b4eda9ca82ec220f07c7f0245e6a8

Tested it out and made a couple of small tweaks from the version @austensen had. I think it looks good.

Questions:

I put it in /src/, but does anyone feel like it belongs in src/scripts since it's.... you know.. a script? If folks agree, I can move it and update the path variables in the script.
There's a to_camel_case() function in the script file, but it is not actually transforming the text to camelCase, it's transforming it to PascalCase. Maybe it's a nitpick, but I feel like we should be accurate. Feels confusing to me to use the wrong terminology. What do other folks think?

…ipcode array

aepyornis · 2023-02-24T17:52:02Z

thanks!!

in my opinion this might as well just go in src/nycdb and usable from the cli so you can do something like nycdb --scaffold but if others don't want helpers like that in the main program, keeping it as a script is fine too.
DoesntMatterToMe :)

mimiflynn · 2023-02-24T19:07:50Z

Yes, I agree regarding src/scripts as it makes the most sense for standalone scripts.
lol isn't it our job to be pedantic? I totally agree. If its PascalCasing a thing then I would hate someone to question their sanity when it doesn't camelCase a thing.

steve52 · 2023-02-28T16:32:19Z

Adding it to the CLI commands is a cool idea! I'm down for that if other people want it too!

vukevint · 2023-02-28T23:23:59Z

Looks good! Just a couple thoughts:

There will probably need to be various checks in place to prevent things like overwriting to existing files, which happened when I ran the script for an existing dataset (hpd_violations.csv). Could also take this further and allow users to add info via cli inputs, like the URL.

I would also like to see it on the command line! But just wondering if it is good practice to allow modifications of the source code via cli? Maybe just require a path to the local repo of nycdb that will hold the skeleton files?

aepyornis · 2023-03-01T15:31:06Z

good point @vukevint. Perhaps this directory should use the current working directory instead?

aepyornis · 2023-03-01T15:47:33Z

Could also take this further and allow users to add info via cli inputs, like the URL.

this reminds me of something I've wanted before: to be able to declare the transformations in datasets.yml and not need the file dataset_transformations.py at all so a schema entry could look like

schema:
  - table_name: some_city_dataset
    transformations:
      - bbl
      - csv
    fields:
      boro: int
      block: int
      lot: int
      [...]

with this #236 it would be very cool if people could submit datasets with just a yaml file.

wstlabs

LGTM + some feedback.

First of, holistically: Overall this looks like definitely very positive, well-considered effort that scratches addresses a significant project need. Very nice work indeed.

Also, I haven't taken the time to thoroughly analyze the scripts inner workings, or to test it beyond running it once (on a freshly pulled dob_certificate_occupancy dataset).

My suggestions below are mostly of the nitpicking sort, in this context (though a couple of the suggested changes are of the strongly recommended). As such, they are "voluntary" suggestions (subject to whatever the admins might say).

Suggested Changes

(1) the script should definitely move to `src/scripts`.

I'd also like to boldy suggest we use hyphens rather than underscores in script names at least. So the new path becomes:

src/scripts/create-dataset.py

Both are safe, permitted and widely used in Unix-land -- I happen to think hyphenated filenames in general are much easier on the eyes. Others may disagree of course.

(2) Since this is a "script" it should be have perms set accordingly (`+ogx`) and should have canonical script header:

#!/usr/bin/env python3

(3) imports should be done at the top of the script

Currently we have on lines 277-278:

    import random
    import subprocess

I would also suggest we order stdlib imports (and all other natural groupings of imports) by length, for readability:

import re
import sys
import csv
import random
import argparse
import textwrap
import subprocess

(4) in non-testing context, prefer exceptions over asserts

Right now we have the block:

assert DATASETS_DIR.exists()
assert TRANSFORMATIONS_PY_PATH.exists()
assert SQL_DIR.exists()
assert TEST_DIR.exists()
assert NYCDB_TEST_PY_PATH.exists()
assert TEST_DATA_DIR.exists()

Strictly speaking -- though the assert keyword can certainly be used this way, it's really intended for testing contexts (and in fact can be a liability in runtime scripts, because it gets disabled if the -O flag is invoked.

In runtime contexts it's generally better to use exceptions or some other kind of failure handling, e.g.:

if not DATASETS_DIR.exists():
    fail("fatal error: cannot verify datasets directory")

(5) Lines 47-61 - remove spaces between these lines (so they contract to a single block)

(6) Prefer triple-double (rather than single) quote block delimiters

As this seems more standard (or at least more consistent with the rest of the codebase).

Applies both to the comment block at the very top, and to various docstrings throughout the script.

(7) Prefer wider over taller text blurbs

It seems these mostly tap out at 55-60 chars per line.

In my view this ends up making the code (and shell output) look choppy, and "taller" than it needs to be.

In the view of many style guides -- paragraphs should in general be delimited consistnetly to 78-79 chars if they are intended for output to the shell (for example the "Scaffolding created ..." blurb).

In docstrings -- people have differing views (I not only don't mind, much in general prefer longer docstrings) but in general it seems there's a consensus these should also fill out to 78-79 characters as well.

wstlabs · 2023-11-17T19:33:42Z

Added a review (with requested changes), hopefully in the right spirit.

austensen · 2024-05-29T15:04:57Z

Added a review (with requested changes), hopefully in the right spirit.

Hey @wstlabs Thanks for your really thorough review on this, appreciate all the improvements you suggested. I've made all the changes.

If everyone else thinks this looks ok let me know and I can merge it in

steve52 added 2 commits February 23, 2023 15:38

Add script to create datasets

6c2813c

Remove import nycdb, add ()s to lower calls and include postcode in z…

193d343

…ipcode array

steve52 self-assigned this Feb 23, 2023

austensen mentioned this pull request Feb 28, 2023

create contributing guide #255

Open

respoect the PascalCase

9f3cc4a

wstlabs mentioned this pull request Nov 17, 2023

[documentation] better file layout needed #277

Open

wstlabs requested changes Nov 17, 2023

View reviewed changes

austensen added 3 commits May 28, 2024 18:59

address (most) review comments

3341e54

update cursor in test for psycopg3

f159891

Merge branch 'main' into create-dataset-script

eeb6977

austensen merged commit 712811e into main Jul 9, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset script #252

Create dataset script #252

steve52 commented Feb 23, 2023

aepyornis commented Feb 24, 2023

mimiflynn commented Feb 24, 2023

steve52 commented Feb 28, 2023

vukevint commented Feb 28, 2023

aepyornis commented Mar 1, 2023

aepyornis commented Mar 1, 2023 •

edited

Loading

wstlabs left a comment

wstlabs commented Nov 17, 2023

austensen commented May 29, 2024

Create dataset script #252

Create dataset script #252

Conversation

steve52 commented Feb 23, 2023

aepyornis commented Feb 24, 2023

mimiflynn commented Feb 24, 2023

steve52 commented Feb 28, 2023

vukevint commented Feb 28, 2023

aepyornis commented Mar 1, 2023

aepyornis commented Mar 1, 2023 • edited Loading

wstlabs left a comment

Choose a reason for hiding this comment

Suggested Changes

(1) the script should definitely move to src/scripts.

(2) Since this is a "script" it should be have perms set accordingly (+ogx) and should have canonical script header:

(3) imports should be done at the top of the script

(4) in non-testing context, prefer exceptions over asserts

(5) Lines 47-61 - remove spaces between these lines (so they contract to a single block)

(6) Prefer triple-double (rather than single) quote block delimiters

(7) Prefer wider over taller text blurbs

wstlabs commented Nov 17, 2023

austensen commented May 29, 2024

aepyornis commented Mar 1, 2023 •

edited

Loading

(1) the script should definitely move to `src/scripts`.

(2) Since this is a "script" it should be have perms set accordingly (`+ogx`) and should have canonical script header: