-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong count in LP yaml file #29
Comments
This is actually intentional, although poorly thought-out. The SCC's file layout describes the first few fields like such:
So they're basically nesting data within a field—both the unique number and where the organization was formed. With the YAML, we're capturing that position as its own field, too, figuring that it can always be discarded if somebody doesn't want it. That said, the poorly thought-out bit is that this transforms the data somewhat. After all, a pure capture of the data would keep the field count the same. I have not settled on the point at which we should begin making transformations. Perhaps it would be best to first just do a literal translation of the raw data to CSV, and save those files? And then embark on a transformative process, in which columns are renamed (e.g., lowercased, unnecessary prefixes removed), nested data is broken out into its own fields, geocoding is done, etc., etc? You have more experience than anybody else in this area—what do you think the right way to handle this is? |
From looking at the website I think the 'L' or 'M' is actually part of the identifier (see also what happens in the CORP-ID in the main corporate table, where some identifiers have a 'F' in the first character, and some have a numeric field). I would suggest that it's probably best to extract the full identifier, and then to extract intelligence from it. If you decompose it into constituent parts when storing as data (as opposed to inferring other attributes from it), I think you'll have difficulties. For example, building up URLs from the identifier, such as https://sccefile.scc.virginia.gov/Business/L014744 Hope this helps |
I'm actually leaving the identifier intact. - name: CORP-FORMED
type: A
start: 2
length: 1
description: Unique number assigned to LP
- name: CORP-ID
type: A
start: 2
length: 7
description: Unique number assigned to LP D'oh—I just noticed that the Anyhow, I think (hope?) this is the best of both worlds—the identifier is intact, but a little more data is gathered at the same time. |
Yes, starting at position 2 for two things is not ideal, and what confused Chris On 7 May 2014 19:53, Waldo Jaquith notifications@github.com wrote:
OpenCorporates :: The Open Database of the Corporate World |
The second entry in https://github.com/openva/crump/blob/master/table_maps/3_lp.yaml is wrong -- it's not a separate field but part of the CORP-ID
The text was updated successfully, but these errors were encountered: