Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong count in LP yaml file #29

Closed
CountCulture opened this issue May 7, 2014 · 4 comments
Closed

Wrong count in LP yaml file #29

CountCulture opened this issue May 7, 2014 · 4 comments

Comments

@CountCulture
Copy link

The second entry in https://github.com/openva/crump/blob/master/table_maps/3_lp.yaml is wrong -- it's not a separate field but part of the CORP-ID

@waldoj
Copy link
Member

waldoj commented May 7, 2014

This is actually intentional, although poorly thought-out.

The SCC's file layout describes the first few fields like such:

2 REC-TYPE          (A2)      Record Type = '03' 
2 CORP-ID           (A7)      Unique number assigned to LP 
                              Pos. 1 L = Formed in VA
                              Pos. 1 M = Formed outside VA 

So they're basically nesting data within a field—both the unique number and where the organization was formed. With the YAML, we're capturing that position as its own field, too, figuring that it can always be discarded if somebody doesn't want it.

That said, the poorly thought-out bit is that this transforms the data somewhat. After all, a pure capture of the data would keep the field count the same. I have not settled on the point at which we should begin making transformations. Perhaps it would be best to first just do a literal translation of the raw data to CSV, and save those files? And then embark on a transformative process, in which columns are renamed (e.g., lowercased, unnecessary prefixes removed), nested data is broken out into its own fields, geocoding is done, etc., etc?

You have more experience than anybody else in this area—what do you think the right way to handle this is?

@CountCulture
Copy link
Author

From looking at the website I think the 'L' or 'M' is actually part of the identifier (see also what happens in the CORP-ID in the main corporate table, where some identifiers have a 'F' in the first character, and some have a numeric field). I would suggest that it's probably best to extract the full identifier, and then to extract intelligence from it. If you decompose it into constituent parts when storing as data (as opposed to inferring other attributes from it), I think you'll have difficulties. For example, building up URLs from the identifier, such as https://sccefile.scc.virginia.gov/Business/L014744

Hope this helps

@waldoj
Copy link
Member

waldoj commented May 7, 2014

I'm actually leaving the identifier intact. L014744 is extracted as a single unit of data. It's just that there's another field that contains L, since that has its own additional meaning. You can see in the YAML that both of these fields start at position 2.

- name:        CORP-FORMED
  type:        A
  start:       2
  length:      1
  description: Unique number assigned to LP
- name:        CORP-ID
  type:        A
  start:       2
  length:      7
  description: Unique number assigned to LP

D'oh—I just noticed that the description field is unhelpfully duplicated. :-/

Anyhow, I think (hope?) this is the best of both worlds—the identifier is intact, but a little more data is gathered at the same time.

@CountCulture
Copy link
Author

Yes, starting at position 2 for two things is not ideal, and what confused
us to start off with. It also means you can't use #unpack to split the
array into elements. But don't worry -- if it works for you. We're doing it
in Ruby, so needed some change in any case.

Chris

On 7 May 2014 19:53, Waldo Jaquith notifications@github.com wrote:

I'm actually leaving the identifier intact. L014744 is extracted as a
single unit of data. It's just that there's another field that contains
L, since that has its own additional meaning. You can see in the YAML
that both of these fields start at position 2.

  • name: CORP-FORMED
    type: A
    start: 2
    length: 1
    description: Unique number assigned to LP- name: CORP-ID
    type: A
    start: 2
    length: 7
    description: Unique number assigned to LP

D'oh—I just noticed that the description field is unhelpfully duplicated.
:-/

Anyhow, I think (hope?) this is the best of both worlds—the identifier
is intact, but a little more data is gathered at the same time.


Reply to this email directly or view it on GitHubhttps://github.com//issues/29#issuecomment-42467844
.


OpenCorporates :: The Open Database of the Corporate World
http://opencorporates.com
OpenlyLocal :: Making Local Government More Transparent
http://openlylocal.com
Blog: http://countculture.wordpress.com
Twitter: http://twitter.com/CountCulture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants