Wrong count in LP yaml file #29

CountCulture · 2014-05-07T07:36:44Z

The second entry in https://github.com/openva/crump/blob/master/table_maps/3_lp.yaml is wrong -- it's not a separate field but part of the CORP-ID

waldoj · 2014-05-07T14:43:09Z

This is actually intentional, although poorly thought-out.

The SCC's file layout describes the first few fields like such:

2 REC-TYPE          (A2)      Record Type = '03' 
2 CORP-ID           (A7)      Unique number assigned to LP 
                              Pos. 1 L = Formed in VA
                              Pos. 1 M = Formed outside VA

So they're basically nesting data within a field—both the unique number and where the organization was formed. With the YAML, we're capturing that position as its own field, too, figuring that it can always be discarded if somebody doesn't want it.

That said, the poorly thought-out bit is that this transforms the data somewhat. After all, a pure capture of the data would keep the field count the same. I have not settled on the point at which we should begin making transformations. Perhaps it would be best to first just do a literal translation of the raw data to CSV, and save those files? And then embark on a transformative process, in which columns are renamed (e.g., lowercased, unnecessary prefixes removed), nested data is broken out into its own fields, geocoding is done, etc., etc?

You have more experience than anybody else in this area—what do you think the right way to handle this is?

CountCulture · 2014-05-07T16:02:50Z

From looking at the website I think the 'L' or 'M' is actually part of the identifier (see also what happens in the CORP-ID in the main corporate table, where some identifiers have a 'F' in the first character, and some have a numeric field). I would suggest that it's probably best to extract the full identifier, and then to extract intelligence from it. If you decompose it into constituent parts when storing as data (as opposed to inferring other attributes from it), I think you'll have difficulties. For example, building up URLs from the identifier, such as https://sccefile.scc.virginia.gov/Business/L014744

Hope this helps

waldoj · 2014-05-07T18:53:39Z

I'm actually leaving the identifier intact. L014744 is extracted as a single unit of data. It's just that there's another field that contains L, since that has its own additional meaning. You can see in the YAML that both of these fields start at position 2.

- name:        CORP-FORMED
  type:        A
  start:       2
  length:      1
  description: Unique number assigned to LP
- name:        CORP-ID
  type:        A
  start:       2
  length:      7
  description: Unique number assigned to LP

D'oh—I just noticed that the description field is unhelpfully duplicated. :-/

Anyhow, I think (hope?) this is the best of both worlds—the identifier is intact, but a little more data is gathered at the same time.

CountCulture · 2014-05-07T22:27:51Z

Yes, starting at position 2 for two things is not ideal, and what confused
us to start off with. It also means you can't use #unpack to split the
array into elements. But don't worry -- if it works for you. We're doing it
in Ruby, so needed some change in any case.

Chris

On 7 May 2014 19:53, Waldo Jaquith notifications@github.com wrote:

I'm actually leaving the identifier intact. L014744 is extracted as a
single unit of data. It's just that there's another field that contains
L, since that has its own additional meaning. You can see in the YAML
that both of these fields start at position 2.

name: CORP-FORMED
type: A
start: 2
length: 1
description: Unique number assigned to LP- name: CORP-ID
type: A
start: 2
length: 7
description: Unique number assigned to LP

D'oh—I just noticed that the description field is unhelpfully duplicated.
:-/

Anyhow, I think (hope?) this is the best of both worlds—the identifier
is intact, but a little more data is gathered at the same time.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/29#issuecomment-42467844
.

OpenCorporates :: The Open Database of the Corporate World
http://opencorporates.com
OpenlyLocal :: Making Local Government More Transparent
http://openlylocal.com
Blog: http://countculture.wordpress.com
Twitter: http://twitter.com/CountCulture

waldoj mentioned this issue May 7, 2014

Fix the description of CORP-FORMED #30

Closed

waldoj closed this as completed May 14, 2014

waldoj mentioned this issue May 19, 2014

Establish an optional transformations stanza #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong count in LP yaml file #29

Wrong count in LP yaml file #29

CountCulture commented May 7, 2014

waldoj commented May 7, 2014

CountCulture commented May 7, 2014

waldoj commented May 7, 2014

CountCulture commented May 7, 2014

Wrong count in LP yaml file #29

Wrong count in LP yaml file #29

Comments

CountCulture commented May 7, 2014

waldoj commented May 7, 2014

CountCulture commented May 7, 2014

waldoj commented May 7, 2014

CountCulture commented May 7, 2014