Skip to content
This repository has been archived by the owner on Mar 6, 2019. It is now read-only.

Major overhaul of source mapping #75

Open
saizai opened this issue Apr 21, 2015 · 4 comments
Open

Major overhaul of source mapping #75

saizai opened this issue Apr 21, 2015 · 4 comments

Comments

@saizai
Copy link
Contributor

saizai commented Apr 21, 2015

I noticed that a lot of mappings were
a) just wrong (e.g. linked to the wrong record, like col a vs col b, or the wrong version number / line item)
b) missing (e.g. no field to capture some data in a record)
c) duplicated (e.g. multiple fields mapped to the same name)
d) inconsistently named
e) not well segregated (e.g. comma or newline within fields that aren't escaped and are comma/newline separated)

So I'm working on a major overhaul of the source mapping, deriving directly from the e-filing headers all versions.xlsx eFilingFormats file. While at it, I'm having it support versions 1 & 2 as well as deprecated forms.

Because the data import will have to be re-done anyway (because of a-c above), I'm being a bit aggressive about making the names consistent and semantic — e.g. total_receipts_ytd instead of col_b_total_receipts. I'm hoping to reduce the total number of canonical field names from the current ~1.2k to something a bit more sane. ;-)

The new version will have a regex based mapping file, with US delimiters (ascii 31) and field type/size data, both to make it easier to edit in the future and to be able to automatically output a database migration file.

I'm expecting to be done in about a week and will make a pull request then. Right now it's not in a fully consistent state.

So @dwillis et al, please hold off on working on this part of the code for the moment.

(Also, I'll be publishing an .sql.gz dump of the full import to date.)

@dwillis
Copy link
Contributor

dwillis commented Apr 21, 2015

This is a pretty significant undertaking; I appreciate the effort. I do want to say one thing about conventions: using something like ytd all the time isn't correct; in some cases col_b is cycle-to-date, and we want to reflect that where we can. Reducing the list of canonical field names is something I'm very interested in, but want to make sure it doesn't lose anything we actually need.

@saizai
Copy link
Contributor Author

saizai commented Apr 21, 2015 via email

@dwillis
Copy link
Contributor

dwillis commented Apr 21, 2015

Makes sense. e-Filing formats 5.0-6.3 use cycle-to-date for form F10 and Form 3 uses it as well for Schedule A.

@saizai
Copy link
Contributor Author

saizai commented Apr 21, 2015 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants