OpenTaxForms opens and automates US tax forms--it reads PDF tax forms (currently from IRS.gov only, not state forms), converts them to more feature-full HTML5, and offers a database and API for developers to create their own tax applications. The converted forms will be available to test (and ultimately use) at OpenTaxForms.org.
pip install opentaxforms
- [issue tracker link forthcoming]
- [milestones link forthcoming]
The script reports a status for each form. Current status categories are:
- layout means textboxes and checkboxes--they should not overlap.
- refs are references to other forms--they should all be recognized (ie, in the list of all forms).
- math is the computed fields and their dependencies--each computed field should have at least one dependency, or else what is it computed from?
Each status error has a corresponding warning in the log file, so they're easy to find. Each bugfix will likely reduce errors across many forms.
The ReSTful API is read-only and provides a complete accounting of form fields: data type, size and position on page, and role in field groupings like dollars-and-cents fields, fields on the same line, fields in the same table, fields on the same page, and fields involved in the same formula. The API will also provide status information and tester feedback for each form.
[API docs forthcoming, for now see examples in test/run_apiclient.sh]
How it works
- relationships among fields (such as dollar and cent fields; fields on the same line; columns and rows of a table).
- math formulas, including which fields are computed vs user-entered (such as "Subtract line 37 from line 35. If line 37 is greaterthan line 35, enter -0-").
- references to other forms
- Move lower-level ToDo items to github/issues.
- Refactor toward a less script-ish architecture that will scale to more developers. [architecturePlease]
- Switch to a pdf-to-svg converter that preserves text (rather than converting text to paths), perhaps pdfjs, so that testers can easily copy and paste text from forms. [copyableText]
- Should extractFillableFields.py be a separate project called xfadump? This might provide a cleaner target output interface for an OCR effort. [xfadump]
- Replace allpdfnames.txt with a more detailed form dictionary via a preprocess step. [formDictionary]
- Offer entire-form html interface (currently presenting each page separately). [formAsSingleHtmlPage]
- Incorporate instructions and publications, especially extracting the worksheets from instructions. [worksheets]
- Add the ability to process US state forms. [stateForms]
- Fix countless bugs, especially in forms that contain tables (see [issues])
- Don't seek in a separate file a schedule that occurs within a form. [refsToEmbeddedSchedules]
- Separate dirName command line option into pdfInputDir,htmlOutputDir. [splitIoDirs]
Other tax- and PDF-related projects