Madgrades-extractor Design #101
Replies: 1 comment
-
|
I agree with this breakdown, and how the current Java implementation mixes all of these stages together, which makes it rigid and difficult to maintain.Instead of keeping everything inside the legacy extractor, I want to do away with it entirely and split this into two modules:
I also think your idea to log and output all skipped rows is fine. I guess I am missing, why are there skipped rows anyways? I do not remember why.. Now, I am not married to the Madgrades/reports repo or idea, but it seems convenient to build from there, unless we really want entirely different infrastructure. With that in mind, there are a just a couple design choices I've made so far in the new Madgrades/reports repo:
I'd encourage anyone to take a look at Madgrades/reports and try iterating from there. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion is prompted from the desire to allow
madgrades-extractorto output data from the DIR and Grades Reports in a "raw," tabular format: a delimited but non-relational file, similar to or even potentially exactly as the data are shown on the pages of the PDFs. It also comes from the desire to make the extractor, as Keenan put it well: "carry context like College and Subject down into each row, creating a flat, self-contained record."The extractor currently only ever has the data in this format exactly as Tabula sees the PDF after the page-wide information such as
CollegeorSubject. The "human-interpretation" process happens after the data are converted into Java objects; "human-interpretation" referring to assigning the page-wide information to every record on a page of the PDF, etc. Therefore, outputting the data in this form would leave much to be desired in my opinion.I think it is most productive to frame this discussion like this: What steps should the extractor take? What order should these steps be done in? And when should the user be able to output results?
In my opinion, this is ideal:
Again, IMO, the user could output the data after Human-interpretation and/or after Organize, but I can see the argument for "after Fix bad data and/or after Organize". I'd imagine we don't have to choose one or the other either.
From this perspective, I would describe the current extractor as: Extract, Parse, Human-interpretation, Organize, with an output option at the end.
In fact, maybe the "
madgrades-extractor" shouldn't do anything more actually just "extracting" (and maybe human interpretation); should there be a separate tool doing subsequent processing for Madgrades' use? My initial thought is that it suffices to have the existingmadgrades-extractorhave a default behavior of outputting as it historically has (after organization) with a flag to alternatively output raw results, but maybe it is worth discussing too.One last thing to note: I find being able to output all skipped rows to be very valuable for verifying extractor accuracy too.
Beta Was this translation helpful? Give feedback.
All reactions