Madgrades-extractor Design #101

CadenKruckeberg · 2026-06-23T16:21:29Z

CadenKruckeberg
Jun 23, 2026

This discussion is prompted from the desire to allow madgrades-extractor to output data from the DIR and Grades Reports in a "raw," tabular format: a delimited but non-relational file, similar to or even potentially exactly as the data are shown on the pages of the PDFs. It also comes from the desire to make the extractor, as Keenan put it well: "carry context like College and Subject down into each row, creating a flat, self-contained record."

The extractor currently only ever has the data in this format exactly as Tabula sees the PDF after the page-wide information such as College or Subject. The "human-interpretation" process happens after the data are converted into Java objects; "human-interpretation" referring to assigning the page-wide information to every record on a page of the PDF, etc. Therefore, outputting the data in this form would leave much to be desired in my opinion.

I think it is most productive to frame this discussion like this: What steps should the extractor take? What order should these steps be done in? And when should the user be able to output results?

In my opinion, this is ideal:

Extract: Convert information on the PDF to Java objects (using Tabula)
Human-interpretation: Assign PDF-wide and page-wide information to each row/record, recognize that Course Names are only shown on the last section in the Grades PDF, recognize that text with differing vertical positions can still be contents from the same cell, etc.
Fix bad data: e.g. 1032-grades.pdf's Section column has seemingly nonsensical values following the course and section number
Parse: Convert text to useful objects or otherwise transform the extracted data
Organize: In the case of Madgrades, convert the data to the existing relational format

Again, IMO, the user could output the data after Human-interpretation and/or after Organize, but I can see the argument for "after Fix bad data and/or after Organize". I'd imagine we don't have to choose one or the other either.

From this perspective, I would describe the current extractor as: Extract, Parse, Human-interpretation, Organize, with an output option at the end.

In fact, maybe the "madgrades-extractor" shouldn't do anything more actually just "extracting" (and maybe human interpretation); should there be a separate tool doing subsequent processing for Madgrades' use? My initial thought is that it suffices to have the existing madgrades-extractor have a default behavior of outputting as it historically has (after organization) with a flag to alternatively output raw results, but maybe it is worth discussing too.

One last thing to note: I find being able to output all skipped rows to be very valuable for verifying extractor accuracy too.

thekeenant · 2026-07-01T01:24:38Z

thekeenant
Jul 1, 2026
Maintainer

I agree with this breakdown, and how the current Java implementation mixes all of these stages together, which makes it rigid and difficult to maintain.Instead of keeping everything inside the legacy extractor, I want to do away with it entirely and split this into two modules:

Madgrades/reports (The Extraction/Interpretation Layer): A tool dedicated strictly to your first few steps: extracting the raw PDF text, handling the human-interpretation layer (carrying the College, Subject, and Course context down into every row), fixing the bad data anomalies, and outputting clean, flat, deterministically sorted files.
reports -> Madgrades app tool: Later on, build a completely separate tool that takes those standardized report files and handles the Parse and Organize steps to convert them into the specific relational DB structure needed for the Madgrades app.

I also think your idea to log and output all skipped rows is fine. I guess I am missing, why are there skipped rows anyways? I do not remember why..

Now, I am not married to the Madgrades/reports repo or idea, but it seems convenient to build from there, unless we really want entirely different infrastructure. With that in mind, there are a just a couple design choices I've made so far in the new Madgrades/reports repo:

Linting and testing: The new repo has linting and encourages thorough testing of core components to keep the codebase maintainable from the start
Camelot for table extraction: I've been using Camelot here because it claims to offer better extraction accuracy and ease of use. Open to other options however.

I'd encourage anyone to take a look at Madgrades/reports and try iterating from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Madgrades

Madgrades-extractor Design #101

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Madgrades

Madgrades-extractor Design #101

Uh oh!

CadenKruckeberg Jun 23, 2026

Replies: 1 comment

Uh oh!

thekeenant Jul 1, 2026 Maintainer

CadenKruckeberg
Jun 23, 2026

thekeenant
Jul 1, 2026
Maintainer