-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading XML datasets #5001
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
I should have some time to look at this on Friday :) |
@albertvillanova I've tried this with a few different XML datasets. One issue I've run into is getting a When parsing a file, this instance has no 'STYLE' attribute: <TextLine HEIGHT="39" WIDTH="295" VPOS="926" HPOS="247"><String WC="0.4600000083" CONTENT="jufqu’en" HEIGHT="39" WIDTH="117" VPOS="926" HPOS="247"/><SP WIDTH="14" VPOS="928" HPOS="365"/><String WC="0.6075000167" CONTENT="l’an" HEIGHT="26" WIDTH="50" VPOS="928" HPOS="380"/><SP WIDTH="24" VPOS="936" HPOS="431"/><String WC="0.4300000072" CONTENT="1" HEIGHT="16" WIDTH="9" VPOS="936" HPOS="456"/><String STYLE="italics" WC="0.5774999857" CONTENT="361." HEIGHT="25" WIDTH="68" VPOS="933" HPOS="474"/></TextLine> Whereas this one which appears later in the file, does have this field: <TextLine HEIGHT="39" WIDTH="712" VPOS="966" HPOS="297"><String STYLE="italics" WC="0.6999999881" CONTENT="I" HEIGHT="17" WIDTH="9" VPOS="977" HPOS="297"/><String WC="0.5" CONTENT="I." HEIGHT="18" WIDTH="25" VPOS="976" HPOS="318"/><SP WIDTH="24" VPOS="971" HPOS="344"/><String STYLE="italics" WC="0.3359999955" CONTENT="Crade" HEIGHT="26" WIDTH="91" VPOS="967" HPOS="369"/><SP WIDTH="31" VPOS="971" HPOS="461"/><String STYLE="italics" WC="0.6060000062" CONTENT="Pétri" HEIGHT="26" WIDTH="71" VPOS="968" HPOS="493"/><SP WIDTH="23" VPOS="968" HPOS="565"/><String STYLE="italics" WC="0.612857163" CONTENT="Candidi" HEIGHT="27" WIDTH="111" VPOS="967" HPOS="589"/><SP WIDTH="19" VPOS="967" HPOS="701"/><String STYLE="italics" WC="0.4088888764" CONTENT="Decembrii" HEIGHT="28" WIDTH="144" VPOS="966" HPOS="721"/><SP WIDTH="10" VPOS="968" HPOS="866"/><String STYLE="italics" WC="0.4600000083" CONTENT="in" HEIGHT="25" WIDTH="27" VPOS="968" HPOS="877"/><SP WIDTH="9" VPOS="967" HPOS="905"/><String STYLE="italics" WC="0.5099999905" CONTENT="funere" HEIGHT="38" WIDTH="94" VPOS="967" HPOS="915"/></TextLine> Since the first-seen fields define what is passed to Since it's important to support streaming, I'm not sure there is a nice way to detect attributes for the whole file easily in an automatic way. The two potential ways I can see of doing it.
I think the other way of doing this would be to allow users to define expected/wanted attributes as another loading argument. This could then be used to extract the described attributes (and make them None if not found). This requires a bit more work from the user but could be helpful. For example, in the XML above, likely, most users will only want the |
CC: @davanstrien