Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading XML datasets #5001

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

albertvillanova
Copy link
Member

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@davanstrien
Copy link
Member

CC: @davanstrien

I should have some time to look at this on Friday :)

@davanstrien
Copy link
Member

@albertvillanova I've tried this with a few different XML datasets. One issue I've run into is getting a KeyError when the attributes of a field differ from the first parsed row. Unfortunately, this can come up in the ALTO XML format, for example, if you want to parse the 'string' field, which contains the text in the ALTO XML files.

When parsing a file, this instance has no 'STYLE' attribute:

<TextLine HEIGHT="39" WIDTH="295" VPOS="926" HPOS="247"><String WC="0.4600000083" CONTENT="jufqu’en" HEIGHT="39" WIDTH="117" VPOS="926" HPOS="247"/><SP WIDTH="14" VPOS="928" HPOS="365"/><String WC="0.6075000167" CONTENT="l’an" HEIGHT="26" WIDTH="50" VPOS="928" HPOS="380"/><SP WIDTH="24" VPOS="936" HPOS="431"/><String WC="0.4300000072" CONTENT="1" HEIGHT="16" WIDTH="9" VPOS="936" HPOS="456"/><String STYLE="italics" WC="0.5774999857" CONTENT="361." HEIGHT="25" WIDTH="68" VPOS="933" HPOS="474"/></TextLine>

Whereas this one which appears later in the file, does have this field:

<TextLine HEIGHT="39" WIDTH="712" VPOS="966" HPOS="297"><String STYLE="italics" WC="0.6999999881" CONTENT="I" HEIGHT="17" WIDTH="9" VPOS="977" HPOS="297"/><String WC="0.5" CONTENT="I." HEIGHT="18" WIDTH="25" VPOS="976" HPOS="318"/><SP WIDTH="24" VPOS="971" HPOS="344"/><String STYLE="italics" WC="0.3359999955" CONTENT="Crade" HEIGHT="26" WIDTH="91" VPOS="967" HPOS="369"/><SP WIDTH="31" VPOS="971" HPOS="461"/><String STYLE="italics" WC="0.6060000062" CONTENT="Pétri" HEIGHT="26" WIDTH="71" VPOS="968" HPOS="493"/><SP WIDTH="23" VPOS="968" HPOS="565"/><String STYLE="italics" WC="0.612857163" CONTENT="Candidi" HEIGHT="27" WIDTH="111" VPOS="967" HPOS="589"/><SP WIDTH="19" VPOS="967" HPOS="701"/><String STYLE="italics" WC="0.4088888764" CONTENT="Decembrii" HEIGHT="28" WIDTH="144" VPOS="966" HPOS="721"/><SP WIDTH="10" VPOS="968" HPOS="866"/><String STYLE="italics" WC="0.4600000083" CONTENT="in" HEIGHT="25" WIDTH="27" VPOS="968" HPOS="877"/><SP WIDTH="9" VPOS="967" HPOS="905"/><String STYLE="italics" WC="0.5099999905" CONTENT="funere" HEIGHT="38" WIDTH="94" VPOS="967" HPOS="915"/></TextLine>

Since the first-seen fields define what is passed to arrow_writer, this causes a KeyError when the version with the extra attributes is encountered because it doesn't expect this column.

Since it's important to support streaming, I'm not sure there is a nice way to detect attributes for the whole file easily in an automatic way. The two potential ways I can see of doing it.

  • Do an initial pass on a batch of data to have a higher chance of encountering variations in attributes before doing the arrow write.
  • Do a full pass on one file (and assume that this won't change across files)

I think the other way of doing this would be to allow users to define expected/wanted attributes as another loading argument. This could then be used to extract the described attributes (and make them None if not found). This requires a bit more work from the user but could be helpful. For example, in the XML above, likely, most users will only want the WC and CONTENT attributes. So they could specify this upfront and avoid loading extra data they don't need or want. I suspect this option would make more sense than making this operation automatic for the case where attributes might change. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants