Skip to content

Docx reader: support extracting embedded non-image files with --extract-media #7115

@orthoxerox

Description

@orthoxerox

I often receive docx documentation with embedded files. It would be useful if Pandoc could extract them and replace them with hyperlinks, similar to what it does with images.

The files are stored in word\embeddings and their representations in document.xml look like this:

<!-- Object embedded as icon -->
<w:object w:dxaOrig="1543" w:dyaOrig="998">
  <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
    <v:stroke joinstyle="miter"/>
    <v:formulas>
      <v:f eqn="if lineDrawn pixelLineWidth 0"/>
      <v:f eqn="sum @0 1 0"/>
      <v:f eqn="sum 0 0 @1"/>
      <v:f eqn="prod @2 1 2"/>
      <v:f eqn="prod @3 21600 pixelWidth"/>
      <v:f eqn="prod @3 21600 pixelHeight"/>
      <v:f eqn="sum @0 0 1"/>
      <v:f eqn="prod @6 1 2"/>
      <v:f eqn="prod @7 21600 pixelWidth"/>
      <v:f eqn="sum @8 21600 0"/>
      <v:f eqn="prod @7 21600 pixelHeight"/>
      <v:f eqn="sum @10 21600 0"/>
    </v:formulas>
    <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
    <o:lock v:ext="edit" aspectratio="t"/>
  </v:shapetype>
  <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:77.25pt;height:49.9pt" o:ole="">
    <v:imagedata r:id="rId6" o:title=""/>
  </v:shape>
  <o:OLEObject Type="Embed" ProgID="Word.Document.12" ShapeID="_x0000_i1025" DrawAspect="Icon" ObjectID="_1675674465" r:id="rId7">
    <o:FieldCodes>\s</o:FieldCodes>
  </o:OLEObject>
</w:object>

<!-- Inline embedding -->
<w:object w:dxaOrig="9355" w:dyaOrig="450">
 <v:shape id="_x0000_i1028" type="#_x0000_t75" style="width:467.65pt;height:22.5pt" o:ole="">
   <v:imagedata r:id="rId8" o:title=""/>
 </v:shape>
 <o:OLEObject Type="Embed" ProgID="Word.Document.12" ShapeID="_x0000_i1028" DrawAspect="Content" ObjectID="_1675674466" r:id="rId9">
   <o:FieldCodes>\s</o:FieldCodes>
 </o:OLEObject>
</w:object>

The r:id attribute can be used to locate the correct embedded file via _rels\document.xml.rels:

<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/package" Target="embeddings/Microsoft_Word_Document.docx"/>
<Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/package" Target="embeddings/Microsoft_Word_Document1.docx"/>

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions