Extract_Text extract character of embedded PDF #981

flashpixx · 2023-09-04T10:51:55Z

flashpixx
Sep 4, 2023

Hello,

I have got a LaTeX generated PDF, which uses the \includepdf call to embed other PDF. Now I'm using PDFPlumber to extract the text (it is a test-case for any PDF later). In general extract_text works fine, but on the embedded pages I get "character garbage" back e.g. (partly extracted)

 oo(cid:212)iiioiIIIII
               2nov2
       .glofrE   
   dnu           
     eleiv       
    .trh(cid:159)feghcrud suarK
 erawtfos III -snoitargetnI reb(cid:159)negeg
     tiebrA      
      red      (cid:209)
   ssezorpgnitseT dnu leiv
  iiil sesaC-tseT ruz nrreH 99-23 ed.erawtfos-vdp.~
      tieZ       
     red         
     tztesegmu nihretiew
     renies     hcsiweiN
   neigolonhceT  
 1 II           75)12350(
     gnuressebreV ezruk
      ehcsn(cid:159)w
    gidn(cid:138)tsnegie netlahreV

How can I avoid that this text is returned on extract_text for a whole page, which contains another pdf? Is it possible this is a font information?

Thanks

jsvine · 2023-09-11T16:46:21Z

jsvine
Sep 11, 2023
Maintainer

Hi @flashpixx, and very interesting. I'm not familiar with LaTeX's \includepdf feature. Would you be able to attach an example PDF that reproduces this problem?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract_Text extract character of embedded PDF #981

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Extract_Text extract character of embedded PDF #981

flashpixx Sep 4, 2023

Replies: 1 comment

jsvine Sep 11, 2023 Maintainer

flashpixx
Sep 4, 2023

jsvine
Sep 11, 2023
Maintainer