Problem in extracting text from PDF (font firasans) #962

danltw · 2023-08-10T04:01:11Z

danltw
Aug 10, 2023

I'm not sure if the issue lies in the PDF font being firasans (embedded). I've inspected the fonts below as below:

Seems to be able to be read as pdf object in pdf = pdfplumber.open() but no text is extracted. I am unable to get the text from pdf2txt.py from pdfminer as well.

pdf in question:
test.pdf

pdfminer says that it supports the font type that the pdf fonts are in, as inspected. I don't have the source file.

Any advice on this?

cmdlineluser · 2023-08-10T10:24:16Z

cmdlineluser
Aug 10, 2023

It appears to be an issue with pdfminer (pdfplumber uses pdfminer internally).

pypdfium2 can read it:

import pdfplumber.display

text = (
   pdfplumber.display.pypdfium2.PdfDocument("Downloads/test.pdf")
      .get_page(0)
      .get_textpage()
      .get_text_range()
)

print(text)

# 64
# HMM Leadership Topic Summaries
# 1. CHANGE MANAGEMENT
# • Foster skills for adapting to continual change
# • Identify and carry out opportunities for improvement
# • Implement formal change programs
# • Address factors that can derail change
# 2. COACHING
# • Identify and act on coaching opportunities
# • Listen and question effectively during coaching
# • Give constructive feedback during coaching
# • Coach employees to become agile learners
# • Develop awareness and skills to coach all employees
# 3. DEVELOPING EMPLOYEES
# • Tailor development strategies to individual employees
# • Help employees create and implement development plans
# • Identify and design experiences that foster individual development
# • Build your team members’ global skills
# 4. DIFFICULT INTERACTIONS
# • Determine which conflicts to resolve
# • Address the negative emotions conflict raises
# • Clarify the facts of an interpersonal conflict
# • Solve the problem underlying a difficult interaction
# • Manage conflict between direct reports
# 5. DIGITAL INTELLIGENCE
# • Adopt a digital mindset—and foster one in others
# • Cultivate a team culture that thrives in today’s digital world
# • Use data responsibly and effectively
# • Prioritize and act on digital opportunities
# 6. FEEDBACK ESSENTIALS
# • Give effective feedback
# • Tailor feedback to the individual
# • Create an environment that encourages improvement through feedback
# • Seek feedback to improve your performance

4 replies

danltw Aug 11, 2023
Author

hmm, interesting... any idea on the underlying cause?

cmdlineluser Aug 11, 2023

I'm not too qualified when it comes to debugging PDF internals, hopefully someone with more knowledge can give some advice.

It should probably be filed as a bug at https://github.com/pdfminer/pdfminer.six

>>> from pdfminer.high_level import extract_text
>>> extract_text("Downloads/test.pdf")
'\x0c\x0c\x0c'

Out of interest, I tried to see if you could pass the information from pdfium2 into page.objects["char"] manually:

import ctypes
import pdfplumber
import pypdfium2 as pdfium
import pypdfium2.raw as pdfium_c

def get_chars(page):
    _page = pdfium.PdfDocument(page.pdf.stream).get_page(page.page_number - 1)
    textpage = _page.get_textpage()
    
    n_bytes = 1024  # how do we know the right size?
    buf = ctypes.create_string_buffer(n_bytes)
    
    x0, x1, y0, y1 = (
        ctypes.c_double(),
        ctypes.c_double(),
        ctypes.c_double(),
        ctypes.c_double(),
    )
    
    fs_matrix = pdfium_c.FS_MATRIX()
    
    initial_doctop = 0
    height = _page.get_height()
    
    for char in range(textpage.count_chars()):
        ok = pdfium_c.FPDFText_GetCharBox(textpage.raw, char, x0, x1, y0, y1)
        top = height - y1.value
        bottom = height - y0.value
        doctop = initial_doctop + top
        
        ok = pdfium_c.FPDFText_GetFontInfo(textpage.raw, char, buf, n_bytes, None)
        fontname = buf.value.decode()
        
        ok = pdfium_c.FPDFText_GetMatrix(textpage.raw, char, fs_matrix)
        matrix = pdfium.PdfMatrix.from_raw(fs_matrix)
        matrix = matrix.a, matrix.b, matrix.c, matrix.d, matrix.e, matrix.f
        
        size = pdfium_c.FPDFText_GetFontSize(textpage.raw, char)
        weight = pdfium_c.FPDFText_GetFontWeight(textpage.raw, char)
        character = pdfium_c.FPDFText_GetUnicode(textpage.raw, char)
        
        yield dict(
           matrix = matrix, 
           fontname = fontname, 
           upright = 1,
           x0 = x0.value, 
           y0 = y0.value, 
           x1 = x1.value, 
           y1 = y1.value, 
           width = x1.value - x0.value,
           height = y1.value - y0.value,
           size = size, 
           object_type = "char", 
           text = chr(character), 
           top = top, 
           bottom = bottom, 
           doctop = doctop
        )

with pdfplumber.open("Downloads/test.pdf") as pdf:
    for page in pdf.pages:
        page.objects["char"] = list(get_chars(page))
        print(page.extract_text(layout=True, use_text_flow=True))

Seems to work okay apart from the header, I'm probably missing some steps in how pdfplumber processes the char objects.

                                                                              64  
                                                                                  
                                                                                  
        H  M  M    L ead         h  i   T      i   S             ri               
                           e  rs     p    o  p  c     u m   m  a    es            
                                                                                  
                                                                                  
                                                                                  
                                                                                  
                                                                                  
                                                                                  
        1  CHANGE MANAGEMENT                                                      
         .                                                                        
           • Foster skills for adapting to continual change                       
           • Identify and carry out opportunities for improvement                 
           • Implement formal change programs                                     
           • Address factors that can derail change                               
                                                                                  
        2  COACHING                                                               
         .                                                                        
           • Identify and act on coaching opportunities                           
             Listen and question effectively during coaching                      
           •                                                                      
             Give constructive feedback during coaching                           
           •                                                                      
           • Coach employees to become agile learners                             
           • Develop awareness and skills to coach all employees                  
        3  DEVELOPING EMPLOYEES                                                   
         .                                                                        
           • Tailor development strategies to individual employees                
           • Help employees create and implement development plans                
             Identify and design experiences that foster individual development   
           •                                                                      
           • Build your team members’ global skills                               
        4  DIFFICULT INTERACTIONS                                                 
         .                                                                        
           • Determine which conflicts to resolve                                 
             Address the negative emotions conflict raises                        
           •                                                                      
           • Clarify the facts of an interpersonal conflict                       
           • Solve the problem underlying a difficult interaction                 
           • Manage conflict between direct reports                               
        5  DIGITAL INTELLIGENCE                                                   
         .                                                                        
           • Adopt a digital mindset —and foster one in others                    
           • Cultivate a team culture that thrives in today’s digital world       
           • Use data responsibly and effectively                                 
           • Prioritize and act on digital opportunities                          
                                                                                  
        6  FEEDBACK ESSENTIALS                                                    
         .                                                                        
             Give effective feedback                                              
           •                                                                      
           • Tailor feedback to the individual                                    
           • Create an environment that encourages improvement through feedback   
           • Seek feedback to improve your performance

jsvine Aug 11, 2023
Maintainer

@cmdlineluser Wow, this is fantastic, and a great demonstration of the potential for us to swap in pypdfium2 as the backend in the future. Thank you for sharing!

@danltw I'm not sure what's causing this; but if I had to guess, it might be due to an apparent unusual lack of newlines in the internal PDF commands. E.g. (via PDF Object Browser:

danltw Aug 14, 2023
Author

thanks @cmdlineluser !

@jsvine that'd be great if we are able to somehow swap between the different pdf backends either if one fails, we can fallback to another, or perhaps give the option as an extension to call that backed from the dev end 😄 looking forward to this option if included! Also, interesting work on the interoperability of the page object, may prove useful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in extracting text from PDF (font firasans) #962

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Problem in extracting text from PDF (font firasans) #962

danltw Aug 10, 2023

Replies: 1 comment · 4 replies

cmdlineluser Aug 10, 2023

danltw Aug 11, 2023 Author

cmdlineluser Aug 11, 2023

jsvine Aug 11, 2023 Maintainer

danltw Aug 14, 2023 Author

danltw
Aug 10, 2023

Replies: 1 comment 4 replies

cmdlineluser
Aug 10, 2023

danltw Aug 11, 2023
Author

jsvine Aug 11, 2023
Maintainer

danltw Aug 14, 2023
Author