word-level font names and heights #28

jsfenfen · 2017-03-08T06:06:26Z

Having a font for an entire word helps parsing. A lot. Height also helps some.

I took a crack at this here, with some settings. Defaults also may need adjustment.

If you've got thoughts, @jsvine, lemme know and I can clean this up into a pr. Haven't gotten the testing set up yet.

jsfenfen@847a3bb

The text was updated successfully, but these errors were encountered:

jsfenfen · 2017-03-08T06:10:06Z

I guess with word heights I'm going back and forth on averaging them or taking the mode; left the latter in for the moment.

jsvine · 2017-03-08T16:07:26Z

Thanks! I like this. For testing's sake: Do you have shareable examples of PDFs where chars that should belong to the same word either have different heights or fontnames?

jsfenfen · 2017-03-13T17:37:38Z

So I still haven't heard back about the files that originally required this. I could pretty easily just make up a sample pdf that failed the font height test, though obviously having an example would be better... The other time this stuff (can) come up is when the word tolerance is set too high and words run together inadvertently--though only if adjacent cells have different fonts. Will look around a bit.

jsvine · 2017-03-15T03:21:40Z

No worries. Thinking through this a bit. I'm tempted to, by default, group words by fonts, size, and color. (Yes, upcoming versions of pdfplumber will include font color!) Boolean params could turn them off. I.e., defaults would be:

def extract_words(chars,
  x_tolerance=DEFAULT_X_TOLERANCE,
  y_tolerance=DEFAULT_Y_TOLERANCE,
  keep_blank_chars=False,
  match_fontsize=True,
  match_fontcolor=True,
  match_fontname=True
)

That'd mean losing some of the flexibility of, e.g., DEFAULT_FONT_HEIGHT_TOLERANCE, but might make the options clearer. It'd also mean avoiding having to calculate the average/mode values for tolerance-ed attributes. For instance, this ...

page.extract_words()

.... might return ...

[ {
  "text": "Hello",
  "fontsize": 12,
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

... while ...

page.extract_words(match_fontsize=False)

.... would return ...

[ {
  "text": "Hello",
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

What do you think? Too inflexible?

jsfenfen · 2017-03-19T16:52:21Z

I think that's great!

Also, I think whatever adjustments might be needed will become more obvious the more pdfs we trawl through...

jsfenfen · 2017-04-26T06:40:02Z

I got a different sample of the docs with the font height thing! Going through them, uh, soonish.

jsfenfen · 2017-04-27T19:31:00Z

Ok, I have this working in the word_fonts branch here using made up pdfs as tests. Trying to dig up the sample observed in the wild.

Am doing this with a custom WordFontError subclassed from RuntimeError, but am open to suggestions...

No idea if this will be at all helpful ahead of 0.60 rewrite, but...

jsvine · 2017-04-28T20:56:39Z

Ooh, thanks! Will definitely aim to incorporate this (or something close to it) into the next big release.

problemsniper · 2017-09-28T21:37:50Z

Is this in the current version? I am looking for font name and font size per work and not per letter.

jsfenfen · 2017-09-28T22:32:16Z

hey @krishnakt031990 I don't think so, though the version I did of it is still here: https://github.com/jsfenfen/pdfplumber/tree/master . I guess there's a minor release that's been added since, I will update when I've got a sec.
@jsvine it looks like the pr doesn't have squashed commits? this isn't a big change, though would be clearer if I could squash those. Hmm.

problemsniper · 2017-09-28T23:11:33Z

Works perfectly! thanks @jsfenfen. Just have another question regarding the document. Did you try to reverse engineer to build a pdf out of the extracted properties of text? Just wanted some tips to create one if you did look into doing it.

jsfenfen · 2017-09-29T05:32:03Z

"Did you try to reverse engineer to build a pdf out of the extracted properties of text?"
No.... I'm not sure I get the use case--couldn't you just use the original pdf? But if you really want to create a pdf from objects of your choosing, maybe https://bitbucket.org/rptlab/reportlab ?

jsfenfen · 2017-10-03T18:49:23Z

@krishnakt031990 is this a pdf that's been OCR'ed? Fonts aren't very reliable in most of the OCR I've seen--could this have been set there? Also possible this is a pdfminer thing? Can you share a doc that does this?

problemsniper · 2017-10-23T17:09:55Z

For the font size.. the point size is about 4-5 pts more than the actual font. I can give an example with an image here.

See that extra spacing on top of My?

Saqhas · 2020-03-19T14:03:55Z

@jsvine Is this issue resolved and the functionality added.

jsvine · 2020-04-01T13:09:49Z

This functionality has not yet been added. I'm certainly open to adding it, but haven't had the time quite yet.

Saqhas · 2020-04-01T13:12:52Z

I wanted this functionality in one of my project. I have done some changes in the repo code to support this functionality, should I push it in a branch and create pull request. So that we can discuss and add it.

jsvine · 2020-04-01T13:20:55Z

Thanks, @Saqhas! It's definitely worth a discussion and opening a pull request. I'm not certain I'll use your code, but it could definitely be helpful inspiration and I would certainly credit you for that.

ibrahimshuail · 2020-07-22T15:49:41Z

can we capture based on the font size, for eg if my font size is 12 I need the relevant words from that?

jsvine · 2020-07-24T13:20:00Z

@ibrahimshuail See my response to the separate issue you opened, #234

jsvine · 2021-01-27T23:32:44Z

Closing this now-done issue. Per merged PR above, this feature was added last year! 🎉

jsfenfen mentioned this issue Apr 28, 2017

font heights, names optionally returned from utils.extract_words #32

Closed

jsvine added the enhancement label Apr 1, 2020

jsvine mentioned this issue Aug 29, 2020

Refactor several complex methods and add extra_attrs to .extract_words(...) #260

Merged

jsvine closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word-level font names and heights #28

word-level font names and heights #28

jsfenfen commented Mar 8, 2017

jsfenfen commented Mar 8, 2017 •

edited

Loading

jsvine commented Mar 8, 2017

jsfenfen commented Mar 13, 2017

jsvine commented Mar 15, 2017

jsfenfen commented Mar 19, 2017

jsfenfen commented Apr 26, 2017

jsfenfen commented Apr 27, 2017

jsvine commented Apr 28, 2017

problemsniper commented Sep 28, 2017

jsfenfen commented Sep 28, 2017

problemsniper commented Sep 28, 2017

jsfenfen commented Sep 29, 2017

jsfenfen commented Oct 3, 2017

problemsniper commented Oct 23, 2017 •

edited

Loading

Saqhas commented Mar 19, 2020

jsvine commented Apr 1, 2020

Saqhas commented Apr 1, 2020

jsvine commented Apr 1, 2020

ibrahimshuail commented Jul 22, 2020

jsvine commented Jul 24, 2020

jsvine commented Jan 27, 2021

word-level font names and heights #28

word-level font names and heights #28

Comments

jsfenfen commented Mar 8, 2017

jsfenfen commented Mar 8, 2017 • edited Loading

jsvine commented Mar 8, 2017

jsfenfen commented Mar 13, 2017

jsvine commented Mar 15, 2017

jsfenfen commented Mar 19, 2017

jsfenfen commented Apr 26, 2017

jsfenfen commented Apr 27, 2017

jsvine commented Apr 28, 2017

problemsniper commented Sep 28, 2017

jsfenfen commented Sep 28, 2017

problemsniper commented Sep 28, 2017

jsfenfen commented Sep 29, 2017

jsfenfen commented Oct 3, 2017

problemsniper commented Oct 23, 2017 • edited Loading

Saqhas commented Mar 19, 2020

jsvine commented Apr 1, 2020

Saqhas commented Apr 1, 2020

jsvine commented Apr 1, 2020

ibrahimshuail commented Jul 22, 2020

jsvine commented Jul 24, 2020

jsvine commented Jan 27, 2021

jsfenfen commented Mar 8, 2017 •

edited

Loading

problemsniper commented Oct 23, 2017 •

edited

Loading