DOC: OCR vs PDF text extraction #1081

MartinThoma · 2022-07-09T12:58:42Z

New Features (ENH): - Add PageObject._get_fonts (#1083) - Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067) Performance Improvements (PI): - Use iterative DFS in PdfWriter._sweep_indirect_references (#1072) Bug Fixes (BUG): - Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066) - Column default for CCITTFaxDecode (#1079) Robustness (ROB): - Guard against None-value in _get_outlines (#1060) Documentation (DOC): - Stamps and watermarks (#1082) - OCR vs PDF text extraction (#1081) - Python Version support - Formatting of CHANGELOG Developer Experience (DEV): - Cache downloaded files (#1070) - Speed-up for CI (#1069) Maintenance (MAINT): - Set page.rotate(angle: int) (#1092) - Issue #416 was fixed by #1015 (#1078) Testing (TST): - Image extraction (#1080) - Image extraction (#1077) Code Style (STY): - Apply black - Typo in Changelog Full Changelog: 2.4.2...2.4.3

DOC: OCR vs PDF text extraction

8df1a16

Closes #1073

MartinThoma force-pushed the ocr branch from cae9a5e to 8df1a16 Compare July 9, 2022 13:03

MartinThoma merged commit 9794ef6 into main Jul 9, 2022

MartinThoma deleted the ocr branch July 9, 2022 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: OCR vs PDF text extraction #1081

DOC: OCR vs PDF text extraction #1081

MartinThoma commented Jul 9, 2022

DOC: OCR vs PDF text extraction #1081

DOC: OCR vs PDF text extraction #1081

Conversation

MartinThoma commented Jul 9, 2022