Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

识别繁体的异常,导出的PDF很完美,但是数据里有很多多余的乱码。 #516

Closed
1 task done
maxin9966 opened this issue May 20, 2024 · 3 comments
Closed
1 task done

Comments

@maxin9966
Copy link

Issues

  • I have browsed through the Issues. 我已浏览过Issues,确定没有重复提问。

Umi-OCR version 程序版本

2.1.1

Windows version 系统版本

win11

OCR plugins Used 使用的OCR插件

No response

Reproduction steps 复现步骤

PDF转换
繁体中文
多栏按自然段换行
整页强制OCR

问题描述:

只有使用【整页强制OCR】才能成功识别,其他模式导出的都是空白
【整页强制OCR】出现以下问题,PDF导出的很完美,但是txt或者json原始数据的每句话结尾大概率都有一些乱码,具体情况如下图所示

Problem screenshots or related files (optional) 问题截图或相关文件(可选)

识别导出的PDF在显示上很完美

image

但是原始数据里,每句话都有几个多余的字

{
"code": 100,
"data": [{
"box": [
[61.58888888888889, 90.59814814814816],
[155.3111111111111, 90.15185185185186],
[155.3111111111111, 101.75555555555556],
[61.58888888888889, 102.20185185185186]
],
"score": 0.8367864489555359,
"text": "檢視當時的想法國",
"from": "ocr",
"end": "\n"
}, {
"box": [
[84.79629629629629, 121.83888888888889],
[418.17962962962963, 121.83888888888889],
[418.17962962962963, 131.65740740740742],
[84.79629629629629, 131.65740740740742]
],
"score": 0.8287469744682312,
"text": "發覺情緒後先停下來·檢視自己當時的想法為何·求思考這樣的想法對間砲砲",
"from": "ocr",
"end": "\n"
}, {
"box": [
[62.92777777777778, 143.7074074074074],
[404.34444444444443, 143.7074074074074],
[404.34444444444443, 152.63333333333333],
[62.92777777777778, 152.63333333333333]
],
"score": 0.8929890394210815,
"text": "題是否有幫助·提醒自己·是否願意讓不佳的情緒影響對孩子的問題處理:",
"from": "ocr",
"end": ""
}, {
"box": [
[62.035185185185185, 174.50185185185185],
[191.4611111111111, 173.60925925925926],
[191.4611111111111, 182.9814814814815],
[62.035185185185185, 183.8740740740741]
],
"score": 0.8536674976348877,
"text": "修正不適應的歸因想法祐",
"from": "ocr",
"end": "\n"
}, {
"box": [
[83.9037037037037, 205.7425925925926],
[415.94814814814816, 205.7425925925926],
[415.94814814814816, 214.22222222222223],
[83.9037037037037, 214.22222222222223]
],
"score": 0.4056791663169861,
"text": "·多主裂器晶1已月鍵·孚詳節口舉5具早彈(T呈雲·彩歌詳電器旱送為",
"from": "ocr",
"end": "\n"
}, {
"box": [
[62.92777777777778, 225.82592592592593],
[215.5611111111111, 225.82592592592593],
[215.5611111111111, 235.64444444444445],
[62.92777777777778, 235.64444444444445]
],
"score": 0.8782615661621094,
"text": "再重新面對孩子的問題立做處理·",
"from": "ocr",
"end": " "
}, {
"box": [
[62.92777777777778, 252.6037037037037],
[221.80925925925925, 252.6037037037037],
[221.80925925925925, 268.6703703703704],
[62.92777777777778, 268.6703703703704]
],
"score": 0.8285078406333923,
"text": "大、如何用適應的歸因想法國國",
"from": "ocr",
"end": "\n"
}, {
"box": [
[102.64814814814815, 290.5388888888889],
[182.53518518518518, 290.5388888888889],
[182.53518518518518, 299.9111111111111],
[102.64814814814815, 299.9111111111111]
],
"score": 0.8204410076141357,
"text": "不適應的歸因想法砲",
"from": "ocr",
"end": "\n"
}, {
"box": [
[65.60555555555555, 306.1592592592593],
[172.2703703703704, 306.1592592592593],
[172.2703703703704, 315.5314814814815],
[65.60555555555555, 315.5314814814815]
],
"score": 0.869182825088501,
"text": "這個孩子怎麼這麼不乖?",
"from": "ocr",
"end": ""
}, {
"box": [
[65.60555555555555, 321.77962962962965],
[152.63333333333333, 321.77962962962965],
[152.63333333333333, 331.15185185185186],
[65.60555555555555, 331.15185185185186]
],
"score": 0.9081498980522156,
"text": "他根本就是故意的!國",
"from": "ocr",
"end": "\n"
}, {
"box": [
[65.60555555555555, 337.4],
[182.53518518518518, 337.4],
[182.53518518518518, 346.77222222222224],
[65.60555555555555, 346.77222222222224]
],
"score": 0.7975088357925415,
"text": "我對這個孩子實在沒撤了!發〇",
"from": "ocr",
"end": "\n"
}, {
"box": [
[65.60555555555555, 359.7148148148148],
[216.9, 359.7148148148148],
[216.9, 369.087037037037],
[65.60555555555555, 369.087037037037]
],
"score": 0.876326858997345,
"text": "除了吃藥·應該沒有其他的法子了!發〇",
"from": "ocr",
"end": "\n"
}, {
"box": [
[65.60555555555555, 388.27777777777777],
[181.1962962962963, 388.27777777777777],
[181.1962962962963, 397.65],
[65.60555555555555, 397.65]
],
"score": 0.8257303237915039,
"text": "這個孩子是有門缺陷(的·",
"from": "ocr",
"end": " "
}, {
"box": [
[65.60555555555555, 410.5925925925926],
[171.82407407407408, 410.5925925925926],
[171.82407407407408, 420.4111111111111],
[65.60555555555555, 420.4111111111111]
],
"score": 0.8381812572479248,
"text": "這個孩子什麼都做不好·嶋",
"from": "ocr",
"end": ""
}, {
"box": [
[65.60555555555555, 426.212962962963],
[162.00555555555556, 426.212962962963],
[162.00555555555556, 436.0314814814815],
[65.60555555555555, 436.0314814814815]
],
"score": 0.9026196002960205,
"text": "我真是個失敗的父母!砲",
"from": "ocr",
"end": "\n"
}, {
"box": [
[64.71296296296296, 442.72592592592594],
[191.4611111111111, 442.72592592592594],
[191.4611111111111, 452.09814814814814],
[64.71296296296296, 452.09814814814814]
],
"score": 0.9279829263687134,
"text": "這個孩子會這樣都是我的錯!砲",
"from": "ocr",
"end": ""
}, {
"box": [
[61.58888888888889, 467.27222222222224],
[146.8314814814815, 470.3962962962963],
[146.38518518518518, 489.587037037037],
[60.696296296296296, 486.01666666666665]
],
"score": 0.6885399222373962,
"text": "大大作業練習區",
"from": "ocr",
"end": "\n"
}, {
"box": [
[286.52222222222224, 290.5388888888889],
[355.69814814814816, 290.5388888888889],
[355.69814814814816, 299.9111111111111],
[286.52222222222224, 299.9111111111111]
],
"score": 0.7950518131256104,
"text": "適應的歸囚想法發",
"from": "ocr",
"end": "\n"
}, {
"box": [
[227.1648148148148, 306.1592592592593],
[382.47592592592594, 306.1592592592593],
[382.47592592592594, 315.0851851851852],
[227.1648148148148, 315.0851851851852]
],
"score": 0.9082364439964294,
"text": "很多事情不是這個孩子能約控制的+",
"from": "ocr",
"end": "\n"
}, {
"box": [
[227.1648148148148, 321.77962962962965],
[413.27037037037036, 321.77962962962965],
[413.27037037037036, 330.7055555555556],
[227.1648148148148, 330.7055555555556]
],
"score": 0.8847109079360962,
"text": "他其實也不是故意·這些都是立狀造成的·國",
"from": "ocr",
"end": ""
}, {
"box": [
[227.61111111111111, 337.8462962962963],
[413.27037037037036, 337.8462962962963],
[413.27037037037036, 346.77222222222224],
[227.61111111111111, 346.77222222222224]
],
"score": 0.9650014638900757,
"text": "應該有其他的方法來解決·我該再試看看·",
"from": "ocr",
"end": " "
}, {
"box": [
[226.2722222222222, 353.02037037037036],
[415.94814814814816, 352.1277777777778],
[415.94814814814816, 361.9462962962963],
[226.2722222222222, 362.3925925925926]
],
"score": 0.9278918504714966,
"text": "吃藥尺是治療計畫的一個部分·而非下答",
"from": "ocr",
"end": ""
}, {
"box": [
[227.1648148148148, 366.4092592592593],
[253.9425925925926, 366.4092592592593],
[253.9425925925926, 376.22777777777776],
[227.1648148148148, 376.22777777777776]
],
"score": 0.22002696990966797,
"text": "業北·",
"from": "ocr",
"end": "\n"
}, {
"box": [
[227.1648148148148, 382.02962962962965],
[414.60925925925926, 382.02962962962965],
[414.60925925925926, 390.9555555555556],
[227.1648148148148, 390.9555555555556]
],
"score": 0.8889887928962708,
"text": "我該接受孩子真實的樣子·其實他也有很多園業",
"from": "ocr",
"end": ""
}, {
"box": [
[227.1648148148148, 394.97222222222223],
[264.2074074074074, 394.97222222222223],
[264.2074074074074, 404.7907407407408],
[227.1648148148148, 404.7907407407408]
],
"score": 0.8324491381645203,
"text": "優點的·國",
"from": "ocr",
"end": "\n"
}, {
"box": [
[227.1648148148148, 410.5925925925926],
[413.27037037037036, 410.5925925925926],
[413.27037037037036, 420.4111111111111],
[227.1648148148148, 420.4111111111111]
],
"score": 0.9247992634773254,
"text": "我應該著重孩子的優點·列尺看他的缺點、",
"from": "ocr",
"end": "\n"
}, {
"box": [
[227.1648148148148, 426.212962962963],
[393.6333333333333, 426.212962962963],
[393.6333333333333, 436.0314814814815],
[227.1648148148148, 436.0314814814815]
],
"score": 0.9061799645423889,
"text": "這個孩子比起其他孩子是更具挑戰的“國",
"from": "ocr",
"end": ""
}, {
"box": [
[226.2722222222222, 441.8333333333333],
[342.7555555555556, 441.8333333333333],
[342.7555555555556, 450.75925925925924],
[226.2722222222222, 450.75925925925924]
],
"score": 0.9286754131317139,
"text": "誰都不知道孩子會出問題、",
"from": "ocr",
"end": "\n"
}, {
"box": [
[83.9037037037037, 503.8685185185185],
[418.6259259259259, 503.8685185185185],
[418.6259259259259, 513.2407407407408],
[83.9037037037037, 513.2407407407408]
],
"score": 0.9499539136886597,
"text": "現在我們已經知道了感受與想法間的關聯性·在這一個星期中·我們可以佔",
"from": "ocr",
"end": "\n"
}, {
"box": [
[62.92777777777778, 524.8444444444444],
[416.8407407407407, 524.8444444444444],
[416.8407407407407, 533.7703703703704],
[62.92777777777778, 533.7703703703704]
],
"score": 0.9447243809700012,
"text": "試著當面對孩子的間題情境而引發負面情緒時·去辦識當時的想法及其合理性·",
"from": "ocr",
"end": " "
}, {
"box": [
[62.92777777777778, 545.3740740740741],
[415.94814814814816, 545.3740740740741],
[415.94814814814816, 554.7462962962964],
[62.92777777777778, 554.7462962962964]
],
"score": 0.9341756701469421,
"text": "芷思考和取代以較合宜的想法·之後再感受看看是否能讓自已的情緒較為緩和·",
"from": "ocr",
"end": " "
}, {
"box": [
[63.37407407407407, 565.4574074074075],
[194.13888888888889, 565.4574074074075],
[194.13888888888889, 575.2759259259259],
[63.37407407407407, 575.2759259259259]
],
"score": 0.9160681962966919,
"text": "更能理性的面對孩子的間題!",
"from": "ocr",
"end": "\n"
}, {
"box": [
[37.93518518518518, 617.674074074074],
[293.662962962963, 617.674074074074],
[293.662962962963, 626.6],
[37.93518518518518, 626.6]
],
"score": 0.7932961583137512,
"text": "oi2JADHD兒童認知行為親子團體治療·父母手冊·(精簡版)臨",
"from": "ocr",
"end": "\n"
}],
"time": 0.9359145164489746,
"timestamp": 1716194101.0420148,
"page": 14,
"fileName": "14",
"path": "C:/Users/ma/Desktop/txt/ADHD兒童認知行為親子團體治療:父母手冊(精簡版).pdf"
}

@maxin9966
Copy link
Author

哦,我知道了,原来是多层的pdf,我看到的是原始图片覆盖在上面

@maxin9966
Copy link
Author

现在这个识别问题有什么好方案吗?

@hiroi-sora
Copy link
Owner

你可以使用忽略区域功能(点表格中的文件名进入设置,右键拖拽建立选区),将主要内容以外的部分全部划为忽略区域。重复的页眉、页脚部分也可以划掉。这样可以让识别内容 减少被无关文本所干扰。

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants