Skip to content

Commit

Permalink
Merge pull request #1 from x4base/master
Browse files Browse the repository at this point in the history
使用tesseract做初步的OCR
  • Loading branch information
miaoski committed Sep 2, 2014
2 parents ca78c02 + 25f5725 commit 1d64cde
Show file tree
Hide file tree
Showing 5 changed files with 77 additions and 0 deletions.
9 changes: 9 additions & 0 deletions toufu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,13 @@
done


使用tesseract 做中英文的初步OCR,可以讓大家少打一點字

需要先在系統上安裝好tesseract, 並裝好中文語言檔(chi_tra),另外db裡toufu這個表要有ocr_eng, ocr_cht這兩欄

for n in *.jpg; do
php ocr_guess.php "$n";
done


4. 寫前端,從 https://github.com/ctiml/campaign-finance.g0v.ctiml.tw 抄一些設計過來
5 changes: 5 additions & 0 deletions toufu/cell.js
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ $(document).ready(function(){
.text("")
.append($('<span></span>').text("第 "+res.p+" 頁, 第 "+res.line+" 行"));

$('#ocrEng').text(res.ocr_eng);
$('#ocrCht').text(res.ocr_cht);

if (res.ans !== null) {
$('.cell-info').append($('<span></span>').text(" 已經有" +res.cnt + "人填寫確認了。"));
$('.confirm').show();
Expand All @@ -72,6 +75,8 @@ $(document).ready(function(){
$('.confirm').hide();
$('.cell-image').html("");
$('#unclear').hide();
$('#ocrEng').text('');
$('#ocrCht').text('');

if (question_pools.length) {
set_question(question_pools.shift());
Expand Down
10 changes: 10 additions & 0 deletions toufu/index.php
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,16 @@
<div class="bar" style="width: 80%;"></div>
<span id="progress_text"></span>
</div>
<div><b>以下是供參考的機器自動辨識結果,可以複製正確的部份到答案欄,可以少打一些字</b></div>
<div class="ocr_block">
<label for="">英文OCR</label>
<textarea name="" id="ocrEng" cols="50" rows="4"></textarea>
</div>
<div class="ocr_block">
<label for="">中文OCR</label>
<textarea name="" id="ocrCht" cols="50" rows="4"></textarea>
</div>
<div style="clear: both;"></div>
</div>
<script src="cell.js"></script>
<link rel="stylesheet" href="cell.css">
Expand Down
43 changes: 43 additions & 0 deletions toufu/ocr_guess.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<?php

if (!isset($argv[1])) {
echo 'please supply image file name';
exit(1);
}

$imgFn = $argv[1];
$tmpFileBase = '/tmp/amisOcrGuess';
$tmpFile = $tmpFileBase . ".txt";

function getOcr($lang)
{
global $imgFn, $tmpFileBase, $tmpFile;

$cmd = 'tesseract "' . $imgFn . '" "' . $tmpFileBase . '" -l ' . $lang . ' 2>&1 > /dev/null';
exec($cmd);
$ocr = file_get_contents($tmpFile);
$ocr = trim($ocr);

return $ocr;
}

preg_match('/([0-9]{3})_([0-9]{3}).jpg/', $imgFn, $matches);
if (!$matches) {
die('wrong format of the image file name ');
}
$p = $matches[1];
$line = $matches[2];

$ocrEng = getOcr('eng');
$ocrCht = getOcr('chi_tra');

$pdo = new PDO("sqlite:toufu.sq3");
$st = $pdo->prepare("UPDATE toufu SET ocr_eng=:eng, ocr_cht=:cht WHERE p=:p AND line=:line ");
$st->execute(array(
'eng' => $ocrEng,
'cht' => $ocrCht,
'p' => $p,
'line' => $line
));

echo "$p $line \n";
10 changes: 10 additions & 0 deletions toufu/style.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
body {
padding-top: 20px;
}

.ocr_block{
float: left;
margin: 10px
}

.ocr_block textarea{
width: 400px;
height: 130px;
}

0 comments on commit 1d64cde

Please sign in to comment.