## Pre-processing Images for OCR
https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#page-segmentation-method
1) Inverted Images
2) Sharpness
2) Rescaling
3) Binarization
4) Noise Removal
5) Dilation and Erosion
6) Rotation / Deskewing
  https://gist.github.com/endolith/334196bac1cac45a4893#
7) Removing Borders
8) Missing Borders
9) Transparency / Alpha Channel

### Page segmentation method
  0    Orientation and script detection (OSD) only.<br>
  1    Automatic page segmentation with OSD.<br>
  2    Automatic page segmentation, but no OSD, or OCR.<br>
  3    Fully automatic page segmentation, but no OSD. (Default)<br>
  4    Assume a single column of text of variable sizes.<br>
  5    Assume a single uniform block of vertically aligned text.<br>
  6    Assume a single uniform block of text.<br>
  7    Treat the image as a single text line.<br>
  8    Treat the image as a single word.<br>
  9    Treat the image as a single word in a circle.<br>
 10    Treat the image as a single character.<br>
 11    Sparse text. Find as much text as possible in no particular order.<br>
 12    Sparse text with OSD.<br>
 13    Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

#### ref
https://youtu.be/ADV-AjAXHdc

In [8]:
#!pip install Pillow

from PIL import Image
# import PIL.ImageOps as ops

import os

https://www.ytn.co.kr/_cs/_ln_0103_202211022306145466_005.html
- 1단의 일반적인 줄글 이미지
- 글 이외에 다른 인식할만한 글/그림은 없음

## grayscale

In [9]:
filename = './test/news_contents.png'
# img = Image.open(filename).convert('L')
img = Image.open(filename)
img.show(title=None)
# img.save('./test/news_contents_grayscale.png')

## Resize

회사에서 pyvips코드 옮겨오기

In [14]:
width, height = img.size
resized = img.resize((width*2, height*2), Image.Resampling.BILINEAR)
# resized.save('./test/news_contents_resized.png')
resized.size

(1578, 1718)

In [11]:
# respimg1 = ops.scale(img1, 2, resample=Image.BILINEAR)
# respimg1.show(title=None)

# # factor : >1 is upsampling / 0< <1 is downsampling
# # resampling : NEAREST not good!  BICUBIC and BILINEAR are similar

  respimg1 = ops.scale(img1, 1.5, resample=Image.BILINEAR)


## Blackpoints

In [15]:
width, height = resized.size
pixel_map = resized.load()

for i in range(width):
    for j in range(height):
        
        r,g,b = resized.getpixel((i,j))
        
        if r<50 and g<50 and b<50:
            pixel_map[i,j] = (0,0,0)
            
resized.save('./test/news_contents_blackpoints.png')

## Sharpness

In [16]:
from PIL import ImageEnhance
enhancer = ImageEnhance.Sharpness(resized)
sharp = enhancer.enhance(2)
sharp.save('./test/news_contents_sharpness.png')

## Excute 1
- 아무 전처리도 하지않고, tesseract 실행한 결과
- 약간의 오타빼고 95% 정확한 결과

In [4]:
os.system("tesseract D:/Code/ocr/test/news_contents.png D:/Code/ocr/test/news_contents1.txt -l kor --psm 4")

0

## Excute 2
- 정확도 100%를 만들기 위한
- grayscale 이미지로 tesseract 실행한 결과
- Excute 1과 비슷

In [9]:
os.system("tesseract D:/Code/ocr/test/news_contents_grayscale.png D:/Code/ocr/test/news_contents2.txt -l kor --psm 4")

0

## Excute 3
- 정확도 100%를 만들기 위한
- grayscale, resize 2배한 이미지로 tesseract 실행한 결과
- 98% 정확

In [20]:
os.system("tesseract D:/Code/ocr/test/news_contents_resized.png D:/Code/ocr/test/news_contents3.txt -l kor --psm 4")

0

## Excute 4
- 정확도 100%를 만들기 위한
- grayscale, resize 2배, sharpness 2.0한 이미지로 tesseract 실행한 결과
- 오히려 역효과  96%로 줄어듬

In [28]:
os.system("tesseract D:/Code/ocr/test/news_contents_sharpness.png D:/Code/ocr/test/news_contents4.txt -l kor --psm 4")

0

## Excute 5
- 블랙포인트(6), sharpness(5) 둘다 효과 없음

In [18]:
os.system("tesseract D:/Code/ocr/test/news_contents_blackpoints.png D:/Code/ocr/test/news_contents.txt -l kor --psm 4")

0