### Text recognition (OCR) with Tesseract and Python

###### Install opercv and pytesseract

In [1]:
# pip install opencv-python
# pip install pytesseract

In [4]:
import cv2
import numpy as np
import pytesseract

In [5]:
# indicate where the tesseract engine is being installed
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

In [6]:
# Read the image
img = cv2.imread("bigsleep.jpg")

In [7]:
# how the image
cv2.imshow("Img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [8]:
# Extract the text 
text = pytesseract.image_to_string(img)
text

"THE BIG SLEEP\nby Raymond Chandler\n\nIt was about eleven o'clock in the morning, mid October, with the sun not\nshining and a look of hard wet rain in the clearness of the foothills. I was\nwearing my powder-blue suit, with dark blue shirt, tie and display\nhandkerchief, black brogues, black wool socks with dark blue clocks on\nthem. I was neat, clean, shaved and sober, and I didn’t care who knew it. I\nwas everything the well-dressed private detective ought to be. T was\ncalling on four million dollars.\n\n‘The main hallway of the Sternwood place was two stories high. Over the\nentrance doors, which would have let in a troop of Indian elephants, there\n‘was a broad stained-glass panel showing a knight in dark armor rescuing\n"

###### Image preprocessing
Here I preprocess a quality image before giving to ocr.
Tasks that to be completed:
* Resize the Image
* Convert Image to grayscale
* Convert image to black and white

In [10]:
img2 = cv2.imread("book_page.jpg")

In [11]:
cv2.imshow("Image2", img2)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [9]:
text = pytesseract.image_to_string(img2)
text

'Design before you implement\n\nparticolasly if the project involves designing a product or service, ensure\n\na bave the best possible answer in the design phase before you start\nimplementation: Another 80/20 rule says that 20 per cent of the prob-\n\nJems with any design project cause 80 per cent of the costs or overruns;\nfe\n\n4 that 80 per cent of these critical problems arise in the design phase\nni ‘ = :\n\n2 are hugely expensive to correct later, requiring massive rework and\nas :\n\nin some cases retooling.\n\ni\n\n'

###### Resize the Image
pytesseract does not work well with big image.So it is better to resize it for getting the maximum resolutions. So text will be easy to read.

In [12]:
img2 = cv2.resize(img2, None, fx=0.5, fy=0.5, interpolation = cv2.INTER_AREA)
#img2 = cv2.resize(img2, (img2.shape[1]//2, img2.shape[0]//2), interpolation = cv2.INTER_AREA)

###### Convert Image to Grayscale
Here the text extraction is important, colors are not. In grayscale image text is easy to identify.

In [35]:
img_gray = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

###### Convert image to black and white (using adaptive threshold)
There are two types of thresholding, Such as, Simple thresholding and adative thresholding.  
* In Simple thresholding, the thresholding value is global, such as it is same for all the pixels in the image. Here pixel above threshold value would be white, pixel below that value would be black. sample of simple thresholding _, threshold = cv2.threshold(img, 155, 255, cv2.THRESH_BINARY)

* Adaptive thresholding is the method where the threshold value is calculated for the smaller regions and therefore, there will be different optimal threshold values for different regions. The addaptive threshold works only in greyscale images. 

###### adaptiveThreshold(src, maxValue, adaptiveMethod, thresholdType, blockSize, C)
* src  = Source of the Image
* maxValue = A value that is to be applied if the pixel value is more than the threshold value.
* adaptiveMethod = There are two types of method. 1 ) ADAPTIVE_THRESH_MEAN_C  2) ADAPTIVE_THRESH_GAUSSAIN_C
* thresholdType = Type of the threshold to be used like (THRESH_BINARY)
* blockSize = Representing size of the pixelneighborhood used to calculate the threshold value.
* C = representing the constant used in the both methods (subtracted from the mean or weighted mean)

In [20]:
_, simple_thresholding = cv2.threshold(img2, 155, 255, cv2.THRESH_BINARY)
cv2.imshow("Simple thresholding", simple_thresholding)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [32]:
adaptive_thresh_mean_c = cv2.adaptiveThreshold(img_gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 51, 20)
cv2.imshow("Adaptive thresholding mean c ", adaptive_thresh_mean_c)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [25]:
adap_thresh_gaussain_c = cv2.adaptiveThreshold(img_gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,91, 12)
cv2.imshow("Adaptive thresholding gaussain c", adap_thresh_gaussain_c)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [33]:
text = pytesseract.image_to_string(adap_thresh_gaussain_c)
text

'Designs before you implement\n\nsf che project ivolves designing produc ot service, cose\nresin yrs pomblesoawer inthe dein phe below you sar\nwe "Another 80/20 rule sya that 20 per cent of the prsb-\nrreany design project cae OD pr cent of the cu rei,\nes wg pe crak of thse cra probs arse inthe devign phe\ney caer crete eng ae reve and\n\nproms ose\n\n“3\n\nee\nGit:\n\nvd,\n\n'

In [34]:
text = pytesseract.image_to_string(img2)
text

'Design\nign before you implement\n\nerent fe projet inva d\neae igning a product or servic\n\nMe bet posible a\n\nity an oe eon peo Perce ep\n\nwe eat 0 per cent of pecomaced to ocr cy\n\nron Pc aed goin the cous over\n\naod aE rgely expensive to correc lt ms ase the desi se\n\nJp vome cases x00 eng ase ee ina\nre ework and\n\n'