# Pytesseract | Orientation and Script Detection (OSD)

This example shows how to use the orientation and script detection (OSD) functions in pytesseract.

OSD, plainly, describes the detection of the orientation of the input image and apparent script (alphabet). This information is extremely useful when you want to improve accuracy with Tesseract/pytesseract, which will be demonstrated in the examples below.

In [None]:
from PIL import Image
import pytesseract

In [None]:
# Load languages and scripts
!git clone --recurse-submodules https://github.com/tesseract-ocr/tessdata_fast.git 2> /dev/null || (cd tessdata_fast; git pull)
!cp tessdata_fast/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
!cp -r tessdata_fast/script /usr/share/tesseract-ocr/4.00/tessdata/

# Image Rotation

If the input image is rotated, then Tesseract will by default give bad results. Tesseract by default, does not apply any preprocessing to rotate images - it is up to the end-user to rotate before processing. 

In [None]:
path = '/kaggle/input/binder-datasets/ocr/images/letter_rotated.jpg'
im = Image.open(path)
display(im.resize(int(0.3*s) for s in im.size))

Let's see the results as-is.

In [None]:
print(pytesseract.image_to_string(im))

The text is completely inaccurate. You can see what Tesseract was trying to do, read it top-to-bottom and extract the text.

# Orientation and Script Detection (OSD)

OSD can help us here by providing necessary information to fix not only the rotation issue, but it also provides addition information such as the script language.

We can get OSD information with pytesseract by using ```image_to_osd```.

It provides this information:

* **page_num** the page index of the current item
* **orientation** the detected rotation of the image
* **rotate** the *required* rotation angle to get the text in a horizontal format
* **orientation_conf** the confience of Tesseract that the orientation was detected correctly - higher is better
* **script**  provides information about the language or script family to which the detected text belongs
* **script_conf** the confience of Tesseract that the script was detected correctly - higher is better

[According to the official documentation](https://tesseract-ocr.github.io/tessapi/5.x/a02438.html#aca4e9a0d9cf388510168d9b58864d1e5) a score of confience score 15.0 is 'reasonably confident' for orientation and script detection.

It is very helpful to use the ```output_type``` of ```dict```, so we can easily access the values with the given keys.

In [None]:
osd = pytesseract.image_to_osd(im, output_type='dict')
print(osd)

# Correcting the rotation

Let's correct the rotation. It is easy using ```Pillow```.

In [None]:
rotate = osd['rotate']
im_fixed = im.copy().rotate(rotate)
display(im_fixed.resize(int(0.3*s) for s in im_fixed.size))
print(pytesseract.image_to_string(im_fixed))

# Use case for Script Detection

Where does script detection come into play? Here's one potential example: what if you are creating a global OCR API? In this case you may not know the language or script of the input image. 

In this example, I have a image in the Hebrew language. I can extract text from this image with out knowing it is Hebrew in advance by utilizing the script trained data that comes with ```tessdata_fast```.

In [None]:
path = '/kaggle/input/binder-datasets/ocr/images/hebrew_text.png'
im = Image.open(path)
display(im)

In [None]:
osd = pytesseract.image_to_osd(im, output_type='dict')
print(osd)

In [None]:
print(pytesseract.image_to_string(im, lang='script/'+osd['script'], config='--psm 6'))

This is not perfect. There are two potential issues with this. 

1. Some languages have the same script type. As an example, English, Spanish, and French all are classified as 'Latin'. 
2. The script type returned by ```image_to_osd``` is not a one-to-one mapping. As an example for Chinese Simple and Chinese Traditional the output might be 'Han'
but if you examine the ```tessdata``` scripts, you will find 'HanS', 'HanS_vert', 'HanT', and 'HanT_vert'. 

For the first issue a resolution could be to extract the text via the 'Latin' script, then use a separate Python library, such as [langdetect](https://github.com/Mimino666/langdetect),
to get the best language match. Then you would could OCR again with the detected language for more accurate results.

For the second issue you may need to use a variety of methods to make an educated guess.