# Introduction

OCR (Optical Character Recognition) is a class of computer-vision algorithms that seek to recognize text within images.

In a real-world scenario the OCR part of a document analysis pipeline is quite small, the overall pipeline consists of ton of other things to consider:

* Preprocessing (Image cleaning): This can make or break your application, it is crucial to the functioning of such a system.

* Classification: Your system will likely deal with extraneaous pages you need a way to filter these and hone in on the information you need.

* Structure Segmentation: If you deal with complex documents you may need to recognize the structure (header, tables, body of text etc..) to refine your extraction efforts.

* OCR (Recognize the text): This can also involve extracting metadata (bounding boxes etc..)

* Postprocessing (Text Cleaning): Like any other model, OCR models make mistakes, these need to be corrected in post processing.

* Reconstruction: If you need structured data out of your system, you'll need to restructure it (tables for example)

* Information Retrieval: Getting the information you want out of the document (text search, regex, or more complicated NLP modelling)

## Purpose of the Intro

This introduction differs from the others in the sense it is just a list of problems that would need to be solved in a real world application, with some short demos with the tesseract OCR Engine

https://github.com/tesseract-ocr/

## Document Examples

The documents presented here are a very small sample of the Tobacco3482 dataset:

https://lampsrv02.umiacs.umd.edu/projdb/project.php?id=72

This dataset is a sample of larger set, which is itself a sample of an even larger set of documents from legal proceedings toward the tobacco industry.

In [None]:
from PIL import Image
import os
import pytesseract
import pandas as pd
import numpy as np
import preprocess

In [None]:
def demo_ocr(directory):
    filenames = [x for x in os.listdir(directory) if x.endswith('tif')]
    
    for fn in filenames:
        pilimg = Image.open(os.path.join(directory,fn))
        
        text = pytesseract.image_to_string(pilimg)
        display(pilimg)
        print(text)
        
        cont = input('Clean and rerun?(y or n)')
        
        if cont not in ['y']:
            pass
        else:
            pilimg = preprocess.clean(pilimg) 
            text = pytesseract.image_to_string(pilimg)
            display(pilimg)
            print(text)
            
        
            
    return 0

In [None]:
demo_ocr('images/real-world')