<a href="https://colab.research.google.com/github/hawc2/Text-Analysis-with-Python/blob/master/BERT_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Resources

Tutorial on BERT OCR: 

https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/

# Libraries


In [None]:
!pip install pytesseract
!pip install pytorch-pretrained-bert
!pip3 install pyenchant

In [None]:
!sudo apt-get -y install libenchant1c2a
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [11]:
from PIL import Image
import pytesseract
from pytesseract import image_to_string
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
import re
import nltk
from difflib import SequenceMatcher

In [12]:
# this maybe fixed pyenchant install: https://github.com/pyenchant/pyenchant/issues/214
import enchant 
from enchant.checker import SpellChecker

# File Import

In [13]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sample6.png to sample6.png
User uploaded file "sample6.png" with length 193492 bytes


In [15]:
filename = './sample6.png'
text = image_to_string(Image.open(filename))
text_original = str(text)
print (text_original)

Most District reports indicate somewhat stronger regional economic
activity on balance in December and early January than at the time of the last
reports in November, with much of the growth centered in the retail and
industrial sectors. It would appear, on the basis of these reports, that the
national economy gained momentum in recent weeks as con@gmer spending
strengthened, manufacturing activity cont@™Med to rise, and producers

scheduled more investment in plant and equipment.



# Clean Text

In [16]:
# cleanup text
rep = { '\n': ' ', '\\': ' ', '\"': '"', '-': ' ', '"': ' " ', 
        '"': ' " ', '"': ' " ', ',':' , ', '.':' . ', '!':' ! ', 
        '?':' ? ', "n't": " not" , "'ll": " will", '*':' * ', 
        '(': ' ( ', ')': ' ) ', "s'": "s '"}

In [17]:
rep = dict((re.escape(k), v) for k, v in rep.items()) 
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
def get_personslist(text):
    personslist=[]
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if isinstance(chunk, nltk.tree.Tree) and chunk.label() == 'PERSON':
                personslist.insert(0, (chunk.leaves()[0][0]))
    return list(set(personslist))
personslist = get_personslist(text)
ignorewords = personslist + ["!", ",", ".", "\"", "?", '(', ')', '*', "'"]


In [18]:
# using enchant.checker.SpellChecker, identify incorrect words
# maybe replace SpellChecker
d = SpellChecker("en_US")
words = text.split()

incorrectwords = [w for w in words if not d.check(w) and w not in ignorewords]
# using enchant.checker.SpellChecker, get suggested replacements
suggestedwords = [d.suggest(w) for w in incorrectwords]

# replace incorrect words with [MASK]
for w in incorrectwords:
    text = text.replace(w, '[MASK]')
    text_original = text_original.replace(w, '[MASK]')

In [19]:
print(text)

Most District reports indicate somewhat stronger regional economic activity on balance in December and early January than at the time of the last reports in November ,  with much of the growth centered in the retail and industrial sectors .  It would appear ,  on the basis of these reports ,  that the national economy gained momentum in recent weeks as [MASK] spending strengthened ,  manufacturing activity [MASK] to rise ,  and producers  scheduled more investment in plant and equipment .  


# Load BERT Model

In [20]:
# Load, train and predict using pre-trained model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
MASKIDS = [i for i, e in enumerate(tokenized_text) if e == '[MASK]']
# Create the segments tensors
segs = [i for i, e in enumerate(tokenized_text) if e == "."]
segments_ids=[]
prev=-1
for k, s in enumerate(segs):
    segments_ids = segments_ids + [k] * (s-prev)
    prev=s
segments_ids = segments_ids + [len(segs)] * (len(tokenized_text) - len(segments_ids))
segments_tensors = torch.tensor([segments_ids])
# prepare Torch inputs 
tokens_tensor = torch.tensor([indexed_tokens])
# Load pre-trained model
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
# Predict all tokens
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)

100%|██████████| 231508/231508 [00:00<00:00, 4437302.61B/s]
100%|██████████| 407873900/407873900 [00:05<00:00, 77878635.24B/s]


# Refine BERT predictions

In [21]:
#Predict words for mask using BERT; 
#refine prediction by matching with proposals from SpellChecker
def predict_word(text_original, predictions, maskids):
    pred_words=[]
    for i in range(len(MASKIDS)):
        preds = torch.topk(predictions[0, MASKIDS[i]], k=50) 
        indices = preds.indices.tolist()
        list1 = tokenizer.convert_ids_to_tokens(indices)
        list2 = suggestedwords[i]
        simmax=0
        predicted_token=''
        for word1 in list1:
            for word2 in list2:
                s = SequenceMatcher(None, word1, word2).ratio()
                if s is not None and s > simmax:
                    simmax = s
                    predicted_token = word1
        text_original = text_original.replace('[MASK]', predicted_token, 1)
    return text_original
text_original = predict_word(text_original, predictions, MASKIDS)
print (text_original)

Most District reports indicate somewhat stronger regional economic
activity on balance in December and early January than at the time of the last
reports in November, with much of the growth centered in the retail and
industrial sectors. It would appear, on the basis of these reports, that the
national economy gained momentum in recent weeks as consumer spending
strengthened, manufacturing activity continued to rise, and producers

scheduled more investment in plant and equipment.

