##  Training an OCR using RNN + CTC on Synthetic Images ##
- To train neural network to do seq2seq mapping, when your input sequence and output sequence are not aligned
- Input sequence is a sequence of image features and output is a sequence of characters
- Images are resized to  fixed width , though we can have variying widths since RNN can handle variable length sequences. This helps in faster batch learning
- Training images are rendered on the fly for the task
- The network is tested on synthetic images, but rendered from out-of-vocabulary words
- We train a network with a bidirectional RNN  and a CTC loss for the task
     - A word image's each column is treated as a timestep. so inputdim= height of the word image and seqlen= width of the image
     - Here to make sure the networks learns the mappings we first overfit it to 3 letter words
     - Then we will the train network on a larger dataset, comprising of images rendered from 90k English words


In [1]:
# =============================================================================
# Use a BRNN + CTC to recognize given word image 
# Network is trained on images rendered using PIL 
# ============================================================================

# ==============================================================================

from __future__ import print_function
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
import numpy as np
from time import sleep
import random
import sys,codecs,glob 
import torch
import torch.autograd as autograd
import torch.nn as nn
import time,math
from time import sleep
import torch.nn.functional as F
import torch.optim as optim
from warpctc_pytorch import CTCLoss
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
random.seed(0)
# TODO - MAKE SURE CTC IS INSTALLED IN ALL MACHINES
use_cuda = torch.cuda.is_available()

if use_cuda:
    print ('CUDA is available')
use_cuda=False

CUDA is available


#### vocabulary and the fonts ####
-  loading the lexicon of 90k words
- get the fontslist to be used


In [3]:
#all word images are resized to a height of 32 pixels
imHeight=32 
"""
image width is also set a fixed size
YES. Though RNNS can handle variable length sequences we resize them to fixed width
This is for the ease of batch learning

"""
#imWidth=100
imWidth=100
#65 google fonts are used
fontsList=glob.glob('../../../data/lab2/googleFonts/'+'*.ttf')
vocabFile=codecs.open('../../../data/lab2/lexicon.txt','r')
#90k vocabulary
words = vocabFile.read().split()
vocabSize=len(words)
fontSizeOptions={'16','20','24','28','30','32','36','38'}
batchSize=5 
alphabet='0123456789abcdefghijklmnopqrstuvwxyz-'
#alphabet="(3)-"
dict={}
for i, char in enumerate(alphabet):
	dict[char] = i + 1


    

In [4]:
## a simple helper function to compute time since some 'start time'
def time_since(since):
	s = time.time() - since
	m = math.floor(s / 60)
	s -= m * 60
	return '%dm %ds' % (m, s)

### converting from strings to Labels ###
- Your GT for each image is word. We will convert the word into sequences of labels.<br>
- Each character in your alphabet is mapped to a number <r>
- And so your GT becomes a sequence of labels <br>
- Note that the class labels begin from 1 not 0<br>
- 0 is rserved for blank label in CTC

### converting from Labels Strings - argMax Decoding ###
- Network would give you an activation vector - probability distribution over the output class labels. At each time step we find out the class/label for which the probability is the max.
- We find out the classes with highest probability at all time steps (this is called argMax decoding) 
- So after decoding you will have a label for each time step ( This will be your predictedRawString)
- Now we will remove the recurring characters and blank labels from the above sequence of labels you will get the final predicted string


In [5]:
# return the class labels for each character in the targetsequence 
def Str2Labels(text):
	global dict
	text = [dict[char.lower()] for char in text]
	#print (text)
	length=len(text)
	return text, length
#StrtoLabels("0-1")









### from the predicted sequence of labels for an image, decode the string
# function returns the rawstring and also the decoded string after removing blanks and duplicates

#eg: if labelsequnce you get after an argmax on the output activation matris is  [12,12,0,0,15,0,15,15,0,0]
#then your raw label string would be "bb~~e~ee~~" and the outputstring "bee"
def Labels2Str(predictedLabelSequences):
    bz=predictedLabelSequences.size(0)
    predictedRawStrings=[]
    predictedStrings=[]
    for i in range(0,bz):
        predictedRawString=""
        predictedString=""
        predictedLabelSeq=predictedLabelSequences.data[i,:]
        prevId=1000 #just a large value which is not in the index 
        character=""
        character_raw=""
        for j in range (0, predictedLabelSeq.size(0)):
            idx=predictedLabelSeq[j]
            if (prevId != 1000 or prevId!=idx) :
                if prevId!=idx:
                    if idx==0:
                        character_raw="~"
                        character=""
                    else:
                        character_raw=alphabet[idx-1]
                        character=alphabet[idx-1]
                else:
                    character_raw="~"
                    character=""
                prevId=idx
            else:
                character=""
                if idx==0:
                    character_raw="~"
                else:
                    character_raw=alphabet[idx-1]
                    
                    

            
            predictedString+=character
            predictedRawString+=character_raw
        predictedRawStrings.append(predictedRawString)
        predictedStrings.append(predictedString)
        
    return predictedRawStrings, predictedStrings



def image2tensor(im):
    #returns the pixel values of a PIL image (in 0-1 range) as a numpy 2D array

    (width, height) = im.size
    greyscale_map = list(im.getdata())
    greyscale_map = np.array(greyscale_map, dtype = np.uint8)
    greyscale_map=greyscale_map.astype(float)
    greyscale_map = torch.from_numpy(greyscale_map.reshape((height, width))).float()/255.0
    return greyscale_map


### Render the images, prepare a training batch ###
- renders a batch of word images, from the list of words supplied
- if singleFont is true then only one font would be used to render images. This is useful in case where you want to test overfitting the network to easy examples
- Along with the rendered images, the target strings are converted to corresponding sequence of labels; for example the word "bee" would be converted to [12,15,15] 

In [6]:
def GetBatch ( batchOfWords ):
	"""
	Renders a batch of word images and returns the images along with the corresponding GTs
	Uses PIL to render word images
	font is randomly picked from a set of freely available google fonts
	word is picked from a vocabulary of English words

	"""
	wordImages=[]
	labelSequences=[]
	labelSeqLengths=[]

	for  i,text in enumerate (batchOfWords):
		wordText=text
		#print('text is', text)
		fontName=fontsList[0]
		fontSize='26'
		#fontSize=fontSizeOptions[0]
		fontName=random.sample(fontsList,1)[0]
		fontSize=random.sample(fontSizeOptions,1)[0]
		imageFont = ImageFont.truetype(fontName,int(fontSize))
		textSize=imageFont.getsize(wordText)
		img=Image.new("L", textSize,(255))
		draw = ImageDraw.Draw(img)
		draw.text((0, 0),wordText,(0),font=imageFont)
		img=img.resize((imWidth,imHeight), Image.ANTIALIAS)
		#img.save(text+'.jpeg')

		imgTensor=image2tensor(img)
		imgTensor=imgTensor.unsqueeze(0) # at 0 a new dimenion is added

		wordImages.append(imgTensor)

		labelSeq,l=Str2Labels(wordText)
		labelSequences+=labelSeq
		labelSeqLengths.append(l)
	batchImageTensor=torch.cat(wordImages,0) #now all the image tensors are combined ( we  did the unsqueeze eariler for this cat)	
	batchImageTensor=torch.transpose(batchImageTensor,1,2)# BxHxW -> BxWxH for it to be in BxTxD shape
	labelSequencesTensor=torch.IntTensor(labelSequences)
	labelSeqLengthsTensor=torch.IntTensor(labelSeqLengths)
	return batchImageTensor, labelSequencesTensor, labelSeqLengthsTensor
		

### Exercise 1 ###
1. For the words 'me', 'you' and 'i' call the function GetBatch(), and display the rendered images along with their labelsequences
    - note that your input to GetBatch() is a list of words ; 



### Model Defintion  ###
![OCR Architecture](blstm.jpg)
- Input image here is of shape 100*32. Hence seqLen=100 and your featDim at a timestep =32
- The below network has two BLSTM layers with #neurons in each layer = hiddenDim
- the outputs of both the forward and backward recurrent layers in the second hidden layer are connected to a linear layer. There are hiddenDim*2 connections coming to this layer and its output is of size=outputDim=nClasses+1 (one extra class for blank label of CTC)


In [7]:
class rnnocr (nn.Module):
    def __init__(self, inputDim, hiddenDim, outputDim,  numLayers, numDirections):
        super(rnnocr, self).__init__()
        self.inputDim=inputDim
        self.hiddenDim=hiddenDim
        self.outputDim=outputDim
        self.numLayers=numLayers
        self.numDirections=numDirections
        if numDirections==2:

            self.blstm1=nn.LSTM(inputDim, hiddenDim,numLayers, bidirectional=True, batch_first=True) # first blstm layer takes the image features as inputs
        else:
            self.blstm1=nn.LSTM(inputDim, hiddenDim,numLayers, bidirectional=False, batch_first=True) # first blstm layer takes the image features as inputs
        
        self.linearLayer2=nn.Linear(hiddenDim*numDirections, outputDim) # linear layer at the output
        self.softmax = nn.Softmax()
        
    def forward(self, x ):
        B,T,D  = x.size(0), x.size(1), x.size(2)
        lstmOut1, _  =self.blstm1(x ) #x has three dimensions batchSize* seqLen * FeatDim
        B,T,D  = lstmOut1.size(0), lstmOut1.size(1), lstmOut1.size(2)
        lstmOut1=lstmOut1.contiguous()

        

        outputLayerActivations=self.linearLayer2(lstmOut1.view(B*T,D))
        outputSoftMax=self.softmax(outputLayerActivations)
        return outputLayerActivations.view(B,T,-1).transpose(0,1)

### Exercise 2 ###
1. In the above model definition, why are there  <i> hiddenDim*numDirections</i> connections to the output layer ?
    - write your answer here itself in one sentence

In [10]:
###########
# Prepare the synthetic validation data
##############

valWords=['9446567456','hyderabad','golconda','charminar','gachibowli']
valImages, valLabelSeqs, valLabelSeqlens=GetBatch(valWords)
valImages=autograd.Variable(valImages)
valImages=valImages.contiguous()
if use_cuda:
    valImages=valImages.cuda()
valLabelSeqs=autograd.Variable(valLabelSeqs)
#print(valLabelSeqs.data)
valLabelSeqlens=autograd.Variable(valLabelSeqlens)

### To handle out of memory error ###
 - First try making the batchSize smaller
 - then you can make the network unidirectional one by setting numDirections=1
 - or change the the number of hidden units
 - in the worst case make imWidth smaller ( this is set at the beginning of the code)
 - if nothing above works uncomment the line use_cuda=False at the beggining 
 - If you are unable to do the training even after trying out above, you should load a pretrained model and see the results (at the end of this notebook)

### Exercise 3 ###
1. Why does reducing the width of the image helps us in saving memory?
    
    
2. It was mentioned earlier that you can actually pass variable length sequences to the RNN. Which means we may use variying width images to train our network. But can we have images with variying height fed to our network? 



### CTC loss ###
- [This](https://github.com/SeanNaren/warp-ctc/tree/pytorch_bindings/pytorch_binding) to understand the arguments for CTC loss
- [Original warp_ctc implemenation](https://github.com/baidu-research/warp-ctc)


CTC loss function takes 4 arguments


In [11]:
###########################################
# TRAINING
##################################################
"""
a batch of words are sequentially fetched from the vocabulary
one epoch runs until all the words in the vocabulary are seen once
then the word list is shuffled and above process is repeated
"""
nHidden=80
batchSize=5 #if you have more gpu memory you may increase it and your training will be faster
nClasses= len(alphabet)
criterion = CTCLoss()

numLayers=2 # the 2 BLSTM layers defined seprately without using numLayers option for nn.LSTM
numDirections=2 # 2 since we need to use a bidirectional LSTM
model = rnnocr(imHeight,nHidden,nClasses,numLayers,numDirections)

optimizer=optim.Adam(model.parameters(), lr=0.001)
start = time.time()
if use_cuda:
    model=model.cuda()
    criterion=criterion.cuda()


for iter in range (0,200):
    avgTrainCost=0
    random.shuffle(words)

    for i in range (0,vocabSize-batchSize+1,batchSize):
    
        model.zero_grad()
        
        batchOfWords=words[i:i+batchSize]
        images,labelSeqs,labelSeqlens =GetBatch(batchOfWords)
        images=autograd.Variable(images)
        #images=autograd.Variable(images)
        images=images.contiguous()
        if use_cuda:
            images=images.cuda()
        labelSeqs=autograd.Variable(labelSeqs)

        labelSeqlens=autograd.Variable(labelSeqlens)
        outputs=model(images)
        outputs=outputs.contiguous()
        outputsSize=autograd.Variable(torch.IntTensor([outputs.size(0)] * batchSize))
        trainCost = criterion(outputs, labelSeqs, outputsSize, labelSeqlens) / batchSize

        avgTrainCost+=trainCost
        if i%500==0:
            avgTrainCost=avgTrainCost/(5000/batchSize)
            #print ('avgTraincost for last 5000 samples is',avgTrainCost)
            avgTrainCost=0
            valOutputs=model(valImages)
#print (valOutputs.size()) 100 X nvalsamoles x 37
            valOutputs=valOutputs.contiguous()
            valOutputsSize=autograd.Variable(torch.IntTensor([valOutputs.size(0)] * len(valWords)))
            valCost=criterion(valOutputs, valLabelSeqs, valOutputsSize, valLabelSeqlens) / len(valWords)
            print ('validaton Cost is',valCost.data[0])


            ### get the actual predictions and compute word error ################
            valOutputs=valOutputs.transpose(0,1)
            # second output of max() is the argmax along the requuired dimension
            _, argMaxActivations= valOutputs.max(2)
            #the below tensor each raw is the sequences of labels predicted for each sample in the batch
            predictedSeqLabels=argMaxActivations.squeeze(2) #batchSize * seqLen
            predictedRawStrings,predictedStrings=Labels2Str(predictedSeqLabels)
            for ii in range(0,5):

                print (predictedRawStrings[ii]+"==>"+predictedStrings[ii])

            #   print (predictedSeqLabels[0,:].transpose(0,0))
            #print(valOutputs_batchFirst[0,0,:])
            print('Time since we began trainiing [%s]' % (time_since(start)))

        optimizer.zero_grad()
        trainCost.backward()
        optimizer.step()
    #iterString=int(iter)
    #torch.save(model.state_dict(), iterString+'.pth')


validaton Cost is 317.145751953
y~~~~~~~~~~f~4y~f~y~f~~4~yf~y~f~~~y~~~~~~~~~~~~~~~~f~~y~~~~~~~~f~~y~~~f~~y~f~~yf~~y~~~~~~~f~~y~~~~4~==>yf4yfyf4yfyfyfyfyfyfyfyfy4
y~~~~~~~~~~~~~~~~~~~~~~4y~~~~~~~~~~~~~~~~~~~~fy~~~~~4y~~~~~~~~~~~~~~~~~~4~y~~~~~~~~~~~~4~y~~~~~~4~~~==>y4yfy4y4y4y4
y~~~~~~~~~~4~~y~~~~~~~f4~~y~~~4~~~y~~~~~~4~~y~~~~~~~~~4y~~~~~~~~~~~~4~~~y~~~~~~~~~f4~~~~~~~~~~y~~4~~==>y4yf4y4y4y4y4yf4y4
yf4~~~~y~~4~y~~~~~4y~~~~4y~~~4~y~4~~~~~~~~~~~~~~~~~~~~~~~~~y4~~~~y~~4~~~~~~~~~y~~~4y~~~4~y~4~~~~~~~~==>yf4y4y4y4y4y4y4y4y4y4y4
y~4f~~~~y~~~f~~~~4~~~f~4~~~~~~~f~~4y~4~f4y~f~~~yf~~~4~~~~~~~~~~~~~~~~~~~~~~~~f4~~~~f4~~~f~~y~~f~~4~~==>y4fyf4f4f4y4f4yfyf4f4f4fyf4
Time since we began trainiing [0m 0s]
validaton Cost is 36.8741111755
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~==>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~==>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

KeyboardInterrupt: 

### Exercise 4 ###
1. While the train cost is computed , what are the dimensions of the 4 arguments supplied to the loss function

### Loading a pretrained model and testing the validation data on it ###
In case your networks takes lot of time to converge, we have a pretrained model for you. <br>


In [26]:
#  load a saved model and test our test/validation data on it #
use_cuda=True #since the saved model is a gpu model
model = torch.load("../../../data/lab2/ocr_valE5_blstm.pt")
criterion = CTCLoss()
if use_cuda:
    model=model.cuda()
    criterion=criterion.cuda()
    

#optimizer=optim.Adam(model.parameters(), lr=0.001)
#model.load_state_dict(torch.load("../../../data/lab2/ocrmodel_iter_40.pt


valWords=['9446567456','hyderabad','golconda','charminar','gachibowli']
valImages, valLabelSeqs, valLabelSeqlens=GetBatch(valWords)
valImages=autograd.Variable(valImages)
valImages=valImages.contiguous()
if use_cuda:
    valImages=valImages.cuda()
valLabelSeqs=autograd.Variable(valLabelSeqs)
#print(valLabelSeqs.data)
valLabelSeqlens=autograd.Variable(valLabelSeqlens)




valOutputs=model(valImages)
valOutputs=valOutputs.contiguous()
valOutputsSize=autograd.Variable(torch.IntTensor([valOutputs.size(0)] * len(valWords)))
valCost=criterion(valOutputs, valLabelSeqs, valOutputsSize, valLabelSeqlens) / len(valWords)
print ('validaton Cost is',valCost.data[0])


# valOutputs is in TxBxoutputDim size we make it BxTxoutputDIm
valOutputs_batchFirst=valOutputs.transpose(0,1)
# second output of max() is the argmax along the requuired dimension
_, argMaxActivations= valOutputs_batchFirst.max(2)
#the below tensor each raw is the sequences of labels predicted for each sample in the batch
predictedSeqLabels=argMaxActivations.squeeze(2) #batchSize * seqLen 
predictedRawStrings,predictedStrings=Labels2Str(predictedSeqLabels)
#print the predicted raw string and the decoded string for the valimages
for ii in range(0,5):

    print (predictedRawStrings[ii]+"==>"+predictedStrings[ii])

validaton Cost is 11.0239067078
w~~~~~~~~~~d~~~~~~~~~d~~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~f~~~~~~~d~~~~~~~~~~~~~~~~~~~~~~~~~~~~~==>wdd6fd
~h~~~~~~~~~~~~~~~~y~~~~~d~~~~~~~~~~e~~~~~~~~~~r~~~~~~~o~~~~~~~~~~b~~~~~~~~~~~o~~~~~~~~~~d~~~~~~~~~~~==>hyderobod
g~~~~~~~~~~~o~~~~~~~~~~~~~l~~~~~e~~~~~~~~~~~o~~~~~~~~~~~~~n~~~~~~~~~~~~~t~~~~~~~~l~~~~~e~~~~~~~~~~~~==>goleontle
c~~~~~~~~~h~~~~~~~~~~~~a~~~~~~~~~~~r~~~~~~~m~~~~~~~~~~~~~~~~~~i~~~~~n~~~~~~~~~~~a~~~~~~~~~~~r~~~~~~~==>charminar
g~~~~~~~~~~a~~~~~~~~~~~c~~~~~~~~~h~~~~~~~~~~~i~~~~~b~~~~~~~~~~~o~~~~~~~~w~~~~~~~~~~~~~~~l~~~~~i~~~~~==>gachibowli


### Exercise 5 ####
1. For the image of 'hyderabad', plot the output activations of all the classes for timestep=2(class labels along x axis and probabilities along y) and find out the probabilities for class h and b
    - hint - do a softmax on the activations of its not already applied
