# Representation

In [1]:
import pandas as pd
import numpy as np

### OCR Factor $\psi_o$

These factors capture the predictions of a character-based OCR system, and hence exist between every image variable and its corresponding character variable. The number of these factors of word w is $len(w)$. The value of factor between an image variable and the character variable at position i is dependent on $img(i)$ and $char(i)$

$$\psi_o (img(i)=id,char(i)=a) = prob$$

In [2]:
ocr_table_factor = pd.read_table("dataset/ocr.dat", names=["id", "character", "p"], index_col=["id", "character"])
ocr_table_factor

Unnamed: 0_level_0,Unnamed: 1_level_0,p
id,character,Unnamed: 2_level_1
0,d,0.153411
0,o,0.088148
0,i,0.011325
0,r,0.074593
0,a,0.096568
0,h,0.130159
0,t,0.131382
0,n,0.012166
0,s,0.073045
0,e,0.229204


In [3]:
def flat(X):
    return [item for sublist in X for item in sublist]   

def ocr_factor(X):
    """ocr factor
    
    :type X: list of list
    :param list X: list of assignments for each sentence
    :return the probability of factor
    """
    X = flat(X)
    return np.exp(np.sum(np.log([ocr_table_factor.ix[item["img"], item["char"]]["p"] for item in X])))
print ocr_factor([[{"img": 1, "char": "e"}, {"img": 2, "char": "d"}]])
print ocr_factor([[{"img": 1, "char": "e"}, {"img": 2, "char": "d"}]])

0.031205540136
0.031205540136


### Transition Factors $\psi_t$

Since we also want to represent the co-occurence frequencies of the characters in our model, we add these factors between all consecutive character variables. The number of these factors of word $w$ is $len(w)-1$. The value of factor between two character variables at positions $i$ and $i+1$ is dependent on $char(i)$ and $char(i+1)$, and is high if $char(i+1)$ is frequently preceded by $char(i)$ in english words

$$\psi_t(char(i)=a, char(i+1)=b) = prob$$

In [4]:
transition_table_factor = pd.read_table("dataset/trans.dat", names=["current", "next", "p"], index_col=["current", "next"])
transition_table_factor

Unnamed: 0_level_0,Unnamed: 1_level_0,p
current,next,Unnamed: 2_level_1
n,s,0.356108
t,r,0.405734
t,s,0.286997
s,o,0.391206
i,e,0.386189
s,r,0.187065
r,a,0.458026
n,d,0.401417
e,t,0.425656
o,n,0.452152


In [5]:
def trans_factor(X):
    """trans factor
    
    :type X: list of list
    :param list X: list of assignments for each sentence
    
    :return the probability of factor
    """
    p = [trans_factor_sentence(s) for s in X]
    return np.exp(np.sum(np.log(p)))
        
def trans_factor_sentence(X):
    """trans factor
    
    :type X: list of list
    :param list X: list of assignments for each sentence
    
    :return the probability of factor
    """
    return np.exp(np.sum(np.log([transition_table_factor.ix[X[i]["char"], X[i+1]["char"]]["p"] for i in range(0, len(X) - 1)])))

print trans_factor_sentence([{"img": 1, "char": "e"}, {"img": 2, "char": "d"}, {"img": 1, "char": "e"}, {"img": 2, "char": "d"}])
print trans_factor([[{"img": 1, "char": "e"}, {"img": 2, "char": "d"}], [{"img": 1, "char": "e"}, {"img": 2, "char": "d"}]])

0.0906615863474
0.190888600464


### Skip Factors $\psi_s$

We would like to capture in our model is that similar images in a word always represent the same character. Thus our model score should be higher if it predicts the same characters for similar images. These factors exist between every pair of image variables that have the same id, i.e. this factor exist between all $i$,$j$, $i \ne j$ such that $img(i)==img(j)$. The value of this factor depends on $char(i)$ and $char(j)$, and is $5.0$ if $char(i)==char(j)$, and $1.0$ otherwise.

In [6]:
import itertools
def skip_factor(X):
    """trans factor
    
    :type X: list of dictionary
    :param list X: list of assignment 
    
    :return the probability of factor
    """
    X = flat(X)
    result = 1
    pairs = itertools.combinations(X, 2)
    for i, j in pairs:
        if i['img'] == j['img'] and i["char"] == j["char"]:
            result = result * 5
    return result
print skip_factor([[{"img": 1, "char": "e"}, {"img": 1, "char": "d"}]]) 
print skip_factor([[{"img": 1, "char": "e"}, {"img": 1, "char": "e"}]]) 
print skip_factor([[{"img": 1, "char": "e"}, {"img": 1, "char": "e"}], [{"img": 2, "char": "d"}, {"img": 2, "char": "e"}]]) 
print skip_factor([[{"img": 1, "char": "e"}, {"img": 1, "char": "e"}], [{"img": 2, "char": "d"}, {"img": 2, "char": "d"}]]) 

1
5
5
25


In [7]:
class Model:
    def __init__(self, factors=None):
        self.factors = factors
        
    def score(self, X):
        p = [factor(X) for factor in self.factors]
        return np.exp(np.sum(np.log(p)))
    
    @staticmethod
    def make_assignment(list_char, word_image):
        return [[{"char": a[0], "img": a[1]} for a in zip(list_char, word_image)]]
    
    def predict(self, word_image):
        chars = "etaoinshrd"
        assignments = list(itertools.product(chars, repeat=len(word_image)))
        assignments = [self.make_assignment(list_char, word_image) for list_char in assignments]
        scores = [(assignment, self.score(assignment)) for assignment in assignments]
        best = max(scores, key=lambda score: score[1])
        return best

In [8]:
model_1 = Model(factors=[ocr_factor])
model_2 = Model(factors=[ocr_factor, trans_factor])
model_3 = Model(factors=[ocr_factor, trans_factor, skip_factor])

In [9]:
print model_1.score([[{'char': 'd', 'img': 582}, {'char': 'a', 'img': 969}, {'char': 'd', 'img': 582}]])
print model_1.score([[{'char': 'd', 'img': 582}, {'char': 'd', 'img': 969}, {'char': 'd', 'img': 582}]])
print model_1.score([[{'char': 'a', 'img': 582}, {'char': 'a', 'img': 969}, {'char': 'd', 'img': 582}]])
print model_1.score([[{'char': 'a', 'img': 582}, {'char': 'a', 'img': 969}, {'char': 'a', 'img': 582}]])

0.00534435069649
0.0050133738879
0.00528119593478
0.00521878747962


In [10]:
X1 = [[{"img": 582, "char": "e"}, {"img": 969, "char": "d"}], [{"img": 582, "char": "a"}, {"img": 969, "char": "d"}]]
X2 = [[{"img": 582, "char": "a"}, {"img": 969, "char": "d"}], [{"img": 582, "char": "a"}, {"img": 969, "char": "e"}]]
X3 = [[{"img": 582, "char": "a"}, {"img": 969, "char": "d"}], [{"img": 582, "char": "a"}, {"img": 969, "char": "d"}]]

In [11]:
print model_1.score(X1)
print model_2.score(X1)
print model_3.score(X1)

0.000181006539499
3.35512079554e-05
0.000167756039777


In [12]:
print model_1.score(X2)
print model_2.score(X2)
print model_3.score(X2)

0.000701101885302
9.36406864829e-05
0.000468203432415


In [13]:
print model_1.score(X3)
print model_2.score(X3)
print model_3.score(X3)

0.000696681170396
0.00012539547631
0.00313488690776


# Inference

Using the graphical model, write code to perform exhaustive inference, i.e. your code should be able to calculate the probability of any assignment of the character and image variables. To calculate the normalization constant Z for the word w, you will need to go through all possible assignments to the character variables (there will be $10^{len(w)}$ of these).

In [14]:
X = [[int(t) for t in s.split()] for s in open("dataset/data.dat", "r").readlines()]
X[:5]

[[582, 969, 582, 969],
 [322, 959, 959, 880, 555, 959],
 [92, 809, 794, 92, 231, 990],
 [942, 979, 793],
 [942, 567, 942]]

In [15]:
word_image = X[0]
%timeit -r 1 model_1.predict(word_image)
%timeit -r 1 model_2.predict(word_image)
%timeit -r 1 model_3.predict(word_image)

1 loop, best of 1: 6.59 s per loop
1 loop, best of 1: 11.4 s per loop
1 loop, best of 1: 11.8 s per loop


In [16]:
word_image = X[1]
%timeit -r 1 model_1.predict(word_image)
%timeit -r 1 model_2.predict(word_image)
%timeit -r 1 model_3.predict(word_image)

1 loop, best of 1: 16min 45s per loop
1 loop, best of 1: 30min 3s per loop
1 loop, best of 1: 29min 33s per loop


In [20]:
word_image = X[3]
print model_1.predict(word_image)
print model_2.predict(word_image)
print model_3.predict(word_image)

([[{'char': 'a', 'img': 942}, {'char': 'd', 'img': 979}, {'char': 's', 'img': 793}]], 0.014222639030861513)
([[{'char': 'a', 'img': 942}, {'char': 'd', 'img': 979}, {'char': 'o', 'img': 793}]], 0.0023411548440003252)
([[{'char': 'a', 'img': 942}, {'char': 'd', 'img': 979}, {'char': 'o', 'img': 793}]], 0.0023411548440003252)


In [5]:
import time
%timeit time.sleep(5)

1 loop, best of 3: 5 s per loop
