# Chapter 4 - word based models

The chapter begins with an brief introduction to word based(lexical) translation. It delves into the alignment sub-problem of SMT and how it can be specially challenging for language pairs which do not follow the same order/structure of sentences. It introduces the NULL trick which can be used to link words in the translation which are not present in the root language. 

## IBM Model 1

Sentence-to-sentence translation is useless. There's is not nearly enough support for every sentence pair and it would never generalize to new sentences with a different structure. Thus, Model1 focusses on translating through lexical translation distributions.

### The Problem:
Get the joint probability of the English sentence **e** and the alignment function *a(f)* given the Foreign sentence **f**. The model assumes that each output word *e* in sentence **e** is generated from a single input word *f* in sentence **f**. The alignment function *a(f)* defined for single words of foreign sentence **f** returns the word(position) of the output word *e* in **e**.

The complete generative model is defined as: 
$$ p(\textbf{e}, a | \textbf{f}) = \frac{\epsilon}{(l_f + 1)^{l_e}} \prod_{j=1}^{l_e}t(e_j | f_{a(j)}) $$
where *t()* is the translation probability of an english word given its **aligned** foreign word. 

### The ML Solution:

The model described above is useless without the alignment function. While its relatively easy to find parallel sentence-aligned texts, a word-aligned corpus is impossible to find. The catch however, is that the texts infact are aligned in *some* way - we just need to learn this hidden alignment that from the data. 

***Enter EXPECTATION MAXIMIZATION***
1. Initialize model
2. Apply model params to get the likelihood of data (E)
3. Learn model params from data (M)
4. Repeat 2,3 till convergence

### DIGRESSION: A simple EM example(s)

Two labelled coins($A, B$) with known biases are put in a bag. The expriment involves drawing a coin out at random and then recording the observations of 10 tosses. In these settings, it is relatively easy to estimate $\theta(\theta_A, \theta_B)$ using MLE - we simply maximize the log-likelihood function:

$$log(L(\theta; x,z)) = log(p(x,z; \theta))$$ where
$$ L: likelihood\ function $$ $$ \theta: parameters$$ $$x: number\ of\ heads$$ $$z: coin\ identity $$

But, how would this problem be solved if ***z*** is hidden? Since the identity of the coin is hidden, we cannot simply count the number of heads as we do not know which series of tosses belong to which coin!

However, this information can be inferred from the data. Since the coin will consistently perform similarly across draws(the performance here refers to the number of heads in 10 tosses), we can try to infer which tosses belong to A and which to B. What this boils down to is a chicken-and-egg problem: if we knew the true identity of the coins, we could accurately estimate the $\theta$ parameters; if we knew the true parameters, we could try to determine the identity of the draws. The EM algorithm is an iterative way of determining the MaxLikelihood estimates of model parameters with hidden data(laten variables).

By beginning with random values of $\theta$, we can determine the "most likely" coin identities of each draw by computing the likelihood of the tosses given the random $\theta$. Since we now know(or at the very least have some idea) of the latent coin identites ***z***, we can compute new estimates of $\theta$ using MLE.

We'll now solve the two coin problem posed in the paper.
<img src="../images/nature_coin_data.png">

The image represents a best-case scenario - one where we know the identity of every draw(***z***). Armed with this info, one can easily estimate $\theta$ using MLE as depicted on the right.

But if the identity is not knows, then we are left with estimating the parameters from only the data. Lets begin with an initial estimate of 0.3 and 0.4 for $\theta$. Knowing these, we can find the likelihood of the data and the parameter given a coin identity, i.e.

$$L(x,\theta|z) = P(z|x, \theta) = 
\frac{P(x, z, \theta)}{p(x, \theta)} = 
\frac{P(x, z|\theta) * P(\theta)}{P(x|\theta) * P(\theta)} = 
\frac{P(x, z\ |\theta)}{\sum_zP(x, z|\theta)}$$ 

In the expression above, the numerator defines the probability distribution over values of **x, z** given $\theta$, i.e. what is the probability of seeing 5 heads in draw1 when the coin is A, given that the best-known parameter of coin A is 0.3. The denominator is the normalizing constant.

In [1]:
import numpy as np
data = [5, 9, 8, 4, 7] ## random variable x: number of heads for a draw coin
theta_A, theta_B = 0.3, 0.4