# Text segmentation using Hidden Markov Models

# Q1

Knowing that each email begins with a header, this means the probability of starting in state 1 is $1$, and the probability of starting in state 2 is $0$.

Therefore, the initial probability vector $\pi$ is:
\begin{equation*}
    \pi = 
    \begin{bmatrix}
        1\\
        0
    \end{bmatrix}
\end{equation*}


# Q2

The probability of moving from state 1 to state 2 is $A(1,2)=0.000781921964187974$.

The probability of staying in state 2 is $A(2,2)=1$.

The probability of remaining in state 2 is the higher probability compared to the probability of moving from state 1 to state 2. This is logical considering the structure of an email: once you finish the header and start the body, you do not go back to the header. Hence, there is a 100% chance of staying in the body (state 2) once it is reached, which is reflected by the transition probability of 1. On the other hand, the probability of moving from the header to the body is very high but not absolute, allowing for multiple lines in the header before the transition to the body occurs. The small value of $0.000781921964187974$ indicates that while the transition from the header to the body does occur, it is very rare or happens after a large number of observations within the header, reflecting the possibility of a long header section.

# Q3

$B \in \mathbb{R}^{2\times N}$

In [132]:
import os
import glob
import numpy as np

In [133]:
ROOT = os.path.abspath('.')

PERL_DIR = os.path.join(ROOT,'PerlScriptAndModel')
RES_DIR = os.path.join(ROOT,'res')

### Coding/Decoding Mails

In [134]:
DATA_DIR = os.path.join(ROOT,'dat')

# Iterate through files and load the text 
def files_iter(data_dir, with_name=False):
    files = glob.glob('{}/*.dat'.format((data_dir)))
    if with_name:
        for f in files:
            # Get the filename 
            name = f.split('\\')[-1].split('.')[0]
            # Return filename and associated text
            yield name, np.loadtxt(f, dtype=int)
    else:
        for f in files :
            yield np.loadtxt(f, dtype=int)

In [135]:
# And we get a generator that will allow us to iterate through the mails
mail_iter = files_iter(DATA_DIR, with_name=True)

### Distribution files

In [136]:
PERL_DIR = os.path.join(ROOT,'PerlScriptAndModel')

# Writing a function to get the probability data
def get_emission_prob(perl_dir):
    return np.loadtxt(os.path.join(perl_dir, 'P.text'), dtype=float)

In [137]:
# Inputs to the Viterbi function
trans = np.array([[0.999218078035812, 0.000781921964187974], [0, 1]])
emission_prob = get_emission_prob(PERL_DIR)
states = list(range(trans.shape[0]))
start_prob = np.array([1, 0])

### To implement:

In [138]:
# Viterbi function
def viterbi(obs, states, start_prob, trans, emission_prob):
    """
        Viterbi Algorithm Implementation

        Keyword arguments:
            - obs: sequence of observation
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """

    # Avoid underflow: use the logarithm !
    # Avoid 0 in logarithm: use a small constant !
    small = 1e-10
    
    start_prob = np.log(start_prob + small)
    trans = np.log(trans + small)
    emission_prob = np.log(emission_prob + small)
    
    T = len(obs)
    N = len(states)
    
    # Initialisation
    log_l = np.full((T, N), -np.inf)
    bcktr = np.zeros((T, N), 'int')
    
    # Viterbi
    
    # Forward loop:
    log_l[0,:]= start_prob + emission_prob[obs[0],:]
    for t in range(1, T):
        log_l[t, :] = np.max(log_l[t - 1, :] + trans.T + emission_prob[obs[t], :].T, 1)
        bcktr[t, :] = np.argmax(log_l[t - 1, :] + trans.T + emission_prob[obs[t], :], 1)
    # Backward loop
    path = np.zeros(T, 'int')
    path[-1] = np.argmax(log_l[-1, :])
    for t in range(T - 2, -1, -1):
        path[t] = bcktr[t, path[t + 1]]

    return path

In [139]:
RES_DIR = os.path.join(ROOT,'res')

# Creating a directory to put the result of the viterbi function
if not os.path.exists(RES_DIR):
    os.mkdir(RES_DIR)
    
# Function that will write a viterbi path for a mail in a dedicated result file
def create_viterbi_path_file(mail_name, viterbi_path):
    with open('{}/{}_path.txt'.format(RES_DIR, mail_name), 'w') as f: 
        f.write(''.join([str(c) for c in viterbi_path]))   

In [140]:
# Using our generator, we get the mail names and data
for name_file, data in mail_iter:
    # Find out the viterbi path using viterbi
    viterbi_path = viterbi(data, states, start_prob, trans, emission_prob)
    # print(name_file, viterbi_path)
    # Put it in the result file
    create_viterbi_path_file(name_file, viterbi_path)

mail1 [1 1 1 ... 2 2 2]
mail10 [1 1 1 ... 2 2 2]
mail11 [1 1 1 ... 2 2 2]
mail12 [1 1 1 ... 2 2 2]
mail13 [1 1 1 ... 2 2 2]
mail14 [1 1 1 ... 2 2 2]
mail15 [1 1 1 ... 2 2 2]
mail16 [1 1 1 ... 2 2 2]
mail17 [1 1 1 ... 2 2 2]
mail18 [1 1 1 ... 2 2 2]
mail19 [1 1 1 ... 2 2 2]
mail2 [1 1 1 ... 2 2 2]
mail20 [1 1 1 ... 2 2 2]
mail21 [1 1 1 ... 2 2 2]
mail22 [1 1 1 ... 2 2 2]
mail23 [1 1 1 ... 2 2 2]
mail24 [1 1 1 ... 2 2 2]
mail25 [1 1 1 ... 2 2 2]
mail26 [1 1 1 ... 2 2 2]
mail27 [1 1 1 ... 2 2 2]
mail28 [1 1 1 ... 2 2 2]
mail29 [1 1 1 ... 2 2 2]
mail3 [1 1 1 ... 2 2 2]
mail30 [1 1 1 ... 2 2 2]
mail4 [1 1 1 ... 2 2 2]
mail5 [1 1 1 ... 2 2 2]
mail6 [1 1 1 ... 2 2 2]
mail7 [1 1 1 ... 2 2 2]
mail8 [1 1 1 ... 2 2 2]
mail9 [1 1 1 ... 2 2 2]


### Visualizing segmentation

In [141]:
import visualize_segmentation
# Writing a function to go into the directory and execute the perl script "segment.pl" on the mail in the given path
def exec_perl_script(mail, path):
    res = !cd {PERL_DIR}; perl segment.pl {mail} {path}
    return res

# Writing a function getting the original mail, the result of viterbi, and applying the segmentation script
# Then putting the result
def segment_mail(mail_name, data_dir, output_dir):
    # Get the full path of the mail
    mail = os.path.join(data_dir, '{}.txt'.format(mail_name))
    # Get the full path of the result
    path = os.path.join(output_dir, '{}_path.txt'.format(mail_name))
    # Execute the visualization script
    formatted_mail = 
    visualize_segmentation(mail, path, segmented_filename)

    # Get the results
    formatted_mail_text = ...
    # Go through the resulting text until the cutting line
    ...
    # If this was not the last line, return the text cut in to parts: header and body
    ...
    # If not, it's just a header
    ...

In [142]:
# Getting mails names
...
# Call the function and look at the result of segmentation
...