<a href="https://colab.research.google.com/github/matthiasweidlich/promi_course/blob/master/case_labelling/case_labelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple notebook to run the individual steps of the EM-based approach to case labelling

## Initialisation

First, helper functions to add dedicated symbols for the start and end of a trace. Also, given an input sequence, a function prints the traces based on a given sequence of case identifiers. The latter is useful for understanding the  partitioning of the input sequence based on the current state of the EM-based approach to case labelling. 

In [0]:
CONST_START = "<"
CONST_END = ">"


def add_start_end(seq: list) -> list:
    """ Add dedicated symbols for start and end to the sequence """
    return [CONST_START] + seq + [CONST_END]

def print_cases(input_sequence: list, case_sequence: list):
    """ Extracts the actual traces from the input sequence, based on the given case sequence """
    cases = {}
    for idx, val in enumerate(input_sequence):
        if val == CONST_START or val == CONST_END:
            continue
        cases.setdefault(case_sequence[idx], []).append(val)
    for case in cases.values():
        print(case)

Second, helper functions to initialise the transition matrix.


In [0]:

def get_empty_matrix(symbols: set) -> dict:
    matrix = {}
    # initialise matrix with 0 values for all pairs
    for first in symbols:
        matrix[first] = {}
        for second in symbols:
            matrix[first][second] = 0

    return matrix


def init_matrix(input_sequence: list) -> dict:
    """ Initialise transition matrix based on ratio of counts for directly-follows relation """
    symbols = set(input_sequence)
    matrix = get_empty_matrix(symbols)

    # count directly-follows relation entries
    for idx, val in enumerate(input_sequence):
        if idx == 0:
            continue
        first = input_sequence[idx-1]
        second = input_sequence[idx]
        matrix[first][second] += 1

    # normalise counts of directly-follows relation entries
    for first in symbols:
        count = 0.0
        for second in symbols:
            count += matrix[first][second]
        if count == 0.0:
            continue
        for second in symbols:
            matrix[first][second] = matrix[first][second] / count

    return matrix

Let's define the specific example from the exercise sheet. 

In [0]:
# define the input sequence for to which the case labels shall be assigned
input_sequence = ['G','A','G','A','R','B','B','R','B','G','R','G','X','A','B','R']

In [0]:
# pre-processing step: assign dedicated labels to the start and end of the input sequence
input_sequence = add_start_end(input_sequence)

## Definition of the EM-approach

The following code snippets defines the function for the expectation step. It also provides a skeleton for the maximisation step.

**Task:** Complete the code snippet for the maximisation step.

In [0]:
def expectation_step(input_sequence: list, matrix: dict) -> list:
    """ Expectation step: estimate the case sequence based on the input sequence and the transition matrix """
    case_sequence = [-1]

    next_case_id = 0
    last_symbols_open_cases = {}

    for idx, symbol in enumerate(input_sequence):
        if symbol == CONST_START or symbol == CONST_END:
            continue

        max_prob = 0
        max_prob_case_id = -1
        for case_id, last_symbol in last_symbols_open_cases.items():
            if matrix[last_symbol][symbol] > max_prob:
                max_prob = matrix[last_symbol][symbol]
                max_prob_case_id = case_id

        if max_prob_case_id == -1 or matrix[CONST_START][symbol] == max({matrix[s][symbol] for s in matrix.keys()}):
            case_sequence.append(next_case_id)
            last_symbols_open_cases[next_case_id] = symbol
            next_case_id += 1
        else:
            case_sequence.append(max_prob_case_id)
            last_symbols_open_cases[max_prob_case_id] = symbol
            if matrix[last_symbol][CONST_END] == max(matrix[last_symbol]):
                last_symbols_open_cases.pop(max_prob_case_id, None)

    case_sequence.append(-1)

    return case_sequence

def maximisation_step(input_sequence: list, case_sequence: list) -> dict:
    """ Maximisation step: estimate the transition matrix based on the input sequence and the case sequence """

    symbols = set(input_sequence)
    matrix = get_empty_matrix(symbols)
    
    ##########################
    # Your code
    ##########################

    return matrix


## Run the approach for the example

Run the EM-approach to case labelling for the given input sequence.

In [0]:
# initialise the transition matrix based on the input sequence
matrix = init_matrix(input_sequence)

In [0]:
# print the initial transition matrix
matrix

In [0]:
# one iteration of the EM-approach
case_sequence = expectation_step(input_sequence, matrix)
matrix = maximisation_step(input_sequence, case_sequence)