# Predicting future states with Markov Chains
We have a set of sequences followed by users in the system. We need to implement a Markov Chain representation that we can use to obtain predictions on future states. Also we want this representation to be of order $>1$.  
Let's start by loading the data and by having a look at it.

In [1]:
# load all dependencies
from support import *
# load the data
data = pd.read_csv("../data/paths_final_cass.csv", sep=";")[["element", "exit_codes", "type"]]

# let's peek into the dataframe...
data.head()

Unnamed: 0,element,exit_codes,type
0,6114|421|2814|1478|2040|3563|3622|1850|1032|20...,489|397|397|397|218|580|465|397|218|397|397|77...,0.0
1,6114|421|2814|1478|2040|3563|3622|1850|1032|20...,489|397|397|397|218|580|465|397|218|397|397|77...,2.0
2,6114|421|2814|1478|2040|3563|3622|1850|1032|20...,489|397|397|397|218|580|465|397|218|397|397|77...,2.0
3,6114|421|2814|1478|2040|3563|3622|1850|1032|20...,489|397|397|397|218|580|465|397|218|397|397|77...,2.0
4,6114|421|2814|1478|2040|3563|3622|1850|1032|20...,489|397|397|397|218|580|465|397|218|397|397|77...,2.0


### Data format
The data contain all sequence of items users touched in their session (`element`) and the exit status for each item (`exit_code`). `type` represents the goodness/badness of the sequence.  
We now need to determine the set of all possible states that we can use to build a transition matrix based on the sequences in our dataset.  
Since we are not sure all labels exist (there may be gaps in the labels), we will redefine them and normalize the counts.  

Also: we want to calculate a second order Markov Chain of the dataset so that we can correlate a pair (Element, Exit code) to the next pair (Element, Exit code).

In [2]:
# get ALL the elements (pipe separated)
all_codes = set(itertools.chain(*data['element'].apply(lambda x: str(x).split("|")).values))
print(f"{len(all_codes)} unique elements found in dataset")
# build a label map
codes_map = {item: n for n, item in enumerate(sorted(map(int, all_codes)))}
# get the new labels and joins them in a similar string
data['new_elements'] = data['element'].apply(lambda x: "|".join(str(codes_map.get(int(i), 'N/A')) for i in str(x).split("|")))

# same thing on exit states
all_codes = set(itertools.chain(*data['exit_codes'].apply(lambda x: str(x).split("|")).values))
print(f"{len(all_codes)} unique exit_codes found in dataset")
_codes_map = {item: n for n, item in enumerate(sorted(map(int, all_codes)), start=len(codes_map))}
data['new_exits'] = data['exit_codes'].apply(lambda x: "|".join(str(_codes_map.get(int(i), 'N/A')) for i in str(x).split("|")))

print(f"\n{max(_codes_map.values()) + 1} possible states single states")

288 unique elements found in dataset
97 unique exit_codes found in dataset

385 possible states single states


In [3]:
data.drop(["element", "exit_codes"], axis=1, inplace=True)
data.head()

Unnamed: 0,type,new_elements,new_exits
0,0.0,287|25|91|49|82|108|112|72|42|79|50|1|12|256|4...,371|361|361|361|325|384|366|361|325|361|361|30...
1,2.0,287|25|91|49|82|108|112|72|42|79|50|1|12|256|4...,371|361|361|361|325|384|366|361|325|361|361|30...
2,2.0,287|25|91|49|82|108|112|72|42|79|50|1|12|256|4...,371|361|361|361|325|384|366|361|325|361|361|30...
3,2.0,287|25|91|49|82|108|112|72|42|79|50|1|12|256|4...,371|361|361|361|325|384|366|361|325|361|361|30...
4,2.0,287|25|91|49|82|108|112|72|42|79|50|1|12|256|4...,371|361|361|361|325|384|366|361|325|361|361|30...


Let's now associate each element to its exit code in the sequence:

In [4]:
# this mixes and matches items and exit codes
pair_sequences = make_sequences(data)

### Implementation
Luckily for us, someone took care of the implementation of an initial version of High Order MC.  
The implementation uses sparse matrices to represent the (very) highly dimensional space that describes all possible states.

In [5]:
# install the High Order Markov chains module
! pip install --upgrade HOMarkov

Requirement already up-to-date: HOMarkov in /Users/pmascolo/anaconda/lib/python3.6/site-packages
Requirement already up-to-date: scikit-learn>=0.18.1 in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from HOMarkov)
Requirement already up-to-date: pandas>=0.20.0 in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from HOMarkov)
Requirement already up-to-date: numpy>=1.13.0 in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from HOMarkov)
Requirement already up-to-date: python-dateutil>=2 in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from pandas>=0.20.0->HOMarkov)
Requirement already up-to-date: pytz>=2011k in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from pandas>=0.20.0->HOMarkov)
Requirement already up-to-date: six>=1.5 in /Users/pmascolo/anaconda/lib/python3.6/site-packages (from python-dateutil>=2->pandas>=0.20.0->HOMarkov)


In [6]:
# from HOMarkov import markov
import markov
number_of_states = max(_codes_map.values()) + 1 # zero based :)
mc = markov.MarkovChain(number_of_states, order=2)

In [7]:
%%time
# this will take about 30 seconds for the big dataset...
mc.fit(list(pair_sequences))

CPU times: user 233 ms, sys: 3.87 ms, total: 237 ms
Wall time: 243 ms


In [8]:
print(f"Number of non-zero transitions: {mc.transition_matrix.count_nonzero()}")
print(f"Number of total transitions: {mc.number_of_states ** 2}")

Number of non-zero transitions: 886
Number of total transitions: 148225


We can now get all non-zero elements in the transition matrix and represent them with their states:

In [9]:
# get coordinates of non-zero elements in transition matrix
non_zero_indices = list(zip(*mc.transition_matrix.nonzero()))
state_lookup = mc.possible_states_lookup()

state_transitions = [(state_lookup.get(i), state_lookup.get(j)) for i, j in non_zero_indices]

state_transitions[:5]

[((0, 361), (200, 323)),
 ((1, 295), (253, 361)),
 ((1, 302), (12, 356)),
 ((1, 302), (12, 369)),
 ((1, 302), (12, 373))]

In [44]:
possible_start_states = set(i for i, _ in state_transitions)

@interact(initial_state=map(str, sorted(possible_start_states)), steps=IntSlider(1, 1, 3))
def draw_future(initial_state, steps):
    
    graph = nx.DiGraph()
    
    state = eval(initial_state)
    graph.add_node(state)
    state_id = mc.possible_states.get(state)

    next_states = list(zip(*mc.predict_state(mc.transition_matrix[state_id]).nonzero()))
    for n, next_state in enumerate(next_states):
        # add to graph
        graph.add_node(next_state)
        graph.add_edge(state, next_state, {"weight":2})
        
        
    nx.draw_networkx(graph)
    plt.axis("off")
    plt.show()


In [11]:
import collections

collections.Counter(i for i, _ in state_transitions).most_common(10)

[((282, 361), 6),
 ((361, 153), 6),
 ((66, 361), 5),
 ((302, 12), 5),
 ((361, 198), 5),
 ((1, 302), 4),
 ((89, 344), 4),
 ((104, 380), 4),
 ((144, 371), 4),
 ((277, 361), 4)]

In [12]:
possible_start_states

{(0, 361),
 (1, 295),
 (1, 302),
 (2, 321),
 (2, 327),
 (2, 342),
 (3, 361),
 (4, 288),
 (4, 361),
 (5, 361),
 (6, 361),
 (7, 361),
 (8, 361),
 (8, 373),
 (9, 379),
 (10, 369),
 (10, 373),
 (10, 379),
 (11, 369),
 (11, 373),
 (11, 379),
 (12, 356),
 (12, 369),
 (12, 373),
 (12, 379),
 (13, 361),
 (14, 361),
 (14, 363),
 (15, 361),
 (16, 379),
 (17, 379),
 (18, 361),
 (19, 288),
 (19, 361),
 (20, 383),
 (21, 290),
 (21, 293),
 (22, 361),
 (23, 361),
 (24, 288),
 (25, 361),
 (26, 325),
 (26, 346),
 (27, 288),
 (28, 372),
 (29, 365),
 (30, 361),
 (31, 361),
 (32, 288),
 (33, 361),
 (34, 288),
 (34, 361),
 (34, 367),
 (35, 363),
 (35, 365),
 (36, 380),
 (37, 380),
 (38, 308),
 (38, 344),
 (39, 362),
 (40, 333),
 (41, 372),
 (42, 325),
 (42, 353),
 (43, 372),
 (44, 288),
 (44, 361),
 (44, 366),
 (45, 361),
 (46, 361),
 (47, 361),
 (48, 380),
 (49, 361),
 (50, 361),
 (51, 361),
 (52, 361),
 (53, 361),
 (54, 361),
 (55, 374),
 (55, 380),
 (56, 361),
 (57, 361),
 (58, 361),
 (59, 361),
 (60, 3

In [13]:
state_lookup

{0: (0, 0),
 1: (0, 1),
 2: (0, 2),
 3: (0, 3),
 4: (0, 4),
 5: (0, 5),
 6: (0, 6),
 7: (0, 7),
 8: (0, 8),
 9: (0, 9),
 10: (0, 10),
 11: (0, 11),
 12: (0, 12),
 13: (0, 13),
 14: (0, 14),
 15: (0, 15),
 16: (0, 16),
 17: (0, 17),
 18: (0, 18),
 19: (0, 19),
 20: (0, 20),
 21: (0, 21),
 22: (0, 22),
 23: (0, 23),
 24: (0, 24),
 25: (0, 25),
 26: (0, 26),
 27: (0, 27),
 28: (0, 28),
 29: (0, 29),
 30: (0, 30),
 31: (0, 31),
 32: (0, 32),
 33: (0, 33),
 34: (0, 34),
 35: (0, 35),
 36: (0, 36),
 37: (0, 37),
 38: (0, 38),
 39: (0, 39),
 40: (0, 40),
 41: (0, 41),
 42: (0, 42),
 43: (0, 43),
 44: (0, 44),
 45: (0, 45),
 46: (0, 46),
 47: (0, 47),
 48: (0, 48),
 49: (0, 49),
 50: (0, 50),
 51: (0, 51),
 52: (0, 52),
 53: (0, 53),
 54: (0, 54),
 55: (0, 55),
 56: (0, 56),
 57: (0, 57),
 58: (0, 58),
 59: (0, 59),
 60: (0, 60),
 61: (0, 61),
 62: (0, 62),
 63: (0, 63),
 64: (0, 64),
 65: (0, 65),
 66: (0, 66),
 67: (0, 67),
 68: (0, 68),
 69: (0, 69),
 70: (0, 70),
 71: (0, 71),
 72: (0, 72)