### Practice 1 - Using logging info and MMs to discover website user trends

Markos Flavio B. G. O.

__Context: Markov Models (MMs).__

__Course: Unsupervised Machine Learning Hidden Markov Models in Python (Udemy, LazyProgrammer)__

This code is a practice study about MMs. It's an adaptation of the code found at https://github.com/lazyprogrammer/machine_learning_examples/tree/master/hmm_class/sites.py
    
__Specific objectives__

     1. Discover from a data set of logging information of a site, which page is more likely to recieve a web user and which page has more bouncing rate.

In [2]:
import numpy as np
import pandas as pd

In [18]:
loc = './Raw repo/hmm_class/site_data.csv'
df = pd.read_csv(loc, names=['last_page_id', 'next_page_id']) 
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,last_page_id,next_page_id
0,-1,8
1,4,8
2,-1,2
3,1,B
4,-1,5


Additional info:
  - 10 pages (IDs from 0 to 9);
  - Start pages have last_page_id = -1;
  - End pages will have B (bounce) or C (close) as next_page_id.
  * Several factors may difference between B and C. Naturally, time is a factor (as more time a user spend in a page before leaving, the higher the probability of closing instead of just bouncing).  

In [16]:
transitions = {} # storing state (page) transitions
row_sums = {} # storing the sums of transitions from each state s

# collecting counts
for index, row in df.iterrows():
    s, e = row['last_page_id'], row['next_page_id']
    transitions[(s,e)] = transitions.get((s, e), 0.) + 1 
    row_sums[s] = row_sums.get(s, 0.) + 1

In [55]:
for key in df.next_page_id.unique():
    try:
        trans = transitions[-1, key]
    except KeyError: # not transition between -1 and B or C exists.
        continue
    print('Transition from -1 to {0}: {1}'.format(key, trans))

Transition from -1 to 8: 2016.0
Transition from -1 to 2: 1888.0
Transition from -1 to 5: 1942.0
Transition from -1 to 9: 2062.0
Transition from -1 to 0: 2045.0
Transition from -1 to 3: 1889.0
Transition from -1 to 6: 1946.0
Transition from -1 to 7: 1980.0
Transition from -1 to 4: 2034.0
Transition from -1 to 1: 2055.0


In [57]:
print(row_sums[-1])

19857.0


In [60]:
# normalizing transitions occurence to represent probabilities
for k, v in transitions.items():
    s, e = k
    transitions[k] = v/row_sums[s]

In [66]:
# initial state distributions
for key in df.next_page_id.unique():
    try:
        trans = transitions[-1, key]
    except KeyError: # not transition between -1 and B or C exists.
        continue
    print('Transition from -1 to {0}: {1}'.format(key, trans))

Transition from -1 to 8: 0.10152591025834719
Transition from -1 to 2: 0.09507982071813466
Transition from -1 to 5: 0.09779926474291183
Transition from -1 to 9: 0.10384247368686106
Transition from -1 to 0: 0.10298635241980159
Transition from -1 to 3: 0.09513018079266758
Transition from -1 to 6: 0.09800070504104345
Transition from -1 to 7: 0.09971294757516241
Transition from -1 to 4: 0.10243239159993957
Transition from -1 to 1: 0.10348995316513068


In [82]:
# finding the page with highest bouncing rate (page with highest probability to end up in state B)
for key in df.last_page_id.unique():
    try:
        trans = transitions[key, 'B']
    except KeyError: # not transition between -1 and B or C exists.
        continue
    print('Transition from {0} to B: {1}'.format(key, trans))

Transition from 4 to B: 0.1255756067205974
Transition from 1 to B: 0.125939617991374
Transition from 7 to B: 0.12371650388179314
Transition from 0 to B: 0.1279673590504451
Transition from 2 to B: 0.12649551345962112
Transition from 3 to B: 0.12743384922616077
Transition from 8 to B: 0.12529550827423167
Transition from 6 to B: 0.1208153180975911
Transition from 9 to B: 0.13176232104396302
Transition from 5 to B: 0.12369559684398065


### Conslusions
We conclude that a user has more probability to reach the site from page 9 and also leave the site from this page.