# Ukážka analýzy dát od GetSmartLook

Tento dokument demonštruje to, čo vieme potenciálne dostať z dát o pohyboch myši od _GetSmartLook_.

## Zdroj a rozsah dát

Od _GetSmartLook_ sme dostali cez `10GB` dát vo `JSON` formáte, ktoré reprezentujú jednotlivé "sessions". 
Dokopy máme `88634` "sessions". 

Na ukážku v tejto prvej analýze použijeme prvých `4880` "sessions" z dňa **6.mája 2016**.

In [1]:
import pandas as pd
from gsl_parser import *
# load an already exported sample CSV
sample_csv = 'sample.csv'
sample_data = pd.read_csv('sample.csv', index_col=0)
sample_data.head(5)

Unnamed: 0,batch_timestamp,batch_uid,callback,dom_classes,dom_height,dom_id,dom_tag,dom_width,dom_x,dom_y,...,position_x,position_y,scroll,server,session_uid,timestamp,url,user_uid,window_height,window_width
0,0,0,,,0,,,0,0,0,...,0.0,0.0,0,,0,0,,0,0.0,0.0
1,0,0,,,0,,,0,0,0,...,0.0,0.0,0,,0,0,,0,1271.0,1785.0
2,0,0,,,0,,,0,0,0,...,700.0,446.0,0,www.getsmartlook.com,572a16a9fff33de157fe5cb0,0,https://www.getsmartlook.com/,0,1271.0,1785.0
3,21,0,,,0,,,0,0,0,...,1601.0,825.0,0,www.getsmartlook.com,572a16a9fff33de157fe5cb0,21,https://www.getsmartlook.com/,0,1271.0,1785.0
4,30,0,,,0,,,0,0,0,...,1761.0,962.0,0,www.getsmartlook.com,572a16a9fff33de157fe5cb0,30,https://www.getsmartlook.com/,0,1271.0,1785.0


V tejto prvej vzorky sme načítali 120101 udalostí (pohybov myši, klikov...). Nevyužili sme všetkých 4880 sessions, k tomu sa dostaneme následne.

Teraz si len ukážeme, čo sa na tých všetkých sessions dá robiť.

In [2]:
print(sample_data.count())

sample_data.event_type.unique()
# rename events

def rename(x):
    renamer = {
        'mousemove': 'move',
        'blur': 'click',
        'focus': 'click',
        'change': 'move',
        'resize': 'move',
        'scroll': 'move',
        'click': 'click',
        'scrollel': 'move'
    }
    try:
        return renamer[x]
    except KeyError:
        return 'move'

sample_data['event_type'] = sample_data.event_type.apply(rename)

batch_timestamp    120101
batch_uid          120101
callback                0
dom_classes           950
dom_height         120101
dom_id                283
dom_tag              4206
dom_width          120101
dom_x              120101
dom_y              120101
event_type         118949
position_x         120101
position_y         119646
scroll             120101
server             118949
session_uid        120101
timestamp          120101
url                118949
user_uid           120101
window_height      120101
window_width       120101
dtype: int64


## Agregovanie dát

Prvým krokom v analýze je vypočítanie agregovaných dát pre každú "session".
To znamená, pre každú "session" počítame údaje ako
- prejdená vzdialenosť
- počet klikov
- rýchlosť, zrýchlenie
- komplikovanosť trasy
...

Všetky agregované premenné sú k dispozícii v štatistickej dokumentácii.

In [3]:
#aggregates = aggregate_by_session(sample_data)
#aggregates.head(10)

# load already made aggregates
aggregates = pd.read_csv('aggregate_sample.csv', index_col=0)
aggregates.head(5)

Unnamed: 0,click_freq,distance,double_click_freq,empty_click_freq,freq_change,latency,mean_acc,mean_deac,mean_miss,mean_velocity,...,scroll_mean_acc_up,scroll_mean_vel_down,scroll_mean_vel_up,scroll_mean_velocity,scroll_peak_acc_down,scroll_peak_acc_up,scroll_peak_vel_down,scroll_peak_vel_up,scroll_peak_velocity,session_uid
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,6.8e-05,15156.96556,1.3e-05,0.0,0.001517,21.0,0.147085,-0.16192,,6.88953,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,572a16a9fff33de157fe5cb0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,572a41aafff33de157fe6dd5
4,0.000497,2117.446719,0.000311,0.0,0.015585,0.0,0.022224,-0.019165,,0.407201,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,572a882cfff33de157fe81b1


### Vizualizácia

Po agregácii dát si môžeme tieto sessions trochu prezrieť a pozrieť sa na to, nakoľko odlišné hodnoty v jednotlivých sessions sú.

In [4]:
from bokeh.charts import Histogram, output_file, show, hplot
from bokeh.io import output_notebook

output_notebook()

output_file('graphs.html')

vel = Histogram(aggregates.loc[aggregates.mean_velocity.notnull(), 'mean_velocity'], title='Priemerna rychlost mysi')
show(vel)

acc = Histogram(aggregates.loc[aggregates.mean_acc.notnull(), 'mean_acc'], title='Priemerne zrychlenie')
show(acc)

dis = Histogram(aggregates.loc[aggregates.distance.notnull(), 'mean_acc'], title='Prejdena vzdialenost')
show(dis)

fre = Histogram(aggregates.loc[aggregates.click_freq.notnull(), 'click_freq'], title='Frekvencia klikov')
show(fre)

### Štandardy a odchýlky

Ďalším krokom bude vypočítanie akéhosi štandardného správania. Teda, priemerných hodnôt jednotlivých agregovaných premenných, ako aj ich rozptylu a vzájomnej závislosti (`variance`, `covariance`).

Na základe týchto hodnôt môžeme následne hľadať zaujímavé, nezvyčajné "sessions".

In [5]:
from gsl_parser import load_file, get_events
# count the number of sessions
def log_counter(session):
    """
    Count the number of evenet logs in each session
    """
    folder = '../data/new_gsl'
    filename = session + '.vt'
    try:
        loaded = load_file(folder, filename)
        events = get_events(loaded)
    except IOError:
        events = []
    return len(events)

aggregates['log_count'] = aggregates.session_uid.apply(log_counter)


In [6]:
from baselines import compute_baselines

# remove string/id variables that are not needed for baselines
no_strings = aggregates.drop(['session_uid', 'recalc_uid', 'recalc_timestamp'], axis=1)
# fix some wrong values like '[0.0]'
for key in no_strings:
    no_strings.loc[:,key] = no_strings[key].apply(lambda x: 0.0 if x == '[0.0]' else x)
no_strings = no_strings.astype(float)
# compute baselines
means, variances, covariance, count = compute_baselines(no_strings)

In [7]:
import numpy as np
# compute deviations from means
deviations = no_strings - means
# standardize by dividing by std deviations
std_dev = np.sqrt(variances)
standardized = deviations / std_dev
standardized.head(10)

Unnamed: 0,click_freq,distance,double_click_freq,empty_click_freq,freq_change,latency,mean_acc,mean_deac,mean_miss,mean_velocity,...,scroll_mean_acc_up,scroll_mean_vel_down,scroll_mean_vel_up,scroll_mean_velocity,scroll_peak_acc_down,scroll_peak_acc_up,scroll_peak_vel_down,scroll_peak_vel_up,scroll_peak_velocity,log_count
0,-0.514034,-0.411125,-0.469857,,-0.724966,-0.023378,-1.20333,1.162229,,-1.313298,...,,,,,,,,,,-0.716428
1,-0.514034,-0.411125,-0.469857,,-0.724966,-0.023378,-1.20333,1.162229,,-1.313298,...,,,,,,,,,,-0.716428
2,-0.502571,-0.124177,-0.466472,,-0.607416,-0.023194,3.824225,-5.298518,,9.814621,...,,,,,,,,,,-0.614148
3,-0.514034,-0.411125,-0.469857,,-0.724966,-0.023378,-1.20333,1.162229,,-1.313298,...,,,,,,,,,,-0.689678
4,-0.430691,-0.371038,-0.38911,,0.482316,-0.023378,-0.443674,0.397524,,-0.655589,...,,,,,,,,,,-0.527603
5,-0.514034,-0.411125,-0.469857,,-0.724966,-0.023378,-1.20333,1.162229,,-1.313298,...,,,,,,,,,,-0.714854
6,28.620858,-0.397887,37.166129,,0.161292,-0.023273,-0.339874,0.644136,,-0.33118,...,,,,,,,,,,-0.661354
7,-0.025621,,-0.30161,,,-0.003845,-1.20333,1.162229,,,...,,,,,,,,,,-0.692825
8,-0.499232,-0.360783,-0.46412,,0.55682,-0.023317,-0.777408,0.759529,,-0.716763,...,,,,,,,,,,-0.483544
9,-0.514034,-0.411125,-0.469857,,-0.724966,-0.023378,-1.20333,1.162229,,-1.313298,...,,,,,,,,,,-0.710134


### Najväčšie odchýlky

Na základe vypočítaných priemerov a odchýliek vieme nájsť sessions, v ktorých boli jednotlivé agregované premenné najviac vzdialené priemeru. Keďže odchýlky sme predelili štandardnou odchýlkou každej premennej, vieme navzájom porovnávať aj premenné rôznych veľkostí a rádov.

In [8]:
# maximum and minimum standardized deviations from mean
maxes = standardized.max()
mins = standardized.min()
# get indices for these maxima/minima
max_ids = set(standardized.idxmax().dropna().values.astype(int))
min_ids = set(standardized.idxmin().dropna().values.astype(int))
interesting_ids = max_ids.union(min_ids)
interesting_ids

{0, 10, 15, 185, 212, 863, 970, 1011, 1124}

In [9]:
top_deviants = aggregates.ix[interesting_ids, :]
print('Session ID:\n')
print(top_deviants.session_uid)
print('\n')
print('Ku ktorým premenným patria:\n')
print('Maximá:')
print(standardized.idxmax().dropna().astype(int))
print('\nMinimá:')
print(standardized.idxmin().dropna().astype(int))

Session ID:

0                              0
1124    572c09c481c84cd16ed37877
10      572adb51afc06abd6eb55bbc
15      572b43a381c84cd16ed35109
1011    572c0095afc06abd6eb5a45e
212     572bca5281c84cd16ed3738b
185     572bc83481c84cd16ed37350
970     572bfdc8afc06abd6eb5a419
863     572bf45a81c84cd16ed37706
Name: session_uid, dtype: object


Ku ktorým premenným patria:

Maximá:
click_freq            212
distance               10
double_click_freq     212
freq_change          1011
latency                15
mean_acc              185
mean_deac               0
mean_velocity         185
peak_acc              970
peak_deac               0
peak_velocity         970
log_count              10
dtype: int64

Minimá:
click_freq              0
distance                0
double_click_freq       0
freq_change             0
latency              1124
mean_acc                0
mean_deac             863
mean_velocity           0
peak_acc                0
peak_deac             970
peak_velocity           