# Web session clustering

The objective of this project is to look at web session clusters based on different web session representation.

About the data set
* The data set is composed of page sessions 

<u>Data dictionary</u>
* IP
* network_visitor_id
* event_time
* utc_year
* utc_month
* utc_day
* utc_hour
* page_url
* page_hostname
* doc_title
* ref_url
* ref_hostname
* browser_family

### Two step methodology

#### Step 1: Session representation

Extract description tags from page_url column and since people's interest may be affected by time, I will also difine temporal tags

Sessions will comprise of description tags extracted from the page_url column (or doc_title), temporal tags extracted from the event_time column and a weight assigned.

<u>Description tag</u>

Using verbosity alpha: http://T1/T2/.../Tn to create tags P =  {P1, P2, ... , Pn}

Example https://en.wikipedia.org/wiki/Data_science

Tags: en.wikipedia.org, wiki, Data_science

Model: Natural Language Processing (NLP) - Lesson 14

<u>Temporal tag</u>


Extract temporal tag from user_time column to create tags T = {T1, T2, ... , Tn}

Model: Time series / Exponential smoothing - Lesson 15

<u>Session representation</u>

Each session will be represented as such:

S = {w(P1,s), w(P2,s), ... , w(Pn,s)}


#### Step 2:  Session clustering
Models: 
* K-means algorithim (K = 20, based on literature)
    - To be used with different distance measures EG Euclidean, cosine and Manhattan distances
* Expectation - Maximization algorithim
    - To look identify the association between users and pages
    - Provide user profiles



# Evaluation:

In [1]:
import pandas as pd

In [2]:
df = pd.ExcelFile('~/Documents/GMTPalma/acess_log.xlsx')
df.sheet_names

['Sheet1']

In [3]:
file = df.parse('Sheet1')
file.head()

Unnamed: 0,ip,network_visitor_id,event_time,utc_year,utc_month,utc_day,utc_hour,page_url,page_hostname,doc_title,req_method,ref_url,ref_hostname,browser_family
0,120.18.102.193,433e16c0-1a79-4712-b4d4-6549f829085c,2017-07-21 01:18:42,2017,7,21,1,http://candidthat.com/2017/06/nivea-creme-beau...,candidthat.com,NIVEA CREME BEAUTY HACKS YOU NEED TO KNOW - Ca...,GET,http://paid.outbrain.com/network/redir?p=_kswN...,paid.outbrain.com,Facebook
1,1.136.97.15,38293c02-6dc3-4442-b430-ed79738c2281,2017-07-21 01:14:40,2017,7,21,1,http://thewifelife.com.au/2017/06/20/our-gener...,thewifelife.com.au,Our generations with NIVEA Creme | the wife life,GET,http://paid.outbrain.com/network/redir?p=1ay7w...,paid.outbrain.com,Mobile Safari
2,139.130.199.58,068b4a3b-8528-4ee5-82f5-65743e678f5d,2017-07-21 01:11:00,2017,7,21,1,http://thewifelife.com.au/2017/06/20/our-gener...,thewifelife.com.au,Our generations with NIVEA Creme | the wife life,GET,http://paid.outbrain.com/network/redir?p=1ay7w...,paid.outbrain.com,Chrome
3,73.166.227.158,acacd4e8-1dde-4054-916b-6f361086eda9,2017-07-21 01:08:32,2017,7,21,1,https://m.nivea.com.au/products/face-care/Dail...,m.nivea.com.au,Daily Essentials Refreshing Facial Wash Gel - ...,GET,android-app://com.google.android.googlequickse...,com.google.android.googlequicksearchbox,Chrome Mobile
4,49.197.186.77,3a1b812a-e884-450a-9d20-5aafab46e425,2017-07-21 01:04:10,2017,7,21,1,http://candidthat.com/2017/06/nivea-creme-beau...,candidthat.com,NIVEA CREME BEAUTY HACKS YOU NEED TO KNOW - Ca...,GET,http://paid.outbrain.com/network/redir?p=_kswN...,paid.outbrain.com,Mobile Safari


In [4]:
file.describe()

Unnamed: 0,utc_year,utc_month,utc_day,utc_hour
count,246747.0,246747.0,246747.0,246747.0
mean,2017.0,7.290545,14.68815,10.635598
std,0.0,0.619733,8.914797,6.505787
min,2017.0,6.0,1.0,0.0
25%,2017.0,7.0,7.0,6.0
50%,2017.0,7.0,14.0,10.0
75%,2017.0,8.0,23.0,16.0
max,2017.0,8.0,31.0,23.0


In [5]:
# Check the unique values of categorical (string) variables:
print(file['ref_hostname'].unique())
print(file['page_hostname'].unique())
print(file['req_method'].unique())
print(file['doc_title'].unique())
print(file['browser_family'].unique())

['paid.outbrain.com' 'com.google.android.googlequicksearchbox'
 'www.google.com' 'm.nivea.co.nz' nan 'm.facebook.com' 'm.nivea.com.au'
 'www.google.co.uk' 'www.google.co.nz' 'thewifelife.com.au'
 'www.nivea.com.au' 'www.google.com.au' 'www.mademois-elle.com'
 'www.nivea.co.nz' 'www.google.co.za' 'candidthat.com' 'l.instagram.com'
 'www.google.dk' 'frame.bloglovin.com' 'www.google.ro' 'fashionhyper.com'
 'stylista.no' 'www.google.ca' 'www.bing.com' 'uk.pinterest.com'
 'www.google.com.pk' 'www.youtube.com' 'www.google.es' 'www.google.de'
 'www.google.co.in' 'ashowens.com' 'www.google.si' 'www.bodyandsoul.com.au'
 'www.pinterest.com' 'pinterest.com' 'the-curious-button.com'
 'mademoisellejaime.tictail.com' 'www.google.co.jp' 'www.google.hu'
 'yandex.ru' 'www.google.it' 'www.google.no' 'www.nivea.com'
 'www.google.fr' 'www.google.com.kh' 'l.facebook.com' 'www.tango.me'
 'stylendipity.org' 'www.google.com.co' 'www.temporaryhousewifey.com'
 'www.google.com.sa' 'lm.facebook.com' 'www.google.i

In [6]:
#convert categorical to binary representations:
file_dummies = pd.get_dummies(data=file, columns = ['ref_hostname', 'page_hostname', 'req_method', 'doc_title', 'browser_family'], prefix = ['ref_hostname', 'page_hostname', 'req_method', 'doc_title', 'browser_family'] )
file_dummies.head()

Unnamed: 0,ip,network_visitor_id,event_time,utc_year,utc_month,utc_day,utc_hour,page_url,ref_url,ref_hostname_10.255.200.17,...,browser_family_Samsung Internet,browser_family_SeaMonkey,browser_family_Sogou Explorer,browser_family_Sogou web spider,browser_family_UC Browser,browser_family_Vienna,browser_family_Vivaldi,browser_family_WebKit Nightly,browser_family_WordPress,browser_family_Yandex Browser
0,120.18.102.193,433e16c0-1a79-4712-b4d4-6549f829085c,2017-07-21 01:18:42,2017,7,21,1,http://candidthat.com/2017/06/nivea-creme-beau...,http://paid.outbrain.com/network/redir?p=_kswN...,0,...,0,0,0,0,0,0,0,0,0,0
1,1.136.97.15,38293c02-6dc3-4442-b430-ed79738c2281,2017-07-21 01:14:40,2017,7,21,1,http://thewifelife.com.au/2017/06/20/our-gener...,http://paid.outbrain.com/network/redir?p=1ay7w...,0,...,0,0,0,0,0,0,0,0,0,0
2,139.130.199.58,068b4a3b-8528-4ee5-82f5-65743e678f5d,2017-07-21 01:11:00,2017,7,21,1,http://thewifelife.com.au/2017/06/20/our-gener...,http://paid.outbrain.com/network/redir?p=1ay7w...,0,...,0,0,0,0,0,0,0,0,0,0
3,73.166.227.158,acacd4e8-1dde-4054-916b-6f361086eda9,2017-07-21 01:08:32,2017,7,21,1,https://m.nivea.com.au/products/face-care/Dail...,android-app://com.google.android.googlequickse...,0,...,0,0,0,0,0,0,0,0,0,0
4,49.197.186.77,3a1b812a-e884-450a-9d20-5aafab46e425,2017-07-21 01:04:10,2017,7,21,1,http://candidthat.com/2017/06/nivea-creme-beau...,http://paid.outbrain.com/network/redir?p=_kswN...,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# create a correlation matrix
corr = hr_data_with_dummies.corr()
corr = (corr)
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
sns.plt.title('Heatmap of Correlation Matrix')
corr