# Part 4: Web Mining the Log Data for a Website
## Task 6 Web Mining
### First data mining operation: Association Mining

a. The rationale for selecting the specific operation/method.

Sequence/Association mining was chosen to try to gain a better understanding of users pathways through the various web pages. Usage patterns can help an organisation improve site design by identifying whish pages are commonly viewed in the same sessions.

b. What variables did you include in the analysis and what were their roles and measurement level set? Justify your choice.

For this task we only need to use session and request. Sessions are numbered 1 to 1939, while requests are strings listing the page or file requested by the user.

c. Can you identify data quality issues in order to perform web mining?

The one main quality problem for this exercise was the variations in the Request values. As one example, we can see values like '/eaglefarm', '/eaglefarm/', and '/eaglefarm.html'. Without knowing the website structue intimately, we cannot be sure whether these are the same requests or not.
For our purposes here, we have assumed that values without a file extension, and values ending with a slash (/) are the same. Values with file extensions can be left separate for now.
We handled this by removing the slash (/) if the value ended with it, making them the same as other requests, so '/eaglefarm/' becomes '/eaglefarm'.

d. Discuss the results obtained. Discuss also the applicability of findings of the method. Should include a high-level managerial kind of discussion on the findings, should not be just interpretation of results as shown in the results panel.

We can see that most results with enough support are around the '/eaglefarm' part of the site, which we could have predicted from the initial data exploration (value_counts). There are a couple exceptions, namely '/services.html' and '/robots.txt'.
Sorting by confidence shows us those with the highest probability of the user making the request on the right side, given that they also request the left side in the same session.
Some of those results include /eaglefarm, /eaglefarm/javascript/menu.js, /eaglefarm/pricelist, and /eaglefarm/pdf/Web_Price_List.pdf.
From the high confidence of these top results, it shows that many of the site's visitors are there to get the PDF document at '/eaglefarm/pdf/Web_Price_List.pdf' but their navigation path may take several steps.

I would recommend a review of the site pathways (user experience stories) leading to this PDF document. The decision makers may want to either:
a) place some more targeted advertisments on these pathways leading to the document, which could lead users to other offerings or increased revenue. 
b) place a link to this PDF document on the home page so users can get to it with less clicks, improving the user experience.

In [1]:
import pandas as pd
import numpy as np
from apyori import apriori

#define function to print apriori results neatly
def convert_apriori_results_to_pandas_df(results):
    rules = []
    
    for rule_set in results:
        for rule in rule_set.ordered_statistics:
            # items_base = left side of rules, items_add = right side
            # support, confidence and lift for respective rules
            rules.append([','.join(rule.items_base), ','.join(rule.items_add),
                         rule_set.support, rule.confidence, rule.lift]) 
    
    # typecast it to pandas df
    return pd.DataFrame(rules, columns=['Left_side', 'Right_side', 'Support', 'Confidence', 'Lift']) 

# load the dataset
df = pd.read_csv('database/web_log_data.csv')
# random state
rs = 42
# explore the dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5866 entries, 0 to 5865
Data columns (total 6 columns):
ip           5866 non-null object
date_time    5866 non-null object
request      5866 non-null object
step         5866 non-null int64
session      5866 non-null int64
user_id      5866 non-null int64
dtypes: int64(3), object(3)
memory usage: 275.0+ KB


Unnamed: 0,ip,date_time,request,step,session,user_id
0,c210-49-32-6.rochd2.,18/Apr/2005:21:25:07,/,1,3,3
1,visp.inabox.telstra.,19/Apr/2005:08:24:28,/,1,12,12
2,dsl-61-95-54-84.requ,19/Apr/2005:08:33:01,/,1,13,13
3,d220-236-91-52.dsl.n,19/Apr/2005:09:16:06,/,1,15,15
4,allptrs.eq.edu.au,19/Apr/2005:09:47:54,/,1,22,22


In [2]:
df.request.value_counts()

/                                                                821
/favicon.ico                                                     554
/robots.txt                                                      395
/eaglefarm/javascript/menu.js                                    370
/eaglefarm/pdf/Web_Price_List.pdf                                296
/eaglefarm/                                                      286
/services.html                                                   244
/eaglefarm/pricelist/                                            189
/eaglefarm/pricelist                                             187
/more.html                                                       145
/direct.html                                                     107
/eaglefarm/specials/                                             103
/eaglefarm/contact                                                95
/eaglefarm/contact/                                               93
/eaglefarm                        

In [3]:
#remove the ending / for values so that '/eaglefarm/' and '/eaglefarm' are treated as the same
df['request'] = df['request'].str.replace(r'/$', '', regex=True)
df.request.value_counts()

                                                                 821
/favicon.ico                                                     554
/robots.txt                                                      395
/eaglefarm                                                       378
/eaglefarm/pricelist                                             376
/eaglefarm/javascript/menu.js                                    370
/eaglefarm/pdf/Web_Price_List.pdf                                296
/services.html                                                   244
/eaglefarm/contact                                               188
/eaglefarm/specials                                              174
/richlands                                                       169
/more.html                                                       145
/richlands/contact                                               126
/direct.html                                                     107
/eaglefarm/fileupload             

In [4]:
# sort the rows based on date_time, descending
df.sort_values(by='date_time', inplace=True)
#group by sessions and list requested resources
sessions = df.groupby(['session'])['request'].apply(list)
# type cast the sessions from pandas into normal list format
session_list = list(sessions)
# run apriori with minimum support or 7.5%
results = list(apriori(session_list, min_support=0.075))
# sort results by confidence and print top 30
result_df = convert_apriori_results_to_pandas_df(results)
result_df = result_df.sort_values(by='Confidence', ascending=False)
print(result_df.head(30))

                            Left_side                         Right_side  \
16  /eaglefarm/pdf/Web_Price_List.pdf               /eaglefarm/pricelist   
15               /eaglefarm/pricelist      /eaglefarm/javascript/menu.js   
9                      /services.html                                      
17               /eaglefarm/pricelist  /eaglefarm/pdf/Web_Price_List.pdf   
10                         /eaglefarm      /eaglefarm/javascript/menu.js   
13               /eaglefarm/pricelist                         /eaglefarm   
11      /eaglefarm/javascript/menu.js                         /eaglefarm   
14      /eaglefarm/javascript/menu.js               /eaglefarm/pricelist   
12                         /eaglefarm               /eaglefarm/pricelist   
0                                                                          
8                                                         /services.html   
6                                                            /robots.txt   
2           

# Second data mining operation: Clustering

a. The rationale for selecting the specific operation/method.

Clustering users can help identify groups of users by their activity on the site. Once clusters are established, you can learn more about the distribution of users visiting your site and create targeted advertisments or focus more or less new content to particular user groups.

b. What variables did you include in the analysis and what were their roles and measurement level set? Justify your choice.

For this task we only need to use user_id and request. Users are numbered 1 to 1939, while requests are strings listing the page or file requested by the user.

c. Can you identify data quality issues in order to perform web mining?

As before, we will remove the slash (/) if the request value ended with it, making them the same as other requests, so '/eaglefarm/' becomes '/eaglefarm'.
Since this is an object type column we will also 'one hot' encode to create binary values.
A seperate issue was discovered where user_id matched exactly to session. This indicates that individual users were actually not identified (for lack of information or other reasons), so that each individual sessions is considered a separate user. We could have grouped the sessions by the ip column and assumed that sessions from the same ip are the one user, but this is also not a recommended approach. So for now, we have left this attribute as is.

d. Discuss the results obtained. Discuss also the applicability of findings of the method. Should include a high-level managerial kind of discussion on the findings, should not be just interpretation of results as shown in the results panel.

We can see from the final cluster sizes and data points closest to the cluster centres that the user grouping is not so much different from what we might have expected to see from the initial data exploration. There is a significant group oif users that only visit the site to get '/robots.txt' and another group that doesn't make it past the home/index page.
The remaining groups are all pulled towards the 'eaglefarm' domain, are are separated by whther they are retrieving pricing lists or contact information.
Now we've sorted this set of visitors into user groups, we can discuss how better to target them with messaging or advertisements, whether it be to expose them to other parts of the site, or to focus efforts into increasing the content or value of these parts of the site.

In [5]:
# load the dataset
df = pd.read_csv('database/web_log_data.csv')
# random state
rs = 42
# as before, remove the ending / for values so that '/eaglefarm/' and '/eaglefarm' are treated as the same
df['request'] = df['request'].str.replace(r'/$', '', regex=True)
# drop attributes we don't need
df = df.drop(['ip', 'date_time', 'step', 'session'],axis=1)
# group by user_id
df_group = df.groupby(['user_id'])['request'].apply(list)
# explore the dataset
df_group.head()

user_id
1                                        [/robots.txt]
2                        [/code/Global/code/menu.html]
3    [, /favicon.ico, /guarantee.html, /more.html, ...
4                                        [/robots.txt]
5                           [/code/Ultra/services.htm]
Name: request, dtype: object

In [6]:
# convert the list into binary dataframe (one hot encoding)
df_grouped = pd.get_dummies(df_group.apply(pd.Series),prefix='request')
df_grouped.info()
df_grouped.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1939 entries, 1 to 1939
Columns: 731 entries, request_ to request_/whoare.html
dtypes: uint8(731)
memory usage: 1.4 MB


Unnamed: 0_level_0,request_,request_/acacia.html,request_/carindale.html,request_/cbd.html,request_/cgi-bin/FormMail.pl,request_/code/Global/code/emailform.html,request_/code/Global/code/isearch.html,request_/code/Global/code/location.html,request_/code/Global/code/mainframe.html,request_/code/Global/code/menu.html,...,request_/vicpoint,request_/vicpoint,request_/vicpoint,request_/vicpoint,request_/vicpoint,request_/victoriapoint,request_/victoriapoint,request_/whoare.htm,request_/whoare.htm,request_/whoare.html
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# find optimal number of clusters
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min

X = df_grouped.as_matrix()

# list to save the clusters and cost
clusters = []
inertia_vals = []

# this whole process should take a while
for k in range(2, 15, 2):
    # train clustering with the specified K
    model = KMeans(n_clusters=k, random_state=rs, n_jobs=10)
    model.fit(X)
    
    # append model to cluster list
    clusters.append(model)
    inertia_vals.append(model.inertia_)

# plot the inertia vs K values
plt.plot(range(2,15,2), inertia_vals, marker='*')
plt.show()

  


<Figure size 640x480 with 1 Axes>

In [8]:
# we've chosen 6 as the number of clusters
model = KMeans(n_clusters=6, random_state=rs).fit(X)
# sum of intra-cluster distances
print("Sum of intra-cluster distance:", model.inertia_)

Sum of intra-cluster distance: 4190.219519991431


In [9]:
y = model.predict(X)
df_grouped['Cluster_ID'] = y
# how many records are in each cluster
print("Cluster membership")
print(df_grouped['Cluster_ID'].value_counts())

Cluster membership
0    753
2    715
3    276
5     68
1     64
4     63
Name: Cluster_ID, dtype: int64


In [10]:
# show index of data points closest to the cluster centroids
closest, _ = pairwise_distances_argmin_min(model.cluster_centers_, X)
closest

array([ 97, 107,  14,   0,  78, 237], dtype=int64)

In [11]:
# print closest points (index + 1) for each cluster
print(df_group[98])
print(df_group[108])
print(df_group[15])
print(df_group[1])
print(df_group[79])
print(df_group[238])

['/eaglefarm']
['/eaglefarm', '/eaglefarm', '/eaglefarm/javascript/menu.js', '/eaglefarm/javascript/menu.js', '/eaglefarm/pdf/Web_Price_List.pdf', '/eaglefarm/pdf/Web_Price_List.pdf', '/eaglefarm/pricelist', '/eaglefarm/pricelist']
['']
['/robots.txt']
['/eaglefarm/contact', '/eaglefarm/contact']
['/eaglefarm', '/eaglefarm/javascript/menu.js', '/eaglefarm/pdf/Web_Price_List.pdf', '/eaglefarm/pricelist', '/eaglefarm/pricelist']
