# Practical 9: Web Mining

### In this practical
1. [Loading data](#load)
2. [Cleaning web logs](#prep)
3. [Identifying user session](#group)
4. [Grouping users according to navigational similarity](#cluster)

---
**Written by Hendi Lie (h2.lie@qut.edu.au) and Richi Nayak (r.nayak@qut.edu.au). All rights reserved.**

This practical note introduces you to the web data processing for performing mining in Python. The input file for this exercise is `datasets/wdata.txt` that contains web log data in text format. You will learn to clean and perform user session identification before applying one of the data mining techniques you have learned so far in the previous practicals.

## 1. Loading Data

Web mining is a branch of data mining that concentrates on mining useful information from the web. The significant tasks of this web mining include resource finding, information selection, preprocessing, generalisation and analysis.

There are many types of web mining, including web usage mining, web structure mining and web content mining. Web usage mining, in particular, have allowed organisation to analyse user usage patterns, resulting in great insights for organisations to improve site design, identify potential customers and improve search results.

In general, log analysis is include in web usage mining process. It takes raw web data and process them in order to extract statistical information, such as:
* Key statistical figures (number of visitors, average number of hits, view time, etc)
* Diagnostic statistics (server reports and page not found errors)
* Server statistics (top pages visited, entry/exit pages)
* Referrer statistics (top referrering sites, search engine, key words)
* User demographics, client statistics and so on.

Web usage mining commonly uses web log data, which contain raw information related to pages served and recorded by the web server. This data is not sufficient and is not accurate to infer the behavior of the user. Thus, we need to perform preprocessing to extract meaningful information.

In this practical, we will be using `wdata.txt` log dataset. Load them using files (not pandas) as follows:

In [5]:
# load logs from wdata
wdata = open('datasets/wdata.txt', 'r').readlines()

# print the first 3 lines
print('\n'.join(wdata[:3]))

web_logs

j2439.inktomisearch.com - - [18/Apr/2005:21:16:54 +1000] "GET /robots.txt HTTP/1.0" 404 204 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

lj2559.inktomisearch.com - - [18/Apr/2005:21:16:55 +1000] "GET /code/Global/code/menu.html HTTP/1.0" 200 6092 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"



From the first couple of lines from this file, we could see the structure of this dataset. Its contents are:

1. Host:  The first part of the web log is called host. This is either ip address (202.183.101.13) or  the  host  name (lj2439.inktomisearch.com)  of  the  remote  user  requesting  the  page. But  for  performance  reasons  many  web  servers  are  configured  to  publish  their  IP  address instead of host name.  
2. Identd result: The  dash  (-)  next  to  the  host  name  represents  the  logging  response  returned  by  the  remote  user’s  Identd  result. Almost  no  one  actually  uses  this;  in  every  web  log  I've  ever seen, this field is always just a dash (-).    
3. Authuser: The next part of the log displays the authentication code of the user if there exists any for that particular web site or else just dash (-) is displayed. 
4. Date and time: Next   to   come   in   the   row   is   the   date   and   time   inside   the   square   brackets [18/Apr/2005:21:16:54  +1000].  It’s  the  day/month/year  format.  The  time  is  followed  by the date is displayed in 24 hours format with time zone offset at the end. The time-zone offset corresponds to Universal Time/Greenwich Mean Time.
5. Request: This  is  the  request  sent  by  the  user  enclosed  in  double  quotes.  Normally  it  looks  something  like  “GET  /robots.txt  HTTP/1.0".  In  this  part  the  GET  represents  request method, and next is the path of the URL requested.  
6. Status code: This  is  a  3-digit  code  returned  by  the  server  indicating  the  status  of  the  request  to  server. For example the code 200 stands for successful completion and 404 stands for unsuccessful completion or if the page could not be found.
7. Bytes sent: This represents the amount of the data delivered from the server excluding the header line.  
 
The extended version of this log format is called combined log format, with addition of two more fields.
8. Referrer: This is the referencing page of the user, in this case the referring URL is: http://help.yahoo.com/help/us/ysearch/slurp.  
9. Agent: The  user  agent  reported  by  the  remote  user's  browser.  Typically,  this  is  a  string  describing the type and version of browser software being used.

Once we know the columns available in this dataset, you could reload the dataset using the `.read_csv` function. Each field is separated by spaces, thus we should specify its `sep` or separator as ' ' or space. We have also added `names` variable into `names` parameter of the read function to allow pandas set column names during read process.

In [12]:
import pandas as pd

# set names of pandas dataframe
names=['Host', 'Identd', 'Authuser', 'Date and time', 'Timezone', 'Request',
       'Status code', 'Bytes Sent', 'Referrer', 'Agent']
# read the dataframe
df = pd.read_csv('datasets/wdata.txt', sep=' ', names=names, header=None)

In [14]:
# preview
df.head()

Unnamed: 0,Host,Identd,Authuser,Date and time,Date and time2,Request,Status code,Bytes Sent,Referrer,Agent
0,web_logs,,,,,,,,,
1,j2439.inktomisearch.com,-,-,[18/Apr/2005:21:16:54,+1000],GET /robots.txt HTTP/1.0,404.0,204.0,-,Mozilla/5.0 (compatible; Yahoo! Slurp; http://...
2,lj2559.inktomisearch.com,-,-,[18/Apr/2005:21:16:55,+1000],GET /code/Global/code/menu.html HTTP/1.0,200.0,6092.0,-,Mozilla/5.0 (compatible; Yahoo! Slurp; http://...
3,c210-49-32-6.rochd2.qld.optusnet.com.au,-,-,[18/Apr/2005:21:25:07,+1000],GET / HTTP/1.1,200.0,7138.0,http://www.google.com.au/search?hl=en&q=snap+p...,Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us...
4,c210-49-32-6.rochd2.qld.optusnet.com.au,-,-,[18/Apr/2005:21:25:07,+1000],GET /images/index3_01.gif HTTP/1.1,200.0,382.0,http://www.copyspecialists.com.au/,Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us...


The first line/row of this data is not part of the web logs, thus we should drop it.

In [15]:
df.drop(0, inplace=True)  # drop the row with index 0, on axis 0 (row-wise)
df.head()  # preview after

Unnamed: 0,Host,Identd,Authuser,Date and time,Date and time2,Request,Status code,Bytes Sent,Referrer,Agent
1,j2439.inktomisearch.com,-,-,[18/Apr/2005:21:16:54,+1000],GET /robots.txt HTTP/1.0,404.0,204,-,Mozilla/5.0 (compatible; Yahoo! Slurp; http://...
2,lj2559.inktomisearch.com,-,-,[18/Apr/2005:21:16:55,+1000],GET /code/Global/code/menu.html HTTP/1.0,200.0,6092,-,Mozilla/5.0 (compatible; Yahoo! Slurp; http://...
3,c210-49-32-6.rochd2.qld.optusnet.com.au,-,-,[18/Apr/2005:21:25:07,+1000],GET / HTTP/1.1,200.0,7138,http://www.google.com.au/search?hl=en&q=snap+p...,Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us...
4,c210-49-32-6.rochd2.qld.optusnet.com.au,-,-,[18/Apr/2005:21:25:07,+1000],GET /images/index3_01.gif HTTP/1.1,200.0,382,http://www.copyspecialists.com.au/,Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us...
5,c210-49-32-6.rochd2.qld.optusnet.com.au,-,-,[18/Apr/2005:21:25:07,+1000],GET /images/index3_02.gif HTTP/1.1,200.0,1284,http://www.copyspecialists.com.au/,Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us...


After the loading process is completed, we should explore this dataset. Run the following code cells for exploration.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51327 entries, 1 to 51327
Data columns (total 10 columns):
Host              51327 non-null object
Identd            51327 non-null object
Authuser          51327 non-null object
Date and time     51327 non-null object
Date and time2    51327 non-null object
Request           51327 non-null object
Status code       51327 non-null float64
Bytes Sent        51327 non-null object
Referrer          51327 non-null object
Agent             51327 non-null object
dtypes: float64(1), object(9)
memory usage: 4.3+ MB


There are **51327** columns in this dataset, with most columns being object/string type.

## 2. Cleaning web logs

### 2.1. Removing useless requests

In this  process, we are removing requests relating to non-analysed resources such as extraneous references to 
embedded objects, graphics, sound files, and removing references due to spider navigations. This list might  change when planning for specific analysis. For example, when looking to analyse the performance web cache  application, there is a need for having image and graphic files in the dataset. The significant reason for data 
cleaning  is to reduce storage  space and facilitate the upcoming tasks,  because in general these would constitute nearly 30% to 40 % of the total dataset.

Use the following code cell to remove image requests from the log such as `.gif`, `.jpg` and `.jpeg`.

In [17]:
mask = df['Request'].str.contains('.gif') | df['Request'].str.contains('.jpg') | df['Request'].str.contains('.jpeg')
print("Before:", len(df))

# invert the mask, only keep records without .gif, .jpg and .jpeg in the request column
df = df[~mask]
print("After:", len(df))

Before: 51327
After: 5866


After useless requests are removed, we are left with approximately 10% of the logs.

** To be continued **