# 04 Evaluating the trained self organising map (SOM)

## 4.1 Results

A weight matrix is trained using the parameters selected in Notebook 3, and the response to 100 vectors from the test set generated.


In [1]:
from pickle import load
data_path = 'data'

# load training data
training_data = load(open(data_path + '/training_data.pkl', 'rb'))
testing_data = load(open(data_path + '/testing_data.pkl', 'rb'))

18

In [3]:
from featureextractionsom.functions.somap import get_trained_som
from featureextractionsom.functions.matrix_operations import build_node_matrix
from featureextractionsom.functions.evaluation import get_test_vectors
import random as rd

# generate and train a matrix of weight vectors
trained_matrix = get_trained_som(training_data)

rd.seed(42)
# choose a subset of 100 testing vectors at random
test_indices, test_subset = zip(*get_test_vectors(testing_data, 100))

In [6]:
from featureextractionsom.functions.matrix_operations import distance
from featureextractionsom.functions.utils import try_make_folder
from featureextractionsom.functions.evaluation import record_response
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# set image filepath
save_outputs_path = 'output/4_1'
try_make_folder(save_outputs_path)

# Create list of titles for plots indicating which vector generated the response
plot_titles = [f'Vector {v}' for v in test_indices]

# Create figure to display results
fig = make_subplots(
    rows=10, cols=10,
    subplot_titles = plot_titles)

# iterate through test vectors as well as through rows and columns of figure
i=0
for r in range(1,11):
    for c in range(1,11):
        # record response between test vector and SOM
        response = np.array([[distance(test_subset[i], node) for node in row] for row in trained_matrix])
        record_response(response, save_outputs_path + f'/vector{test_indices[i]}.png')
        
        # Add response to figure
        fig.append_trace(go.Heatmap(z=response, colorscale='Viridis_r',coloraxis="coloraxis"),
                         row=r, col=c)
        # Move to next test vector in list
        i+=1

# Display figure
fig.update_layout(
    height=1000, 
    width=1000,
    showlegend=False,
    font = dict(size=1),
    coloraxis = dict(colorscale='Viridis_r', colorbar = dict(ticks='outside')))
fig.show()

While some responses are certainly stronger than others, the matrix does seem to be responding differently to different groups of vectors and examining the saved images most vectors generate a response from one or two small areas of the matrix.  

With a shared colour axis it is difficult to distinguish the clusters for some vectors, so below an overview and a selection of vectors are displayed with their own axes.

![all_responses](notebook_images/responses.png)

Vector 1516 | Vector 1494 | Vector 952 | Vector 130
---|---|---|---
![image](output/4_1/vector1516.png)|![image](output/4_1/vector1494.png)|![image](output/4_1/vector952.png)|![image](output/4_1/vector130.png)

All of the responses can be viewed under `output/4_1`. 



## 4.2 Comparing the features of the dataset

The matrix can be used to explore the features of the dataset. The dataset will be transformed to a pandas DataFrame as a reminder of the original features.

In [5]:
# rebuild pandas dataframe from testing data

colnames = load(open(data_path+'/colnames.pkl','rb')).tolist()
import pandas as pd

features = pd.DataFrame(data = testing_data, columns=colnames)
features.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0.0,0.0,0.041667,0.002354,0.070922,0.020416,0.020408,0.061224,0.014202,0.0,0.3,0.428571,0.0,0.375,1.0,0.0,0.0,1.0
1,0.37037,0.170396,0.0,0.0,0.031206,0.008438,0.0,0.02381,0.070579,0.0,0.8,0.142857,0.083333,0.25,0.052632,0.5,0.0,0.0
2,0.0,0.0,0.0,0.0,0.011348,0.011348,0.375,0.541667,0.0,0.6,0.3,0.142857,0.083333,0.0,0.947368,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.024113,0.01,0.0,0.058824,0.175415,0.0,0.8,0.142857,0.083333,0.25,0.157895,0.5,0.0,1.0
4,0.037037,0.002501,0.0,0.0,0.019858,0.008501,0.0,0.038462,0.0,0.0,0.3,0.142857,0.083333,0.25,0.210526,0.5,1.0,0.0


The closest weight vector to each observation in the testing set is calculated and  a matrix 'BMU_Count' is constructed to visualise the count of observations for which each element of the SOM matrix is the BMU (best matching unit).

In [6]:
# Add column to show BMU of trained matrix for each feature to identify patterns.
from featureextractionsom.functions.matrix_operations import find_closest
import numpy as np
bmu = []

# Find closest weight vector to each test vector
for i in range(len(testing_data)):
    bmu.append(find_closest(trained_matrix, testing_data[i]))
    
# Add to Pandas DataFrame
features['bmu'] = bmu

# Construct matrix to visualise count of each weight vector that is a BMU
bmu_count = np.zeros((15,15))
for x,y in bmu:
    bmu_count[x][y] +=1
    
bmu_count

array([[ 13.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   1.,   5.,
          0.,   4.,   0.,   0.],
       [  0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   3.,   0.,
          0.,   0.,   0.,   0.],
       [130.,  14.,   0.,  32.,   0.,  29.,   0.,   0.,  18.,  29.,   2.,
          0.,   0.,   0.,   0.],
       [  0.,   8.,   0.,   0.,   0.,   0.,  13.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [  0.,  12.,   0.,   0.,  93.,  82.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [  0.,   5.,   0., 116.,  30.,  31.,   0.,   0.,   0.,   0.,   2.,
          1.,  12.,   0.,   0.],
       [  1.,  23.,   2.,   0.,  18.,   2.,   0.,   0.,   0.,   0.,   1.,
         88.,   9.,   0.,   0.],
       [ 86.,  44.,  37.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   1.,   0.,   0.],
       [  0.,   2.,   8.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   5.,

The 'Revenue' feature is isolated to see if it is correlated with a particular cluster.

For every observation where Revenue (`testing_data[i][17]`) is 1, the element of a zero matrix at the coordinate of that observations BMU is increased by 1. This creates a matrix of counts of bmus to sessions that result in a purchase.

In [7]:
bmu_rev = np.zeros((15,15))
# isolate revenue feature
revenues = testing_data[:,17]

for i, (x,y) in enumerate(bmu):
    """
    Iterate through list of best matching units. 
    If it corresponds to a revenue-generating browsing session, 
    add one to the bmu_rev matrix at that position
    """
    if testing_data[i][17]==1:
        bmu_rev[x][y] += 1

# Visualise the BMU count across the matrix for revenue-generating visits
fig = go.Figure(data=go.Heatmap(z=bmu_rev,colorscale='Viridis'))
fig.show()

It seems that those sessions in which revenue was generated almost exclusively respond best to the weight vector at node (0,1).

This exercise is repeated for all features. Instead of adding 1 if the feature is 1, in this exercise the value of the feature for each observation (between 0 and 1) is added to the element at the position of the BMU. Because a shared colour scale can obscure detail, each response was saved in the folder `output\4_2`

In [8]:
# Set output filepath
save_featureresponses_path = 'output/4_2'
try_make_folder(save_featureresponses_path)

sp_rows = 6
sp_cols = 3

# Generate figure to visualise responses
fig = make_subplots(
    rows=6, cols=3,
    subplot_titles = features.columns)

# Iterate through features in parallel with rows and columns of figure
f=0
for r in range(1,sp_rows+1):
    for c in range(1,sp_cols+1):
        bmu_feature = np.zeros((15,15))
        """
        Iterate through list of best matching units. 
        Add the value of the given feature (features.colnames[f]) to the BMU count matrix
        at the position of the BMU for each test vector.
        """
        for i, (x,y) in enumerate(bmu):
            bmu_feature[x][y] += testing_data[i][f]
        record_response(bmu_feature, save_featureresponses_path + f'/{features.columns[f]}.png', reverse_colourscale=False)
        fig.add_trace(go.Heatmap(z=bmu_feature, colorscale='Viridis', coloraxis = "coloraxis"),
                         row=r, col=c)
        f+=1

# Display results
fig.update_layout(
    height=1600,
    width=800,
    showlegend=False,
    coloraxis = {'colorscale':'viridis'})
fig.show()

### Observations

Unsurprisingly, other features that are grouped with the revenue cluster are ProductRelated and ProductRelated_Duration. More surprisingly, Administrative also appears here, possibly due to customers viewing their orders after making a purchase.

PageValues | ProductRelated | ProductRelated_Duration | Administrative
---|---|---|---
![image](output/4_2/PageValues.png) | ![image](output/4_2/ProductRelated.png) | ![image](output/4_2/ProductRelated_Duration.png) |  ![image](output/4_2/Administrative.png)

Influencing the top-right area of the matrix the most (farthest from revenue responses) are:

Informational_Duration | BounceRates | ExitRates | Weekend 
---|---|---|---
![image](output/4_2/Informational_Duration.png) | ![image](output/4_2/BounceRates.png) | ![image](output/4_2/ExitRates.png) | ![image](output/4_2/Weekend.png)

This implies that the weekend is not a particularly good time for sales weekend visitors tend to gather information.

Another surprising observation is that the most frequently best responding node to SpecialDay is adjaent to that for TrafficType. This might relate to a change in buying habits on special occasions, with people searching for gifts on a desktop rather than impulse buying on a phone, for example.

SpecialDay | TrafficType
---|---
![image](output/4_2/SpecialDay.png) | ![image](output/4_2/TrafficType.png)

## 4.3 Conclusions

A large number of clusters were found in the dataset, one of which appeared to correspond with revenue-generating visits. This information and technique could be applied to identify which features are most useful to track to predict revenue, to forecast sales based on web traffic activity, and to deploy a chatbot that could engage website visitors most likely to purchase based on which cluster their browsing activity falls into.

### Applications

The hyper-parameters of the matrix could potentially be fine-tuned further to ensure that no vectors produce a high response from two separate clusters. The project could be scaled to incorporate data with even more features, which gradually could be scaled down using SOM for feature extraction, replacing 50-60 features with a more manageable number.

The matrix could be retrained using specific data from an e-commerce business to see if their data forms the same clusters, and to inform the business about their browsers' behaviour.