# Visualizing with Image Mouseovers

This notebook shows and example using bokeh to make a beautiful plot of clusters, complete with colour by cluster and image views in mouseover.

In this example from Frederick Lemay and Valerie Poulin, we cluster phishing webpages based on their alt text. We then want to view images with our cluster graph to get a sense of how effective the clustering is.

At section <b>3. Plot with Mouseover</b> we make a bokeh plot from the data.

## 1. Import and Initialize

For plotting alone, you will need to import from bokeh (lines 1-4 below).

In [1]:
from bokeh.plotting import figure, output_file, output_notebook, show, ColumnDataSource
from bokeh.transform import factor_cmap
from bokeh.palettes import viridis
from bokeh.models import HoverTool
import logging
from logging import handlers
import mysql.connector
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import hdbscan
import umap
from matplotlib import pyplot as plt
import os
import shutil
import sys

### 1.1 Path Variables

(for Frederick G's local machine).

<b>output_file</b> is the location of our html plot from bokeh. For use in jupyter, use <b>output_notebook</b>

<b>IMAGE_PATH</b> is a folder of every image in the set, named by hash.

DATABASE is credentials for accessing SQL database from GeekWeekV.
LOG_PATH is the location of the logger (see 1.2).
CLUSTER_PATH is used by main() to structure system file folders.

In [2]:
#output_file(r"C:\Users\nafmo\Desktop\GeekWeekV\Frederick_phishing\clusters.html")
SOURCE_PATH = r'C:\Users\nafmo\Desktop\GeekWeekV\Frederick_phishing\html'
CLUSTER_PATH = r'C:\Users\nafmo\Desktop\GeekWeekV\Frederick_phishing\cluster'
LOG_PATH = r"C:\Users\nafmo\Desktop\GeekWeekV\Frederick_phishing\log"
IMAGE_PATH = r"C:\Users\nafmo\Desktop\GeekWeekV\Frederick_phishing\image"
DATABASE = {
    'user': 'geekweek',
    'password': 'g33kw33k@LAC!',
    'host': 'databank.ccirc.geekweek',
    'database': 'geekweek32'
}

In [3]:
output_notebook()

### 1.2 Logger  & Database

Functions to create a logger and connect to SQL database.

In [4]:
def create_logger(name):
    LOG_FORMAT = '%(asctime)s - %(name)8s - %(levelname)8s - %(lineno)4d - %(message)s'
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    fh = handlers.TimedRotatingFileHandler('%s/%s.log' % (LOG_PATH, name), 'midnight', 1, 14, None, False, None)
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter(LOG_FORMAT)
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    if (logger.hasHandlers()):
        logger.handlers.clear()
    logger.addHandler(fh)
    logger.addHandler(ch)
    return logger

def create_database():
    global logger
    logger.info("Connecting to database.")
    db = mysql.connector.connect(user=DATABASE['user'],
                                 password=DATABASE['password'],
                                 host=DATABASE['host'],
                                 database=DATABASE['database'],
                                 auth_plugin='mysql_native_password',
                                 buffered=True)
    return db

## 2. Vectorize and Cluster

Here we define a function to cluster text "documents" (strings of alt text for a website) based on n-gram frequency. 

This fn takes values and hashes retrieved from database in section 4.

In [5]:
def vectorize(values, hashes):
    '''Input lists of values (text strs) and hashes (text strs) corresponding to phishing sites.
    Plot a visualization of the clustering using bokeh, see pretty_plot()
    Output a list of cluster labels, used for sorting files in main()'''
    ngram_vectorizer = CountVectorizer(ngram_range=(1, 3), min_df=2, lowercase=True)
    X = ngram_vectorizer.fit_transform(values)
    clusterer = hdbscan.HDBSCAN()
    clusterer.fit(X)
    model = umap.UMAP(n_components=2)
    embedding = model.fit_transform(X)
    labels = clusterer.labels_
    #
    #
    pretty_plot(embedding, labels, values, hashes)
    #
    #
    return clusterer.labels_

## 3. Plot with Mouseover

In [6]:
def pretty_plot(embedding, labels, values, hashes):
    global viridis #use your palette of choice as imported from bokeh
    
    '''First, we define the data for our plot. 
        x,y     are coordinates from UMAP reduction.
        images  are paths for display (can also be web addresses). 
        labels  are numbers of clusters courtesy of HDBSCAN.
        values  are strings of website's alt text.'''
    x = embedding[:,0] #define x, y as the two columns from UMAP's output, keeping the indices consistent to values, hashes, etc.
    y = embedding[:,1]
    images = [os.path.join(IMAGE_PATH, "%s.png" % (hashes[i])) for i in range(0,len(hashes))] #define images by their path on the local drive
    strlabels = [str(x) for x in labels] #bokeh requires str for labels, not int
    values = [x[:40] for x in values] #shorten values for easy display
    
    '''Next, put the data in a ColumnDataSource object for easy reading by bokeh.'''
    datadict = dict(x=x, y=y,values=values,hashes=hashes,images=images,labels=strlabels)
    source = ColumnDataSource(datadict)
    
    '''The "fun" part: supply HTML formatting for the mouseover tooltip.
    Ours shows the image in a 150x150 thumbnail (try also height="auto").
    We show the index from our lists, the hash for finding the image locally, and the alt text [:40] used for clustering.
    We include the location and the cluster ID.
    See also https://bokeh.pydata.org/en/latest/docs/user_guide/tools.html#custom-tooltip '''
    TOOLTIPS = """
    <div>
        <div>
            <img
                src="file://@images" height="150" width="150" alt="file://@images"
                style="float: left; margin: 0px 10px 10px 0px;"
                border="2"
            ></img>
        </div>
        <div>
            <span style="font-size: 15px; color: #966;">[$index]</span>
            <span style="font-size: 10px;">@hashes</span>
        </div>
        <div>
            <span style="font-size: 10px;">@values</span>
        </div>
        <div>
            <span style="font-size: 15px;">Location</span>
            <span style="font-size: 10px; color: #696;">($x, $y)</span>
        </div>
        <div>
            <span style="font-size: 15px;">Cluster ID </span>
            <span style="font-size: 15px; color: #855; font-weight: bold;">@labels</span>
        </div>
    </div>
"""
    
    '''Set our color palette from bokeh's viridis, adjusted to number of clusters.
    This lets us give each cluster a unique color.'''
    lsslabels = list(set(strlabels)) #bokeh throws errors at certain datatypes... I'm sure there is a cleaner way to do this in the future.
    viridis = viridis(len(lsslabels)-1) #get colors from the palette equal to the number of different labels (max 256) -1 for the unclustered label
    viridis.insert(lsslabels.index('-1'), '#000000') #make outliers (label=='-1') pure black (or bright red is #f00000)
    
    '''Define the bokeh plot.
    Fiddle with the visual settings to your liking!'''
    p = figure(title='Clustering Phishing Sites', active_scroll='wheel_zoom', tooltips=TOOLTIPS)
    p.background_fill_color = "beige"
    p.background_fill_alpha = 0.75
    p.circle('x', 'y', source=source, size=6.66, color=factor_cmap('labels', palette=viridis, factors=lsslabels))

    show(p)

## 4. main() 

This function retrives a sample (1k) of hashes and alt text from the database.

Then, calls vectorize(values, hashes) to cluster and plot in one line.

Finally, copies images files into the CLUSTER_FILES path for viewing.

In [17]:
def main():
    global logger, db

    logger = create_logger('htmlstrip')
    db = create_database()

    # reading information
    cursor = db.cursor()
    cursor.execute("""SELECT p.hash, GROUP_CONCAT(r.value ORDER BY r.value SEPARATOR '  ')
                      FROM phish_resource r
                      JOIN phish_page_has_resource pr
                        ON pr.phish_resource_id = r.id AND r.type LIKE 'alt' AND LENGTH(r.value) < 20
                           JOIN phish_page p
                             ON pr.phish_page_id = p.id
                      GROUP BY p.id, p.hash
                      LIMIT 1000;""")
    hashes = list()
    values = list()
    for (hash, value) in cursor:
        values.append(value)
        hashes.append(hash)

    vector = vectorize(values, hashes)

    if os.path.exists(CLUSTER_PATH):
        shutil.rmtree(CLUSTER_PATH)
    os.mkdir(CLUSTER_PATH)

    for i in range(0, len(hashes)):
        target_path = os.path.join(CLUSTER_PATH, str(vector[i]))
        if not os.path.exists(target_path):
            os.mkdir(target_path)

        source_file = os.path.join(IMAGE_PATH, "%s.png" % (hashes[i],))
        if os.path.isfile(source_file):
            shutil.copy(source_file, target_path)
        else:
            #logger.warning("%s not found." % (source_file,))
            #uncomment logger.warning to warn when missing image files.
            pass
    db.close()

Call main to do it all: retrieve data, vectorize by alttext, plot, and make a directory of clustered images.

In [18]:
main()

2018-10-25 12:06:06,578 - htmlstrip -     INFO -   20 - Connecting to database.
  n_components


TypeError: 'list' object is not callable

Alternatively, use section 5 to poke around with vectorization and plotting, without calling main to move files.

## APPENDIX

## 5. Vectorize and plot only

In [9]:
logger = create_logger('htmlstrip')
db = create_database()

cursor = db.cursor()
cursor.execute("""SELECT p.hash, GROUP_CONCAT(r.value ORDER BY r.value SEPARATOR '  ')
                      FROM phish_resource r
                      JOIN phish_page_has_resource pr
                        ON pr.phish_resource_id = r.id AND r.type LIKE 'alt' AND LENGTH(r.value) < 20
                           JOIN phish_page p
                             ON pr.phish_page_id = p.id
                      GROUP BY p.id, p.hash
                      LIMIT 1000;""")
hashes = list()
values = list()
for (hash, value) in cursor:
        values.append(value)
        hashes.append(hash)

2018-10-25 12:04:31,845 - htmlstrip -     INFO -   20 - Connecting to database.


In [10]:
ngram_vectorizer = CountVectorizer(ngram_range=(1, 3), min_df=2, lowercase=True)
X = ngram_vectorizer.fit_transform(values)
clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
model = umap.UMAP(n_components=2)
embedding = model.fit_transform(X)
labels = clusterer.labels_

  n_components


In [11]:
x = embedding[:,0]
y = embedding[:,1]
images = [os.path.join(IMAGE_PATH, "%s.png" % (hashes[i])) for i in range(0,len(hashes))]
strlabels = [str(x) for x in labels]
values = [x[:40] for x in values]

In [12]:
datadict = dict(x=x, y=y,values=values,hashes=hashes,images=images,labels=strlabels)
source = ColumnDataSource(datadict)

who's afraid of the big bad tooltip

In [13]:
TOOLTIPS = """
    <div>
        <div>
            <img
                src="file://@images" height="150" width="150" alt="file://@images"
                style="float: left; margin: 0px 10px 10px 0px;"
                border="2"
            ></img>
        </div>
        <div>
            <span style="font-size: 15px; color: #966;">[$index]</span>
            <span style="font-size: 10px;">@hashes</span>
        </div>
        <div>
            <span style="font-size: 10px;">@values</span>
        </div>
        <div>
            <span style="font-size: 15px;">Location</span>
            <span style="font-size: 10px; color: #696;">($x, $y)</span>
        </div>
        <div>
            <span style="font-size: 15px;">Cluster ID </span>
            <span style="font-size: 15px; color: #855; font-weight: bold;">@labels</span>
        </div>
    </div>
"""

In [14]:
from bokeh.palettes import viridis
lsslabels = list(set(strlabels))
viridis = viridis(len(lsslabels)-1)
viridis.insert(lsslabels.index('-1'), '#000000')

In [15]:
p = figure(title='Clustering Phishing Sites', active_scroll='wheel_zoom', tooltips=TOOLTIPS)
p.background_fill_color = "beige"
p.background_fill_alpha = 0.75
p.circle('x', 'y', source=source, size=6.66, color=factor_cmap('labels', palette=viridis, factors=lsslabels))

show(p)

Wow, so <b><font color="#440154">b</font><font color="#3D4A89">e</font><font color="#2E6B8E">a</font><font color="#208F8C">u</font><font color="1F958B">t</font><font color="#27AD80">i</font><font color="#42BE71">f</font><font color="#69CC5B">u</font><font color="#97D83E">l</font><font color="#FDE724">!</font></b>