## Get embeddings

We have generated embeddings (number vectors) for both tags and class names. These are text files that look like `name num1 num2 ...`, one line per object.

In [166]:
%load_ext autoreload
%autoreload
import json
from pha.lazygetter import lazyget
lazyget("https://raw.githubusercontent.com/ianb/personal-history-data/master/html-vectors.json", "data/html-vectors.json")
data = json.load(open("data/html-vectors.json"))
classname_vectors = data["classes"]
tag_vectors = data["tags"]
print("Class example:", list(classname_vectors.items())[0])
print("Tag example:", list(tag_vectors.items())[0])

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Creating directory data/
Reading 21132kb into data/html-vectors.json... done.
Class example: ('-comment-item-option-tool', [0.267649, 0.060728, 0.372383, -0.167402, -0.42405, -0.213464, -0.18574, 0.22672, 0.159343, 0.14346, 0.225321, 0.350789, 0.218112, 0.276691, 0.447376, -0.192947, -0.290775, -0.259949, 0.388473, -0.310735, -0.424629, -0.472496, 0.188353, 0.150703, 0.112088, 0.265407, 0.222766, -0.08162, 0.402486, 0.190056, 0.430345, -0.476926, 0.11799, -0.422889, 0.111582, 0.288491, 0.013147, -0.005348, -0.018681, -0.291235, -0.016522, 0.251612, -0.322645, 0.318849, -0.48921, -0.307299, 0.004299, -0.581498, 0.13047, -0.641503])
Tag example: ('nobr', [1.215299, 0.857834, 0.162706, 0.966696, 0.232542, 1.218955, 0.733365, 0.797194, -0.145894, -0.231142, -0.262402, -0.961403, -0.097864, -1.0834, 0.766679, 0.968583, 1.385928, 0.209797, -0.425753, -1.923545])


## Create vectors for elements

Now, given an element, we need to make a vector. 

A tag has a few properties:
* One tag name
* Zero or more class names
* A number of immediate children
* A number of descendants (including grandchildren, etc)
* A number of words
* Number of links

We'll turn these into several vectors, all of which are concatenated. Several use a parameter `min(n / range, 1.0)` which we'll call `trunc_count(n, range)`

* The tag vector
* The sum of the class name vectors (or the vector for `no-class`)
* `trunc_count(children, 10)`
* `trunc_count(desc, 100)`
* `trunc_count(words, 250)`
* `links / words` or 0 if there are no words

FIXME: add `display` property (i.e., block, block-ish like inline-block, or not-block)

In [125]:
import numpy as np
from pha import htmltools

def trunc_count(n, max_count):
    return min(n / max_count, 1.0)

def element_to_vector(el):
    children = len(el)
    desc = len(list(el.iter()))
    words = len(el.text_content().split())
    links = len(el.cssselect('a'))
    classnames = htmltools.normalize_classes(el)
    classnames = [classname_vectors[c] for c in classnames if c in classname_vectors] or [classname_vectors['no-class']]
    classname_sum = np.sum(classnames, axis=0)
    tag_vector = tag_vectors.get(el.tag) or tag_vectors["div"]
    word_links = links / words if words else 0.0
    vector = np.concatenate([
        tag_vector,
        classname_sum,
        [trunc_count(children, 10), trunc_count(desc, 100), trunc_count(words, 250), word_links],
    ])
    return vector

In [126]:
# Let's test it out:
import lxml.html

example_el = lxml.html.fromstring('<div class="login">this is an element</div>')
print(element_to_vector(example_el), element_to_vector(example_el).shape)

[-1.342614e+00 -9.486100e-02  4.825000e-02 -2.420997e+00  1.879200e-01
  1.059590e+00  2.564651e+00  4.462840e-01  6.541920e-01 -8.095740e-01
 -6.323590e-01 -3.787137e+00  1.598670e-01 -2.616020e-01  3.036870e-01
 -1.347139e+00  1.822867e+00 -1.205438e+00  9.460170e-01 -1.036174e+00
 -2.597010e-01 -3.733580e-01  1.566220e-01  1.588957e+00  1.118785e+00
 -1.518593e+00 -7.373980e-01  6.349290e-01  5.760940e-01  2.792740e-01
 -1.000190e-01 -3.212480e-01 -1.515810e-01  1.517445e+00 -4.100120e-01
  4.545120e-01 -1.409812e+00  2.038540e-01 -7.124000e-02 -1.194100e-02
 -4.407500e-01 -7.227330e-01  1.052250e+00 -8.168990e-01  1.433370e-01
 -1.351080e-01  2.169070e-01 -8.799660e-01  2.570460e-01 -4.959670e-01
 -4.564300e-02 -6.928400e-02  1.093841e+00 -1.262750e-01  9.825410e-01
 -1.089900e-01 -3.273710e-01  1.234727e+00 -1.384890e-01  1.434000e-03
  3.838530e-01 -1.019180e-01  8.018710e-01  7.520770e-01 -1.827950e-01
  4.175690e-01 -5.354360e-01  7.710520e-01 -4.671780e-01 -2.141180e-01
  0.00

## Training data

The training (Y) data is fairly simple: is the element readable or not? We'll use a vector of `[is_not_readable, is_readable]` for each item (so always `[1, 0]` for non-readable elements, and `[0, 1]` for readable elements.

But not all elements are either! Specifically elements *inside* a readable element are readable, but aren't what we are selecting. These elements will be filtered out.

In [131]:
def iter_testable_elements(doc):
    body = list(doc.iter("body"))[0]
    def iter_children(el):
        for child in el:
            if child.get("data-isreadable"):
                yield True, child
            else:
                yield False, child
                yield from iter_children(child)
    return iter_children(body)

In [132]:
example_doc = lxml.html.document_fromstring("""
<html><head><title>Test</title></head>
<body><div><ul data-isreadable="1"><li>1<li>2</ul>
<ol><li>3<li>4</ol>
</div></body></html>
""")
print(list(iter_testable_elements(example_doc)))

[(False, <Element div at 0x496f5eae8>), (True, <Element ul at 0x496f5e458>), (False, <Element ol at 0x496f5eea8>), (False, <Element li at 0x496f5e098>), (False, <Element li at 0x496f5eb88>)]


## Actual training data

To get real training data we have to load the history, get all the elements, convert them to vectors, and construct results along the way.

In [140]:
import pha
import pha.htmltools
import random
archive = pha.Archive.default_location()
print(archive)
histories = archive.histories_with_page()
random.shuffle(histories)
print(len(histories))

<Archive at '/Users/ianbicking/src/personal-history-archive' 19596/52900 fetched, 31046 errored>
14995


In [153]:
X = []
Y = []
total = 0
with_readable = 0
for h in histories:
    for is_readable, el in iter_testable_elements(h.page.lxml):
        if not is_readable and random.random() > 0.02:
            continue
        total += 1
        X.append(element_to_vector(el))
        if is_readable:
            with_readable += 1
        Y.append([0., 1.] if is_readable else [1., 0.])
X = np.array(X)
Y = np.array(Y)
print("Total:", total, "with readable:", with_readable)

Total: 256890 with readable: 8640


In [97]:
import h5py
h5f = h5py.File('readable-training-data.h5', 'w')
h5f.create_dataset('X_1', data=X, dtype=np.float32, compression="gzip")
h5f.create_dataset('Y_1', data=Y, dtype=np.int8, compression="gzip")
h5f.close()

In [154]:
print("X:", type(X), type(X[0]), len(X), X[0].shape, "sub-length:", len(X[0]), "size:", len(X) * len(X[0]) // 1000000, "Y:", len(Y))

X: <class 'numpy.ndarray'> <class 'numpy.ndarray'> 256890 (74,) sub-length: 74 size: 19 Y: 256890


In [64]:
h5f = h5py.File('readable-training-data.h5', 'r')
print(len(h5f['X_1']), len(h5f['Y_1']))
h5f.close()

14996243 14996243


In [155]:
split = 8 * len(X) // 10
X_train, X_test = X[:split], X[split:]
Y_train, Y_test = Y[:split], Y[split:]

In [156]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

def make_model(input_shape):
    model = Sequential()
    model.add(Dense(input_shape[0], input_shape=input_shape))
    model.add(Dropout(0.2))
    model.add(Dense(256, input_shape=input_shape))
    model.add(Dense(2, activation='softmax')) # is readable, or is not readable
    model.compile(
        optimizer='rmsprop', 
        loss='categorical_crossentropy',
        metrics=['accuracy'])
    return model

model = make_model(X[0].shape)

In [157]:
from keras.callbacks import ModelCheckpoint

checkpointer = ModelCheckpoint(
    filepath='readable-model-best.h5', 
    verbose=1, save_best_only=True)
hist = model.fit(
    np.array(X_train), np.array(Y_train), 
    batch_size=128, epochs=50,
    validation_split=0.2, 
    callbacks=[checkpointer],
    verbose=0, shuffle=True)
model.load_weights('readable-model-best.h5')
score = model.evaluate(np.array(X_test), np.array(Y_test), verbose=0)
print("Accuracy:", score, model.metrics_names)


Epoch 00001: val_loss improved from inf to 0.05133, saving model to readable-model-best.h5

Epoch 00002: val_loss did not improve

Epoch 00003: val_loss did not improve

Epoch 00004: val_loss did not improve

Epoch 00005: val_loss did not improve

Epoch 00006: val_loss did not improve

Epoch 00007: val_loss did not improve

Epoch 00008: val_loss did not improve

Epoch 00009: val_loss did not improve

Epoch 00010: val_loss improved from 0.05133 to 0.05114, saving model to readable-model-best.h5

Epoch 00011: val_loss did not improve

Epoch 00012: val_loss did not improve

Epoch 00013: val_loss did not improve

Epoch 00014: val_loss did not improve

Epoch 00015: val_loss did not improve

Epoch 00016: val_loss did not improve

Epoch 00017: val_loss did not improve

Epoch 00018: val_loss improved from 0.05114 to 0.05059, saving model to readable-model-best.h5

Epoch 00019: val_loss did not improve

Epoch 00020: val_loss did not improve

Epoch 00021: val_loss did not improve

Epoch 00022: 

In [158]:
right_readable = wrong_readable = right_unreadable = wrong_unreadable = 0
for el, result in list(zip(X_test, Y_test)):
    predict = model.predict(numpy.array([el]))
    predict_readable = predict[0][0] < predict[0][1]
    is_readable = result[0] < result[1]
    if is_readable:
        if predict_readable:
            right_readable += 1
        else:
            wrong_readable += 1
    elif predict_readable:
        wrong_unreadable += 1
    else:
        right_unreadable += 1
print("Got readable right", right_readable, "and wrong", wrong_readable)
print("Got unreadable right", right_unreadable, "and wrong", wrong_unreadable)

Got readable right 826 and wrong 977
Got unreadable right 49378 and wrong 197
