# KB demo

Work on a set of KB articles to see what spaCy can determine about them, and how we could use this to assist in customer experience


### Set up cache

We'll use the hmrc.gov.uk VAT pages for this (https://www.gov.uk/topic/business-tax/vat) and search for a depth of 2 (this page, and the ones it directly links to)

This will get data from a previous crawl if there is any

In [1]:
from crawler.crawler import Crawler, CrawlerCache

cache = CrawlerCache('crawler.db')
domain = 'www.gov.uk'  # TODO: Move this earlier 
base_url = 'topic/business-tax/vat'

### Set up an initial crawl to populate the cache for later

*** ONLY run this if the cache is empty or it's a new domain!

In [2]:
#crawler = Crawler(cache, depth=2)
#crawler.crawl('https://{}/{}'.format(domain,base_url))

Crawl www.gov.uk
retrieving url... [www.gov.uk] /topic/business-tax/vat
set: self.is_cacheable(/topic/business-tax/vat)=True 
retrieving url... [www.gov.uk] 
set: self.is_cacheable()=True 
retrieving url... [www.gov.uk] /vat-retail-schemes
set: self.is_cacheable(/vat-retail-schemes)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/email-signup
set: self.is_cacheable(/topic/business-tax/vat/email-signup)=True 
retrieving url... [www.gov.uk] /browse/driving
set: self.is_cacheable(/browse/driving)=True 
retrieving url... [www.gov.uk] /vat-building-new-home
set: self.is_cacheable(/vat-building-new-home)=True 
retrieving url... [www.gov.uk] /government/organisations/government-digital-service
set: self.is_cacheable(/government/organisations/government-digital-service)=True 
retrieving url... [www.gov.uk] /government/publications
set: self.is_cacheable(/government/publications)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/latest
set: self.is_cacheable(/topic/busin

set: self.is_cacheable(/government/collections/vat-forms)=True 
retrieving url... [www.gov.uk] /government/organisations
set: self.is_cacheable(/government/organisations)=True 
retrieving url... [www.gov.uk] /browse/housing-local-services
set: self.is_cacheable(/browse/housing-local-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
set: self.is_cacheable(/guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country)=True 
retrieving url... [www.gov.uk] /government/organisations/department-for-work-pensions
set: self.is_cacheable(/government/organisations/department-for-work-pensions)=True 
retrieving url... [www.gov.uk] /guidance/vat-how-to-work-out-your-place-of-supply-of-services
set: self.is_cacheable(/guidance/vat-how-to-work-out-your-place-of-supply-of-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-capital-goods-scheme-and-capital-assets
set: self.is_cacheable(/guidance/vat-capital-goods-s

Show what pages we have found and indexed

In [2]:
for key in crawler.content['www.gov.uk'].keys():
    print (key)

NameError: name 'crawler' is not defined

View one of the pages as an example

In [4]:
page = crawler.content['www.gov.uk']['/vat-record-keeping']
print (page)

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en">
<!--<![endif]-->
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>VAT record keeping - GOV.UK</title>

    <!--[if gt IE 8]><!--><link href="https://assets.publishing.service.gov.uk/static/govuk-template-2775f99eaec64ff8121bfbfb3eb67b0c2b4b7c3fc78d25da30e12db2a09d30d6.css" media="screen" rel="stylesheet">
<!--<![endif]-->
    <!--[if IE 6]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie6-5bb08c355a12ac38b0ac9d2446da122ec0f81c78e02dcd2a98766f53c23793c8.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 7]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie7-be1ea757827710f20eae59ae3ebfd172b7dbeabb171a79945ba610947eebb3cc.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 8]><link href="https://assets.publishing.service.go

## Now get the text from the page to parse

In [3]:
#import spacy
import numpy as np
from bs4 import BeautifulSoup

Stop words from http://xpo6.com/download-stop-word-list/

* Needs adding to, perhaps remove gov.uk, hm, etc.

In [4]:
stoplist = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

In [5]:
def get_text_from_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", "nav", "footer"]):
        script.extract()    # rip it out
    text = soup.get_text()
    return text

In [6]:
def remove_stop_words(doc):
    return [word for word in doc.lower().split() if word not in stoplist]

### Build an array of parsed documents

In [7]:
pages=[]
for url in cache.get_urls(domain):
    pages.append(remove_stop_words(get_text_from_html(cache.get(domain,url))))

In [8]:
print (pages[5])

['building', 'new', 'home', 'vat', '-', 'gov.uk', 'skip', 'main', 'content', 'gov.uk', 'uses', 'cookies', 'make', 'site', 'simpler.', 'find', 'out', 'more', 'cookies', 'gov.uk', 'search', 'search', 'home', 'money', 'tax', 'vat', 'building', 'new', 'home', 'vat', '1.', 'overview', 'apply', 'vat', 'refund', 'building', 'materials', 'services', 'youre:', 'building', 'new', 'home', 'converting', 'property', 'home', 'building', 'non-profit', 'communal', 'residence', '-', 'eg', 'hospice', 'building', 'property', 'charity', 'building', 'work', 'materials', 'qualify', 'apply', 'hm', 'revenue', 'customs', '(hmrc)', 'within', '3', 'months', 'completing', 'work.', 'separate', 'guide', 'vat', 'youre', 'working', 'construction', 'industry.', 'print', 'entire', 'guide', 'vat', 'elsewhere', 'gov.uk', 'help', 'improve', 'gov.uk', 'dont', 'include', 'personal', 'financial', 'information', 'national', 'insurance', 'number', 'credit', 'card', 'details.', 'doing', 'went', 'wrong', 'send']


### Let's introduce Gensim

In [9]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora


Using TensorFlow backend.
2017-08-15 15:39:58,593 : INFO : 'pattern' package not found; tag filters are not available for English


Remove words that only appear once

In [10]:
from collections import defaultdict

frequency = defaultdict(int)
for page in pages:
     for token in page:
        frequency[token] += 1

pages = [[token for token in page if frequency[token] > 1]
          for page in pages]

In [11]:
pages[5]

['building',
 'new',
 'home',
 'vat',
 '-',
 'gov.uk',
 'skip',
 'main',
 'content',
 'gov.uk',
 'uses',
 'cookies',
 'make',
 'site',
 'simpler.',
 'find',
 'out',
 'more',
 'cookies',
 'gov.uk',
 'search',
 'search',
 'home',
 'money',
 'tax',
 'vat',
 'building',
 'new',
 'home',
 'vat',
 '1.',
 'overview',
 'apply',
 'vat',
 'refund',
 'building',
 'materials',
 'services',
 'building',
 'new',
 'home',
 'converting',
 'property',
 'home',
 'building',
 'non-profit',
 'communal',
 'residence',
 '-',
 'eg',
 'hospice',
 'building',
 'property',
 'charity',
 'building',
 'work',
 'materials',
 'qualify',
 'apply',
 'hm',
 'revenue',
 'customs',
 '(hmrc)',
 'within',
 '3',
 'months',
 'completing',
 'separate',
 'guide',
 'vat',
 'youre',
 'working',
 'construction',
 'print',
 'entire',
 'guide',
 'vat',
 'elsewhere',
 'gov.uk',
 'help',
 'improve',
 'gov.uk',
 'dont',
 'include',
 'personal',
 'financial',
 'information',
 'national',
 'insurance',
 'number',
 'credit',
 'card',
 'd

Now create a dictionary (initial just a bag of words)

In [12]:
dictionary = corpora.Dictionary(pages)
dictionary.save('./tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary.token2id)

2017-08-15 15:39:58,676 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-08-15 15:39:58,749 : INFO : built Dictionary(4401 unique tokens: ['usual', 'children,', 'climate', 'health,', 'being']...) from 103 documents (total 59407 corpus positions)
2017-08-15 15:39:58,751 : INFO : saving Dictionary object under ./tmp/deerwester.dict, separately None
2017-08-15 15:39:58,757 : INFO : saved ./tmp/deerwester.dict


{'usual': 3151, 'children,': 61, 'climate': 510, 'health,': 512, 'being': 381, 'films': 3841, 'armenia': 560, 'charitable': 2249, 'emergencies': 519, 'books,': 2244, '732:': 3852, 'so,': 2187, 'llysoedd': 4340, 'functions': 3207, 'beginning': 250, 'exclude': 4146, 'clearance': 1862, 'establish': 4281, 'code': 534, 'sterling.': 2795, 'liverpool': 536, 'arrival': 3703, 'sitpro': 666, 'gyda': 4350, 'organisations': 2922, 'white': 419, 'niger': 543, 'chancellor': 2997, 'highways': 1270, 'rail': 548, 'personalised': 552, 'simplifications': 3692, 'animals,': 1874, 'leisure': 554, 'carers,': 106, '42kb)': 3740, 'heavy': 3592, 'reviewing': 1836, 'supreme': 517, 'adviser': 561, 'character': 3730, 'twitter': 483, 'supplies.': 1879, 'appoint': 3341, 'customer': 269, 'scan': 3281, 'template': 2564, 'mobility': 855, 'green': 572, 'conflict': 4329, 'arab': 576, 'regularly': 2570, 'ago': 2959, 'montserrat': 579, 'cattle': 582, 'schemes': 276, 'worth.': 3387, '(this': 3470, 'dates': 113, 'zero': 1888,

Test against a sample question...

In [13]:
question = "Import VAT"
q_vec = dictionary.doc2bow(question.lower().split())
print (q_vec)

[(47, 1), (1958, 1)]


In [14]:
corpus = [dictionary.doc2bow(page) for page in pages]
corpora.MmCorpus.serialize('./tmp/deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

2017-08-15 15:39:58,915 : INFO : storing corpus in Matrix Market format to ./tmp/deerwester.mm
2017-08-15 15:39:58,918 : INFO : saving sparse matrix to ./tmp/deerwester.mm
2017-08-15 15:39:58,919 : INFO : PROGRESS: saving document #0
2017-08-15 15:39:58,977 : INFO : saved 103x4401 matrix, density=5.262% (23852/453303)
2017-08-15 15:39:58,978 : INFO : saving MmCorpus index to ./tmp/deerwester.mm.index


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 3), (23, 1), (24, 1), (25, 1), (26, 2), (27, 2), (28, 2), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 4), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 2), (47, 2), (48, 2), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1)], [(3, 3), (6, 1), (7, 1), (8, 1), (12, 1), (13, 1), (14, 1), (17, 1), (18, 6), (19, 1), (20, 1), (21, 1), (22, 1), (24, 4), (25, 2), (26, 2), (27, 4), (29, 1), (30, 1), (31, 1), (33, 1), (34, 1), (36, 1), (37, 1), (39, 2), (40, 8), (43, 1), (44, 4), (45, 1), (46, 1), (47, 1), (48, 7), (49, 1), (50, 1), (51, 1), (54, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 3), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (7

### Transformations
Based on https://radimrehurek.com/gensim/tut2.html

In [52]:
import os
from gensim import corpora, models, similarities
if (os.path.exists("./tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('./tmp/deerwester.dict')
    corpus = corpora.MmCorpus('./tmp/deerwester.mm')
    print("Used files generated above")
else:
    print("Please run code above to generate data set")

2017-08-15 16:40:34,954 : INFO : loading Dictionary object from ./tmp/deerwester.dict
2017-08-15 16:40:34,958 : INFO : loaded ./tmp/deerwester.dict
2017-08-15 16:40:34,960 : INFO : loaded corpus index from ./tmp/deerwester.mm.index
2017-08-15 16:40:34,961 : INFO : initializing corpus reader from ./tmp/deerwester.mm
2017-08-15 16:40:34,963 : INFO : accepted corpus with 103 documents, 4401 features, 23852 non-zero entries


Used files generated above


In [53]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2017-08-15 16:40:35,531 : INFO : collecting document frequencies
2017-08-15 16:40:35,535 : INFO : PROGRESS: processing document #0
2017-08-15 16:40:35,626 : INFO : calculating IDF weights for 103 documents and 4400 features (23852 matrix non-zeros)


In [54]:
print(tfidf[q_vec]) # step 2 -- use the model to transform vectors

[(0, 0.70223096914338468), (1, 0.46262469114467708), (2, 0.12787414448588894), (4, -0.099524188551776416), (5, -0.00029142399245634949), (9, 0.042588727575982047), (10, 0.003738630013232378), (11, -0.095939011430362264), (15, -0.21179062786125252), (16, -0.12621523184562175), (21, -0.03897754351751842), (22, 0.00015629763924897617), (23, -0.084944884383754388), (24, 0.037989412523652608), (28, 0.017738262998218037), (29, 0.061829253588616219), (32, 0.013121835997555738), (35, 0.0420801025190864), (36, 0.025560122298658297), (38, 0.11446933612763543), (41, -0.14451704552288608), (42, -0.040265855465817811), (47, -0.0046828557712736739), (48, -0.0056212837345615135), (52, -0.029772086170304871), (53, -0.027247902872119058), (55, -0.00024100352275913274), (56, -0.0031728306032371054), (57, -0.0018100263569231607), (58, -0.086322441468719507), (60, 0.054539220415762965), (61, 0.024202232626511423), (62, 0.018705558761275679), (63, -0.071484769851182101), (64, -0.075313133739338775), (65, -

In [57]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.24028567464054193), (1, 0.29155755079777423), (2, 0.2615654259041742), (4, 0.2615654259041742), (5, 0.002186455186415519), (9, 0.2615654259041742), (10, 0.032922000221756655), (11, 0.1011143753694164), (15, 0.29155755079777423), (16, 0.14251581617711098), (21, 0.05962604239725152), (22, 0.06169518837332447), (23, 0.18030142485334183), (24, 0.11089791855448385), (28, 0.06584400044351331), (29, 0.07984146333480929), (32, 0.1890137984833096), (35, 0.18030142485334183), (36, 0.17250794107071102), (38, 0.17250794107071102), (41, 0.1988910730618725), (42, 0.2615654259041742), (47, 0.05504070849093269), (48, 0.019959877938092653), (52, 0.14761919690464018), (53, 0.0629141251153567), (55, 0.11418600127258902), (56, 0.006763359090215163), (57, 0.29155755079777423), (58, 0.33091575485964264)]
[(21, 0.015953376022157757), (22, 0.005502330297892731), (24, 0.11868613938770205), (29, 0.021362157130157742), (36, 0.04615573900366969), (47, 0.007363268496269477), (48, 0.018691430598973366), (60,

### Use LSI
Provides continuous training capabilities

In [67]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save('./tmp/deerwester.index')

2017-08-15 16:41:57,760 : INFO : using serial LSI version on this node
2017-08-15 16:41:57,765 : INFO : updating model with new documents
2017-08-15 16:41:57,892 : INFO : preparing a new chunk of documents
2017-08-15 16:41:57,897 : INFO : using 100 extra samples and 2 power iterations
2017-08-15 16:41:57,899 : INFO : 1st phase: constructing (4401, 400) action matrix
2017-08-15 16:41:57,919 : INFO : orthonormalizing (4401, 400) action matrix
2017-08-15 16:41:58,271 : INFO : 2nd phase: running dense svd on (400, 103) matrix
2017-08-15 16:41:58,287 : INFO : computing the final decomposition
2017-08-15 16:41:58,288 : INFO : keeping 102 factors (discarding 0.000% of energy spectrum)
2017-08-15 16:41:58,302 : INFO : processed documents up to #103
2017-08-15 16:41:58,304 : INFO : topic #0(2.491): 0.322*"vat" + 0.238*"goods" + 0.141*"includes" + 0.118*"notice" + 0.112*"scheme" + 0.107*"claim" + 0.107*"uk" + 0.104*"hmrc" + 0.102*"reclaim" + 0.100*"living"
2017-08-15 16:41:58,306 : INFO : topic 

### Now add a more complex question

In [68]:
#question = "What VAT applies to import and export"
#vec_bow = dictionary.doc2bow(question.lower().split())
#vec_lsi = lsi[vec_bow]
#print(vec_lsi)

In [69]:
#index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
#index.save('./tmp/deerwester.index')


Load the index if saved previously

In [70]:
#index = similarities.MatrixSimilarity.load('./tmp/deerwester.index')

In [71]:
#sims = index[vec_lsi]

In [72]:
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

In [73]:
#sims = sorted(enumerate(sims), key=lambda item: -item[1])
#print(sims) # print sorted (document number, similarity score) 2-tuples

In [74]:
def get_predicted_urls(sims, n=5):
    urls = []
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    #print(sims) # print sorted (document number, similarity score) 2-tuples
    for r in range(n):
        page_id = sims[r][0]
        #print (cache.get_urls(domain)[page_id])
        urls.append(cache.get_urls(domain)[page_id])
    return urls

In [75]:
#print (get_predicted_urls(sims))

Create a vector based on the question and the context

In [76]:
def create_question_vector(question, context):
    vec_bow = dictionary.doc2bow(question.lower().split())
    vec_lsi = lsi[vec_bow]
    return vec_lsi

In [77]:
question = "What VAT applies to import and export"
context = "import export"
q_vec = create_question_vector(question, context)


Determine the most appropriate answers

In [80]:
def get_similarity(lsi, q_vec):
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[vec_lsi]
    return sims

In [81]:
sims = get_similarity(lsi, q_vec )
print (get_predicted_urls(sims))

2017-08-15 16:42:24,018 : INFO : creating matrix with 103 documents and 102 features


['/guidance/vat-imports-acquisitions-and-purchases-from-abroad', '/guidance/how-to-value-your-imports-for-customs-duty-and-trade-statistics', '/vat-annual-accounting-scheme', '/government/collections/vat-manuals', '/vat-returns']
