# KB demo

Work on a set of KB articles to see what spaCy can determine about them, and how we could use this to assist in customer experience


### Set up cache

We'll use the hmrc.gov.uk VAT pages for this (https://www.gov.uk/topic/business-tax/vat) and search for a depth of 2 (this page, and the ones it directly links to)

This will get data from a previous crawl if there is any

In [1]:
from crawler.crawler import Crawler, CrawlerCache

cache = CrawlerCache('crawler.db')
domain = 'www.gov.uk'  # TODO: Move this earlier 
base_url = 'topic/business-tax/vat'

### Set up an initial crawl to populate the cache for later

*** ONLY run this if the cache is empty or it's a new domain!

In [2]:
crawler = Crawler(cache, depth=2)
crawler.crawl('https://{}/{}'.format(domain,base_url))

Crawl www.gov.uk
retrieving url... [www.gov.uk] /topic/business-tax/vat
set: self.is_cacheable(/topic/business-tax/vat)=True 
retrieving url... [www.gov.uk] 
set: self.is_cacheable()=True 
retrieving url... [www.gov.uk] /vat-retail-schemes
set: self.is_cacheable(/vat-retail-schemes)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/email-signup
set: self.is_cacheable(/topic/business-tax/vat/email-signup)=True 
retrieving url... [www.gov.uk] /browse/driving
set: self.is_cacheable(/browse/driving)=True 
retrieving url... [www.gov.uk] /vat-building-new-home
set: self.is_cacheable(/vat-building-new-home)=True 
retrieving url... [www.gov.uk] /government/organisations/government-digital-service
set: self.is_cacheable(/government/organisations/government-digital-service)=True 
retrieving url... [www.gov.uk] /government/publications
set: self.is_cacheable(/government/publications)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/latest
set: self.is_cacheable(/topic/busin

set: self.is_cacheable(/government/collections/vat-forms)=True 
retrieving url... [www.gov.uk] /government/organisations
set: self.is_cacheable(/government/organisations)=True 
retrieving url... [www.gov.uk] /browse/housing-local-services
set: self.is_cacheable(/browse/housing-local-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
set: self.is_cacheable(/guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country)=True 
retrieving url... [www.gov.uk] /government/organisations/department-for-work-pensions
set: self.is_cacheable(/government/organisations/department-for-work-pensions)=True 
retrieving url... [www.gov.uk] /guidance/vat-how-to-work-out-your-place-of-supply-of-services
set: self.is_cacheable(/guidance/vat-how-to-work-out-your-place-of-supply-of-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-capital-goods-scheme-and-capital-assets
set: self.is_cacheable(/guidance/vat-capital-goods-s

Show what pages we have found and indexed

In [3]:
for key in crawler.content['www.gov.uk'].keys():
    print (key)


/topic/business-tax/vat/email-signup
/browse/driving
/vat-building-new-home
/government/organisations/government-digital-service
/government/publications
/topic/business-tax/vat/latest
/help/cookies
/guidance/vat-lost-stolen-damaged-or-destroyed-goods
/guidance/rates-of-vat-on-different-goods-and-services
/browse/visas-immigration
/contact
/guidance/foreign-currency-transactions-vat-and-tour-operators
/government/collections/vat-tribunal-reports-and-appeal-updates
/government/organisations/hm-revenue-customs
/browse/benefits
/government/news/webinars-emails-and-videos-on-vat
/government/announcements
/government/publications/notification-of-multiple-heavy-commercial-vehicles-brought-into-the-uk
/government/organisations/hm-treasury
/vat-record-keeping
/guidance/vat-exports-dispatches-and-supplying-goods-abroad
/vat-corrections
/guidance/vat-registration-for-groups-divisions-and-joint-ventures
/browse/abroad
/government/policies
/importing-vehicles-into-the-uk
/vat-margin-schemes
/guid

View one of the pages as an example

In [4]:
page = crawler.content['www.gov.uk']['/vat-record-keeping']
print (page)

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en">
<!--<![endif]-->
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>VAT record keeping - GOV.UK</title>

    <!--[if gt IE 8]><!--><link href="https://assets.publishing.service.gov.uk/static/govuk-template-2775f99eaec64ff8121bfbfb3eb67b0c2b4b7c3fc78d25da30e12db2a09d30d6.css" media="screen" rel="stylesheet">
<!--<![endif]-->
    <!--[if IE 6]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie6-5bb08c355a12ac38b0ac9d2446da122ec0f81c78e02dcd2a98766f53c23793c8.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 7]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie7-be1ea757827710f20eae59ae3ebfd172b7dbeabb171a79945ba610947eebb3cc.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 8]><link href="https://assets.publishing.service.go

## Now get the text from the page to parse

In [5]:
#import spacy
import numpy as np
from bs4 import BeautifulSoup

Stop words from http://xpo6.com/download-stop-word-list/

* Needs adding to, perhaps remove gov.uk, hm, etc.

In [17]:
stoplist = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

In [6]:
def get_text_from_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", "nav", "footer"]):
        script.extract()    # rip it out
    text = soup.get_text()
    return text

In [26]:
def remove_stop_words(doc):
    return [word for word in doc.lower().split() if word not in stoplist]

### Build an array of parsed documents

In [22]:
pages=[]
for url in cache.get_urls(domain):
    pages.append(remove_stop_words(get_text_from_html(cache.get(domain,url))))

In [24]:
print (pages[5])

['building', 'new', 'home', 'vat', '-', 'gov.uk', 'skip', 'main', 'content', 'gov.uk', 'uses', 'cookies', 'make', 'site', 'simpler.', 'find', 'out', 'more', 'cookies', 'gov.uk', 'search', 'search', 'home', 'money', 'tax', 'vat', 'building', 'new', 'home', 'vat', '1.', 'overview', 'apply', 'vat', 'refund', 'building', 'materials', 'services', 'youre:', 'building', 'new', 'home', 'converting', 'property', 'home', 'building', 'non-profit', 'communal', 'residence', '-', 'eg', 'hospice', 'building', 'property', 'charity', 'building', 'work', 'materials', 'qualify', 'apply', 'hm', 'revenue', 'customs', '(hmrc)', 'within', '3', 'months', 'completing', 'work.', 'separate', 'guide', 'vat', 'youre', 'working', 'construction', 'industry.', 'print', 'entire', 'guide', 'vat', 'elsewhere', 'gov.uk', 'help', 'improve', 'gov.uk', 'dont', 'include', 'personal', 'financial', 'information', 'national', 'insurance', 'number', 'credit', 'card', 'details.', 'doing', 'went', 'wrong', 'send']


### Let's introduce Gensim

In [31]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora


Using TensorFlow backend.
2017-08-04 11:49:11,835 : INFO : 'pattern' package not found; tag filters are not available for English


Remove words that only appear once

In [33]:
from collections import defaultdict

frequency = defaultdict(int)
for page in pages:
     for token in page:
        frequency[token] += 1

pages = [[token for token in page if frequency[token] > 1]
          for page in pages]

In [35]:
pages[5]

['building',
 'new',
 'home',
 'vat',
 '-',
 'gov.uk',
 'skip',
 'main',
 'content',
 'gov.uk',
 'uses',
 'cookies',
 'make',
 'site',
 'simpler.',
 'find',
 'out',
 'more',
 'cookies',
 'gov.uk',
 'search',
 'search',
 'home',
 'money',
 'tax',
 'vat',
 'building',
 'new',
 'home',
 'vat',
 '1.',
 'overview',
 'apply',
 'vat',
 'refund',
 'building',
 'materials',
 'services',
 'building',
 'new',
 'home',
 'converting',
 'property',
 'home',
 'building',
 'non-profit',
 'communal',
 'residence',
 '-',
 'eg',
 'hospice',
 'building',
 'property',
 'charity',
 'building',
 'work',
 'materials',
 'qualify',
 'apply',
 'hm',
 'revenue',
 'customs',
 '(hmrc)',
 'within',
 '3',
 'months',
 'completing',
 'separate',
 'guide',
 'vat',
 'youre',
 'working',
 'construction',
 'print',
 'entire',
 'guide',
 'vat',
 'elsewhere',
 'gov.uk',
 'help',
 'improve',
 'gov.uk',
 'dont',
 'include',
 'personal',
 'financial',
 'information',
 'national',
 'insurance',
 'number',
 'credit',
 'card',
 'd

Now create a dictionary (initial just a bag of words)

In [37]:
dictionary = corpora.Dictionary(pages)
dictionary.save('./tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary.token2id)

2017-08-04 11:54:37,284 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-08-04 11:54:37,352 : INFO : built Dictionary(4401 unique tokens: ['44', 'out', 'half', 'products', 'territory']...) from 103 documents (total 59407 corpus positions)
2017-08-04 11:54:37,355 : INFO : saving Dictionary object under ./tmp/deerwester.dict, separately None
2017-08-04 11:54:37,360 : INFO : saved ./tmp/deerwester.dict


{'44': 3652, 'out': 49, 'half': 3557, 'products': 513, 'territory': 514, "soane's": 517, 'conditions': 519, 'registry': 581, 'slow': 2982, 'buy': 249, 'thailand': 525, 'instalments,': 4292, '38': 3666, '(as': 2254, 'cabinet': 398, 'dartmoor': 530, 'requirements': 3027, 'postgraduate': 1470, '15th': 3516, 'point': 255, 'intangible': 4081, 'invictus': 537, 'very': 3404, 'trustee': 601, 'liverpool': 543, 'youtube': 407, 'support.': 2864, 'money': 176, 'equipment': 547, 'iceland': 548, 'personnel': 550, 'nautical': 552, 'repair': 1409, 'units': 2276, 'expensive': 4094, 'each': 272, 'amendments': 2557, 'awards': 562, 'ethics': 563, 'barrow': 564, 'voluntarily,': 3377, 'register.': 2563, 'yourself': 3016, 'customer,': 3019, 'acquire': 3493, 'horserace': 570, 'horticultural': 571, 'mozambique': 573, 'accounts': 574, 'lucia': 576, 'meets': 3260, '709/3:': 3905, 'eastern': 578, 'review': 579, 'russia': 580, 'dead': 2301, 'client.': 4150, 'controls': 2305, 'deduction': 3686, 'characters.': 3759,

Test against a sample question...

In [39]:
question = "Import VAT"
q_vec = dictionary.doc2bow(question.lower().split())
print (q_vec)

[(40, 1), (2034, 1)]


In [41]:
corpus = [dictionary.doc2bow(page) for page in pages]
corpora.MmCorpus.serialize('./tmp/deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

2017-08-04 11:56:37,778 : INFO : storing corpus in Matrix Market format to ./tmp/deerwester.mm
2017-08-04 11:56:37,781 : INFO : saving sparse matrix to ./tmp/deerwester.mm
2017-08-04 11:56:37,784 : INFO : PROGRESS: saving document #0
2017-08-04 11:56:37,834 : INFO : saved 103x4401 matrix, density=5.262% (23852/453303)
2017-08-04 11:56:37,838 : INFO : saving MmCorpus index to ./tmp/deerwester.mm.index


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 2), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 2), (30, 1), (31, 1), (32, 1), (33, 2), (34, 1), (35, 4), (36, 2), (37, 1), (38, 1), (39, 1), (40, 2), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 3), (55, 1), (56, 2), (57, 1), (58, 1), (59, 1)], [(0, 2), (1, 1), (3, 1), (5, 1), (6, 1), (8, 1), (9, 1), (10, 1), (12, 6), (13, 4), (15, 1), (16, 1), (19, 1), (24, 1), (28, 1), (29, 7), (30, 1), (33, 2), (35, 8), (38, 2), (39, 1), (40, 1), (41, 4), (43, 1), (44, 1), (45, 1), (47, 1), (49, 3), (50, 1), (51, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 4), (58, 1), (59, 1), (60, 9), (61, 1), (62, 3), (63, 1), (64, 1), (65, 1), (66, 1), (67, 3), (68, 2), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 

### Transformations
Based on https://radimrehurek.com/gensim/tut2.html

In [44]:
import os
from gensim import corpora, models, similarities
if (os.path.exists("./tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('./tmp/deerwester.dict')
    corpus = corpora.MmCorpus('./tmp/deerwester.mm')
    print("Used files generated above")
else:
    print("Please run code above to generate data set")

2017-08-04 12:00:38,455 : INFO : loading Dictionary object from ./tmp/deerwester.dict
2017-08-04 12:00:38,461 : INFO : loaded ./tmp/deerwester.dict
2017-08-04 12:00:38,465 : INFO : loaded corpus index from ./tmp/deerwester.mm.index
2017-08-04 12:00:38,466 : INFO : initializing corpus reader from ./tmp/deerwester.mm
2017-08-04 12:00:38,471 : INFO : accepted corpus with 103 documents, 4401 features, 23852 non-zero entries


Used files generated above


In [45]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2017-08-04 12:00:57,866 : INFO : collecting document frequencies
2017-08-04 12:00:57,869 : INFO : PROGRESS: processing document #0
2017-08-04 12:00:57,971 : INFO : calculating IDF weights for 103 documents and 4400 features (23852 matrix non-zeros)


In [46]:
print(tfidf[q_vec]) # step 2 -- use the model to transform vectors

[(40, 0.20859533946076006), (2034, 0.9780020369893154)]


In [47]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(2, 0.006763359090215163), (4, 0.14761919690464018), (7, 0.18030142485334183), (11, 0.18030142485334183), (14, 0.1011143753694164), (17, 0.1988910730618725), (18, 0.002186455186415519), (20, 0.032922000221756655), (21, 0.1890137984833096), (22, 0.11418600127258902), (23, 0.2615654259041742), (25, 0.0629141251153567), (26, 0.2615654259041742), (27, 0.33091575485964264), (29, 0.019959877938092653), (31, 0.2615654259041742), (32, 0.14251581617711098), (34, 0.29155755079777423), (36, 0.06584400044351331), (37, 0.2615654259041742), (40, 0.05504070849093269), (42, 0.29155755079777423), (44, 0.07984146333480929), (46, 0.29155755079777423), (48, 0.24028567464054193), (50, 0.05962604239725152), (52, 0.17250794107071102), (54, 0.06169518837332447), (55, 0.17250794107071102), (57, 0.11089791855448385)]
[(29, 0.01869143059897337), (40, 0.007363268496269478), (44, 0.021362157130157745), (50, 0.01595337602215776), (54, 0.005502330297892732), (55, 0.046155739003669705), (57, 0.11868613938770207), (6

### Use LSI
Provides continuous training capabilities

In [50]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)

2017-08-04 12:07:00,198 : INFO : using serial LSI version on this node
2017-08-04 12:07:00,201 : INFO : updating model with new documents
2017-08-04 12:07:00,321 : INFO : preparing a new chunk of documents
2017-08-04 12:07:00,327 : INFO : using 100 extra samples and 2 power iterations
2017-08-04 12:07:00,327 : INFO : 1st phase: constructing (4401, 400) action matrix
2017-08-04 12:07:00,342 : INFO : orthonormalizing (4401, 400) action matrix
2017-08-04 12:07:00,676 : INFO : 2nd phase: running dense svd on (400, 103) matrix
2017-08-04 12:07:00,688 : INFO : computing the final decomposition
2017-08-04 12:07:00,689 : INFO : keeping 102 factors (discarding 0.000% of energy spectrum)
2017-08-04 12:07:00,703 : INFO : processed documents up to #103
2017-08-04 12:07:00,705 : INFO : topic #0(2.491): 0.322*"vat" + 0.238*"goods" + 0.141*"includes" + 0.118*"notice" + 0.112*"scheme" + 0.107*"claim" + 0.107*"uk" + 0.104*"hmrc" + 0.102*"reclaim" + 0.100*"living"
2017-08-04 12:07:00,706 : INFO : topic 

### Now add a more complex question

In [51]:
question = "Tell me how to file VAT for import and export"
vec_bow = dictionary.doc2bow(question.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)

[(0, 0.48119610439034577), (1, 0.25994149245796588), (2, 0.076082591435192859), (3, -0.051504420799863573), (4, -0.023168871452661412), (5, -0.05089904282473981), (6, -0.15852924315936828), (7, -0.0027993215374255212), (8, -0.1427505225014048), (9, -0.042011186631463353), (10, 0.0014453531931023049), (11, 0.30072237866434931), (12, 0.014070740059230039), (13, 0.15111767300311374), (14, 0.16208341221576406), (15, 0.052707231438354457), (16, 0.11234543139905373), (17, 0.034574461987621409), (18, 0.073252654880957724), (19, 0.056238215853928984), (20, 0.15004451694452894), (21, 0.068057329599508568), (22, 0.060826915824870029), (23, 0.054418123113996332), (24, -0.021347383632959394), (25, 0.011647333466497559), (26, 0.023807910930421283), (27, 0.0074855655852057237), (28, -0.056350320936544566), (29, -0.12741423931405657), (30, 0.014436062981731376), (31, -0.00073606104530137819), (32, -0.018703464739265541), (33, -0.037613342793211865), (34, 0.035757252593924448), (35, -0.059743978969219

In [55]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save('./tmp/deerwester.index')


2017-08-04 13:07:29,497 : INFO : creating matrix with 103 documents and 102 features
2017-08-04 13:07:29,586 : INFO : saving MatrixSimilarity object under ./tmp/deerwester.index, separately None
2017-08-04 13:07:29,589 : INFO : saved ./tmp/deerwester.index


Load the index if saved previously

In [None]:
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

In [56]:
sims = index[vec_lsi]

In [57]:
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.14245054), (1, 0.034225542), (2, 0.39325169), (3, 0.20444174), (4, 0.0097444504), (5, 0.27023506), (6, 0.017110713), (7, 0.028758645), (8, 0.24716249), (9, 0.018415801), (10, 0.050054908), (11, 0.39741176), (12, 0.4859674), (13, 0.013322771), (14, 0.00046807528), (15, 0.42774644), (16, 0.31227803), (17, 0.36688143), (18, 0.08773388), (19, -0.0083340947), (20, 0.1424399), (21, 0.026853286), (22, 0.023034785), (23, 0.38692591), (24, 0.49379462), (25, 0.3481552), (26, 0.41757524), (27, 0.23552638), (28, 0.38455695), (29, 0.20469952), (30, 0.40674463), (31, 0.011883866), (32, 0.29844257), (33, 0.32790247), (34, 0.37167236), (35, 0.34008205), (36, 0.35128781), (37, 0.12713417), (38, 0.36512741), (39, 0.00022258144), (40, 0.6705566), (41, 0.4015823), (42, 0.034225542), (43, 0.34120247), (44, 0.036710106), (45, 0.01227235), (46, 0.48092169), (47, 0.36596948), (48, 0.30134642), (49, 0.17780286), (50, 0.16011497), (51, -0.0047097486), (52, 0.16806617), (53, 0.26889855), (54, 0.40439928),

In [58]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

[(40, 0.6705566), (97, 0.62273836), (61, 0.56415594), (77, 0.49929014), (24, 0.49379462), (12, 0.4859674), (89, 0.48495665), (46, 0.48092169), (82, 0.44793889), (65, 0.43265408), (15, 0.42774644), (92, 0.42496577), (26, 0.41757524), (84, 0.41104642), (30, 0.40674463), (54, 0.40439928), (41, 0.4015823), (11, 0.39741176), (2, 0.39325169), (87, 0.39217973), (95, 0.38893798), (23, 0.38692591), (28, 0.38455695), (99, 0.38452324), (79, 0.38130826), (34, 0.37167236), (17, 0.36688143), (47, 0.36596948), (38, 0.36512741), (68, 0.35327187), (36, 0.35128781), (94, 0.3490333), (25, 0.3481552), (43, 0.34120247), (35, 0.34008205), (91, 0.33915207), (81, 0.3336615), (69, 0.32809725), (33, 0.32790247), (73, 0.31360465), (16, 0.31227803), (80, 0.31224293), (57, 0.30777568), (67, 0.30315095), (48, 0.30134642), (72, 0.29922992), (32, 0.29844257), (66, 0.28229317), (74, 0.27186039), (5, 0.27023506), (53, 0.26889855), (93, 0.26876965), (55, 0.26798639), (8, 0.24716249), (78, 0.24222277), (64, 0.23739362), 

In [62]:
print (cache.get_urls(domain)[40])

/guidance/vat-imports-acquisitions-and-purchases-from-abroad


In [70]:
for r in range(5):
    page_id = sims[r][0]
    print (cache.get_urls(domain)[page_id])

/guidance/vat-imports-acquisitions-and-purchases-from-abroad
/guidance/how-to-value-your-imports-for-customs-duty-and-trade-statistics
/guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
/vat-registration
/guidance/vat-exports-dispatches-and-supplying-goods-abroad
