# KB demo

Work on a set of KB articles to see what spaCy can determine about them, and how we could use this to assist in customer experience


### Set up cache

We'll use the hmrc.gov.uk VAT pages for this (https://www.gov.uk/topic/business-tax/vat) and search for a depth of 2 (this page, and the ones it directly links to)

This will get data from a previous crawl if there is any

In [1]:
from crawler.crawler import Crawler, CrawlerCache

cache = CrawlerCache('crawler.db')
domain = 'www.gov.uk'  # TODO: Move this earlier 
base_url = 'topic/business-tax/vat'

### Set up an initial crawl to populate the cache for later

*** ONLY run this if the cache is empty or it's a new domain!

In [2]:
#crawler = Crawler(cache, depth=2)
#crawler.crawl('https://{}/{}'.format(domain,base_url))

Crawl www.gov.uk
retrieving url... [www.gov.uk] /topic/business-tax/vat
set: self.is_cacheable(/topic/business-tax/vat)=True 
retrieving url... [www.gov.uk] 
set: self.is_cacheable()=True 
retrieving url... [www.gov.uk] /vat-retail-schemes
set: self.is_cacheable(/vat-retail-schemes)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/email-signup
set: self.is_cacheable(/topic/business-tax/vat/email-signup)=True 
retrieving url... [www.gov.uk] /browse/driving
set: self.is_cacheable(/browse/driving)=True 
retrieving url... [www.gov.uk] /vat-building-new-home
set: self.is_cacheable(/vat-building-new-home)=True 
retrieving url... [www.gov.uk] /government/organisations/government-digital-service
set: self.is_cacheable(/government/organisations/government-digital-service)=True 
retrieving url... [www.gov.uk] /government/publications
set: self.is_cacheable(/government/publications)=True 
retrieving url... [www.gov.uk] /topic/business-tax/vat/latest
set: self.is_cacheable(/topic/busin

set: self.is_cacheable(/government/collections/vat-forms)=True 
retrieving url... [www.gov.uk] /government/organisations
set: self.is_cacheable(/government/organisations)=True 
retrieving url... [www.gov.uk] /browse/housing-local-services
set: self.is_cacheable(/browse/housing-local-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
set: self.is_cacheable(/guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country)=True 
retrieving url... [www.gov.uk] /government/organisations/department-for-work-pensions
set: self.is_cacheable(/government/organisations/department-for-work-pensions)=True 
retrieving url... [www.gov.uk] /guidance/vat-how-to-work-out-your-place-of-supply-of-services
set: self.is_cacheable(/guidance/vat-how-to-work-out-your-place-of-supply-of-services)=True 
retrieving url... [www.gov.uk] /guidance/vat-capital-goods-scheme-and-capital-assets
set: self.is_cacheable(/guidance/vat-capital-goods-s

Show what pages we have found and indexed

In [2]:
#for key in crawler.content['www.gov.uk'].keys():
#    print (key)

View one of the pages as an example

In [3]:
#page = crawler.content['www.gov.uk']['/vat-record-keeping']
#print (page)

## Now get the text from the page to parse

In [4]:
#import spacy
import numpy as np
from bs4 import BeautifulSoup

Stop words from http://xpo6.com/download-stop-word-list/

* Needs adding to, perhaps remove gov.uk, hm, etc.

In [5]:
stoplist = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

In [6]:
def get_text_from_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", "nav", "footer"]):
        script.extract()    # rip it out
    text = soup.get_text()
    return text

In [7]:
def remove_stop_words(doc):
    return [word for word in doc.lower().split() if word not in stoplist]

### Build an array of parsed documents

In [8]:
pages=[]
for url in cache.get_urls(domain):
    pages.append(remove_stop_words(get_text_from_html(cache.get(domain,url))))

In [9]:
print (pages[5])

['building', 'new', 'home', 'vat', '-', 'gov.uk', 'skip', 'main', 'content', 'gov.uk', 'uses', 'cookies', 'make', 'site', 'simpler.', 'find', 'out', 'more', 'cookies', 'gov.uk', 'search', 'search', 'home', 'money', 'tax', 'vat', 'building', 'new', 'home', 'vat', '1.', 'overview', 'apply', 'vat', 'refund', 'building', 'materials', 'services', 'youre:', 'building', 'new', 'home', 'converting', 'property', 'home', 'building', 'non-profit', 'communal', 'residence', '-', 'eg', 'hospice', 'building', 'property', 'charity', 'building', 'work', 'materials', 'qualify', 'apply', 'hm', 'revenue', 'customs', '(hmrc)', 'within', '3', 'months', 'completing', 'work.', 'separate', 'guide', 'vat', 'youre', 'working', 'construction', 'industry.', 'print', 'entire', 'guide', 'vat', 'elsewhere', 'gov.uk', 'help', 'improve', 'gov.uk', 'dont', 'include', 'personal', 'financial', 'information', 'national', 'insurance', 'number', 'credit', 'card', 'details.', 'doing', 'went', 'wrong', 'send']


### Let's introduce Gensim

In [10]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora


Using TensorFlow backend.
2017-08-15 16:51:46,081 : INFO : 'pattern' package not found; tag filters are not available for English


Remove words that only appear once

In [11]:
from collections import defaultdict

frequency = defaultdict(int)
for page in pages:
     for token in page:
        frequency[token] += 1

pages = [[token for token in page if frequency[token] > 1]
          for page in pages]

In [12]:
pages[5]

['building',
 'new',
 'home',
 'vat',
 '-',
 'gov.uk',
 'skip',
 'main',
 'content',
 'gov.uk',
 'uses',
 'cookies',
 'make',
 'site',
 'simpler.',
 'find',
 'out',
 'more',
 'cookies',
 'gov.uk',
 'search',
 'search',
 'home',
 'money',
 'tax',
 'vat',
 'building',
 'new',
 'home',
 'vat',
 '1.',
 'overview',
 'apply',
 'vat',
 'refund',
 'building',
 'materials',
 'services',
 'building',
 'new',
 'home',
 'converting',
 'property',
 'home',
 'building',
 'non-profit',
 'communal',
 'residence',
 '-',
 'eg',
 'hospice',
 'building',
 'property',
 'charity',
 'building',
 'work',
 'materials',
 'qualify',
 'apply',
 'hm',
 'revenue',
 'customs',
 '(hmrc)',
 'within',
 '3',
 'months',
 'completing',
 'separate',
 'guide',
 'vat',
 'youre',
 'working',
 'construction',
 'print',
 'entire',
 'guide',
 'vat',
 'elsewhere',
 'gov.uk',
 'help',
 'improve',
 'gov.uk',
 'dont',
 'include',
 'personal',
 'financial',
 'information',
 'national',
 'insurance',
 'number',
 'credit',
 'card',
 'd

Now create a dictionary (initial just a bag of words)

In [13]:
dictionary = corpora.Dictionary(pages)
dictionary.save('./tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary.token2id)

2017-08-15 16:51:46,172 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-08-15 16:51:46,243 : INFO : built Dictionary(4401 unique tokens: ['zones', 'peak', 'surveillance', 'rates', '727/3']...) from 103 documents (total 59407 corpus positions)
2017-08-15 16:51:46,245 : INFO : saving Dictionary object under ./tmp/deerwester.dict, separately None
2017-08-15 16:51:46,249 : INFO : saved ./tmp/deerwester.dict


{'zones': 3460, 'peak': 514, 'surveillance': 1453, 'rates': 64, '727/3': 3842, 'falsely': 3409, 'happen,': 3232, 'cycle': 2251, 'exports': 2964, '700/56:': 1858, 'organisation:': 14, 'publications': 389, 'thinking': 2068, 'devolution': 3704, 'shop': 2608, 'women': 4329, 'flagging': 3714, 'products': 534, 'mac': 535, 'administration': 536, 'last': 2125, 'closure': 4060, 'lines.': 3799, 'promotion': 709, 'curriculum,': 542, 'lwfans': 4355, 'discounts': 3713, 'make.': 3475, 'cleveland': 547, 'apart': 3433, 'free': 3510, 'registering,': 4227, 'germany': 551, 'november': 2884, 'passports': 78, 'collecting': 3888, 'information': 36, 'establishments.': 2549, 'parenting': 81, 'netherlands': 555, 'fund': 556, 'usa': 558, 'workers': 2758, 'june': 422, 'excluding': 269, 'nato': 565, '316': 3413, 'quality': 567, 'parts': 2606, 'boost': 2942, 'unserviceable': 4009, 'privately,': 4132, 'procurement': 571, 'albania': 572, 'sign': 2565, 'litigation': 574, 'deduction': 3702, 'private': 1887, 'overpayme

Test against a sample question...

In [15]:
#question = "Import VAT"
#q_vec = dictionary.doc2bow(question.lower().split())
#print (q_vec)

In [16]:
corpus = [dictionary.doc2bow(page) for page in pages]
corpora.MmCorpus.serialize('./tmp/deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

2017-08-15 16:51:56,639 : INFO : storing corpus in Matrix Market format to ./tmp/deerwester.mm
2017-08-15 16:51:56,640 : INFO : saving sparse matrix to ./tmp/deerwester.mm
2017-08-15 16:51:56,641 : INFO : PROGRESS: saving document #0
2017-08-15 16:51:56,694 : INFO : saved 103x4401 matrix, density=5.262% (23852/453303)
2017-08-15 16:51:56,696 : INFO : saving MmCorpus index to ./tmp/deerwester.mm.index


[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 2), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 4), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 1)], [(1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (8, 1), (9, 1), (10, 8), (11, 4), (12, 4), (13, 1), (16, 1), (19, 1), (22, 1), (23, 1), (24, 2), (26, 1), (27, 1), (31, 1), (33, 2), (34, 1), (35, 1), (36, 4), (37, 1), (39, 1), (40, 1), (41, 7), (43, 6), (47, 1), (49, 1), (50, 1), (51, 1), (53, 1), (55, 1), (56, 1), (58, 3), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 

### Transformations
Based on https://radimrehurek.com/gensim/tut2.html

In [17]:
import os
from gensim import corpora, models, similarities
if (os.path.exists("./tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('./tmp/deerwester.dict')
    corpus = corpora.MmCorpus('./tmp/deerwester.mm')
    print("Used files generated above")
else:
    print("Please run code above to generate data set")

2017-08-15 16:52:01,043 : INFO : loading Dictionary object from ./tmp/deerwester.dict
2017-08-15 16:52:01,048 : INFO : loaded ./tmp/deerwester.dict
2017-08-15 16:52:01,050 : INFO : loaded corpus index from ./tmp/deerwester.mm.index
2017-08-15 16:52:01,051 : INFO : initializing corpus reader from ./tmp/deerwester.mm
2017-08-15 16:52:01,053 : INFO : accepted corpus with 103 documents, 4401 features, 23852 non-zero entries


Used files generated above


In [18]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2017-08-15 16:52:07,750 : INFO : collecting document frequencies
2017-08-15 16:52:07,761 : INFO : PROGRESS: processing document #0
2017-08-15 16:52:07,862 : INFO : calculating IDF weights for 103 documents and 4400 features (23852 matrix non-zeros)


In [19]:
#print(tfidf[q_vec]) # step 2 -- use the model to transform vectors

In [20]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.2915575507977742), (1, 0.172507941070711), (6, 0.18901379848330957), (7, 0.3309157548596426), (12, 0.11089791855448383), (14, 0.26156542590417414), (15, 0.172507941070711), (17, 0.2915575507977742), (18, 0.0021864551864155188), (20, 0.24028567464054187), (21, 0.06291412511535668), (25, 0.1803014248533418), (28, 0.11418600127258899), (29, 0.26156542590417414), (30, 0.10111437536941639), (32, 0.26156542590417414), (34, 0.07984146333480928), (37, 0.06169518837332445), (38, 0.1803014248533418), (41, 0.01995987793809265), (42, 0.14761919690464015), (44, 0.26156542590417414), (45, 0.006763359090215162), (46, 0.19889107306187245), (47, 0.055040708490932684), (48, 0.2915575507977742), (51, 0.05962604239725151), (52, 0.14251581617711098), (54, 0.03292200022175665), (57, 0.0658440004435133)]
[(1, 0.046155739003669684), (12, 0.11868613938770202), (34, 0.02136215713015774), (37, 0.005502330297892729), (41, 0.018691430598973362), (47, 0.007363268496269475), (51, 0.015953376022157753), (60, 0

### Use LSI
Provides continuous training capabilities

In [21]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index.save('./tmp/deerwester.index')

2017-08-15 16:52:34,687 : INFO : using serial LSI version on this node
2017-08-15 16:52:34,689 : INFO : updating model with new documents
2017-08-15 16:52:34,815 : INFO : preparing a new chunk of documents
2017-08-15 16:52:34,820 : INFO : using 100 extra samples and 2 power iterations
2017-08-15 16:52:34,821 : INFO : 1st phase: constructing (4401, 400) action matrix
2017-08-15 16:52:34,838 : INFO : orthonormalizing (4401, 400) action matrix
2017-08-15 16:52:35,153 : INFO : 2nd phase: running dense svd on (400, 103) matrix
2017-08-15 16:52:35,168 : INFO : computing the final decomposition
2017-08-15 16:52:35,169 : INFO : keeping 102 factors (discarding 0.000% of energy spectrum)
2017-08-15 16:52:35,182 : INFO : processed documents up to #103
2017-08-15 16:52:35,185 : INFO : topic #0(2.491): 0.322*"vat" + 0.238*"goods" + 0.141*"includes" + 0.118*"notice" + 0.112*"scheme" + 0.107*"claim" + 0.107*"uk" + 0.104*"hmrc" + 0.102*"reclaim" + 0.100*"living"
2017-08-15 16:52:35,187 : INFO : topic 

### Now add a more complex question

In [22]:
#question = "What VAT applies to import and export"
#vec_bow = dictionary.doc2bow(question.lower().split())
#vec_lsi = lsi[vec_bow]
#print(vec_lsi)

In [23]:
#index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
#index.save('./tmp/deerwester.index')


Load the index if saved previously

In [24]:
#index = similarities.MatrixSimilarity.load('./tmp/deerwester.index')

In [25]:
#sims = index[vec_lsi]

In [26]:
#print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

In [27]:
#sims = sorted(enumerate(sims), key=lambda item: -item[1])
#print(sims) # print sorted (document number, similarity score) 2-tuples

In [28]:
def get_predicted_urls(sims, n=5):
    urls = []
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    #print(sims) # print sorted (document number, similarity score) 2-tuples
    for r in range(n):
        page_id = sims[r][0]
        #print (cache.get_urls(domain)[page_id])
        urls.append(cache.get_urls(domain)[page_id])
    return urls

In [29]:
#print (get_predicted_urls(sims))

Create a vector based on the question and the context

In [30]:
import_export = 'import export'
new_business = 'register for vat'
small_business = 'vat schemes'

In [41]:
def get_similarity(lsi, q_vec):
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[q_vec]
    return sims

In [31]:
def create_question_vector(question, context):
    question += context
    vec_bow = dictionary.doc2bow(question.lower().split())
    vec_lsi = lsi[vec_bow]
    return vec_lsi

In [40]:
question = "What VAT schemes are available"
context = import_export
q_vec = create_question_vector(question, context)

Determine the most appropriate answers

In [42]:
sims = get_similarity(lsi, q_vec )
urls = get_predicted_urls(sims)

2017-08-15 16:53:48,736 : INFO : creating matrix with 103 documents and 102 features


In [43]:
for url in urls:
    print ('http://{}{}'.format(domain, url))

http://www.gov.uk/vat-retail-schemes
http://www.gov.uk/vat-margin-schemes
http://www.gov.uk/guidance/register-and-use-the-vat-mini-one-stop-shop
http://www.gov.uk/vat-registration
http://www.gov.uk/reclaim-vat
