# KB demo

Work on a set of KB articles to see what spaCy can determine about them, and how we could use this to assist in customer experience


### Start by crawling some text pages

We'll use the hmrc.gov.uk VAT pages for this (https://www.gov.uk/topic/business-tax/vat) and search for a depth of 2 (this page, and the ones it directly links to)

In [None]:
from crawler.crawler import Crawler, CrawlerCache

crawler = Crawler(CrawlerCache('crawler.db'), depth=2)

In [2]:
domain = 'www.gov.uk'  # TODO: Move this earlier 
base_url = 'topic/business-tax/vat'
crawler.crawl('https://{}/{}'.format(domain,base_url))

Crawl www.gov.uk
retrieving url... [www.gov.uk] /topic/business-tax/vat
retrieving url... [www.gov.uk] /importing-vehicles-into-the-uk
retrieving url... [www.gov.uk] /browse/citizenship
retrieving url... [www.gov.uk] /guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
retrieving url... [www.gov.uk] 
retrieving url... [www.gov.uk] /guidance/vat-how-to-work-out-your-place-of-supply-of-services
retrieving url... [www.gov.uk] /vat-businesses
retrieving url... [www.gov.uk] /government/publications/vat-notice-700-the-vat-guide
retrieving url... [www.gov.uk] /government/collections/exchange-rates-for-customs-and-vat
retrieving url... [www.gov.uk] /browse/abroad
retrieving url... [www.gov.uk] /government/organisations/government-digital-service
retrieving url... [www.gov.uk] /government/collections/vat-manuals
retrieving url... [www.gov.uk] /guidance/vat-refunds-for-non-eu-businesses-visiting-the-uk
retrieving url... [www.gov.uk] /send-vat-return
retrieving url... [www.gov.

Show what pages we have found and indexed

In [3]:
for key in crawler.content['www.gov.uk'].keys():
    print (key)

/importing-vehicles-into-the-uk
/browse/citizenship
/guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
/government/publications/vat-notice-700-the-vat-guide

/government/collections/exchange-rates-for-customs-and-vat
/browse/visas-immigration
/browse/abroad
/guidance/vat-how-to-work-out-your-place-of-supply-of-services
/government/collections/vat-manuals
/guidance/vat-refunds-for-non-eu-businesses-visiting-the-uk
/send-vat-return
/guidance/vat-get-clearance-on-the-rules-for-complex-transactions
/government/collections/vat-forms
/guidance/vat-imports-acquisitions-and-purchases-from-abroad
/vat-record-keeping
/government/organisations/hm-revenue-customs
/government/collections/vat-notes
/starting-to-export
/government/collections/vat-moss-vat-on-sales-of-digital-services-in-the-eu
/help/terms-conditions
/browse/childcare-parenting
/topic/business-tax/vat/email-signup
/vat-motor-dealers
/government/organisations/driver-and-vehicle-licensing-agency
/government/news/web

View one of the pages as an example

In [14]:
page = crawler.content['www.gov.uk']['/vat-record-keeping']
print (page)

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en">
<!--<![endif]-->
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>VAT record keeping - GOV.UK</title>

    <!--[if gt IE 8]><!--><link href="https://assets.publishing.service.gov.uk/static/govuk-template-2775f99eaec64ff8121bfbfb3eb67b0c2b4b7c3fc78d25da30e12db2a09d30d6.css" media="screen" rel="stylesheet">
<!--<![endif]-->
    <!--[if IE 6]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie6-5bb08c355a12ac38b0ac9d2446da122ec0f81c78e02dcd2a98766f53c23793c8.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 7]><link href="https://assets.publishing.service.gov.uk/static/govuk-template-ie7-be1ea757827710f20eae59ae3ebfd172b7dbeabb171a79945ba610947eebb3cc.css" media="screen" rel="stylesheet" /><![endif]-->
    <!--[if IE 8]><link href="https://assets.publishing.service.go




## Now get the text from the page to parse

In [15]:
import spacy
import numpy as np
from bs4 import BeautifulSoup

In [16]:
nlp=spacy.load('en')
soup = BeautifulSoup(page, 'html.parser')

In [35]:
def get_text_from_html(html):
    for script in soup(["script", "style", "nav", "footer"]):
        script.extract()    # rip it out
    text = soup.get_text()
    return text

## Set up training data

In [45]:
X=[]
y=[]
count = 0

for url in crawler.content[domain].keys():
    print ('Extracting and getting vector for {}'.format(url))
    text = nlp(get_text_from_html(crawler.content[domain][url]))
    vector = text.vector
    X.append(vector)
    y.append(count)
    count += 1
    
X=np.array(X)
y=np.array(y)
print('Training samples: {}, labels: {}'.format(X.shape[0], y.shape[0]))

Extracting and getting vector for /importing-vehicles-into-the-uk
Extracting and getting vector for /browse/citizenship
Extracting and getting vector for /guidance/vat-relief-goods-imported-then-supplied-to-another-eu-country
Extracting and getting vector for /government/publications/vat-notice-700-the-vat-guide
Extracting and getting vector for 
Extracting and getting vector for /government/collections/exchange-rates-for-customs-and-vat
Extracting and getting vector for /browse/visas-immigration
Extracting and getting vector for /browse/abroad
Extracting and getting vector for /guidance/vat-how-to-work-out-your-place-of-supply-of-services
Extracting and getting vector for /government/collections/vat-manuals
Extracting and getting vector for /guidance/vat-refunds-for-non-eu-businesses-visiting-the-uk
Extracting and getting vector for /send-vat-return
Extracting and getting vector for /guidance/vat-get-clearance-on-the-rules-for-complex-transactions
Extracting and getting vector for /go

## Let's do some analysis on the data
This is based on my work for MLND P3: https://github.com/markstrefford/P3---Customer-Segments

In [47]:
# TODO: Apply PCA with the same number of dimensions as variables in the dataset
from sklearn.decomposition import PCA
pca = PCA(n_components=6).fit(X)

# Print the components and the amount of variance in the data contained in each dimension
print (pca.components_)
print (pca.explained_variance_ratio_)

  explained_variance_ratio_ = explained_variance_ / total_var


[[ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
[ nan  nan  nan  nan  nan  nan]


In [48]:
# Import clustering modules
from sklearn.cluster import KMeans
from sklearn.mixture import GMM

In [70]:
def plot_clusters(reduced_data, centroids):
    # Plot the decision boundary by building a mesh grid to populate a graph.
    x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
    y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
    hx = (x_max-x_min)/1000.
    hy = (y_max-y_min)/1000.
    xx, yy = np.meshgrid(np.arange(x_min, x_max, hx), np.arange(y_min, y_max, hy))
    
    # Obtain labels for each point in mesh. Use last trained model.
    print (np.c_[xx.ravel().shape, yy.ravel()].shape)
    Z = clusters.predict(np.c_[xx.ravel(), yy.ravel()]) 
    print (Z.shape)
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    print (Z.shape)
    
    plt.figure(1)
    plt.clf()
    plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

    plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
    plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
    plt.title('Clustering on the wholesale grocery dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.xticks(())
    plt.yticks(())
    plt.show()

In [71]:
# TODO: First we reduce the data to two dimensions using PCA to capture variation
def reduce_data_with_pca(data, n_components = 2):
    reduced_data = PCA(n_components).fit(X).transform(data)
    return reduced_data

In [72]:
reduced_data = reduce_data_with_pca(X, n_components = 20)
print (reduced_data[:10])  # print upto 10 elements
clusters = KMeans(n_clusters=20).fit(reduced_data)
print (clusters)

  explained_variance_ratio_ = explained_variance_ / total_var


[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]]
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=20, n_init=10, n_jobs=1, precompute_distances='auto',
    random

In [73]:
centroids = clusters.cluster_centers_
print (centroids)

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.

In [78]:
print (X)
print (reduced_data.shape)
print (centroids.shape)
plot_clusters(reduced_data, centroids)

[[-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]
 [-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]
 [-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]
 ..., 
 [-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]
 [-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]
 [-0.07266881  0.11562211 -0.17159715 ..., -0.07926504  0.09211879
   0.03359894]]
(103, 20)
(20, 20)


ValueError: all the input array dimensions except for the concatenation axis must match exactly

In [76]:
reduced_data

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [77]:
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
hx = (x_max-x_min)/1000.
hy = (y_max-y_min)/1000.
xx, yy = np.meshgrid(np.arange(x_min, x_max, hx), np.arange(y_min, y_max, hy))

In [82]:
xx.shape

(1000, 1000)

In [81]:
yy.shape

(1000, 1000)